Making OCaml native code 0.5% shorter on Mac OS X
By pascal on Saturday, January 7 2012, 13:00 - Permalink
Mac OS X and assembly labels
A few months ago, I was moving things around in Zarith. It's a good way to relax, not unlike gardening. And then I noticed something strange.
- On Mac OS X, a label in assembly code is marked as local by prefixing it with "L", unlike another common convention of using ".".
- On Mac OS X, the assembler has some sort of strange limitation that makes it generate long variants of jump instructions if the destination label is not local, even if the label is in the same compilation unit (and its relative position known).
Here is a hand-written assembly snippet to show this:
testl %eax, %eax jne L6 testl %ebx, %ebx jne .L7 movl $0, %ecx L6: movl $0, %eax .L7: movl $0, %ebx
The two above facts together mean that on Mac OS X, the snippet is compiled into object code that can then be disassembled as:
0x0000000000000057 <main+24>: test %eax,%eax 0x0000000000000059 <main+26>: jne 0x68 <main+41> 0x000000000000005b <main+28>: test %ebx,%ebx 0x000000000000005d <main+30>: jne 0x63 <main+36> 0x0000000000000063 <main+36>: mov $0x0,%ecx 0x0000000000000068 <main+41>: mov $0x0,%eax 0x000000000000006d <.L7+0>: mov $0x0,%ebx
You may notice that since
.L7 is not a local label,
gdb considers that it may be the name you want to see at address 0x6d. This is just a heuristic of little importance. The second conditional jump at <main+30> appears to be going to <main+36>, but this is just because we are looking at an unlinked object file. The destination has been left blank in the object file, and since it is expressed as a relative offset, the default value 0 makes it look like the destination is the instruction that immediately follows. More to the point, the first conditional jump at <main+26> occupies two bytes, because the assembler sees that the destination is close and that the relative offset fits into one byte, whereas the second conditional jump at <main+30> occupies 6 bytes, leaving room for a 4-byte encoding of the target address.
Enter OCaml. The plot thickens.
Antoine Miné and Xavier Leroy then respectively contributed the following additional facts:
- OCaml generates labels intended to be local with a ".L" prefix on Mac OS X. This at first sight seems inefficient, since it leads the assembler to use the long encoding of jumps all the time, even when the destination is nearby.
- But in fact, Mac OS X's linker complains when you have subtracted local labels in the file being linked, so that if you intend to subtract some of the labels you are generating, you shouldn't make them local anyway.
Subtracting addresses indicated by local labels is something that OCaml does in the process of generating meta-data accompanying the code in the assembly file. Thus, on Mac OS X, OCaml is prevented from using proper local labels with an "L" prefix.
Mac OS X's compilation chain is, of course, being obtuse. It could generate the short jump variants when the non-local destination label happens to be known and nearby. It could as well compute whatever subtractions between local labels occur in an assembly file while it is being assembled, instead of leaving them for later and then complaining that it's impossible. The usual GNU compilation suite on a modern Linux distribution gets both of these features right, and Mac OS X's gets both of them wrong, leaving the OCaml native compiler no choice but to generate the inefficiently compiled non-local labels. Mac OS X deserves all the blame here.
Solution: a hack
Are we destined to keep ugly 6-byte jumps in our OCaml-generated native code on Mac OS X then? No, because I made a hack. In Frama-C's Makefile, I changed the rule to compile .ml files into the .cmx, .o, ..., natively compiled versions thus:
--- share/Makefile.common (revision 16792) +++ share/Makefile.common (working copy) @@ -306,7 +306,10 @@ %.cmx: %.ml $(PRINT_OCAMLOPT) $@ - $(OCAMLOPT) -c $(OFLAGS) $< + $(OCAMLOPT) -S -c $(OFLAGS) $< + sed -f /Users/pascal/ppc/sed_asm \ + < $(patsubst %.ml,%.s,$<) > $(patsubst %.ml,%.1.S,$<) + gcc -c $(patsubst %.ml,%.1.S,$<) -o $(patsubst %.ml,%.o,$<) # .o are generated together with .cmx, but %.o %.cmx: %.ml only confuses # make when computing dependencies...
ocamlopt's -S option to generate the assembly file from the .ml source code file. Then, a sed script is applied to the assembly file to modify it a little. And finally, the modified assembly file is compiled (gcc can be used for this).
The sed commands to transform the assembly file are these:
s/^[.]\(L[0-9]*\):/.\1: \1:/g s/\([[:space:]]*j.*[[:space:]]*\)[.]\(L[0-9]*\)$/\1\2/g
The first command transform all label declarations (e.g.
.L100:) into a double declaration
.L100: L100:. The two labels thus indicate the same location, but the second one is local whereas the first one isn't.
The second command transforms labels when used inside jump instructions, so that
jne .L100 is transformed into
jne L100. Crucially, it does not transform labels elsewhere, for instance when referenced in the meta-data that the OCaml compiler generates, where the compiler may have to subtract one label's address from another's.
On Mac OS X, the described trick makes Ocaml-generated native code smaller:
-rwxr-xr-x 1 pascal staff 11479984 Jan 6 23:24 bin/toplevel.old -rwxr-xr-x 1 pascal staff 11414512 Jan 7 00:12 bin/toplevel.opt
The nearly 65000 bytes of difference between the two version represent the accumulation of all the inefficiently assembled jumps in a large piece of software such as Frama-C.
If you play with the above compilation rule and sed script, do not expect much in terms of speed improvements: changing an inefficient encoding into an efficient one of the same instruction helps the processor, but only marginally. I guess it would be measurable, but not with my usual protocol of launching three of each and keeping the median measurement. I would have to learn about confidence intervals, which does not sound fun (not like gardening at all). Instead, I will avoid making the claim that this hack improves execution speed, and I will just postpone a bit more the moment I have to learn about statistics.