← More big round numbers Static analysis benchmarks →

Making OCaml native code 0.5% shorter on Mac OS X

Pascal Cuoq - 7th Jan 2012

Mac OS X and assembly labels

A few months ago, I was moving things around in Zarith. It's a good way to relax not unlike gardening. And then I noticed something strange.

On Mac OS X a label in assembly code is marked as local by prefixing it with "L" unlike another common convention of using ".".
On Mac OS X the assembler has some sort of strange limitation that makes it generate long variants of jump instructions if the destination label is not local even if the label is in the same compilation unit (and its relative position known).

Here is a hand-written assembly snippet to show this:

	testl	%eax  %eax 
	jne	L6 
	testl	%ebx  %ebx 
	jne	.L7 
	movl	$0  %ecx 
L6: 
	movl	$0  %eax 
.L7: 
	movl	$0  %ebx

The two above facts together mean that on Mac OS X the snippet is compiled into object code that can then be disassembled as:

0x0000000000000057 <main+24>:	test   %eax %eax 
0x0000000000000059 <main+26>:	jne    0x68 <main+41> 
0x000000000000005b <main+28>:	test   %ebx %ebx 
0x000000000000005d <main+30>:	jne    0x63 <main+36> 
0x0000000000000063 <main+36>:	mov    $0x0 %ecx 
0x0000000000000068 <main+41>:	mov    $0x0 %eax 
0x000000000000006d <.L7+0>:	mov    $0x0 %ebx

You may notice that since .L7 is not a local label gdb considers that it may be the name you want to see at address 0x6d. This is just a heuristic of little importance. The second conditional jump at <main+30> appears to be going to <main+36> but this is just because we are looking at an unlinked object file. The destination has been left blank in the object file and since it is expressed as a relative offset the default value 0 makes it look like the destination is the instruction that immediately follows. More to the point the first conditional jump at <main+26> occupies two bytes because the assembler sees that the destination is close and that the relative offset fits into one byte whereas the second conditional jump at <main+30> occupies 6 bytes leaving room for a 4-byte encoding of the target address.

Enter OCaml. The plot thickens.

Antoine Miné and Xavier Leroy then respectively contributed the following additional facts:

OCaml generates labels intended to be local with a ".L" prefix on Mac OS X. This at first sight seems inefficient since it leads the assembler to use the long encoding of jumps all the time even when the destination is nearby.
But in fact Mac OS X's linker complains when you have subtracted local labels in the file being linked so that if you intend to subtract some of the labels you are generating you shouldn't make them local anyway.

Subtracting addresses indicated by local labels is something that OCaml does in the process of generating meta-data accompanying the code in the assembly file. Thus on Mac OS X OCaml is prevented from using proper local labels with an "L" prefix.

Mac OS X's compilation chain is of course being obtuse. It could generate the short jump variants when the non-local destination label happens to be known and nearby. It could as well compute whatever subtractions between local labels occur in an assembly file while it is being assembled instead of leaving them for later and then complaining that it's impossible. The usual GNU compilation suite on a modern Linux distribution gets both of these features right and Mac OS X's gets both of them wrong leaving the OCaml native compiler no choice but to generate the inefficiently compiled non-local labels. Mac OS X deserves all the blame here.

Solution: a hack

Are we destined to keep ugly 6-byte jumps in our OCaml-generated native code on Mac OS X then? No because I made a hack. In Frama-C's Makefile I changed the rule to compile .ml files into the .cmx .o ... natively compiled versions thus:

--- share/Makefile.common	(revision 16792) 
+++ share/Makefile.common	(working copy) 
@@ -306 7 +306 10 @@ 
 %.cmx: %.ml 
 	$(PRINT_OCAMLOPT) $@ 
-	$(OCAMLOPT) -c $(OFLAGS) $< 
+	$(OCAMLOPT) -S -c $(OFLAGS) $< 
+	sed -f /Users/pascal/ppc/sed_asm \ 
+            < $(patsubst %.ml %.s $<) > $(patsubst %.ml %.1.S $<) 
+	gcc -c $(patsubst %.ml %.1.S $<) -o $(patsubst %.ml %.o $<) 
 # .o are generated together with .cmx  but %.o %.cmx: %.ml only confuses 
 # make when computing dependencies...

This uses ocamlopt's -S option to generate the assembly file from the .ml source code file. Then a sed script is applied to the assembly file to modify it a little. And finally the modified assembly file is compiled (gcc can be used for this).

The sed commands to transform the assembly file are these:

s/^[.]\(L[0-9]*\):/.\1: \1:/g 
s/\([[:space:]]*j.*[[:space:]]*\)[.]\(L[0-9]*\)$/\1\2/g

The first command transform all label declarations (e.g. .L100:) into a double declaration .L100: L100:. The two labels thus indicate the same location but the second one is local whereas the first one isn't.

The second command transforms labels when used inside jump instructions so that jne .L100 is transformed into jne L100. Crucially it does not transform labels elsewhere for instance when referenced in the meta-data that the OCaml compiler generates where the compiler may have to subtract one label's address from another's.

Results

On Mac OS X the described trick makes Ocaml-generated native code smaller:

-rwxr-xr-x  1 pascal  staff  11479984 Jan  6 23:24 bin/toplevel.old 
-rwxr-xr-x  1 pascal  staff  11414512 Jan  7 00:12 bin/toplevel.opt

The nearly 65000 bytes of difference between the two version represent the accumulation of all the inefficiently assembled jumps in a large piece of software such as Frama-C.

Conclusion

If you play with the above compilation rule and sed script do not expect much in terms of speed improvements: changing an inefficient encoding into an efficient one of the same instruction helps the processor but only marginally. I guess it would be measurable but not with my usual protocol of launching three of each and keeping the median measurement. I would have to learn about confidence intervals which does not sound fun (not like gardening at all). Instead I will avoid making the claim that this hack improves execution speed and I will just postpone a bit more the moment I have to learn about statistics.