Dead code elimination dead end: OCaml doesn't tree-shake

If you write code for the web, one of your most common output concerns is the size of the JS bundle you ship to customers. Bigger bundles means longer loading times, and more customers turning away.

I work mostly on the backend building serverless applications, and interestingly the same concerns apply. Lambda cold starts affect user experience in much the same way by slowing down request times when a warm lambda container is not available¹. Large package sizes makes your cold start longer and hence makes API requests slower.

Although there are solutions like provisioned concurrency that can help reduce the number of customers experiencing cold starts, it's an expensive workaround and it can't eliminate them entirely, so it makes sense to minimise this cause.

I've been creating AWS SDKs for OCaml with the intention of using them in serverless applications, so output binary size has been on my mind. I've been wondering how you minimise binary size in OCaml, and have led myself down this rabbit hole.

Native code output - the ELF format

ELF is the principle binary format for executables on Linux, which is the main OS available for Lambda, so that is what I have been analysing (the principles are the same for Windows and MacOS).

An ELF file consists of sections, and the main sections that contribute to the bulk of its size with an OCaml application are:

.text: the compiled code
.data: strings i.e. any static strings (or other data values) used in the application
.rela.dyn: dynamic relocation information
.strtab .symtab: strings associated with symbol tables

We can see the relative sizes of these with the bloaty tool:

➜  aws-smsdk git:(master) ✗ bloaty _build/default/awssdklib_examples/dynamodb_createTableAndWriteData.exe
    FILE SIZE        VM SIZE
 --------------  --------------
  21.4%  2.70Mi  34.6%  2.70Mi    .text
  15.6%  1.97Mi  25.2%  1.97Mi    .data
   9.5%  1.20Mi  15.4%  1.20Mi    .rela.dyn
   9.2%  1.16Mi   0.0%       0    .debug_info
   7.4%   958Ki   0.0%       0    .strtab
   6.5%   835Ki   0.0%       0    .symtab
   5.9%   757Ki   0.0%       0    .debug_line
   4.4%   569Ki   7.1%   569Ki    .dynstr
   4.1%   525Ki   6.6%   525Ki    .eh_frame
   3.4%   444Ki   0.0%       0    .debug_str
   3.3%   429Ki   5.4%   429Ki    .dynsym
   3.3%   425Ki   0.0%       0    .debug_loclists
   1.2%   150Ki   1.9%   150Ki    .gnu.hash
   1.1%   146Ki   0.0%       0    .debug_abbrev
   0.9%   116Ki   1.5%   116Ki    .eh_frame_hdr
   0.7%  85.4Ki   0.0%       0    .debug_aranges
   0.6%  74.5Ki   0.7%  55.6Ki    [24 Others]
   0.0%       0   0.9%  70.1Ki    .bss
   0.5%  66.6Ki   0.0%       0    .debug_rnglists
   0.5%  66.3Ki   0.8%  66.3Ki    .rodata
   0.4%  46.2Ki   0.0%       0    .debug_ranges
 100.0%  12.6Mi 100.0%  7.81Mi    TOTAL

What makes a large binary

The example in the previous section is one my OCaml AWS SDK examples. Although this example is only a toy interacting with the DynamoDB SDK, it's already around 12MiB.

The reason why can be found by inspecting the symbol table using nm (the middle column is the size):

nm -S -t d --size-sort _build/default/awssdklib_examples/dynamodb_createTableAndWriteData.exe
0000000006770248 0000000000039064 D camlYojson__Basic.frametable
0000000006462896 0000000000038976 D camlAws_SmSdk_Client_DynamoDB__Operations.frametable
0000000008262688 0000000000035840 b all_domains
0000000008142104 0000000000030376 D camlStdlib__Scanf.frametable
0000000007505048 0000000000030104 D camlBase__List.frametable
0000000006817688 0000000000030008 D camlXmlm.frametable
0000000007661312 0000000000029992 D camlBase__String.frametable
0000000007581192 0000000000027360 D camlBase__Sequence.frametable
0000000006510496 0000000000023344 D camlAws_SmSdk_Client_DynamoDB__Builders.frametable
0000000007268440 0000000000023136 D camlUri.frametable
0000000004255920 0000000000021790 T camlBase__String.entry
0000000007355984 0000000000020056 D camlFmt.frametable
0000000007175832 0000000000019552 D camlCstruct.frametable
0000000006944072 0000000000018944 D camlPsq.frametable
0000000008116288 0000000000018464 D camlStdlib__Format.frametable
0000000002783520 0000000000018041 T camlAws_SmSdk_Client_DynamoDB__Deserializers.entry
0000000002654256 0000000000017848 T camlAws_SmSdk_Client_DynamoDB__Serializers.entry

OCaml statically links all its libraries to create self-contained executables, so we'd expect to see all our linked libraries included in the output.

In the above snapshot, the top contributors are data sections (D) from our AWS DynamoDB SDK (camlAws_smSdk_dynamoDB_*). We can also see dependencies like xmlm (camlXmlm*), an XML parser used by the AWS SDK, but not required for the DynamoDB SDK which uses JSON (via Yojson i.e. camlYojson_*) for its protocol.

As suspected, OCaml does not throw away modules when it statically links, even if they are unused in the imported libraries. Its compiler will only exclude entire libraries if they are not required by any of the code in the executable module.

How to tree-shake in binaries

I've used the terms tree shaking and dead code elimination interchangably in this article. Although they are both concerned with the removal of unused code, I'm concerned with the macro-optimisation of eliminating unused functions and modules

On the web, tree shaking is usually done by bundlers like esbuild or webpack. They parse the TypeScript or JavaScript included and make an assessment of declarations which don't appear to have any global effects and are unreferenced by the code.

Unused code is dropped, and can be additionally made smaller by obfuscating symbols and removing whitespace. The same tools can be used when deploying NodeJS code to AWS Lambda to help reduce cold starts by minimising package size ².

In compiled languages targeting native code, dead-code elimination can be done at two levels:

in the compiler: gcc provides flags like -fdce which eliminate unused code inside functions
in the linker: ldd provides --gc-sections, which removes unused ELF sections from the code

The latter, --gc-sections only works when the compiler splits up code and data over multiple sections. This can be achieved with -ffunction-sections -fdata-sections³ which put each function and string into their own ELF section.

Although this can make the overall binary larger and disable some relative location optimisations, the linker can shrink statically compiled code if there are enough unused sections. This is advantegous when we want to optimise for download and startup time, instead of runtime performance.

The curious `-function-sections` option in OCaml

It's not documented, but OCaml also has a -function-sections option (thanks prometheansacrifice) that achieves much the same thing. When you compile a library with it, you can see the effect in bloaty:

   FILE SIZE        VM SIZE
 --------------  --------------
  15.1%   293Ki  42.5%   293Ki    .data
   9.5%   184Ki   0.0%       0    .strtab
   8.3%   161Ki   0.0%       0    .rela.data
   7.9%   153Ki   0.0%       0    [ELF Section Headers]
   5.6%   108Ki   0.0%       0    .shstrtab
   5.4%   105Ki   0.0%       0    .symtab
   4.9%  96.0Ki   0.0%       0    [AR Symbol Table]
   2.2%  43.0Ki   6.2%  43.0Ki    .eh_frame
   1.7%  33.3Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Deserializers.entry
   1.7%  33.3Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Serializers.entry
   1.4%  27.5Ki   0.0%       0    .rela.eh_frame
   0.9%  17.6Ki   2.5%  17.6Ki    .text.caml.camlAws_SmSdk_Client_DynamoDB__Deserializers.entry
   0.9%  17.4Ki   2.5%  17.4Ki    .text.caml.camlAws_SmSdk_Client_DynamoDB__Serializers.entry
   0.8%  15.1Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Builders.entry
   0.7%  12.9Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB.entry
   0.6%  12.1Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Operations.entry
   0.5%  9.74Ki   1.4%  9.74Ki    .text.caml.camlAws_SmSdk_Client_DynamoDB.entry
   0.5%  9.23Ki   1.3%  9.23Ki    .text.caml.camlAws_SmSdk_Client_DynamoDB__Operations.entry
   0.4%  7.52Ki   1.1%  7.51Ki    .text.caml.camlAws_SmSdk_Client_DynamoDB__Builders.entry
   0.2%  3.47Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Deserializers.table_description_of_yo
json_1510
   0.2%  3.05Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Deserializers.export_description_of_y
ojson_2796
   0.2%  2.95Ki   0.4%  2.95Ki    .text.caml.camlAws_SmSdk_Client_DynamoDB__Serializers.table_description_to_yojson_14
30
   0.1%  2.84Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Serializers.table_description_to_yojs
on_1430
   0.1%  2.77Ki   0.0%       0    .rela.rodata
   0.1%  2.77Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Deserializers.import_table_descriptio
n_of_yojson_2718
   0.1%  2.59Ki   0.4%  2.58Ki    .text.caml.camlAws_SmSdk_Client_DynamoDB__Serializers.export_description_to_yojson_2
170
   0.1%  2.48Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Serializers.export_description_to_yoj
son_2170
   0.1%  2.44Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Deserializers.query_input_of_yojson_2
360
   0.1%  2.34Ki   0.3%  2.34Ki    .text.caml.camlAws_SmSdk_Client_DynamoDB__Serializers.import_table_description_to_yo
json_2127
   0.1%  2.30Ki   0.0%       0    .rela.text.caml.camlAws_SmSdk_Client_DynamoDB__Deserializers.scan_input_of_yojson_22
59
   0.1%  2.30Ki   0.3%  2.29Ki    .text.caml.camlAws_SmSdk_Client_DynamoDB__Deserializers.table_description_of_yojson_
1510

(notice all the .rela.text and .text sections, referring to the code)

Although these are now split up, running the linker with --gc-sections (and --print-gc-sections to show what is removed) demonstrates this has no meaningful effect, even though we are only using a small part of the SDK.

BEFORE

-r-xr-xr-x 1 chris chris 13212208 Oct 22 21:22 _build/default/awssdklib_examples/dynamodb_createTableAndWriteData.exe

AFTER

-r-xr-xr-x 1 chris chris 13193568 Oct 22 21:28 _build/default/awssdklib_examples/dynamodb_createTableAndWriteData.exe

Unfortunately, I don't understand enough about the compiler internals in order to explain why this is the case. I suspect it's something to do with the .data section (there is no equivalent -data-sections option in OCaml) containing references that prevent its elimination, but I can't be sure.

Previous efforts

Some discussion on X pointed me to an old PR that aimed to address this issue through a flambda optimiser stage in the compiler.

As far as I can tell, it works by pulling all the libraries together into one .cmxa (the OCaml library file) and then eliminating unused functions before passing it to the linker. This PR has yet to be worked on and merged and seems to have fallen by the wayside (I can't work out if its approach is acceptable to compiler maintainers, or if it just needs to be brought up to date and reviewed).

Other avenues of enquiry

I've also tried to understand bits of the compiler to see if there's anything obvious I can do, but becoming a compiler developer is not on my list of priorities right now.

It's also possible that none of this matters: I haven't proven if these sorts of binary sizes will have much of an impact on cold starts of Lambda functions that are powered by OCaml.

Package size has an impact for other languages like NodeJS which have to load a lot of files from the filesystem and parse and compile their code on startup, but compiled languages like OCaml or Rust simply need their binaries to be copied into memory.

Another possibility is that even though there are lots of redundant strings, Lambda deployment packages are zipped, and the binary compresses well (to about 32% of its size!). That already eliminates a lot of unwanted download time. A single executable also means less filesystem operations (bundled NodeJS applications also benefit from this).

Once I've released the SDK and started building more examples, I think I'll revisit this with a more scientific experiment comparing OCaml binary sizes and Lambda settings to see its effect on cold starts.

Until then, more hacking on the SDK (which is almost ready for a preview release).

Lambda functions are backed by ephemeral containers that start up on-demand and disappear when they are no longer being used. Those containers are reused as more requests come in, so those requests avoid startup time, but eventually the container will disappear. Lambda function code is deployed to object buckets and downloaded to the container when it starts for the first time, so cold starts are made longer by larger code bundles downloading over the network and being unzipped. Large package sizes is one of the few causes of cold starts - the other is insufficient provisioned memory (which affects the selected CPU speed too). ↩
They can also make deployment quicker because less code needs to be uploaded over the network. ↩
https://gcc.gnu.org/onlinedocs/gcc-10.1.0/gcc/Optimize-Options.html ↩