Exploring Mach-O, Part 4

January 16, 2022

This is part 4 of a 4 part series exploring the structure of the Mach-O file format. Here are links to part 1, part 2, and part 3.

When we started this series we didn’t know anything about Mach-O other than some vague idea that it’s a binary file format used by macOS. By now, we have a much better understanding of how Mach-O is laid out and how the operating system uses the information in a Mach-O file to run a program.

We learned that every Mach-O file has a 32 byte header right at the beginning. The first 4 bytes of the header (and thus, the first 4 bytes of every Mach-O) file are a magic number that indicates that the file is, in fact, encoded using Mach-O. Much to our delight, we found that this magic number is 0xfeedface on 32-bit systems, and 0xfeedfacf on 64-bit systems.

This header tells us a lot more about our file too, including the type of CPU it is compiled for and the size of the load commands that come directly after it.

After the header come a sequence of load commands. Every load command shares two fields in common: the command type (which we encoded as a Command enum in Zig) and the total size in bytes of the command. Each command is parsed slightly differently, and some commands (such as the segment_commands) have more data structures that come directly after them.

There are a lot of different load commands that a Mach-O file can contain: our Command enum has 54 different values. Not every Mach-O file contains all of these commands, of course. We primarily investigated the Segment commands, which are an important concept for understanding how a Mach-O file is laid out. Each segment can have zero or more sections, and in our tiny exit program we found 3 segments: __PAGEZERO, __TEXT, and __LINKEDIT. More complex programs will have more segments, notably a __DATA segment, which our program lacks.

Immediately following the load commands are the segments. We found that our file contains a large block of contiguous zeros, and we discovered that this is because the header and load commands share the first segment with the __TEXT segment. The sections within the __TEXT segment are aligned to the end of the segment, and because segments must be page aligned, there can often be large chunks of unused space between the end of the load commands and the start of the first section in the __TEXT segment. This problem is “worse” on systems with larger page sizes, such as ARM64 (which uses a 16 KiB page size rather than the standard 4 KiB pages on x86).

We learned how to use Zig to parse binary file formats, such as using packed structs to represent bit fields and validating enum values using the std.meta.intToEnum standard library function. We also explored some cool parts of Zig’s comptime features to make an elegant and easy-to-use API for our parser.

But wait, there’s more

If you followed along then you know there is a lot we didn’t cover. There are still a lot of load commands we didn’t even talk about, and who knows what kind of goodies those involve.

We also didn’t talk about universal binaries. Apple has transitioned between ISAs twice: first from PowerPC to Intel x86, and then again to ARM. Because of this, they figured out a long time ago how to combine two or more binary formats into a single file. Apple refers to this as a “universal binary”, which is a packaged archive of multiple Mach-O files. Universal binaries have their own format, including their own header and magic number (spoiler alert: the universal binary magic number is 0xcafebabe. Is that even better than 0xfeedface?).

The point is, there are many rabbit holes for you to follow if you’re interested. I hope you do, and if you do, I hope you write about!

Thanks for reading!

Last modified on January 19, 2022