Exploring Mach-O, Part 2

January 16, 2022

This is part 2 of a 4 part series exploring the structure of the Mach-O file format. Here are links to part 1, part 3, and part 4.

Last time, we created our own tiny Mach-O executable. This program doesn’t do anything useful, it’s simply the smallest executable we can use to examine what the Mach-O file format looks like.

From here on, we’ll write our own primitive parser to examine the contents of our program. I’m going to write my Mach-O parser in Zig. Why Zig? Mostly because I really like it and I find it’s quite easy to get simple things like this up and running. It’s also particularly well suited to tasks like this¹.

Note to readers in the future: it’s important to note that Zig does not yet have a stable 1.0 release. While at this point the language itself is fairly stable, the standard library often has breaking changes. I’ll do my best to keep the code in this article up-to-date, but be warned that it may not work by the time you read it. You can find the full source code on sourcehut.

First things first, let’s bootstrap an executable:

$ mkdir macho
$ cd macho
$ zig init-exe
info: Created build.zig
info: Created src/main.zig
info: Next, try `zig build --help` or `zig build run`

We’ll leave main.zig simple and simply call our parsing function and then print the parsed data. We’ll put the guts of our parser in src/macho.zig.

/// src/macho.zig

// First, import std, cuz we're gonna need it
const std = @import("std");

// Now, let's create a Parser struct
pub const Parser = struct {
    /// Field definitions
    // Our parser will hold a slice of bytes
    data: []const u8,

    /// Functions
    ...
};

To start off, we define our Parser struct with a single field: a slice of bytes (or u8 in Zig). We mark this slice as const because we don’t plan to mutate the data, we are simply interpreting it.

Our parser will work by implementing a few parse functions to read off a certain number of bytes from the front of the data slice and interpret those bytes as a given value. Let’s first add an init function to initialize a Parser object from an array of bytes.

/// src/macho.zig

pub const Parser = struct {
    /// Field definitions
    ...

    /// Functions
    pub fn init(data: []const u8) Parser {
        return Parser{
            .data = data,
        };
    }
};

Next, let’s add a simple parseLiteral function:

/// src/macho.zig

pub const Parser = struct {
    /// Field definitions
    ...

    /// Functions
    pub fn init(...) { ... }

    pub fn parseLiteral(self: *Parser, comptime T: type) !T {

    }
};

Let’s explain what’s going on here. The first argument to our function is a pointer to a *Parser object. Because this function is defined within the Parser struct itself, Zig treats this argument as a “receiver”, meaning we can use standard method call syntax (e.g. parser.parseLiteral()). The object on which this method is called is used as the first argument (self). We use a pointer to Parser because this function will mutate the parser object (by removing bytes from the data byte slice).

Next, the second argument is a comptime argument, which means it has to be known at compile time. It also has type type, which may be confusing at first. In Zig’s comptime, types are just values, which means you can do things like

const MySuperCoolType = u32;
const y: MySuperCoolType = 42;

This also means that we can accept a type as an argument to our function. This is how Zig does polymorphism. In our case, we accept a type T which is also the return type. So this function will read some bytes off the front of our data byte slice, interpret those bytes as a T, and then return the value.

Finally, the return type !T means that this function returns a T or an error. If you’re familiar with Rust, this is similar to Result<T, Error>.

This is what the implementation looks like:

    pub fn parseLiteral(self: *Parser, comptime T: type) !T {
        const size = @sizeOf(T);
        if (self.data.len < size) {
            return error.NotEnoughBytes;
        }

        const bytes = self.data[0..size];
        self.data = self.data[size..];
        return std.mem.bytesToValue(T, bytes);
    }

First, we use the builtin @sizeOf function to get the size of the type T in bytes. We then ensure that our byte slice has enough data in it: if it does not, we return a NotEnoughBytes error.

We then grab size bytes from our byte slice and then mutate the byte slice to remove those bytes from the front. We then call std.mem.bytesToValue(T, bytes) to re-interpret those bytes as a type T.

Is this safe to do? When we’re parsing integers (which we’ll be doing a lot of), this is fine, so long as the bytes are in the endian order we expect. On macOS, everything is little endian, so this is not an issue. We can also parse structs that are made up strictly of integers or arrays of integers for the same reason.

There are also no lifetime concerns here: bytesToValue creates a copy of the bytes being interpreted, but even if it didn’t, the byte slice that our parser is operating under has the same lifetime as our program since it is created in main() and is not released until the program exits.

If we want to parse an enum, then we need to validate that the value we’re parsing is a valid enum value. We will do this later when we introdue a parseEnum function.

In our main.zig file, we can test this out by adding some boilerplate to open and mmap a file:

/// src/main.zig

const std = @import("std");

const macho = @import("macho.zig");

pub fn main() anyerror!void {
    // Read the first command line argument. If it doesn't exist, return an
    // error
    var args = std.process.args();
    _ = args.skip();
    const fname = args.nextPosix() orelse {
        std.debug.print("Missing required argument: FILENAME\n", .{});
        return error.MissingArgument;
    };

    // Open the file. We use `defer` to ensure the file is closed when the
    // variable goes out of scope. The `try` keyword is semantic sugar that
    // uses the result of the function if no error occurs; otherwise, it
    // returns whatever error value the function itself returned (if you're
    // familiar with Rust, this is like the `?` operator).
    var file = try std.fs.cwd().openFile(fname, .{});
    defer file.close();

    // This is a standard mmap(2) call. If you're unfamiliar with mmap, give
    // `man 2 mmap` a read. This memory maps the file's contents into our
    // program's virtual memory space. This gives us access to the bytes
    // without having to copy them. Again, we use `defer` to "clean up" the
    // mmap when `data` goes out of scope.
    const data = try std.os.mmap(null, try file.getEndPos(), std.os.PROT.READ, std.os.MAP.PRIVATE, file.handle, 0);
    defer std.os.munmap(data);

    // Finally, we initialize our parser.
    var parser = macho.Parser.init(data);

    // Let's read the magic number from the data
    const magic = try parser.parseLiteral(u32);
    if (magic != 0xfeedfacf) {
        return error.BadMagic;
    }

    std.debug.print("0x{x}\n", .{magic});
}

We can compile our program by running

$ zig build

If it compiles without error (which it should), the macho executable can be found at zig-out/bin/macho:

$ zig-out/bin/macho
Missing required argument: FILENAME
error: MissingArgument
/Users/greg/src/gpanders.com/macho/src/main.zig:13:9: 0x104f6bb9f in main (macho)
        return error.MissingArgument;
        ^

As expected, we get a MissingArgument error since we did not supply a required argument. Let’s give it our exit binary:

$ zig-out/bin/macho ../exit
0xfeedfacf

Hey that’s the magic number! It looks like we are successfully able to parse integers. Before we move on, let’s see what happens if we give macho a non Mach-O object file:

$ ./zig-out/bin/macho build.zig
error: BadMagic
/Users/greg/src/gpanders.com/macho/src/main.zig:38:9: 0x1003d7de7 in main (macho)
        return error.BadMagic;
        ^

Good, as expected we get a BadMagic error.

Inviting friends to the party

One of the great things about using Zig is how well it interacts with C headers and libraries. This will come in handy as we flesh out our parser, because we are going to need lots of enums. But the enum values are already defined for us in /usr/include/mach-o/loader.h (and a few other places). So rather than redo all that work ourselves, we can simply import the existing C header files and reuse those definitions.

First, we need to tell Zig where to look for these header files. In build.zig there is a block of lines that looks like this:

const exe = b.addExecutable("macho", "src/main.zig");
exe.setTarget(target);
exe.setBuildMode(mode);
exe.install();

Just above the line exe.install(), add

exe.addIncludeDir("/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/");

Now, back in src/macho.zig, we can include the C headers:

/// src/macho.zig

const std = @import("std");

// NEW!
const loader = @cImport(@cInclude("mach-o/loader.h"));
const machine = @cImport(@cInclude("mach/machine.h"));

Zig will translate all of the C code found in those two headers and their declarations can be accessed under each respective namespace.

For example, in /usr/include/mach-o/loader.h we find the line

#define	MH_MAGIC	0xfeedface	/* the mach magic number */

We can access this value from Zig using

loader.MH_MAGIC

Let’s leverage this to create some data structures.

First, we’ll create a Zig version of the mach_header_64 struct. This will allow us to use Zig’s much better type system.

/// src/macho.zig

pub const MachHeader64 = struct {
    magic: u32,
    cputype: CpuType,
    cpusubtype: u32,
    filetype: Filetype,
    ncmds: u32,
    sizeofcmds: u32,
    flags: Flags,
    reserved: u32 = undefined,
};

Now we need to define the CpuType and Filetype enums (the enum values of cpusubtype depend on the value of cputype. For simplicity, we’ll just parse this as a raw number rather than defining an actual enum).

This part is a bit tedious. We simply need to copy the #defines from loader.h into our Zig file and transform them into valid Zig syntax. This is no match for some :s Vim-fu, but if you’re following along, feel free to simply copy and paste these definitions:

/// src/macho.zig

const Filetype = enum(u32) {
    object = loader.MH_OBJECT,
    execute = loader.MH_EXECUTE,
    fvmlib = loader.MH_FVMLIB,
    core = loader.MH_CORE,
    preload = loader.MH_PRELOAD,
    dylib = loader.MH_DYLIB,
    dylinker = loader.MH_DYLINKER,
    bundle = loader.MH_BUNDLE,
    dylib_stub = loader.MH_DYLIB_STUB,
    dsym = loader.MH_DSYM,
    kext_bundle = loader.MH_KEXT_BUNDLE,
    fileset = loader.MH_FILESET,
};

const CpuType = enum(u32) {
    // @bitCast here is necessary to convert CPU_TYPE_ANY (-1) to unsigned int
    // (all 1's)
    any = @bitCast(u32, machine.CPU_TYPE_ANY),
    vax = machine.CPU_TYPE_VAX,
    mc680x0 = machine.CPU_TYPE_MC680x0,
    x86 = machine.CPU_TYPE_X86,
    x86_64 = machine.CPU_TYPE_X86 | machine.CPU_ARCH_ABI64,
    mc98000 = machine.CPU_TYPE_MC98000,
    hppa = machine.CPU_TYPE_HPPA,
    arm = machine.CPU_TYPE_ARM,
    arm64 = machine.CPU_TYPE_ARM | machine.CPU_ARCH_ABI64,
    arm64_32 = machine.CPU_TYPE_ARM | machine.CPU_ARCH_ABI64_32,
    mc88000 = machine.CPU_TYPE_MC88000,
    sparc = machine.CPU_TYPE_SPARC,
    i860 = machine.CPU_TYPE_I860,
    powerpc = machine.CPU_TYPE_POWERPC,
    powerpc64 = machine.CPU_TYPE_POWERPC | machine.CPU_ARCH_ABI64,
};

const Flags = packed struct {
    noundefs: bool,
    incrlink: bool,
    dyldlink: bool,
    bindatload: bool,
    prebound: bool,
    split_segs: bool,
    lazy_init: bool,
    twolevel: bool,
    force_flat: bool,
    nomultidefs: bool,
    nofixprebinding: bool,
    prebindable: bool,
    allmodsbound: bool,
    subsections_via_symbols: bool,
    canonical: bool,
    weak_defines: bool,
    binds_to_weak: bool,
    allow_stack_execution: bool,
    root_safe: bool,
    setuid_safe: bool,
    no_reexported_dylibs: bool,
    pie: bool,
    dead_strippable_dylib: bool,
    has_tlv_descriptors: bool,
    no_heap_execution: bool,
    app_extension_safe: bool,
    nlist_outofsync_with_dyldinfo: bool,
    sim_support: bool,
    dylib_in_cache: bool,
    _: u3, // pad to 32 bits
};

The enum(u32) syntax above means that we want those enums to be represented using a u32. The packed keyword on the Flags struct creates a bitfield: each bool field in this struct will use exactly a single bit. The _: u3 at the end pads our struct to a full 32 bits.

Note that we are able to use the definitions from the C header files to define our enums. That means we don’t need to know or care what those values are.

Now, let’s add another function to our parser to parse enum values.

/// src/macho.zig

pub const Parser = struct {
    /// Field definitions
    ...


    /// Functions
    pub fn init(...) { ... }

    pub fn parseLiteral(...) { ... }

    pub fn parseEnum(self: *Parser, comptime T: type) !T) {

    }
};

Note that this function signature is identical to that of parseLiteral. We could combine these into a single function and use Zig’s type reflection to handle literal values and enums differently. Perhaps we’ll do that later, but for now let’s keep things simple and just use separate functions.

This implementation looks like this:

/// src/macho.zig

pub fn parseEnum(self: *Parser, comptime T: type) !T) {
    const size = @sizeOf(T);
    if (self.data.len < size) {
        return error.NotEnoughBytes;
    }

    const bytes = self.data[0..size];
    const tag = std.mem.bytesToValue(std.meta.Tag(T), bytes);
    const val = std.meta.intToEnum(T, tag) catch |err| {
        std.debug.print("{} is not a valid value for {s}\n", .{
            tag,
            @typeName(T),
        });
        return err;
    };
    self.data = self.data[size..];
    return val;
}

Once again, we check to make sure our byte slice has enough data in it. This time, we convert our bytes into the “tag” type of our enum, which we get using the std.meta.Tag standard library function. For the CpuType enum we defined earlier, this is u32. Once we convert the bytes into the tag type, we can convert the tag value into an enum value using std.meta.intToEnum. This will convert a number like 7 into an enum value like x86. If the enum doesn’t have a value corresponding to the given tag value, intToEnum returns an error. In that case, we print a helpful error message, and return the same error from our function. Note that this should not happen, and indicates a bug in our parser.

It’s also a good idea to encapsulate the parsing logic for our MachHeader64 struct within the struct definition itself. Let’s add a parse method to that struct:

/// src/macho.zig

pub const MachHeader64 = struct {
    /// Field definitions
    ...

    /// Functions
    pub fn parse(parser: *Parser) !MachHeader64 {

    }
};

This parse function will take a Parser object and return a MachHeader64. This is what the implementation looks like:

/// src/macho.zig

pub fn parse(parser: *Parser) !MachHeader64 {
    const magic = try parser.parseLiteral(u32);
    if (magic != loader.MH_MAGIC_64) {
        return error.BadMagic;
    }

    const cputype = try parser.parseEnum(CpuType);
    const cpusubtype = try parser.parseLiteral(u32);
    const filetype = try parser.parseEnum(Filetype);
    const ncmds = try parser.parseLiteral(u32);
    const sizeofcmds = try parser.parseLiteral(u32);
    const flags = try parser.parseLiteral(Flags);
    const reserved = try parser.parseLiteral(u32);

    return MachHeader64{
        .magic = magic,
        .cputype = cputype,
        .cpusubtype = cpusubtype,
        .filetype = filetype,
        .ncmds = ncmds,
        .sizeofcmds = sizeofcmds,
        .flags = flags,
        .reserved = reserved,
    };
}

Hopefully nothing here is surprising. We first parse the magic number and return a BadMagic error if it doesn’t match what we expect. Note that now instead of hardcoding 0xfeedfacf we are using the definition from the loader.h header file.

Next we use our parseEnum and parseLiteral functions to parse each field of the struct and finally place each value into the returned struct value.

Let’s update main.zig:

/// src/main.zig

var parser = macho.Parser.init(data);

const header = try macho.MachHeader64.parse(&parser);
std.debug.print("{}\n", .{header});

If we build and run our program now, we should see a lot more information (note that we are using the shorthand zig build run to both build and run the program at once):

$ zig build run -- ../exit
MachHeader64{ .magic = 4277009103, .cputype = CpuType.arm64, .cpusubtype = 0,
.filetype = Filetype.execute, .ncmds = 16, .sizeofcmds = 744, .flags = Flags{
.noundefs = true, .incrlink = false, .dyldlink = true, .bindatload = false,
.prebound = false, .split_segs = false, .lazy_init = false, .twolevel = true,
.force_flat = false, .nomultidefs = false, .nofixprebinding = false,
.prebindable = false, .allmodsbound = false, .subsections_via_symbols = false,
.canonical = false, .weak_defines = false, .binds_to_weak = false,
.allow_stack_execution = false, . root_safe = false, .setuid_safe = false,
.no_reexported_dylibs = false, .pie = true, .dead_strippable_dylib = false,
.has_tlv_descriptors = false, .no_heap_execution = false, .app_extension_safe
= false, .nlist_outofsync_with_dyldinfo = false, .sim_support = false,
.dylib_in_cache = false, ._ = 0 }, .reserved = 0 }

Woof, that’s not pretty. This “raw” representation is good for debugging, but it’s not very nice to look at.

We can tell print how we want our MachHeader64 struct to be formatted by adding a format() method to MachHeader64.

/// src/macho.zig

pub const MachHeader64 = struct {
    /// Field definitions
    ...

    /// Functions
    pub fn parse(...) { ... }

    pub fn format(value: MachHeader64, comptime fmt: []const u8, options:
    std.fmt.FormatOptions, writer: anytype) !void {
        _ = fmt;
        _ = options;
        try std.fmt.format(writer, "magic: 0x{x}\n", .{value.magic});
        try std.fmt.format(writer, "cputype: {}\n", .{value.cputype});
        try std.fmt.format(writer, "cpusubtype: 0x{x}\n", .{value.cpusubtype});
        try std.fmt.format(writer, "filetype: {}\n", .{value.filetype});
        try std.fmt.format(writer, "ncmds: {d}\n", .{value.ncmds});
        try std.fmt.format(writer, "sizeofcmds: {d}\n", .{value.sizeofcmds});
        try std.fmt.format(writer, "flags: ", .{});
        inline for (std.meta.fields(Flags)) |field| {
            if (field.field_type == bool and @field(value.flags, field.name)) {
                try std.fmt.format(writer, " ", .{});
                inline for (field.name) |c| {
                    try std.fmt.format(writer, "{c}", .{std.ascii.toUpper(c)});
                }
            }
        }
    }
};

The first two lines of this function suppress Zig’s compiler errors for unused function arguments. Next, we simply print each struct field in a manner appropriate for that field’s type. For the bit flags, we want a neat representation that only shows flags that are set. We do this using an inline for loop, which is a loop that is unrolled at compile time. We get a slice of all of the fields of the Flags struct with std.meta.fields, iterate over each of them, and print the field’s name if it’s corresponding bit is set.

Now when we rebuild and run we should see

$ zig build run -- ../exit
magic: 0xfeedfacf
cputype: CpuType.arm64
cpusubtype: 0x0
filetype: Filetype.execute
ncmds: 16
sizeofcmds: 744
flags: NOUNDEFS DYLDLINK TWOLEVEL PIE

Much nicer. This gives us a much better view into the data contained within our Mach-O file.

We see our old friend feedfacf, as well as a few things that we already know, such as the CPU type (arm64) and the filetype (execute, indicating this file is an executable). Following that we see that this Mach-O has 16 load commands which take up 744 total bytes. Finally, this Mach-O file has the NOUNDEFS, DYLDLINK, TWOLEVEL, and PIE flags set:

#define	MH_NOUNDEFS	0x1		/* the object file has no undefined
					   references */
#define MH_DYLDLINK	0x4		/* the object file is input for the
					   dynamic linker and can't be staticly
					   link edited again */
#define MH_TWOLEVEL	0x80		/* the image is using two-level name
					   space bindings */
#define	MH_PIE 0x200000			/* When this bit is set, the OS will
					   load the main executable at a
					   random address.  Only used in
					   MH_EXECUTE filetypes. */

Before we move on to parsing the load commands, let’s make one more improvement to our Parser struct. We added a parse method to the MachHeader64 struct which let us encapsulate the parsing logic for that data type. We want to extend this to all of the structs that we expect to parse, and we want to be able to easily dispatch the right method from our Parser struct.

To do this, we will use more comptime polymorphism. Let’s add a parseStruct method to our Parser:

/// src/macho.zig

pub const Parser = struct {
    /// Field definitions
    ...

    /// Functions
    pub fn init(...) { ... }
    pub fn parseLiteral(...) { ... }
    pub fn parseEnum(...) { ... }

    pub fn parseStruct(self: *Parser, comptime T: type) !T {

    }
};

We’ll handle this by checking if the type T has a parse method and, if so, simply call that:

/// src/macho.zig

pub fn parseStruct(self: *Parser, comptime T: type) !T {
    if (comptime !std.meta.trait.hasFn("parse")(T)) {
        @compileError(@typeName(T) ++ " does not have a parse() method");
    }

    const val = try T.parse(self);
    return val;
}

Note that in this function we are not checking to make sure our byte slice has enough bytes, nor are we updating self.data. That is because both of these things will be done by the struct’s own parse function.

Notice that parseStruct has the exact same signature as both parseLiteral and parseEnum. Perhaps this is a good type to consolidate all of these into a single parse function.

/// src/macho.zig

pub const Parser = struct {
    /// Field definitions
    ...

    /// Functions
    pub fn init(...) { ... }

    pub fn parse(self: *Parser, comptime T: type) !T {
        if (comptime std.meta.trait.hasFn("parse")(T)) {
            return try T.parse(self);
        }

        const size = @sizeOf(T);
        if (self.data.len < size) {
            return error.NotEnoughBytes;
        }

        const bytes = self.data[0..size];

        const val = switch (@typeInfo(T)) {
            .Enum => |info| blk: {
                const tag = std.mem.bytesToValue(info.tag_type, bytes);
                break :blk std.meta.intToEnum(T, tag) catch |err| {
                    std.debug.print("{} is not a valid value for {s}\n", .{
                        tag,
                        @typeName(T),
                    });
                    return err;
                };
            },
            .Int, .Struct => std.mem.bytesToValue(T, bytes),
            else => unreachable,
        };

        self.data = self.data[size..];
        return val;
    }
};

Here we see a few more Zig features. First, we use a comptime expression in an if statement to check whether or not our T type has a parse method.

Next, we use a switch on @typeInfo(T), which is a tagged union. The switch branches are the different tags. The captured value (|info|) in the .Enum branch is the payload; in this case, the actual type info of the enum. This allows us to replace the call to std.meta.Tag with simply info.tag_type.

We also make use of Zig’s labeled break to return a value from a block expression. The blk: label gives a name to the succeeding block, and break :blk ... uses the following value as the value of the block itself. As before, if std.meta.intToEnum encounters an error, the entire function returns an error. Otherwise, we remove the bytes from our byte slice and return the parsed value.

Be sure to update the parse function in MachHeader64 to use this new variant as well:

/// src/macho.zig

const MachHeader64 = struct {
    /// Field definitions
    ...

    /// Functions
    pub fn parse(parser: *Parser) !MachHeader64 {
        const magic = try parser.parse(u32);
        if (magic != loader.MH_MAGIC_64) {
            return error.BadMagic;
        }

        const cputype = try parser.parse(CpuType);
        const cpusubtype = try parser.parse(u32);
        const filetype = try parser.parse(Filetype);
        const ncmds = try parser.parse(u32);
        const sizeofcmds = try parser.parse(u32);
        const flags = try parser.parse(Flags);
        const reserved = try parser.parse(u32);

        return MachHeader64{
            .magic = magic,
            .cputype = cputype,
            .cpusubtype = cpusubtype,
            .filetype = filetype,
            .ncmds = ncmds,
            .sizeofcmds = sizeofcmds,
            .flags = flags,
            .reserved = reserved,
        };
    }
};

With this abstraction in place, we can update main.zig again:

/// src/main.zig

var parser = macho.Parser.init(data);

const header = try parser.parse(macho.MachHeader64);
std.debug.print("{}\n", .{header});

This will come in handy as we move on to load commands and begin parsing a larger variety of data structures.

Let’s stop here for now. In part 3, we’ll parse the load commands and see what else lies in store in Mach-O.

Full disclosure: I first tried writing the parser in Rust. However, I found the experience pretty frustrating. I don’t want to put all the blame on Rust, as it’s extremely likely that I just wasn’t approaching the task in an idiomatic way, but it felt like the language was fighting me at every turn (and I don’t mean the borrow checker – that part I have no trouble with). After battling to get something working for a couple of days, I finally gave up and did it in Zig and had something working in under an hour. ↩︎

Last modified on December 15, 2024