Pure-Zig H3: Outperforming the C reference by inlining through the call graph

2026-05-14 · project zig-h3

This is a lab notebook entry, not an essay. The format is hypothesis → setup → method → results → discussion → reproducibility → limitations, every claim graded, every benchmark reproducible from source. The substrate under test is [zig-h3][repo] — an idiomatic Zig wrapper around Uber's libh3 v4.1.0, with a parallel pure-Zig reimplementation that lives in the same binary and is cross-validated cell-by-cell against the C reference on every test run.

[repo]: https://github.com/SMC17/zig-h3

1. Hypothesis

A pure-Zig port of libh3 v4, given identical algorithmic structure, will produce bit-identical outputs and competitive-or-better performance, because LLVM has visibility across the entire call graph in a way the original C ABI does not permit. The C library compiles each translation unit independently and exposes every public function across a C-ABI boundary; the pure-Zig path compiles the entire call chain in a single module-level visibility scope, so the optimizer can inline through it. The expected mechanism is codegen, not algorithm — both paths execute the same H3 v4 math.

<!— runnable-claim: hypothesis_statement —>

2. Setup

| Item | Value | | --------------------------— | --------------------------------------------------— | | CPU | Intel Core i7-1065G7 @ 1.30 GHz (Ice Lake, 4C / 8T) | | OS | Linux 7.0.3-arch1-1 x86_64 | | Toolchain | Zig 0.16.0 | | Optimize mode | -Doptimize=ReleaseFast | | C reference | libh3 v4.1.0, vendored via Zig package manager, hash-pinned3 | | Wall-clock primitive | std.os.linux.clock_gettime(.MONOTONIC, &ts) direct syscall2 | | Bench harness | bench/bench_pure_vs_libh3.zig1 | | Iterations | 5,000 warm-up + 200,000 measured per probe (50,000 for gridDisk)1 | | Host state during bench | Concurrent agent fleet, load avg 28–55 on 8 cores4 |

Two things to call out about the setup before any numbers ship:

  1. The wall-clock primitive is a direct clock_gettime syscall, not

std.time.Timer or std.time.nanoTimestamp. Both of those were removed in Zig 0.16's stdlib reshuffle; the direct syscall is the replacement substrate and is what the bench harness uses verbatim2.

  1. The measurement host was under heavy concurrent load during the

runs reported here (multi-agent fleet). These numbers are not quiet-room measurements. The README is explicit about this: absolute ns/op figures vary by ±2× across runs on this contended host, so the headline table below reports a median of four runs plus the observed range, not a single-run point estimate4.

The reference implementation version is pinned at the package-manager level. build.zig.zon declares the h3c dependency with the v4.1.0 tarball URL and a hash; the first build downloads the upstream Uber/h3 archive, verifies its hash, and compiles its 18 C source files into a static library that this wrapper links against35.

3. Method

The substrate that makes the perf claim credible is the cross-validation suite, not the bench itself. Anyone can be fast and wrong. The pure-Zig path is matched cell-by-cell against libh3 across 117 cross-validation tests in the same test binary, plus 47 wrapper-layer tests and 2 adversarial-input fuzz tests, for 166 / 166 passing on Zig 0.16.06.

The fixture matrix covers, in the README's wording7:

cell resolution.

resolution.

resolution-step 2).

4100–4200 km published bound.

on a res-9 NYC cell.

surface, plus NaN / Inf input rejection on pure.latLngToCell.

Every test calls both the libh3-backed wrapper (h3.<fn>) and the pure-Zig equivalent (h3.h3index.<fn> / h3.h3decode.<fn> / h3.grid.<fn> / …) on the same input and asserts equality. This is the binding evidence that the perf claim isn't lying: same outputs, every fixture.

The bench protocol. bench/bench_pure_vs_libh3.zig warms up each path with 5,000 iterations, then runs 200,000 measured iterations of latLngToCell and cellToLatLng at resolutions 7, 9, 11, plus 50,000 iterations of gridDisk(k=3) at resolution 9. The input stream is a deterministic PRNG (seed 0xDEAD_60A1_C0DE_BE17) generating uniform points in [-π/2, π/2] × [-π, π], shared between both paths so they see bit-identical inputs1. Output is parseable key=value whitespace-separated lines per probe, including libh3_ns_per_op= / pure_ns_per_op= / pure_over_libh3_x1000= fields.

4. Results

The killer chart is the ratio of pure-Zig ns/op to libh3 ns/op. Ratio < 1.000 means pure-Zig is faster on that run. The table below is the median of four runs, with the observed range across all four, copied verbatim from the project README4:

| Op | Res | Ratio (median of 4) | Range (4 runs) | | ------------— | -— | -----------------— | -------------— | | latLngToCell | 7 | 0.54 | 0.40 – 0.55 | | latLngToCell | 9 | 0.65 | 0.46 – 1.38 | | latLngToCell | 11 | 0.75 | 0.33 – 0.96 | | cellToLatLng | 7 | 0.83 | 0.26 – 1.86 | | cellToLatLng | 9 | 0.76 | 0.47 – 1.67 | | cellToLatLng | 11 | 0.78 | 0.64 – 0.97 | | gridDisk k=3 | 9 | 0.79 | 0.24 – 1.78 |

<!— runnable-claim: bench_lat_lng_to_cell_res_7 —> <!— runnable-claim: bench_lat_lng_to_cell_res_9 —> <!— runnable-claim: bench_lat_lng_to_cell_res_11 —> <!— runnable-claim: bench_cell_to_lat_lng_res_7 —> <!— runnable-claim: bench_cell_to_lat_lng_res_9 —> <!— runnable-claim: bench_cell_to_lat_lng_res_11 —> <!— runnable-claim: bench_grid_disk_k3_res_9 —>

Three things to read off this table, honestly:

op. The largest median win is latLngToCell res 7 at 0.54× — close to a 2× speedup. Smaller wins on cellToLatLng res 7 (0.83×) and cellToLatLng res 11 (0.78×).

came in slower than libh3 — latLngToCell res 9 hit 1.38×, cellToLatLng res 7 hit 1.86×, gridDisk k=3 res 9 hit 1.78×. On a contended host, a single run can flip pure-Zig to noticeably slower than the C reference. The shape "pure-Zig is competitive" is robust to noise; the precise headline number on any one run is not.

the implementations. Quiet-machine numbers are likely tighter, with smaller ranges. Those have not been measured yet and are an honest gap in this lab entry.

Both paths produce bit-identical output — validated by the cross-validation suite at zig build test. The perf delta is codegen, not algorithm4.

5. Discussion — why pure-Zig wins (the median case)

The architectural reason is inlining through the call graph. Look at the call chain for latLngToCell in the libh3 reference:

latLngToCell
  → _geoToFaceIjk
    → _geoToHex2d
      → _hex2dToCoordIJK
        → _coordIJKDigit

Each arrow is a C ABI hop. libh3's source is split across separate translation units (latLng.c, faceijk.c, coordijk.c, …), so when the wrapper compiles the static library, the optimizer sees each function as an opaque cross-TU call. Without LTO across the libh3 C sources, every hop materializes register-save / register-restore sequences and a real call instruction.

The pure-Zig path compiles the entire chain as a single module-visible Zig graph. h3index.latLngToCell at src/pure_h3index.zig:1119 calls into private module-local helpers in the same file (and into pure_proj.zig via a top-level @import), all of which are visible to LLVM at function-emission time8. The same is true for h3decode.cellToLatLng at src/pure_h3decode.zig:399, which calls into private projection helpers in pure_proj.zig and the faceNeighbors overage table in the same file9.

When LLVM sees the full call graph as inlineable, it folds the chain into a single function body — register pressure is the only remaining constraint, and Ice Lake has 16 general-purpose registers to spend. The C path can't get there because its function boundaries are real ABI boundaries.

This is a hypothesis about mechanism, not a measured cause. We have the shape (pure-Zig faster on the median, the C path's call graph is deeper), but we have not yet:

contribution of inlining vs other codegen differences.

LTO'd libh3.

Until those land, "LLVM inlines through the Zig path" is the most plausible explanation for the median delta, not a confirmed one. Read this section as a hypothesized mechanism, evidence-graded hypothesized.

6. Discussion — where pure-Zig doesn't win

The wide ranges in Section 4 are the honest signal that pure-Zig is not uniformly faster. Specifically:

is ~1.4× slower than libh3.

is ~1.9× slower than libh3.

is ~1.8× slower than libh3.

Two candidate causes, neither confirmed:

  1. Allocator pathology in the pure-Zig path. pure_grid.gridDisk

takes an Allocator parameter and uses it for an internal hash-set to deduplicate ring cells10. The libh3 gridDisk does not allocate. On a contended host, an allocator stall in the pure path can dominate a single-run measurement and flip the ratio.

  1. Page-fault / thermal noise on a load-avg-55 host. The README

names this explicitly: run-to-run variance is dominated by host load. Quiet-host numbers are very likely tighter ranges; this is the un-measured-yet gap.

The wider range on cellToLatLng than on latLngToCell is consistent with a thermal-throttle hypothesis — cellToLatLng is the cheaper op (fewer ns/op), so per-iteration host noise is a larger relative share. That's a hypothesis, not a confirmed cause.

Three commitments fall out of this section, in order:

publish the tightened ranges.

for comparison.

internal hash-set can be replaced by a fixed-size scratch buffer sized via maxGridDiskSize.

None of these are done. The headline claim — pure-Zig is competitive, median-faster, with wide ranges on a contended host — is what the current evidence supports.

7. Reproducibility

The bench is one command, assuming a Zig 0.16 toolchain and the zig-h3 working tree:

git clone https://github.com/SMC17/zig-h3
cd zig-h3
zig build bench

This runs all three benchmarks (bench_latlng_to_cell.zig, bench_grid_disk.zig, bench_pure_vs_libh3.zig) under -Doptimize=ReleaseFast4. The pure-vs-libh3 bench prints one bench=... line per probe with parseable key=value fields:

# zig-h3 bench: pure-Zig vs libh3 (ReleaseFast, MONOTONIC ns)
bench=latLngToCell.cmp res7 libh3_ns_per_op=<n> pure_ns_per_op=<n> pure_over_libh3_x1000=<ratio> iters=200000 libh3_total_ns=<...> pure_total_ns=<...>
bench=cellToLatLng.cmp res7 libh3_ns_per_op=<n> pure_ns_per_op=<n> pure_over_libh3_x1000=<ratio> iters=200000 libh3_total_ns=<...> pure_total_ns=<...>
...
bench=gridDisk.cmp res9_k3 libh3_ns_per_op=<n> pure_ns_per_op=<n> pure_over_libh3_x1000=<ratio> iters=50000 libh3_total_ns=<...> pure_total_ns=<...>

The ratio field (pure_over_libh3_x1000) is (pure_ns × 1000) / libh3_ns; divide by 1000 to recover the table values in Section 4.

The cross-validation suite is one command:

zig build test

166 / 166 tests pass on Zig 0.16.0. The pure-Zig path is cross-validated against libh3 in the same test binary; if the two paths ever produce different output on any tested input, the test fails — that's the binding evidence the perf claim isn't a lie6.

Hardware portability. These numbers are workstation-grade Ice Lake under heavy load. On a quiet M-series Mac, the perf delta is likely smaller — Apple's clang is also good at inlining and the M-series microarchitecture has different register-pressure characteristics. On older x86-64 without Ice Lake's wider issue width, the delta is likely larger — ABI hop cost is more pronounced. Neither has been measured yet.

8. Cross-validation discipline (what makes the win trustworthy)

A perf number on a port is meaningless unless the port produces the same outputs as the reference. The 166-test cross-validation matrix is what makes "pure-Zig is faster" a trustworthy claim rather than a parlor trick — and it lives at:

aperture-7 hierarchy, digit / coord rotations, H3 index bit-level ops.

handler, faceNeighbors table, inverse-spherical projection helpers.

gridRingUnsafe, areNeighborCells.

src/pure_localij.zig, src/pure_vertex.zig, src/pure_edge.zig, src/pure_polygon.zig — the rest of the substrate.

Plus the property-based harnesses:

across all 16 resolutions (≥630 samples per res, 443 pentagon visits), exercising 7 structural invariants: parent/child round-trip, gridDisk(c,0)={c}, |gridDisk(c,1)| ∈ {6,7}, gridDistance(c,c) = 0, lat/lng round-trip through cell centroid, child-area sum ≈ parent area (within empirically-bounded per-res tolerance), compactCells(children) = {parent}, and validity of every constructor-produced cell11.

resolutions 5–9 over random gridDisk(seed, k) cell sets, plus Property B over all 122 base cells. 1999 / 2000 pass on the pinned seed (the 1 skip is the antimeridian-wedge bbox finding documented in the CHANGELOG)12.

The fuzz suite probes the parser surface specifically (10,000 random u64 inputs into isValidCell / getResolution / getBaseCellNumber / isPentagon, plus NaN / Inf inputs into latLngToCell). It does not fuzz the projection / hierarchy / grid paths against libh3 — a real Type-II risk and a gap the next pass should close.

9. Three libh3 quirks the port surfaced

The cross-validation work turned up three places where libh3's behavior differs from the obvious spec reading. None of these are bugs in libh3 — they're real, documented behaviors that the C source quietly implements. The port surfaced them as part of the differential-test work; surfacing them is part of the lab-notebook value:

  1. cellToLocalIj(origin, origin) does not return (0, 0). libh3

picks an arbitrary IJ frame anchored at the origin, but the coordinates themselves are not zero. The wrapper test in src/root.zig:1194 documents this explicitly: the contract isn't "zero-anchoring at origin," it's "the IJ coordinate round-trips back to the same cell via localIjToCell"13. The test asserts round-trip, not zero, and walks all k = 1 neighbors at src/root.zig:1202.

  1. **cellToVertex returns Error.Domain, not Error.VertexInvalid,

on out-of-range vertex_num.** The pure-Zig path at src/pure_vertex.zig:238 matches: if (vertex_num < 0 or vertex_num > num_verts - 1) return Error.Domain;. The wrapper test at src/root.zig:1107 confirms libh3 returns the same code. The error mapping in Error enum surfaces this honestly — Domain is the documented C return for vertex range violations, not the plausibly-named VertexInvalid14.

  1. cellsToMultiPolygon requires zero-initialized head storage.

libh3's destroyLinkedMultiPolygon walks the linked list to free each node; if the caller hands it a LinkedGeoPolygon with stack garbage in the next / first pointers, the destructor walks junk. The wrapper at src/root.zig:665 fixes this with var out: c.LinkedGeoPolygon = std.mem.zeroes(c.LinkedGeoPolygon); before passing the address to libh315. The C reference relies on the caller to zero the head; Zig makes that contract explicit.

These are not bugs — they are quiet conventions of the C reference that a port has to honor. Documenting them in the differential-test substrate is what made the perf claim possible without breaking behavior.

10. Honest limitations

Six things this evidence package does not cover:

  1. Wrapper coverage is 63 of ~70 public H3 v4 functions. The

remaining six are variant grid traversals — gridDiskUnsafe, gridDiskDistances, gridDiskDistancesSafe, gridDiskDistancesUnsafe, gridDisksUnsafe, gridRingUnsafe. They are reachable only through the raw C-bindings escape hatch16. The safe path (gridDisk / gridDistance) is covered.

  1. **Pentagon distortion handling is replicated, not independently

verified.** The pure-Zig path translates libh3's pentagon-overage logic verbatim. We have not independently re-derived the pentagon-distortion math; the cross-validation suite catches output drift against libh3, but if libh3 itself is wrong on some pentagon-adjacent input, the pure-Zig path will be wrong in the same way. Independent verification is months of correctness work and is out of scope.

  1. Performance is workstation-grade Ice Lake under heavy load. No

quiet-host numbers, no aarch64 numbers, no server-grade Xeon / EPYC numbers, no representative production workload (e.g. a real order-stream trace from the private spatial CLOB). The single-laptop ratios are directional; absolute production characteristics will differ.

  1. Only ReleaseFast is measured. No ReleaseSafe, no

ReleaseSmall, no Debug.

  1. No micro-architectural profiling. Wall-clock ns/op only — no

cache-miss counters, no branch-miss counters, no IPC measurement. The "LLVM inlines through the call graph" claim in Section 5 is a hypothesized mechanism, not a confirmed cause.

  1. The cross-validation matrix is finite. Anti-meridian wrap,

exact-pole coordinates (lat = ±π/2 exactly, not just ±89°), and sub-millimetre coordinate precision near pentagon distortion boundaries are not enumerated in the fixture set. The property-polygon test caught one antimeridian-wedge bbox bug in pure_polygon.bboxFromGeoLoop that is documented in the CHANGELOG as out-of-scope-for-this-commit. If you find a divergence between the pure-Zig path and libh3 on any input, that is a bug; please open an issue.

11. Status footer

matrix (166 / 166 tests on Zig 0.16.0; pure-Zig path matched cell-by-cell against libh3 in the same test binary; 10,392-sample property-invariant harness; 2,000-trial polygon round-trip harness) is the differential-test substrate. evidence: differential-tested matches the YAML frontmatter.

zig-h3 && zig build bench produces the raw numbers; zig build test` produces the cross-validation verdict.

(Intel Core i7-1065G7 @ 1.30 GHz, Linux 7.0.3-arch1-1 x86_64, Zig 0.16.0). Re-verification on a quiet host is the next benchmark task.

LLVM-asm dump to confirm inlining; LTO'd libh3 comparison; pure_grid.gridDisk allocator audit; anti-meridian-wedge bbox fix in pure_polygon.bboxFromGeoLoop.


Footnotes

  1. bench/bench_pure_vs_libh3.zig, lines 1–189. Probe definitions at lines 33–37 (res 7, 9, 11). Point-count constant at line 25 (200_000). Warmup-iters constant at line 26 (5_000). PRNG seed at line 45 (0xDEAD_60A1_C0DE_BE17). Grid-disk iter count at line 151 (50_000). pure_over_libh3_x1000 ratio computation at lines 87–88, 129–130, 181–182.
  2. bench/bench_pure_vs_libh3.zig, lines 18–23. inline fn nanos() u64 { var ts: std.os.linux.timespec = undefined; _ = std.os.linux.clock_gettime(.MONOTONIC, &ts); return @as(u64, @intCast(ts.sec)) * std.time.ns_per_s + @as(u64, @intCast(ts.nsec)); }. std.time.Timer and std.time.nanoTimestamp were removed in Zig 0.16's stdlib reshuffle; the direct syscall is the replacement.
  3. build.zig.zon declares the h3c dependency with the upstream Uber/h3 v4.1.0 release tarball URL and a content hash. The Zig package manager downloads the archive on first build, verifies the hash, and caches it. README "Install" and "Why hash-pin v4.1.0" sections at README.md lines 86–110 and 309–313.
  4. README.md, "Benchmarks" section, lines 363–428. Hardware-and-software list at lines 387–392 (Intel Core i7-1065G7 @ 1.30 GHz, Linux 7.0.3-arch1-1 x86_64, Zig 0.16.0, ReleaseFast, load avg 28–55 on 8 cores). Median-of-four-runs table at lines 399–407. Honest-read-off bullets at lines 415–425. "Both paths produce bit-identical output … the perf delta is codegen, not algorithm" at lines 427–428.
  5. README.md, "Install" section, lines 84–109. The .dependencies block in build.zig.zon points at https://github.com/SMC17/zig-h3/archive/refs/tags/v1.1.0.tar.gz; the underlying libh3 dependency follows the same hash-pin pattern.
  6. README.md, "Tests" section, lines 340–362. 47 wrapper-layer tests + 117 pure-Zig cross-validation tests + 2 adversarial-input fuzz tests = 166 total, all currently passing on Zig 0.16.0 via zig build test. Status section at lines 54–73 lists the cross-validation fixture coverage.
  7. README.md, lines 58–73 — the fixture-coverage paragraph spanning degrees↔radians, closed-form counts, landmark-city resolution, hierarchy round-trip, h3↔string round-trip, SF↔NYC distance, res-9 area bound, base-cell / pentagon enumeration, malformed-string and zero-cell rejection, directed-edge round-trip, hexagon-vs-pentagon edge / vertex counts, polygon-to-cells on a small bbox, cells-to-multi-polygon on single cells and k=1 disks, local-IJ ↔ cell round-trip, gridPathCells endpoint / contiguity, compact / uncompact round-trip.
  8. src/pure_h3index.zig:1119pub fn latLngToCell(point: LatLng, res: i32) Error!H3Index { … }. The function is module-visible and calls into private helpers (_geoToFaceIjk, _geoToHex2d, _hex2dToCoordIJK-equivalents) within the same file and into pure_proj.zig via the top-level @import at the head of the module. The Zig compilation unit covers the whole graph; LLVM sees it inlineable.
  9. src/pure_h3decode.zig:399pub fn cellToLatLng(cell: H3Index) Error!LatLng { … }. Inverse-projection chain through _h3ToFaceIjkWithInitializedFijk, _h3ToFaceIjk, _adjustOverageClassII, _hex2dToGeo, all within the same Zig file and module. faceNeighbors[20][4] table at the top of the same file, lines 26–69.
  10. src/pure_grid.zig:632pub fn gridDisk(origin: H3Index, k: i32, allocator: std.mem.Allocator, out: []H3Index) Error!void { if (k < 0) return Error.Domain; … }. Note the Allocator parameter — the pure-Zig path uses it for an internal deduplication hash-set. The libh3 gridDisk does not allocate; this is one suspect for the wide-range gridDisk row in Section 4.
  11. src/property_invariants.zig. 10,392 randomly-generated cells across all 16 resolutions (≥630 samples per res, 443 pentagon visits), 7 structural invariants. Documented finding: libh3's cellAreaKm2 parent-vs-children-sum drift is ~1.8e-3 relative at res 0, decaying monotonically with resolution; this is gnomonic-projection distortion of the substrate grid, not a wrapper bug. Per-resolution tolerance table at AREA_TOLERANCE_BY_RES. PRNG seeded deterministically. (See CHANGELOG.md Unreleased section for the full description.)
  12. src/property_polygon_test.zig. 2,000 Property A trials at resolutions 5–9 over random gridDisk(seed, k) cell sets with k ∈ {0, 1, 2}, asserting polygonToCells(cellsToMultiPolygon(S).polygons[0], res) ⊇ S. Pinned seed 0xC0FFEE_BEEF_F00D; 1999 / 2000 pass, 1 skip for antimeridian-wedge bbox bug in pure_polygon.bboxFromGeoLoop (documented in CHANGELOG as out-of-scope-for-this-commit). Property B: 122 res-0 base cells walked to res 5–9, 550 hex passes, 60 pentagon skips.
  13. src/root.zig:1194test "cellToLocalIj: origin roundtrips through localIjToCell". Test body at lines 1195–1200: const ij = try cellToLocalIj(origin, origin, 0); followed by try testing.expectEqual(origin, try localIjToCell(origin, ij, 0)); — the comment at lines 1197–1198 reads "libh3 doesn't pin the origin to (0,0); the contract is that the IJ coordinate inverts back to the same cell." k = 1 neighbor round-trip at src/root.zig:1202–1211.
  14. src/pure_vertex.zig:238if (vertex_num < 0 or vertex_num > num_verts - 1) return Error.Domain;. Wrapper test at src/root.zig:1107–1108: try testing.expectError(Error.Domain, cellToVertex(cell, -1)); and try testing.expectError(Error.Domain, cellToVertex(cell, 6));. The Error enum mapping at src/root.zig:161 maps C code 2 (E_DOMAIN) to Error.Domain; code 8 (E_VERTEX_INVALID) at src/root.zig:167 maps to Error.VertexInvalid. libh3 returns code 2 (not 8) on out-of-range vertex_num, which is the quirk.
  15. src/root.zig:665–668pub fn cellsToMultiPolygon(cells: []const H3Index) Error!LinkedMultiPolygon { var out: c.LinkedGeoPolygon = std.mem.zeroes(c.LinkedGeoPolygon); try check(c.cellsToLinkedMultiPolygon(cells.ptr, @intCast(cells.len), &out)); return .{ .inner = out }; }. The std.mem.zeroes initialization is the fix — without it, LinkedMultiPolygon.deinit at src/root.zig:657–659 (c.destroyLinkedMultiPolygon(&self.inner)) walks junk next / first pointers from stack garbage.
  16. README.md, lines 75–80 — "Out of scope (variant grid traversals — the safe / non-pentagon path is covered by gridDisk / gridDistance): gridDiskUnsafe, gridDiskDistances, gridDiskDistancesSafe, gridDiskDistancesUnsafe, gridDisksUnsafe, gridRingUnsafe. The raw C bindings remain exposed via the raw module export so callers can reach any unwrapped function and PR an idiomatic wrapper."