"SOVEREIGN AUDIT 06"

Sovereign Audit 06: The 38-Microsecond Mind

2026-05-12 · 15 min read · 3636 words

The robotics industry has settled on a comfortable shorthand: that 30-millisecond control loops are "fast enough" for physical intelligence, that a Vision-Language-Action stack can live in a data center and reach the body through a network call. Google DeepMind's AutoRT9 and the Physical Intelligence π-zero10 family both publish latency budgets in that 30–300ms regime. In the language of physical control, that is an eternity; it is also a number with a specific provenance, and the 30ms figure deserves to be cited honestly rather than waved at as a rhetorical anchor.

The essay you are reading is the second pass at a claim I first published as sovereign-audit-04. That earlier post asserted a 38.9μs Sovereign VLA latency without showing the measurement, the code, or the limits of what was measured. A hostile external review (Grok, May 2026) flagged it correctly: bold number, no methodology, no repo link, no admitted scope. This rewrite is the apology for that earlier version, and the standing reproducible artifact for the current one.

What this essay covers, in order: what was actually measured, by what code, on what hardware, with what variance, against what baseline, with what known caveats, and how to rerun it.

What the 37.6μs measures (and what it doesn't)

The benchmark lives at src/bench_vla.zig in the <canonical-tree> working tree1. It runs a tight loop of 1,000 iterations after a 10-iteration warm-up, and for each iteration it calls generateVLABlock(allocator), a function that constructs a 12-layer symbolic attention block in the Kircher Ark symbolic graph, then lowers it to a PTX string targeting sm_75 (the NVIDIA T4 compute capability)2.

The measurement primitive is clock_gettime(CLOCK_MONOTONIC) bracketing the single generateVLABlock call. This matters: the 37.6μs figure is wall-clock CPU latency for symbolic kernel synthesis, on the host side. It is not the GPU execution time of the resulting kernel. That distinction is the most important caveat in this essay, and I am putting it in bold here because the earlier version of this post elided it.

What the 37.6μs does claim, with code-level fidelity:

What it does not claim:

With those scopes named: here is the measurement.

I. Methodology

Hardware

The 12-layer block

generateVLABlock is 12 lines of essential Zig (src/cognitive/learning/crl.zig, lines 46–64)2:

pub fn generateVLABlock(allocator: std.mem.Allocator) ![]const u8 {
    var scrl = SCRL.init(allocator);
    defer scrl.deinit();

    const x = try scrl.ark.param("X");
    var current = x;

    // Generate 12 "Layers" of symbolic morphisms
    // Each layer is an attention-like projection (mul + add)
    for (0..12) |i| {
        var buf: [16]u8 = undefined;
        const w_name = try std.fmt.bufPrint(&buf, "W{d}", .{i});
        const w = try scrl.ark.param(w_name);
        const projection = try scrl.ark.mul(current, w);
        current = try scrl.ark.add(projection, x); // Skip connection
    }

    return scrl.ark.compilePtx("vla_12_layer");
}

Each "layer" is a fused multiply-add with a residual connection back to the input, the smallest topology that resembles an attention block while remaining honest about scale. The KircherArk graph deduplicates identical nodes (the node_cache: AutoHashMap in ptx_ark.zig) and constant-folds at AST construction time (x + 0 = x, x * 1 = x, x * 0 = 0), so by the time compilePtx walks the node list, the IR is already simplified.

The measurement loop

From src/bench_vla.zig, lines 17–421:

// Warmup
for (0..10) |_| {
    const ptx = try zig_canonical.learning.SCRL.generateVLABlock(allocator);
    allocator.free(ptx);
}

for (0..1000) |i| {
    var ts_start: std.os.linux.timespec = undefined;
    _ = std.os.linux.clock_gettime(std.os.linux.CLOCK.MONOTONIC, &ts_start);
    const start_time = @as(u64, @intCast(ts_start.sec)) * 1_000_000_000
                     + @as(u64, @intCast(ts_start.nsec));

    const ptx = try zig_canonical.learning.SCRL.generateVLABlock(allocator);

    var ts_end: std.os.linux.timespec = undefined;
    _ = std.os.linux.clock_gettime(std.os.linux.CLOCK.MONOTONIC, &ts_end);
    const end_time = @as(u64, @intCast(ts_end.sec)) * 1_000_000_000
                   + @as(u64, @intCast(ts_end.nsec));
    const elapsed = end_time - start_time;
    allocator.free(ptx);

    times[i] = elapsed;
    sum_ns += elapsed;
    if (elapsed < min_ns) min_ns = elapsed;
    if (elapsed > max_ns) max_ns = elapsed;
}

Three things to call out about this loop:

  1. The warm-up is short (10 iterations). That is enough to bring the allocator's free-lists and the libc/CRT pages into cache, but it's not enough to defeat a determined adversarial reviewer who wants to see longer warm-up. A v2 of the bench should ramp warm-up to 1,000 and report iteration-binned tail latencies.
  2. The allocator is an arena (init.arena.allocator() in main). That means the per-iteration allocator.free(ptx) is cheap; arena free is essentially a noop, and the high-water mark is bounded by the largest PTX string produced. Production code generating PTX on a hot path would either reuse a fixed-size buffer or hand the allocation off; arena timing is a fair approximation of "no allocator pathology," not a claim that allocation has been eliminated.
  3. The measurement scope is exactly one generateVLABlock call. That includes graph construction, deduplication, register allocation, and PTX string emission. It does not include ptxas compilation, cuModuleLoadDataEx, or kernel launch; those are measured separately, downstream.

Reported figures

bench_vla prints, at the end of the run:

The 37.6μs figure cited in the summary is the average reported by this loop on the measurement host. Min hovers in the high-20μs range; max occasionally spikes to ~1.08ms when an arena resize or a page fault catches the loop. Tail behavior matters for control loops. The max-latency outlier is more important than the average for hard real-time, and section V flags this honestly.

II. The PTX the bench produces

The kernel that comes out of generateVLABlock is a 12-fused-layer projection. The emitter (compilePtx in ptx_ark.zig, lines 192–266)3 separates register classes (.f32, .u32, .pred) and emits a single-entry-point .visible .entry vla_12_layer block.

For a smaller but fully-formed example of the Ark's output, atomic_dot.ptx is checked into the repo6. This is a warp-reduction dot product, generated by KircherArk.generateAtomicDot, and used as the "Action Tokenizer baseline" by the live harness:

.version 8.0
.target sm_80
.address_size 64

.visible .entry atomic_dot(
  .param .u64 A,
  .param .u64 B,
  .param .u64 C,
  .param .u64 N
)
{
  reg .f32 %f<4>;
  reg .b32 %r<10>;
  reg .b64 %rd<10>;
  mov.u32 %r0, %tid.x;
  mov.u32 %r1, %ntid.x;
  mov.u32 %r2, %ctaid.x;
  mad.lo.u32 %r3, %r2, %r1, %r0;
  ld.param.u64 %rd1, [A];
  ld.param.u64 %rd2, [B];
  ld.param.u64 %rd3, [C];
  ld.param.u32 %r4, [N];
  setp.ge.u32 %p1, %r3, %r4;
  @%p1 ret;
  mul.wide.u32 %rd4, %r3, 4;
  add.u64 %rd5, %rd1, %rd4;
  add.u64 %rd6, %rd2, %rd4;
  ld.global.nc.f32 %f0, [%rd5];
  ld.global.nc.f32 %f1, [%rd6];
  mul.f32 %f2, %f0, %f1;
  shfl.sync.bfly.b32 %f3, %f2, 16, 31, 0xffffffff;
  add.f32 %f2, %f2, %f3;
  shfl.sync.bfly.b32 %f3, %f2, 8, 31, 0xffffffff;
  add.f32 %f2, %f2, %f3;
  shfl.sync.bfly.b32 %f3, %f2, 4, 31, 0xffffffff;
  add.f32 %f2, %f2, %f3;
  shfl.sync.bfly.b32 %f3, %f2, 2, 31, 0xffffffff;
  add.f32 %f2, %f2, %f3;
  shfl.sync.bfly.b32 %f3, %f2, 1, 31, 0xffffffff;
  add.f32 %f2, %f2, %f3;
  and.b32 %r5, %r0, 31;
  setp.eq.u32 %p2, %r5, 0;
  @%p2 atom.global.add.f32 %f4, [%rd3], %f2;
  ret;
}

(Note: the file committed to the repo targets sm_80. The Ark itself emits sm_75 for the in-process VLA path. Both are real and both are valid PTX; the sm_80 artifact is a manually curated reference that uses ld.global.nc non-coherent loads and a slightly tighter shuffle pattern: features available on Ampere and useful as a forward-target reference.)

The key architectural property is the warp-level reduction: 32 threads collapse to a single partial sum via shfl.sync butterfly, and only lane 0 of each warp issues the global atomic add. This reduces global-memory atomic contention by 32× and is the kind of optimization that human-written CUDA usually only acquires after a profiling pass; emitting it directly from a symbolic graph is the engineering point of the Kircher Ark.

III. The kernel-by-kernel pipeline (where 37.6μs sits)

The bench measures one stage of a larger pipeline. The full inference path lives in src/vla_pipeline.zig and breaks down as follows:

+---------------------------+----------------------------+
|  STAGE                    |  WHAT IT DOES              |
+---------------------------+----------------------------+
|  1. KircherArk synthesis  |  Symbolic graph → PTX      |
|     (= the 37.6μs bench)  |  string. CPU side.         |
+---------------------------+----------------------------+
|  2. ptxas / cuModule load |  PTX → SASS, JIT-cached    |
|     (init-time, one-shot) |  per-process.              |
+---------------------------+----------------------------+
|  3. copyToDevice (input)  |  Host → device DMA, 32f×N. |
+---------------------------+----------------------------+
|  4. LayerNorm kernel      |  N=32, 1×1×1 grid, 32×1×1  |
|                           |  block.                    |
+---------------------------+----------------------------+
|  5. Softmax kernel        |  N=32, 1×1×1 grid, 32×1×1  |
|                           |  block.                    |
+---------------------------+----------------------------+
|  6. Causality Guard       |  N=1 lane. Filters         |
|                           |  invariant-violating       |
|                           |  actions.                  |
+---------------------------+----------------------------+
|  7. copyToHost (output)   |  Device → host DMA.        |
+---------------------------+----------------------------+
|  8. postAction (motor)    |  argmax → JSON-RPC over    |
|                           |  unix socket to agent.     |
+---------------------------+----------------------------+

Stages 4–6 are the genuine GPU work and are launched via Driver.launchExt5. The current input shape is N=32 (a 32-wide attention head, the smallest non-trivial configuration). Stage 1 is the bench's scope. Stage 2 happens at process-init and is amortized; it is excluded from steady-state latency. Stages 3 and 7 are the DMA hops, and for a 32×f32 = 128-byte payload across PCIe Gen3 x16 they are dominated by per-launch fixed overhead (~5–15μs) rather than bandwidth.

The honest summary: 37.6μs is the dominant term for kernel synthesis when the host is doing JIT-style emission of new kernels per step, which is the SCRL/Kircher use case where the graph topology changes with the learned policy. For a static, pre-compiled kernel, stage 1 disappears entirely after warm-up and the steady-state path is stages 3–7.

IV. The T4 execution path (separate measurement, separate harness)

The CPU-side bench answers "how fast can we generate a PTX kernel." The T4 execution side is answered by src/harness_vla_live.zig4, which is a different binary and a different measurement.

Critically, harness_vla_live uses CUDA events (not clock_gettime) for timing:

const start_event = try d.createEvent();
const end_event = try d.createEvent();
defer d.destroyEvent(start_event);
defer d.destroyEvent(end_event);

// ...

try d.recordEvent(start_event, null);
try d.launchExt(func, .{ grid_size, 1, 1 }, .{ block_size, 1, 1 }, &params);
try d.recordEvent(end_event, null);
try d.synchronizeEvent(end_event);

const physical_ms = try d.eventElapsedTime(start_event, end_event);
const physical_us = physical_ms * 1000.0;

This is the right primitive for GPU timing: cuEventRecord enqueues a timestamp into the command stream itself, so cuEventElapsedTime returns the wall-clock duration between the two events on the device, not on the host. CPU-side wall-clock would miss kernel-launch overhead and would conflate driver-queue stall with kernel execution.

The harness loads generateAtomicDot()'s PTX (the warp-reduction dot product shown above) and launches it against a real sensor log read from <internal-lab>/runs/sample.bin. Grid size is (n_elements + 255) / 256, block size is 256. The harness prints the measured physical_us to stdout and gates a SOVEREIGN status flag on physical_us < 50.0.

What this means for the headline claim: the 37.6μs in the title refers to the kernel-generation bench, which is the more thoroughly measured side (1,000-iteration loop, min/max/mean reported, gated on a 1ms bound). The T4-side execution number (the actual SASS-to-result latency on hardware) is sampled by harness_vla_live but is not yet aggregated into a 1,000-iteration distribution in the public repo. Calling the full end-to-end claim "T4-verified" requires running the harness with iteration aggregation, which is the next bench-engineering task. The honest 2026-05-12 status is: PTX generation is benchmarked rigorously; PTX execution is wired up live but has not yet been published as a distribution.

V. The cloud-baseline comparison, audited

The "797× faster than 30ms cloud baseline" framing needs the same audit. Where does 30ms come from?

A fair comparison would be: "our 12-layer PTX-native projection generates and (on the T4 execution path) launches faster than a 30ms VLA inference call." That's true. But the apples-to-apples concern stands: a 12-layer attention block is not the same model as a 5-billion-parameter RT-2 head, and conflating them invites the parlor-trick critique I was rightly given on the first version.

Where the comparison is legitimate: at the control-loop level, a robot's planner does not need a 5B-parameter VLA to decide whether to move 5mm left. A 12-layer projection is sufficient for many fine-control morphisms, and if a small specialized kernel produces an action of acceptable quality, then running it at 26,000 Hz instead of 33 Hz is the architecturally correct choice. The argument is about appropriate model scale per control horizon, not about model-vs-model raw capability.

VI. Honest limitations

Five things this essay's claim does not cover, in plain language:

  1. No published end-to-end T4 latency distribution. harness_vla_live emits one measurement per run. There is no committed log file showing 1,000-iteration mean/min/max from the GPU. Producing that log is the next benchmark task.
  2. The bench measures a single function call, not a full inference. Stages 2–7 of the pipeline above (PTX compile, DMA, three kernel launches, action-emit) are not in the 37.6μs envelope. A full sensor-to-servo wall-clock is the integral of all stages; we have measurements for stage 1 (this bench) and stage 4/5/6 (per-kernel via harness_vla_live-style events), but no aggregated full-loop number.
  3. The 12-layer block is the smallest VLA, not the largest. A real-world physical-intelligence stack on a hand-grade manipulator may need more than 12 attention layers, or a richer non-linearity than the mul + add + skip pattern used here. Scaling the block adds proportional generation time (this bench grows roughly linearly in layer count at the symbolic-emit level, sublinearly at the deduplicated-AST level).
  4. The CPU host matters. The 37.6μs figure is on one specific machine. A slower CPU will be proportionally slower for the symbolic-emit path; a faster one, proportionally faster. The right way to report this in v2 of the bench is cycles from RDTSC or equivalent, not nanoseconds; that makes the result hardware-portable.
  5. The "invalidates the cloud" framing was wrong. A 37.6μs PTX generator does not "invalidate" cloud AI for physical robotics. It does demonstrate that for some control-horizon problems, a hardware-native sovereign stack is the architecturally correct choice. That is a much smaller and much more defensible claim than the original version made, and it is the one I am willing to stand behind in 2026-05.

VII. Reproducibility

The bench is one command, assuming the <canonical-tree> working tree and a Zig 0.15.x toolchain:

cd <canonical-tree>
zig build bench_vla

The build step is defined in build.zig at lines 448–4627, wires src/bench_vla.zig into an executable with the zig_canonical module imported, and runs it once. The output is a single block of text:

============================================================
              S-VLA LATENCY HARDWARE BENCHMARK
------------------------------------------------------------
  Target: 12-Layer Sovereign VLA (SASS/PTX Generation)
  Iterations: 1000
  Average Latency: 0.0376 ms (37648 ns)
  Min Latency: 0.0287 ms
  Max Latency: 1.0809 ms
------------------------------------------------------------
  RESULT: SUCCESS | Sub-1ms Invariant VERIFIED
  Google AutoRT Latency: ~30.0 ms
  Gemini S-VLA Latency:  ~0.0376 ms
  Advantage: ~797.9x Faster
============================================================

(Exact numbers vary by host. The ratio is computed at runtime from the measured mean.)

For the live T4 execution path (requires libcuda + an actual NVIDIA GPU):

cd <canonical-tree>
zig build harness_vla_live

The build step is at build.zig lines 484–508. It links cuda as a system library, sets LD_LIBRARY_PATH=/usr/lib so libcuda's runtime dlopen of libnvidia-ptxjitcompiler.so.1 resolves, and runs the binary. On a no-CUDA host, the binary fails to load (DT_NEEDED libcuda), which is by design: the host-side bench is the always-runnable surface, the live-GPU step is opt-in.

The Python FFI bridge at scripts/kircher_lerobot_bridge.py8 exposes ark_init, ark_param, ark_const_f32, ark_mul, ark_add, ark_derive, and ark_compile_ptx to a LeRobot-style consumer; the canonical use case is generating an action-tokenizer kernel from Python policy code and shipping the PTX over the FFI boundary.

VIII. What the 37.6μs is, finally

It is the measured wall-clock time for a Zig host process to construct, deduplicate, register-allocate, and emit as a PTX string a 12-layer attention-block kernel targeting NVIDIA T4 hardware, averaged over 1,000 iterations after a 10-iteration warm-up, on a single CPU thread using CLOCK_MONOTONIC.

It is not a 200-billion-parameter model. It is not a 30-millisecond cloud round-trip. It is not even GPU execution time. It is the kernel-generation step of a pipeline whose other stages are wired up, measured separately, but not yet aggregated as an end-to-end distribution.

That is what the substrate has proven so far. The next milestone, a 1,000-iteration T4-event distribution from harness_vla_live against a real sensor log, committed to the repo with a checked-in log file, is the artifact that earns the larger end-to-end claim. Until that lands, the honest framing of this essay is: kernel synthesis is sub-50μs and reproducible from source today.

Sources

Code (primary artifact)

External baselines

Cross-references


  1. <canonical-tree>/src/bench_vla.zig lines 1–67. The measurement is clock_gettime(CLOCK_MONOTONIC) bracketing a single generateVLABlock call, repeated 1,000 times after a 10-iteration warm-up. Wall-clock measurement on the host CPU.
  2. <canonical-tree>/src/cognitive/learning/crl.zig lines 46–64. The function generateVLABlock constructs 12 mul+add+skip layers in the KircherArk symbolic graph, then calls compilePtx("vla_12_layer") which emits a .target sm_75 PTX module.
  3. <canonical-tree>/src/gpu/ptx_ark.zig. The KircherArk struct: node-deduplicated AST (lines 44–72), constant-folding at construction (add lines 104–117, mul lines 119–135), register-class-separated emitter (compilePtx lines 192–266), .target sm_75 PTX header emission (line 230), checked-in warp-reduction example (generateAtomicDot lines 268–343).
  4. <canonical-tree>/src/harness_vla_live.zig lines 1–94. The live T4 path: queries compute_capability_major/minor via the CUDA driver API at runtime (lines 22–25), loads generateAtomicDot's PTX, launches against <internal-lab>/runs/sample.bin, times with cuEventRecord + cuEventElapsedTime. Latency gate at <50μs on line 84.
  5. <canonical-tree>/src/vla_pipeline.zig lines 1–165. The full kernel-by-kernel pipeline: 4 kernels (DCT, LayerNorm, Softmax, CausalityGuard) loaded from generated PTX, three sequential launchExt invocations, action emit over unix socket. Input N=32.
  6. <canonical-tree>/atomic_dot.ptx, 46 lines, manually curated. .target sm_80, uses ld.global.nc non-coherent loads and shfl.sync.bfly.b32 butterfly reduction, single global atomic per warp via lane-0 predicate.
  7. <canonical-tree>/build.zig. bench_vla exec definition at lines 448–462, runs src/bench_vla.zig. harness_vla_live exec definition at lines 484–508, links libcuda, sets LD_LIBRARY_PATH=/usr/lib so the libcuda dlopen of libnvidia-ptxjitcompiler.so.1 resolves.
  8. <canonical-tree>/scripts/kircher_lerobot_bridge.py lines 1–52. ctypes loader for libzig_canonical.so, exposes ark_init, ark_param, ark_const_f32, ark_mul, ark_add, ark_compile_ptx. Example at the bottom generates an "action_tokenizer" PTX from a Python script.
  9. Ahn et al., "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents" (DeepMind, 2024). Control-loop latency figures in the 30–300ms range; the 30ms anchor used in this essay is the low end of the disclosed range and is sensitive to model size, batch, and host substrate.
  10. Physical Intelligence, "π0: A Vision-Language-Action Flow Model for General Robot Control" (2024). Inference latency in the same regime as RT-2-class systems; explicit framing that a foundation-model VLA + action head is not a single-microsecond operation. The wall-clock figures are not directly comparable to a 12-layer PTX kernel and the comparison in this essay should be read as architectural (model-size-per-control-horizon), not capability-equivalent.