Sovereign Audit 06: The 38-Microsecond Mind
The robotics industry has settled on a comfortable shorthand: that 30-millisecond control loops are "fast enough" for physical intelligence, that a Vision-Language-Action stack can live in a data center and reach the body through a network call. Google DeepMind's AutoRT9 and the Physical Intelligence π-zero10 family both publish latency budgets in that 30–300ms regime. In the language of physical control, that is an eternity; it is also a number with a specific provenance, and the 30ms figure deserves to be cited honestly rather than waved at as a rhetorical anchor.
The essay you are reading is the second pass at a claim I first published as sovereign-audit-04. That earlier post asserted a 38.9μs Sovereign VLA latency without showing the measurement, the code, or the limits of what was measured. A hostile external review (Grok, May 2026) flagged it correctly: bold number, no methodology, no repo link, no admitted scope. This rewrite is the apology for that earlier version, and the standing reproducible artifact for the current one.
What this essay covers, in order: what was actually measured, by what code, on what hardware, with what variance, against what baseline, with what known caveats, and how to rerun it.
What the 37.6μs measures (and what it doesn't)
The benchmark lives at src/bench_vla.zig in the <canonical-tree> working tree1. It runs a tight loop of 1,000 iterations after a 10-iteration warm-up, and for each iteration it calls generateVLABlock(allocator), a function that constructs a 12-layer symbolic attention block in the Kircher Ark symbolic graph, then lowers it to a PTX string targeting sm_75 (the NVIDIA T4 compute capability)2.
The measurement primitive is clock_gettime(CLOCK_MONOTONIC) bracketing the single generateVLABlock call. This matters: the 37.6μs figure is wall-clock CPU latency for symbolic kernel synthesis, on the host side. It is not the GPU execution time of the resulting kernel. That distinction is the most important caveat in this essay, and I am putting it in bold here because the earlier version of this post elided it.
What the 37.6μs does claim, with code-level fidelity:
- A 12-layer attention-style projection block (
mul + addwith skip connections, iterated 12 times) is constructed in a node-deduplicated symbolic AST, lowered through a register-allocating PTX emitter, and emitted as asm_75PTX module string. - On the measurement host (Arch Linux,
x86_64, single-core wall-clock), the median round-trip fromgenerateVLABlockentry to free is in the tens of microseconds. - The output PTX is a real
.target sm_75module that the live CUDA harness (harness_vla_live.zig, described below) accepts viacuModuleLoadDataEx.
What it does not claim:
- It does not claim 37.6μs of end-to-end sensor-to-servo loop time. The full physical pipeline (sensor ingest → VLA forward → action command) has additional latency.
- It does not claim 37.6μs on the T4 GPU itself. The T4-side execution time is measured by a separate harness, against a separate kernel, with separate caveats (see Section IV).
- It does not claim equivalence with a 200B-parameter VLA. The Sovereign VLA evaluated here is a 12-layer symbolic projection: a control kernel, not a foundation model. The architectural argument is that the right shape of model for a 100μs control loop is small and PTX-native, not that this small model matches a 1000× larger one's task-completion quality.
With those scopes named: here is the measurement.
I. Methodology
Hardware
- Host CPU side (the 37.6μs measurement): standard x86_64 Linux workstation, single-threaded execution,
CLOCK_MONOTONICtiming. No CUDA driver involvement during the bench itself. - GPU side (when the PTX is executed): the live harness target is an NVIDIA T4, compute capability
sm_75. Theharness_vla_live.zigexec path queriescompute_capability_major/minorat runtime via the CUDA driver API4. The PTX emitter unconditionally targetssm_75(seeptx_ark.zigline 230:.target sm_75)3.
The 12-layer block
generateVLABlock is 12 lines of essential Zig (src/cognitive/learning/crl.zig, lines 46–64)2:
pub fn generateVLABlock(allocator: std.mem.Allocator) ![]const u8 {
var scrl = SCRL.init(allocator);
defer scrl.deinit();
const x = try scrl.ark.param("X");
var current = x;
// Generate 12 "Layers" of symbolic morphisms
// Each layer is an attention-like projection (mul + add)
for (0..12) |i| {
var buf: [16]u8 = undefined;
const w_name = try std.fmt.bufPrint(&buf, "W{d}", .{i});
const w = try scrl.ark.param(w_name);
const projection = try scrl.ark.mul(current, w);
current = try scrl.ark.add(projection, x); // Skip connection
}
return scrl.ark.compilePtx("vla_12_layer");
}
Each "layer" is a fused multiply-add with a residual connection back to the input, the smallest topology that resembles an attention block while remaining honest about scale. The KircherArk graph deduplicates identical nodes (the node_cache: AutoHashMap in ptx_ark.zig) and constant-folds at AST construction time (x + 0 = x, x * 1 = x, x * 0 = 0), so by the time compilePtx walks the node list, the IR is already simplified.
The measurement loop
From src/bench_vla.zig, lines 17–421:
// Warmup
for (0..10) |_| {
const ptx = try zig_canonical.learning.SCRL.generateVLABlock(allocator);
allocator.free(ptx);
}
for (0..1000) |i| {
var ts_start: std.os.linux.timespec = undefined;
_ = std.os.linux.clock_gettime(std.os.linux.CLOCK.MONOTONIC, &ts_start);
const start_time = @as(u64, @intCast(ts_start.sec)) * 1_000_000_000
+ @as(u64, @intCast(ts_start.nsec));
const ptx = try zig_canonical.learning.SCRL.generateVLABlock(allocator);
var ts_end: std.os.linux.timespec = undefined;
_ = std.os.linux.clock_gettime(std.os.linux.CLOCK.MONOTONIC, &ts_end);
const end_time = @as(u64, @intCast(ts_end.sec)) * 1_000_000_000
+ @as(u64, @intCast(ts_end.nsec));
const elapsed = end_time - start_time;
allocator.free(ptx);
times[i] = elapsed;
sum_ns += elapsed;
if (elapsed < min_ns) min_ns = elapsed;
if (elapsed > max_ns) max_ns = elapsed;
}
Three things to call out about this loop:
- The warm-up is short (10 iterations). That is enough to bring the allocator's free-lists and the libc/CRT pages into cache, but it's not enough to defeat a determined adversarial reviewer who wants to see longer warm-up. A v2 of the bench should ramp warm-up to 1,000 and report iteration-binned tail latencies.
- The allocator is an arena (
init.arena.allocator()inmain). That means the per-iterationallocator.free(ptx)is cheap; arena free is essentially a noop, and the high-water mark is bounded by the largest PTX string produced. Production code generating PTX on a hot path would either reuse a fixed-size buffer or hand the allocation off; arena timing is a fair approximation of "no allocator pathology," not a claim that allocation has been eliminated. - The measurement scope is exactly one
generateVLABlockcall. That includes graph construction, deduplication, register allocation, and PTX string emission. It does not includeptxascompilation,cuModuleLoadDataEx, or kernel launch; those are measured separately, downstream.
Reported figures
bench_vla prints, at the end of the run:
- Average latency (mean over 1,000 iterations)
- Min latency
- Max latency
- A pass/fail gate at the 1.0ms (
< 1_000_000ns) sub-millisecond invariant - A computed ratio against the 30.0ms reference baseline
The 37.6μs figure cited in the summary is the average reported by this loop on the measurement host. Min hovers in the high-20μs range; max occasionally spikes to ~1.08ms when an arena resize or a page fault catches the loop. Tail behavior matters for control loops. The max-latency outlier is more important than the average for hard real-time, and section V flags this honestly.
II. The PTX the bench produces
The kernel that comes out of generateVLABlock is a 12-fused-layer projection. The emitter (compilePtx in ptx_ark.zig, lines 192–266)3 separates register classes (.f32, .u32, .pred) and emits a single-entry-point .visible .entry vla_12_layer block.
For a smaller but fully-formed example of the Ark's output, atomic_dot.ptx is checked into the repo6. This is a warp-reduction dot product, generated by KircherArk.generateAtomicDot, and used as the "Action Tokenizer baseline" by the live harness:
.version 8.0
.target sm_80
.address_size 64
.visible .entry atomic_dot(
.param .u64 A,
.param .u64 B,
.param .u64 C,
.param .u64 N
)
{
reg .f32 %f<4>;
reg .b32 %r<10>;
reg .b64 %rd<10>;
mov.u32 %r0, %tid.x;
mov.u32 %r1, %ntid.x;
mov.u32 %r2, %ctaid.x;
mad.lo.u32 %r3, %r2, %r1, %r0;
ld.param.u64 %rd1, [A];
ld.param.u64 %rd2, [B];
ld.param.u64 %rd3, [C];
ld.param.u32 %r4, [N];
setp.ge.u32 %p1, %r3, %r4;
@%p1 ret;
mul.wide.u32 %rd4, %r3, 4;
add.u64 %rd5, %rd1, %rd4;
add.u64 %rd6, %rd2, %rd4;
ld.global.nc.f32 %f0, [%rd5];
ld.global.nc.f32 %f1, [%rd6];
mul.f32 %f2, %f0, %f1;
shfl.sync.bfly.b32 %f3, %f2, 16, 31, 0xffffffff;
add.f32 %f2, %f2, %f3;
shfl.sync.bfly.b32 %f3, %f2, 8, 31, 0xffffffff;
add.f32 %f2, %f2, %f3;
shfl.sync.bfly.b32 %f3, %f2, 4, 31, 0xffffffff;
add.f32 %f2, %f2, %f3;
shfl.sync.bfly.b32 %f3, %f2, 2, 31, 0xffffffff;
add.f32 %f2, %f2, %f3;
shfl.sync.bfly.b32 %f3, %f2, 1, 31, 0xffffffff;
add.f32 %f2, %f2, %f3;
and.b32 %r5, %r0, 31;
setp.eq.u32 %p2, %r5, 0;
@%p2 atom.global.add.f32 %f4, [%rd3], %f2;
ret;
}
(Note: the file committed to the repo targets sm_80. The Ark itself emits sm_75 for the in-process VLA path. Both are real and both are valid PTX; the sm_80 artifact is a manually curated reference that uses ld.global.nc non-coherent loads and a slightly tighter shuffle pattern: features available on Ampere and useful as a forward-target reference.)
The key architectural property is the warp-level reduction: 32 threads collapse to a single partial sum via shfl.sync butterfly, and only lane 0 of each warp issues the global atomic add. This reduces global-memory atomic contention by 32× and is the kind of optimization that human-written CUDA usually only acquires after a profiling pass; emitting it directly from a symbolic graph is the engineering point of the Kircher Ark.
III. The kernel-by-kernel pipeline (where 37.6μs sits)
The bench measures one stage of a larger pipeline. The full inference path lives in src/vla_pipeline.zig and breaks down as follows:
+---------------------------+----------------------------+
| STAGE | WHAT IT DOES |
+---------------------------+----------------------------+
| 1. KircherArk synthesis | Symbolic graph → PTX |
| (= the 37.6μs bench) | string. CPU side. |
+---------------------------+----------------------------+
| 2. ptxas / cuModule load | PTX → SASS, JIT-cached |
| (init-time, one-shot) | per-process. |
+---------------------------+----------------------------+
| 3. copyToDevice (input) | Host → device DMA, 32f×N. |
+---------------------------+----------------------------+
| 4. LayerNorm kernel | N=32, 1×1×1 grid, 32×1×1 |
| | block. |
+---------------------------+----------------------------+
| 5. Softmax kernel | N=32, 1×1×1 grid, 32×1×1 |
| | block. |
+---------------------------+----------------------------+
| 6. Causality Guard | N=1 lane. Filters |
| | invariant-violating |
| | actions. |
+---------------------------+----------------------------+
| 7. copyToHost (output) | Device → host DMA. |
+---------------------------+----------------------------+
| 8. postAction (motor) | argmax → JSON-RPC over |
| | unix socket to agent. |
+---------------------------+----------------------------+
Stages 4–6 are the genuine GPU work and are launched via Driver.launchExt5. The current input shape is N=32 (a 32-wide attention head, the smallest non-trivial configuration). Stage 1 is the bench's scope. Stage 2 happens at process-init and is amortized; it is excluded from steady-state latency. Stages 3 and 7 are the DMA hops, and for a 32×f32 = 128-byte payload across PCIe Gen3 x16 they are dominated by per-launch fixed overhead (~5–15μs) rather than bandwidth.
The honest summary: 37.6μs is the dominant term for kernel synthesis when the host is doing JIT-style emission of new kernels per step, which is the SCRL/Kircher use case where the graph topology changes with the learned policy. For a static, pre-compiled kernel, stage 1 disappears entirely after warm-up and the steady-state path is stages 3–7.
IV. The T4 execution path (separate measurement, separate harness)
The CPU-side bench answers "how fast can we generate a PTX kernel." The T4 execution side is answered by src/harness_vla_live.zig4, which is a different binary and a different measurement.
Critically, harness_vla_live uses CUDA events (not clock_gettime) for timing:
const start_event = try d.createEvent();
const end_event = try d.createEvent();
defer d.destroyEvent(start_event);
defer d.destroyEvent(end_event);
// ...
try d.recordEvent(start_event, null);
try d.launchExt(func, .{ grid_size, 1, 1 }, .{ block_size, 1, 1 }, ¶ms);
try d.recordEvent(end_event, null);
try d.synchronizeEvent(end_event);
const physical_ms = try d.eventElapsedTime(start_event, end_event);
const physical_us = physical_ms * 1000.0;
This is the right primitive for GPU timing: cuEventRecord enqueues a timestamp into the command stream itself, so cuEventElapsedTime returns the wall-clock duration between the two events on the device, not on the host. CPU-side wall-clock would miss kernel-launch overhead and would conflate driver-queue stall with kernel execution.
The harness loads generateAtomicDot()'s PTX (the warp-reduction dot product shown above) and launches it against a real sensor log read from <internal-lab>/runs/sample.bin. Grid size is (n_elements + 255) / 256, block size is 256. The harness prints the measured physical_us to stdout and gates a SOVEREIGN status flag on physical_us < 50.0.
What this means for the headline claim: the 37.6μs in the title refers to the kernel-generation bench, which is the more thoroughly measured side (1,000-iteration loop, min/max/mean reported, gated on a 1ms bound). The T4-side execution number (the actual SASS-to-result latency on hardware) is sampled by harness_vla_live but is not yet aggregated into a 1,000-iteration distribution in the public repo. Calling the full end-to-end claim "T4-verified" requires running the harness with iteration aggregation, which is the next bench-engineering task. The honest 2026-05-12 status is: PTX generation is benchmarked rigorously; PTX execution is wired up live but has not yet been published as a distribution.
V. The cloud-baseline comparison, audited
The "797× faster than 30ms cloud baseline" framing needs the same audit. Where does 30ms come from?
- DeepMind's RT-2 and AutoRT publications report control-loop latencies in the 30–300ms range depending on model size and deployment configuration9. The "30ms" anchor is the low end of that range and represents the publicly disclosed best-case figure for a smaller RT-2 variant on a specific inference substrate. It is not a fixed industry constant.
- Physical Intelligence's π-zero family has reported inference latencies in similar regimes, with explicit emphasis on the fact that a Vision-Language model + an action head + a smoother is not a single-microsecond operation10.
- What the 30ms number does not represent: it does not represent the speed of light to a data center, it does not represent network jitter, and it does not represent the rest of the robot's perception pipeline. It is a model-inference latency for a specific class of VLA, on a specific class of hardware.
A fair comparison would be: "our 12-layer PTX-native projection generates and (on the T4 execution path) launches faster than a 30ms VLA inference call." That's true. But the apples-to-apples concern stands: a 12-layer attention block is not the same model as a 5-billion-parameter RT-2 head, and conflating them invites the parlor-trick critique I was rightly given on the first version.
Where the comparison is legitimate: at the control-loop level, a robot's planner does not need a 5B-parameter VLA to decide whether to move 5mm left. A 12-layer projection is sufficient for many fine-control morphisms, and if a small specialized kernel produces an action of acceptable quality, then running it at 26,000 Hz instead of 33 Hz is the architecturally correct choice. The argument is about appropriate model scale per control horizon, not about model-vs-model raw capability.
VI. Honest limitations
Five things this essay's claim does not cover, in plain language:
- No published end-to-end T4 latency distribution.
harness_vla_liveemits one measurement per run. There is no committed log file showing 1,000-iteration mean/min/max from the GPU. Producing that log is the next benchmark task. - The bench measures a single function call, not a full inference. Stages 2–7 of the pipeline above (PTX compile, DMA, three kernel launches, action-emit) are not in the 37.6μs envelope. A full sensor-to-servo wall-clock is the integral of all stages; we have measurements for stage 1 (this bench) and stage 4/5/6 (per-kernel via
harness_vla_live-style events), but no aggregated full-loop number. - The 12-layer block is the smallest VLA, not the largest. A real-world physical-intelligence stack on a hand-grade manipulator may need more than 12 attention layers, or a richer non-linearity than the
mul + add + skippattern used here. Scaling the block adds proportional generation time (this bench grows roughly linearly in layer count at the symbolic-emit level, sublinearly at the deduplicated-AST level). - The CPU host matters. The 37.6μs figure is on one specific machine. A slower CPU will be proportionally slower for the symbolic-emit path; a faster one, proportionally faster. The right way to report this in v2 of the bench is
cyclesfromRDTSCor equivalent, not nanoseconds; that makes the result hardware-portable. - The "invalidates the cloud" framing was wrong. A 37.6μs PTX generator does not "invalidate" cloud AI for physical robotics. It does demonstrate that for some control-horizon problems, a hardware-native sovereign stack is the architecturally correct choice. That is a much smaller and much more defensible claim than the original version made, and it is the one I am willing to stand behind in 2026-05.
VII. Reproducibility
The bench is one command, assuming the <canonical-tree> working tree and a Zig 0.15.x toolchain:
cd <canonical-tree>
zig build bench_vla
The build step is defined in build.zig at lines 448–4627, wires src/bench_vla.zig into an executable with the zig_canonical module imported, and runs it once. The output is a single block of text:
============================================================
S-VLA LATENCY HARDWARE BENCHMARK
------------------------------------------------------------
Target: 12-Layer Sovereign VLA (SASS/PTX Generation)
Iterations: 1000
Average Latency: 0.0376 ms (37648 ns)
Min Latency: 0.0287 ms
Max Latency: 1.0809 ms
------------------------------------------------------------
RESULT: SUCCESS | Sub-1ms Invariant VERIFIED
Google AutoRT Latency: ~30.0 ms
Gemini S-VLA Latency: ~0.0376 ms
Advantage: ~797.9x Faster
============================================================
(Exact numbers vary by host. The ratio is computed at runtime from the measured mean.)
For the live T4 execution path (requires libcuda + an actual NVIDIA GPU):
cd <canonical-tree>
zig build harness_vla_live
The build step is at build.zig lines 484–508. It links cuda as a system library, sets LD_LIBRARY_PATH=/usr/lib so libcuda's runtime dlopen of libnvidia-ptxjitcompiler.so.1 resolves, and runs the binary. On a no-CUDA host, the binary fails to load (DT_NEEDED libcuda), which is by design: the host-side bench is the always-runnable surface, the live-GPU step is opt-in.
The Python FFI bridge at scripts/kircher_lerobot_bridge.py8 exposes ark_init, ark_param, ark_const_f32, ark_mul, ark_add, ark_derive, and ark_compile_ptx to a LeRobot-style consumer; the canonical use case is generating an action-tokenizer kernel from Python policy code and shipping the PTX over the FFI boundary.
VIII. What the 37.6μs is, finally
It is the measured wall-clock time for a Zig host process to construct, deduplicate, register-allocate, and emit as a PTX string a 12-layer attention-block kernel targeting NVIDIA T4 hardware, averaged over 1,000 iterations after a 10-iteration warm-up, on a single CPU thread using CLOCK_MONOTONIC.
It is not a 200-billion-parameter model. It is not a 30-millisecond cloud round-trip. It is not even GPU execution time. It is the kernel-generation step of a pipeline whose other stages are wired up, measured separately, but not yet aggregated as an end-to-end distribution.
That is what the substrate has proven so far. The next milestone, a 1,000-iteration T4-event distribution from harness_vla_live against a real sensor log, committed to the repo with a checked-in log file, is the artifact that earns the larger end-to-end claim. Until that lands, the honest framing of this essay is: kernel synthesis is sub-50μs and reproducible from source today.
Sources
Code (primary artifact)
bench_vla.zig— the measurement loop.<canonical-tree>/src/bench_vla.zig, lines 1–67.cognitive/learning/crl.zig—generateVLABlock. Lines 46–64.gpu/ptx_ark.zig— the Kircher Ark symbolic PTX emitter. Lines 1–366.harness_vla_live.zig— the live T4 CUDA-event harness. Lines 1–94.vla_pipeline.zig— the full kernel-by-kernel pipeline definition. Lines 1–165.ark_ffi.zig— the C-ABI FFI surface for Python/LeRobot. Lines 1–63.scripts/kircher_lerobot_bridge.py— the LeRobot-side ctypes wrapper. Lines 1–52.atomic_dot.ptx— example generated PTX (warp-reduction dot product,sm_80).build.zig—bench_vlastep at lines 448–462;harness_vla_livestep at lines 484–508.
External baselines
- DeepMind RT-2 / AutoRT / SARA-RT engineering disclosures (2023–2024) for the ~30ms control-loop latency anchor.
- Physical Intelligence π-zero family papers for the comparison-class VLA latency regime.
- NVIDIA PTX ISA Reference (sm_75 / Turing).
Cross-references
- sovereign-audit-04-38-microsecond-mind — the previous 38.9μs essay (now superseded by this rewrite).
- sovereign-audit-05-silicon-truth — the register-pressure invariant audit.
- sovereign-audit-02-google — the 30ms baseline framing.
<canonical-tree>/src/bench_vla.ziglines 1–67. The measurement isclock_gettime(CLOCK_MONOTONIC)bracketing a singlegenerateVLABlockcall, repeated 1,000 times after a 10-iteration warm-up. Wall-clock measurement on the host CPU. ↩<canonical-tree>/src/cognitive/learning/crl.ziglines 46–64. The functiongenerateVLABlockconstructs 12 mul+add+skip layers in the KircherArk symbolic graph, then callscompilePtx("vla_12_layer")which emits a.target sm_75PTX module. ↩<canonical-tree>/src/gpu/ptx_ark.zig. The KircherArk struct: node-deduplicated AST (lines 44–72), constant-folding at construction (addlines 104–117,mullines 119–135), register-class-separated emitter (compilePtxlines 192–266),.target sm_75PTX header emission (line 230), checked-in warp-reduction example (generateAtomicDotlines 268–343). ↩<canonical-tree>/src/harness_vla_live.ziglines 1–94. The live T4 path: queriescompute_capability_major/minorvia the CUDA driver API at runtime (lines 22–25), loadsgenerateAtomicDot's PTX, launches against<internal-lab>/runs/sample.bin, times withcuEventRecord+cuEventElapsedTime. Latency gate at<50μson line 84. ↩<canonical-tree>/src/vla_pipeline.ziglines 1–165. The full kernel-by-kernel pipeline: 4 kernels (DCT, LayerNorm, Softmax, CausalityGuard) loaded from generated PTX, three sequentiallaunchExtinvocations, action emit over unix socket. Input N=32. ↩<canonical-tree>/atomic_dot.ptx, 46 lines, manually curated..target sm_80, usesld.global.ncnon-coherent loads andshfl.sync.bfly.b32butterfly reduction, single global atomic per warp via lane-0 predicate. ↩<canonical-tree>/build.zig.bench_vlaexec definition at lines 448–462, runssrc/bench_vla.zig.harness_vla_liveexec definition at lines 484–508, links libcuda, setsLD_LIBRARY_PATH=/usr/libso the libcuda dlopen oflibnvidia-ptxjitcompiler.so.1resolves. ↩<canonical-tree>/scripts/kircher_lerobot_bridge.pylines 1–52. ctypes loader forlibzig_canonical.so, exposesark_init,ark_param,ark_const_f32,ark_mul,ark_add,ark_compile_ptx. Example at the bottom generates an "action_tokenizer" PTX from a Python script. ↩- Ahn et al., "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents" (DeepMind, 2024). Control-loop latency figures in the 30–300ms range; the 30ms anchor used in this essay is the low end of the disclosed range and is sensitive to model size, batch, and host substrate. ↩
- Physical Intelligence, "π0: A Vision-Language-Action Flow Model for General Robot Control" (2024). Inference latency in the same regime as RT-2-class systems; explicit framing that a foundation-model VLA + action head is not a single-microsecond operation. The wall-clock figures are not directly comparable to a 12-layer PTX kernel and the comparison in this essay should be read as architectural (model-size-per-control-horizon), not capability-equivalent. ↩