AI inference in Zig — a 4-repo stack from weights to tokens

2026-05-26 · project sovereign-stack

Four small Zig 0.16 projects, AGPL-3.0, that compose into an end-to-end LLM inference path on CPU. Each one shipped as a standalone library; together they form a coherent stack from on-disk weights to streamed tokens. No GPU yet — that comes next.

The point of publishing them as a stack rather than four unrelated repos is composability: each one is a strict layer and the layers above only depend on what's below.

The four repos

[st]: https://github.com/SMC17/safetensors-zig [tk]: https://github.com/SMC17/tokenizers-zig [vl]: https://github.com/SMC17/vllm-zig [fa]: https://github.com/SMC17/faiss-zig

From git clone to inference in five commands

The composition is the artifact. Reproducing end-to-end inference on TinyLlama-1.1B-Chat through this stack is the stack-level claim.

git clone https://github.com/SMC17/vllm-zig && cd vllm-zig

huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --local-dir tests/fixtures/tinyllama-real/

zig build -Doptimize=ReleaseFast

zig build test --summary all

zig build infer-tinyllama

That is the stack-level claim. Not a per-component speedup. The end-to-end pure-Zig CPU inference path, reproducible in under ten minutes on a commodity x86_64 Linux laptop with Zig 0.16 on PATH. Per-token latency on consumer Ice Lake is roughly half a second. The point is the path exists, in Zig, with the architecture auditable end-to-end in an afternoon.

For the retrieval-augmented version, faiss-zig composes on top once you have embeddings to index — same build.zig.zon dependency pattern, no new toolchain.

The composition path

The four layers chain into a single inference run on a CPU laptop:

safetensors-zig parses TinyLlama-1.1B-Chat model.safetensors

(2.2 GB, 201 tensors, BF16) into typed tensor views.

tokenizers-zig loads the model's tokenizer.json and encodes

the prompt into token IDs.

vllm-zig runs the forward pass through 22 transformer blocks

(RoPE rotary embeddings, grouped-query attention with KV cache, SwiGLU FFN, RMSNorm), produces logits, samples the next token.

(Optional) faiss-zig does the RAG retrieval step that feeds

prompt context.

End-to-end works today on a single CPU thread; per-token latency on TinyLlama-1.1B is roughly half a second on commodity Ice Lake. That's slow versus llama.cpp Q4 quantised, but it's f32 and unoptimised; the SIMD matmul is 9-16x scalar baseline, and the multi-thread pool is 1.41x the single-thread spawn variant end-to-end.

What each repo measures

Each repo ships its own BENCH.md with reproducible numbers captured against a reference implementation:

safetensors-zig — parses the 2.2 GB TinyLlama

safetensors blob in 241 µs (open + index + view 201 tensors). Parse hot loop is ~5x faster than the HF Rust upstream on the Llama-shape fixture at v0.3.

tokenizers-zig — full encode pipeline (with

TemplateProcessing wrap) at ~14.6 µs on BERT-base-uncased (123-byte fixture); encodeWithOffsets at ~7.9 µs. Roughly ~5x faster than the HF Rust upstream on WordPiece. 189 tests + 600-iter property fuzz pass at v0.26.

vllm-zig — TinyLlama greedy decode at ~492 ms/token on the

v0.0.4 ad-hoc-spawn path; v0.0.5 persistent-pool closes the decode regression (1.41x end-to-end vs v0.0.4). 69 unit tests + SIMD ↔ scalar matmul agreement at 7 (M,K,N) shape variants.

faiss-zig — FlatIndex queries at ~2.1M scored vectors/sec

scalar single-thread; HNSW at 70%+ top-10 recall vs FlatIndex; v0.7 IVFPQ delivers 16.94x memory compression with the trade documented (R@10 ≈ 0.10 on uniform-random data, recovers on real clustered embeddings).

The numbers are Ice Lake specific. The discipline is the substrate; reproduce on your hardware with zig build bench in each repo.

What this stack is and isn't

It is: a CPU-first AGPL Zig stack that loads real model weights and runs real inference end-to-end, with reproducible benchmarks at every layer, no Python in the serving path, and no GPU runtime dependency.

It isn't: production-grade GPU serving. There is no CUDA kernel, no PagedAttention runtime, no continuous batching, no quantisation. The architecture document in vllm-zig (ARCHITECTURE.md) names the GPU work as a separate phase gated on this CPU substrate being correct first.

It isn't: a replacement for vllm or llama.cpp. Those are mature production systems. This is a four-repo substrate that demonstrates the Zig-side of the stack can be coherent and fast enough to be worth optimising further.

The Mercantile Thesis connection

This page lives in /lab/ and reads as pure AI-infra engineering. The strategic frame that makes it one project rather than four is named in the Mercantile Thesis and the Field Statement. Quantitative Mercantilism is the discipline of owning the bottleneck the rest of the economy has to route through. In 2026, capital is chasing the AI utility layer — foundation models — and ignoring the appliance layer: sovereign deployment, hardware-native runtime, multi-agent orchestration. The error is structurally identical to Edison-Electric selling kits in 1887 while Westinghouse quietly owned the transformer.

The four-repo Zig stack is the appliance-layer move. The kernels and primitives an AI-serving system needs in order to run without paying the rent on every token to a Python-on-rented-GPU substrate. Alignment evaluation, on-device inference, sovereign deployment, agent-oriented composition — each routes through these primitives. This is the engineering side of the Mercantile Thesis: the appliance Edison did not build, applied to the AI stack.

Why Zig

Three reasons that matter for inference infrastructure:

No hidden allocations. Every allocator is explicit; you

see the exact heap behaviour of the forward pass.

Comptime as the type system. Tensor shapes are checked at

compile time in many paths; dimension mismatches that would be Python RuntimeError become Zig compile errors.

C interop without FFI tax. The path to wrapping CUDA /

cuBLAS / cuDNN / MLX without writing a Python C extension is a extern fn declaration, not a setuptools / pybind11 exercise.

The trade-off is ecosystem age: Zig 0.16 is still pre-1.0, and the standard library has API churn each release. The repos target 0.16.0 specifically.

Roadmap

Public lanes after this post:

GPU kernels for vllm-zig (Ampere+ first; the

ARCHITECTURE.md Phase 2 plan).

Quantised inference (int8 / nf4) for vllm-zig so token

latency on commodity hardware becomes presentable against llama.cpp.

Multi-model support in tokenizers-zig beyond Llama-family

(Qwen, Mistral are present; add Gemma and Phi families).

IVF-PQ tuning in faiss-zig to close the gap to

faiss-cpu on million-scale collections.