AI inference in Zig — a 4-repo stack from weights to tokens
Four small Zig 0.16 projects, AGPL-3.0, that compose into an end-to-end LLM inference path on CPU. Each one shipped as a standalone library; together they form a coherent stack from on-disk weights to streamed tokens. No GPU yet — that comes next.
The point of publishing them as a stack rather than four unrelated repos is composability: each one is a strict layer and the layers above only depend on what's below.
The four repos
| Layer | Repo | Latest tag | What it does | |----------------—|---------------------------------------------------—|---------—:|------------------------------------------------------------------—| | Weight loading | [SMC17/safetensors-zig][st] | v0.3.0 | Pure-Zig reader for the HuggingFace safetensors format | | Tokenization | [SMC17/tokenizers-zig][tk] | v0.26.0 | BPE / WordPiece / Unigram. Full HF Encoding parity + sub-token offsets. | | Model + forward | [SMC17/vllm-zig][vl] | v0.0.5 | TinyLlama forward pass: RoPE + GQA + multi-thread matmul + sampler | | Vector retrieval | [SMC17/faiss-zig][fa] | v0.7.0 | FlatIndex + HNSW + IVFFlat + IVFPQ |
[st]: https://github.com/SMC17/safetensors-zig [tk]: https://github.com/SMC17/tokenizers-zig [vl]: https://github.com/SMC17/vllm-zig [fa]: https://github.com/SMC17/faiss-zig
From git clone to inference in five commands
The composition is the artifact. Reproducing end-to-end inference on TinyLlama-1.1B-Chat through this stack is the stack-level claim.
git clone https://github.com/SMC17/vllm-zig && cd vllm-zig
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--local-dir tests/fixtures/tinyllama-real/
zig build -Doptimize=ReleaseFast
zig build test --summary all
zig build infer-tinyllama
That is the stack-level claim. Not a per-component speedup. The end-to-end pure-Zig CPU inference path, reproducible in under ten minutes on a commodity x86_64 Linux laptop with Zig 0.16 on PATH. Per-token latency on consumer Ice Lake is roughly half a second. The point is the path exists, in Zig, with the architecture auditable end-to-end in an afternoon.
For the retrieval-augmented version, faiss-zig composes on top once you have embeddings to index — same build.zig.zon dependency pattern, no new toolchain.
The composition path
The four layers chain into a single inference run on a CPU laptop:
safetensors-zigparses TinyLlama-1.1B-Chatmodel.safetensors
(2.2 GB, 201 tensors, BF16) into typed tensor views.
tokenizers-zigloads the model'stokenizer.jsonand encodes
the prompt into token IDs.
vllm-zigruns the forward pass through 22 transformer blocks
(RoPE rotary embeddings, grouped-query attention with KV cache, SwiGLU FFN, RMSNorm), produces logits, samples the next token.
- (Optional)
faiss-zigdoes the RAG retrieval step that feeds
prompt context.
End-to-end works today on a single CPU thread; per-token latency on TinyLlama-1.1B is roughly half a second on commodity Ice Lake. That's slow versus llama.cpp Q4 quantised, but it's f32 and unoptimised; the SIMD matmul is 9-16x scalar baseline, and the multi-thread pool is 1.41x the single-thread spawn variant end-to-end.
What each repo measures
Each repo ships its own BENCH.md with reproducible numbers captured against a reference implementation:
- safetensors-zig — parses the 2.2 GB TinyLlama
safetensors blob in 241 µs (open + index + view 201 tensors). Parse hot loop is ~5x faster than the HF Rust upstream on the Llama-shape fixture at v0.3.
- tokenizers-zig — full
encodepipeline (with
TemplateProcessing wrap) at ~14.6 µs on BERT-base-uncased (123-byte fixture); encodeWithOffsets at ~7.9 µs. Roughly ~5x faster than the HF Rust upstream on WordPiece. 189 tests + 600-iter property fuzz pass at v0.26.
- vllm-zig — TinyLlama greedy decode at ~492 ms/token on the
v0.0.4 ad-hoc-spawn path; v0.0.5 persistent-pool closes the decode regression (1.41x end-to-end vs v0.0.4). 69 unit tests + SIMD ↔ scalar matmul agreement at 7 (M,K,N) shape variants.
- faiss-zig — FlatIndex queries at ~2.1M scored vectors/sec
scalar single-thread; HNSW at 70%+ top-10 recall vs FlatIndex; v0.7 IVFPQ delivers 16.94x memory compression with the trade documented (R@10 ≈ 0.10 on uniform-random data, recovers on real clustered embeddings).
The numbers are Ice Lake specific. The discipline is the substrate; reproduce on your hardware with zig build bench in each repo.
What this stack is and isn't
It is: a CPU-first AGPL Zig stack that loads real model weights and runs real inference end-to-end, with reproducible benchmarks at every layer, no Python in the serving path, and no GPU runtime dependency.
It isn't: production-grade GPU serving. There is no CUDA kernel, no PagedAttention runtime, no continuous batching, no quantisation. The architecture document in vllm-zig (ARCHITECTURE.md) names the GPU work as a separate phase gated on this CPU substrate being correct first.
It isn't: a replacement for vllm or llama.cpp. Those are mature production systems. This is a four-repo substrate that demonstrates the Zig-side of the stack can be coherent and fast enough to be worth optimising further.
The Mercantile Thesis connection
This page lives in /lab/ and reads as pure AI-infra engineering. The strategic frame that makes it one project rather than four is named in the Mercantile Thesis and the Field Statement. Quantitative Mercantilism is the discipline of owning the bottleneck the rest of the economy has to route through. In 2026, capital is chasing the AI utility layer — foundation models — and ignoring the appliance layer: sovereign deployment, hardware-native runtime, multi-agent orchestration. The error is structurally identical to Edison-Electric selling kits in 1887 while Westinghouse quietly owned the transformer.
The four-repo Zig stack is the appliance-layer move. The kernels and primitives an AI-serving system needs in order to run without paying the rent on every token to a Python-on-rented-GPU substrate. Alignment evaluation, on-device inference, sovereign deployment, agent-oriented composition — each routes through these primitives. This is the engineering side of the Mercantile Thesis: the appliance Edison did not build, applied to the AI stack.
Why Zig
Three reasons that matter for inference infrastructure:
- No hidden allocations. Every allocator is explicit; you
see the exact heap behaviour of the forward pass.
- Comptime as the type system. Tensor shapes are checked at
compile time in many paths; dimension mismatches that would be Python RuntimeError become Zig compile errors.
- C interop without FFI tax. The path to wrapping CUDA /
cuBLAS / cuDNN / MLX without writing a Python C extension is a extern fn declaration, not a setuptools / pybind11 exercise.
The trade-off is ecosystem age: Zig 0.16 is still pre-1.0, and the standard library has API churn each release. The repos target 0.16.0 specifically.
Roadmap
Public lanes after this post:
- GPU kernels for
vllm-zig(Ampere+ first; the
ARCHITECTURE.md Phase 2 plan).
- Quantised inference (int8 / nf4) for
vllm-zigso token
latency on commodity hardware becomes presentable against llama.cpp.
- Multi-model support in
tokenizers-zigbeyond Llama-family
(Qwen, Mistral are present; add Gemma and Phi families).
- IVF-PQ tuning in
faiss-zigto close the gap to
faiss-cpu on million-scale collections.