Tokenizer Caching¶
SMG provides a two-level tokenizer cache that reduces tokenization overhead for repeated content. In typical production workloads, this achieves 60-90% cache hit rates.
Before you begin¶
- Completed the Getting Started guide
- Using gRPC workers (tokenization happens at the gateway)
--model-pathconfigured so SMG can load the tokenizer
How It Works¶
| Cache Level | Strategy | Best For |
|---|---|---|
| L0 (Exact Match) | Hash-based O(1) lookup for identical strings | Repeated system prompts, batch inference |
| L1 (Prefix Match) | Boundary-aligned prefix matching, tokenizes only the suffix | Multi-turn conversations, growing contexts |
On a multi-turn conversation, L1 avoids re-tokenizing the entire history — only new messages are tokenized.
Enable Caching¶
Both cache levels are disabled by default. Enable them with CLI flags:
L0 Only (Exact Match)¶
Best for workloads with many identical prompts (system prompts, batch processing):
smg \
--worker-urls grpc://worker:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-l0-max-entries 10000
L0 + L1 (Exact + Prefix Match)¶
Best for multi-turn chat applications:
smg \
--worker-urls grpc://worker:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-l0-max-entries 20000 \
--tokenizer-cache-enable-l1 \
--tokenizer-cache-l1-max-memory 104857600
Configuration Reference¶
L0 Cache¶
| Parameter | Default | Description |
|---|---|---|
--tokenizer-cache-enable-l0 | false | Enable exact match cache |
--tokenizer-cache-l0-max-entries | 10000 | Maximum number of cached entries |
Each entry uses ~2.2 KB of memory.
L1 Cache¶
| Parameter | Default | Description |
|---|---|---|
--tokenizer-cache-enable-l1 | false | Enable prefix match cache |
--tokenizer-cache-l1-max-memory | 52428800 (50 MB) | Maximum memory in bytes |
Memory Planning¶
L0 Sizing¶
| Entries | Memory | Recommended For |
|---|---|---|
| 1,000 | ~2.2 MB | Development, testing |
| 10,000 | ~22 MB | Standard production |
| 25,000 | ~55 MB | High-repetition workloads |
| 50,000 | ~110 MB | Large-scale deployments |
Set L0 entries to 1-2x the number of unique system prompt variants in your workload.
L1 Sizing¶
| Memory | Recommended For |
|---|---|
| 25 MB | Memory-constrained environments |
| 50 MB | Standard deployments (default) |
| 100 MB | Multi-turn conversation heavy |
| 200 MB | Long context applications |
Estimate ~1 KB per active conversation context for L1 sizing.
Recommended Configurations¶
For workloads with repeated system prompts:
For chat applications with growing conversation history:
Next Steps¶
- Tokenizer Caching Concepts — Cache architecture, special token boundaries, monitoring metrics, PromQL queries
- gRPC Workers — Enable gateway-level tokenization with gRPC mode
- Load Balancing — Choose a routing policy (cache-aware routing uses tokenizer results)