Tokenizer Caching¶

SMG provides a two-level tokenizer cache that reduces tokenization overhead for repeated content. In typical production workloads, this achieves 60-90% cache hit rates.

Before you begin¶

Completed the Getting Started guide
Using gRPC workers (tokenization happens at the gateway)
--model-path configured so SMG can load the tokenizer

How It Works¶

Cache Level	Strategy	Best For
L0 (Exact Match)	Hash-based O(1) lookup for identical strings	Repeated system prompts, batch inference
L1 (Prefix Match)	Boundary-aligned prefix matching, tokenizes only the suffix	Multi-turn conversations, growing contexts

On a multi-turn conversation, L1 avoids re-tokenizing the entire history — only new messages are tokenized.

Enable Caching¶

Both cache levels are disabled by default. Enable them with CLI flags:

L0 Only (Exact Match)¶

Best for workloads with many identical prompts (system prompts, batch processing):

smg \
  --worker-urls grpc://worker:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 10000

L0 + L1 (Exact + Prefix Match)¶

Best for multi-turn chat applications:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 20000 \
  --tokenizer-cache-enable-l1 \
  --tokenizer-cache-l1-max-memory 104857600

Configuration Reference¶

L0 Cache¶

Parameter	Default	Description
`--tokenizer-cache-enable-l0`	`false`	Enable exact match cache
`--tokenizer-cache-l0-max-entries`	`10000`	Maximum number of cached entries

Each entry uses ~2.2 KB of memory.

L1 Cache¶

Parameter	Default	Description
`--tokenizer-cache-enable-l1`	`false`	Enable prefix match cache
`--tokenizer-cache-l1-max-memory`	`52428800` (50 MB)	Maximum memory in bytes

Memory Planning¶

L0 Sizing¶

Entries	Memory	Recommended For
1,000	~2.2 MB	Development, testing
10,000	~22 MB	Standard production
25,000	~55 MB	High-repetition workloads
50,000	~110 MB	Large-scale deployments

Set L0 entries to 1-2x the number of unique system prompt variants in your workload.

L1 Sizing¶

Memory	Recommended For
25 MB	Memory-constrained environments
50 MB	Standard deployments (default)
100 MB	Multi-turn conversation heavy
200 MB	Long context applications

Estimate ~1 KB per active conversation context for L1 sizing.

Recommended Configurations¶

High-Throughput ChatMulti-Turn ConversationsMemory-Constrained

For workloads with repeated system prompts:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 50000

For chat applications with growing conversation history:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 20000 \
  --tokenizer-cache-enable-l1 \
  --tokenizer-cache-l1-max-memory 104857600

Moderate benefit with minimal memory:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 5000

Next Steps¶

Tokenizer Caching Concepts — Cache architecture, special token boundaries, monitoring metrics, PromQL queries
gRPC Workers — Enable gateway-level tokenization with gRPC mode
Load Balancing — Choose a routing policy (cache-aware routing uses tokenizer results)