Tokenizer Caching¶

SMG implements a two-level tokenizer cache that dramatically reduces tokenization overhead for repeated content, achieving 60-90% cache hit rates in typical production workloads.

Overview¶

L0 Cache (Exact Match)¶

Hash-based O(1) lookup for complete tokenization results. Achieves 60-90% hit rate for repeated prompts like system instructions.

L1 Cache (Prefix Match)¶

Boundary-aligned prefix matching that tokenizes only the suffix on hit. Ideal for multi-turn conversations with growing context.

Memory Efficient¶

~2.2KB per L0 entry with configurable L1 memory bounds. Scale from 36MB (small) to 210MB (large) deployments.

Observable¶

Full Prometheus metrics for hit rates, memory usage, and cache sizing. Monitor and tune in real-time.

Why Cache Tokenization?¶

Tokenization—converting text to token IDs—happens on every request. While individual tokenization is fast (~1-5ms), it adds up at scale.

System Prompts¶

Same instructions sent with every request. Perfect for L0 exact-match caching.

Multi-Turn Conversations¶

Growing context with shared prefix. L1 cache tokenizes only new messages.

RAG Applications¶

Common document snippets across queries. Both L0 and L1 provide benefits.

Batch Processing¶

Similar prompt templates with variable parts. High L0 hit rates.

Cache Architecture¶

L0 Cache (Exact Match)¶

Router-level cache storing complete tokenization results for exact string matches.

Hash-based O(1) lookup
~2.2KB per entry
60-90% hit rate for repeated prompts
LRU eviction when full

Best for: Repeated system prompts, identical requests, batch inference

L1 Cache (Prefix Match)¶

Router-level cache storing tokens at special token boundaries for prefix reuse.

Tokenize only the suffix on hit
Cross-request deduplication
Memory-bounded (configurable)
Automatic boundary detection

Best for: Multi-turn conversations, growing contexts, incremental content

Special Token Boundaries (L1)¶

L1 cache identifies boundaries at special tokens for efficient prefix matching:

Model Family	Boundary Tokens	Example
ChatML (Qwen, Yi)	`<\\|im_start\\|>`, `<\\|im_end\\|>`	Each message boundary
Llama 3	`<\\|begin_of_text\\|>`, `<\\|eot_id\\|>`, `<\\|start_header_id\\|>`	Text start, turn end
GPT	`<\\|endoftext\\|>`	Document end

Multi-Turn Conversation Example¶

Consider how caching helps a typical chat application:

Turn 1 (Cold)¶

System: You are a helpful assistant.
User: What is Python?

L0: Miss → Full tokenization (~3ms) L1: Miss → Store at boundaries

Turn 2 (Warm)¶

System: You are a helpful assistant.
User: What is Python?
Assistant: Python is a programming language...
User: How do I install it?

L0: Miss (text changed) L1: Hit! → Only tokenize new content (~0.5ms)

Result: Turn 2 tokenizes only ~20% of the content, saving ~2.5ms per request.

Configuration¶

Model & Tokenizer Paths¶

`--model-path`¶

HuggingFace model ID or local path to load the tokenizer from.

Option	`--model-path`
Default	None

Usage:

# HuggingFace model ID (downloads automatically)
smg --model-path meta-llama/Llama-3.1-8B-Instruct ...

# Local path to model directory
smg --model-path /models/llama-3.1-8b-instruct ...

# Local path to tokenizer.json file
smg --model-path /models/llama-3.1-8b-instruct/tokenizer.json ...

When pointing to a directory, SMG automatically searches for:

tokenizer.json (HuggingFace fast tokenizer format)
tokenizer_config.json (fallback)
vocab.json (fallback)

`--tokenizer-path`¶

Explicit path to a tokenizer file. Overrides --model-path for tokenizer loading.

Option	`--tokenizer-path`
Default	None

When to use:

When the tokenizer is stored separately from the model
When using a custom tokenizer with a standard model
When the model directory structure is non-standard

# Use model for metadata but separate tokenizer
smg \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-path /custom/tokenizers/llama3-tokenizer.json \
  ...

Chat Templates¶

Chat templates convert structured messages (system, user, assistant roles) into the prompt format expected by specific models. SMG uses Jinja2 templates, the same format used by HuggingFace Transformers.

`--chat-template`¶

Path to a Jinja2 chat template file.

Option	`--chat-template`
Default	Auto-discovered from model

Template discovery priority:

Explicit --chat-template path (highest priority)
chat_template.json in model directory
chat_template.jinja in model directory
Any .jinja file in model directory
chat_template field in tokenizer_config.json

Template Variables¶

Chat templates use Jinja2 syntax with access to:

Variable	Description
`messages`	Array of message objects with `role` and `content`
`add_generation_prompt`	Boolean to add assistant prompt prefix
`tools`	Optional array of tool definitions
`documents`	Optional array of document context

Template Examples¶

ChatML (Qwen, Yi)

{%- for message in messages %}
<|im_start|>{{ message.role }}
{{ message.content }}<|im_end|>
{% endfor %}
{%- if add_generation_prompt %}
<|im_start|>assistant
{% endif %}

Llama 3

<|begin_of_text|>{% for message in messages %}
<|start_header_id|>{{ message.role }}<|end_header_id|>

{{ message.content }}<|eot_id|>
{% endfor %}
{% if add_generation_prompt %}<|start_header_id|>assistant<|end_header_id|}

{% endif %}

L0 Cache Configuration¶

The L0 cache stores complete tokenization results for exact string matches.

`--tokenizer-cache-enable-l0`¶

Enable the L0 exact match cache.

Option	`--tokenizer-cache-enable-l0`
Default	`false`

`--tokenizer-cache-l0-max-entries`¶

Maximum number of entries in the L0 cache.

Option	`--tokenizer-cache-l0-max-entries`
Default	`10000`

L1 Cache Configuration¶

The L1 cache stores tokenization results at special token boundaries.

`--tokenizer-cache-enable-l1`¶

Enable the L1 prefix matching cache.

Option	`--tokenizer-cache-enable-l1`
Default	`false`

`--tokenizer-cache-l1-max-memory`¶

Maximum memory for the L1 cache in bytes.

Option	`--tokenizer-cache-l1-max-memory`
Default	`52428800` (50 MB)

Memory Planning¶

L0 Cache Sizing¶

Each L0 entry uses approximately 2.2 KB:

Entries	Memory	Recommended For
1,000	~2.2 MB	Development, testing
10,000	~22 MB	Standard production
25,000	~55 MB	High-repetition workloads
50,000	~110 MB	Large-scale deployments
100,000	~220 MB	Enterprise with many prompt variants

Sizing Guideline

Set L0 entries to 1-2x the number of unique system prompt variants in your workload.

L1 Cache Sizing¶

L1 cache is bounded by total memory:

Memory	Recommended For
25 MB	Memory-constrained environments
50 MB	Standard deployments (default)
100 MB	Multi-turn conversation heavy
200 MB	Long context applications

Sizing Guideline

Estimate ~1 KB per active conversation context for L1 sizing.

Total Cache Budget¶

Small Deployment¶

L0: 5,000 entries (~11 MB)
L1: 25 MB
Total: ~36 MB

Medium Deployment¶

L0: 25,000 entries (~55 MB)
L1: 50 MB
Total: ~105 MB

Large Deployment¶

L0: 50,000 entries (~110 MB)
L1: 100 MB
Total: ~210 MB

Recommended Configurations¶

High-Throughput Chat¶

For workloads with repeated system prompts.

smg \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 50000

Expected: 60-90% cache hit rate

Multi-Turn Conversations¶

For chat applications with varying conversation lengths.

smg \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 20000 \
  --tokenizer-cache-enable-l1 \
  --tokenizer-cache-l1-max-memory 104857600

Expected: L0 catches exact repeats, L1 accelerates prefix sharing

Memory-Constrained¶

For deployments with limited memory.

smg \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 5000

Expected: Moderate benefit with minimal memory

No Caching¶

For stateless deployments or when memory is critical.

smg \
  --model-path meta-llama/Llama-3.1-8B-Instruct
# Caching is disabled by default

Use when: Diverse, unique requests dominate

Complete Example¶

Production configuration with tokenizer and caching:

smg \
  --worker-urls http://worker1:8000 http://worker2:8000 \
  --policy cache_aware \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --chat-template /templates/llama3.jinja \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 25000 \
  --tokenizer-cache-enable-l1 \
  --tokenizer-cache-l1-max-memory 104857600 \
  --host 0.0.0.0 \
  --port 8080

Monitoring & Observability¶

SMG exposes comprehensive Prometheus metrics for cache monitoring:

Cache Metrics¶

Metric	Description
`smg_tokenizer_cache_l0_hits_total`	L0 cache hit count
`smg_tokenizer_cache_l0_misses_total`	L0 cache miss count
`smg_tokenizer_cache_l0_entries`	Current L0 cache entries
`smg_tokenizer_cache_l1_hits_total`	L1 cache hit count
`smg_tokenizer_cache_l1_misses_total`	L1 cache miss count
`smg_tokenizer_cache_l1_memory_bytes`	Current L1 memory usage

Useful PromQL Queries¶

L0 Hit Rate¶

rate(smg_tokenizer_cache_l0_hits_total[5m]) /
(rate(smg_tokenizer_cache_l0_hits_total[5m]) +
 rate(smg_tokenizer_cache_l0_misses_total[5m]))

Combined Hit Rate¶

(rate(smg_tokenizer_cache_l0_hits_total[5m]) +
 rate(smg_tokenizer_cache_l1_hits_total[5m])) /
(rate(smg_tokenizer_cache_l0_hits_total[5m]) +
 rate(smg_tokenizer_cache_l0_misses_total[5m]))

Alert Thresholds¶

Metric	Warning	Critical	Action
L0 hit rate	<50%	<30%	Review prompt patterns
L1 hit rate	<30%	<15%	Check conversation patterns
L0 entries	>90% capacity	>95%	Increase `max-entries`
L1 memory	>80% limit	>90%	Increase `max-memory`

Integration with Other Caching Layers¶

Tokenizer caching is part of SMG's three-level caching strategy:

Layer	What's Cached	Benefit
Tokenizer L0/L1	Token IDs	Skip tokenization
Router radix tree	Prefix → worker mapping	Consistent routing decisions
Worker KV cache	Attention states	Skip prefill computation

Synergy with Cache-Aware Routing

When using the cache_aware routing policy, tokenizer cache results feed directly into the radix tree for routing decisions. This creates a powerful optimization chain where cached tokens determine worker selection for maximum KV cache reuse.

What's Next?¶

Cache-Aware Routing¶

Maximize KV cache hits with prefix-based worker affinity.

Cache-Aware Routing →

Metrics Reference¶

Complete list of cache-related metrics.

Metrics Reference →

Load Balancing¶

Compare all available routing policies.

Load Balancing →

Tokenizer Caching¶

Overview¶

L0 Cache (Exact Match)¶

L1 Cache (Prefix Match)¶

Memory Efficient¶

Observable¶

Why Cache Tokenization?¶

System Prompts¶

Multi-Turn Conversations¶

RAG Applications¶

Batch Processing¶

Cache Architecture¶

L0 Cache (Exact Match)¶

L1 Cache (Prefix Match)¶

Special Token Boundaries (L1)¶

Multi-Turn Conversation Example¶

Turn 1 (Cold)¶

Turn 2 (Warm)¶

Configuration¶

Model & Tokenizer Paths¶

--model-path¶

--tokenizer-path¶

Chat Templates¶

--chat-template¶

Template Variables¶

Template Examples¶

L0 Cache Configuration¶

--tokenizer-cache-enable-l0¶

--tokenizer-cache-l0-max-entries¶

L1 Cache Configuration¶

--tokenizer-cache-enable-l1¶

--tokenizer-cache-l1-max-memory¶

Memory Planning¶

L0 Cache Sizing¶

L1 Cache Sizing¶

Total Cache Budget¶

Small Deployment¶

Medium Deployment¶

Large Deployment¶

Recommended Configurations¶

High-Throughput Chat¶

Multi-Turn Conversations¶

Memory-Constrained¶

No Caching¶

Complete Example¶

Monitoring & Observability¶

Cache Metrics¶

Useful PromQL Queries¶

L0 Hit Rate¶

Combined Hit Rate¶

Alert Thresholds¶

Integration with Other Caching Layers¶

What's Next?¶

Cache-Aware Routing¶

Metrics Reference¶

Load Balancing¶

`--model-path`¶

`--tokenizer-path`¶

`--chat-template`¶

`--tokenizer-cache-enable-l0`¶

`--tokenizer-cache-l0-max-entries`¶

`--tokenizer-cache-enable-l1`¶

`--tokenizer-cache-l1-max-memory`¶