Skip to content

Tokenizer Caching

SMG provides a two-level tokenizer cache that reduces tokenization overhead for repeated content. In typical production workloads, this achieves 60-90% cache hit rates.

Before you begin

  • Completed the Getting Started guide
  • Using gRPC workers (tokenization happens at the gateway)
  • --model-path configured so SMG can load the tokenizer

How It Works

Cache Level Strategy Best For
L0 (Exact Match) Hash-based O(1) lookup for identical strings Repeated system prompts, batch inference
L1 (Prefix Match) Boundary-aligned prefix matching, tokenizes only the suffix Multi-turn conversations, growing contexts

On a multi-turn conversation, L1 avoids re-tokenizing the entire history — only new messages are tokenized.


Enable Caching

Both cache levels are disabled by default. Enable them with CLI flags:

L0 Only (Exact Match)

Best for workloads with many identical prompts (system prompts, batch processing):

smg \
  --worker-urls grpc://worker:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 10000

L0 + L1 (Exact + Prefix Match)

Best for multi-turn chat applications:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 20000 \
  --tokenizer-cache-enable-l1 \
  --tokenizer-cache-l1-max-memory 104857600

Configuration Reference

L0 Cache

Parameter Default Description
--tokenizer-cache-enable-l0 false Enable exact match cache
--tokenizer-cache-l0-max-entries 10000 Maximum number of cached entries

Each entry uses ~2.2 KB of memory.

L1 Cache

Parameter Default Description
--tokenizer-cache-enable-l1 false Enable prefix match cache
--tokenizer-cache-l1-max-memory 52428800 (50 MB) Maximum memory in bytes

Memory Planning

L0 Sizing

Entries Memory Recommended For
1,000 ~2.2 MB Development, testing
10,000 ~22 MB Standard production
25,000 ~55 MB High-repetition workloads
50,000 ~110 MB Large-scale deployments

Set L0 entries to 1-2x the number of unique system prompt variants in your workload.

L1 Sizing

Memory Recommended For
25 MB Memory-constrained environments
50 MB Standard deployments (default)
100 MB Multi-turn conversation heavy
200 MB Long context applications

Estimate ~1 KB per active conversation context for L1 sizing.


For workloads with repeated system prompts:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 50000

For chat applications with growing conversation history:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 20000 \
  --tokenizer-cache-enable-l1 \
  --tokenizer-cache-l1-max-memory 104857600

Moderate benefit with minimal memory:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 5000

Next Steps

  • Tokenizer Caching Concepts — Cache architecture, special token boundaries, monitoring metrics, PromQL queries
  • gRPC Workers — Enable gateway-level tokenization with gRPC mode
  • Load Balancing — Choose a routing policy (cache-aware routing uses tokenizer results)