How to Optimize Token Usage with Milvus (Step by Step)
Handling token usage efficiently with Milvus can reduce unnecessary compute costs and make your embeddings—and thus your vector search—way faster and smarter. While many folks treat “milvus optimize token usage” as a black box, I’m going to show you exactly how you can cut down token bloat in your RAG pipelines, vector search, and downstream querying without sacrificing precision.
Prerequisites
- Python 3.11+
- Milvus Server 2.2.9+ (latest stable as of March 21, 2026)
pymilvus>=2.2.9- Basic familiarity with embeddings and vector search concepts
- Access to GPU or CPU-based vector encoding (like OpenAI embeddings, Huggingface models, or similar)
- Familiarity with token limits of your LLMs (e.g., GPT-4’s 8k tokens) and how they drive cost and latency
What You’re Actually Building
We’re crafting a vector search pipeline that trims the fat from your text inputs so Milvus only stores what truly enriches your query context, all while balancing embedding quality. If you’ve ever pumped all your source documents straight into Milvus and watched costs and token counts explode, this is the fix.
Step-by-Step
Step 1: Measure Your Token Before Committing
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
def count_tokens(text: str) -> int:
return len(tokenizer.encode(text))
sample_text = "This paragraph will be tokenized and counted to prevent wasting tokens."
print(f"Token count: {count_tokens(sample_text)}")
Why: If you blindly send massive texts as-is to the embedding stage, you’re burning tokens you don’t need to pay for or store. The GPT-2 tokenizer is a cheap, easy proxy that maps roughly to OpenAI-style token counts. This initial counting stage prevents you from overly long chunks sneaking into Milvus.
Errors you’ll hit: Using a tokenizer that doesn’t match your LLM leads to under- or over-counting. For example, Huggingface tokenizers for T5 differ significantly from GPT-3/4 tokenizers. Always check which tokenizer aligns with your model’s usage.
Step 2: Chunk Text Intelligently – Go Semantic Over Static
def chunk_text(text: str, max_tokens: int = 500):
words = text.split()
chunks = []
current_chunk = []
current_tokens = 0
for word in words:
word_tokens = count_tokens(word)
if current_tokens + word_tokens > max_tokens:
chunks.append(" ".join(current_chunk))
current_chunk = [word]
current_tokens = word_tokens
else:
current_chunk.append(word)
current_tokens += word_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
long_text = " ".join(["word"] * 2000) # Example long text
split_chunks = chunk_text(long_text, max_tokens=500)
print(f"Created {len(split_chunks)} chunks.")
Why: Splitting by token counts instead of fixed character lengths prevents accidentally blowing token limits at embedding or querying time. I’ve seen pipelines crash or degrade because token counts spiked unexpectedly when spaces or UTF-8 characters appeared. Semantic chunking (like sentence boundaries or paragraph breaks) often works better, but simpler token maxing works reliably.
Errors you’ll hit: Naive character chops create poor query matches — context fragments don’t represent coherent meaning. Oversized chunks cause embedding API errors or push you out of free tiers fast.
Step 3: Deduplicate Before Pushing to Milvus
from hashlib import sha256
def deduplicate_chunks(chunks):
seen = set()
unique_chunks = []
for chunk in chunks:
fingerprint = sha256(chunk.encode("utf-8")).hexdigest()
if fingerprint not in seen:
unique_chunks.append(chunk)
seen.add(fingerprint)
return unique_chunks
unique_chunks = deduplicate_chunks(split_chunks)
print(f"Deduplicated to {len(unique_chunks)} unique chunks.")
Why: Redundancy is the enemy. I can’t stress this enough—many real datasets have weird repetition, whether corrupted PDFs or logs. Deduplication avoids wasted embedding and Milvus storage tokens, saving compute dollars and preventing search noise later.
Errors you’ll hit: Skipping deduplication fills Milvus with duplicate vectors, slows down search, and inflates storage. Your token-based budgeting will explode.
Step 4: Encode Chunks Into Vectors Efficiently
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def encode_chunks(chunks):
embeddings = model.encode(chunks, convert_to_tensor=True)
return embeddings
embeddings = encode_chunks(unique_chunks)
print(f"Produced embeddings shape: {embeddings.shape}")
Why: Choose smaller, faster embedding models unless you specifically need the biggest transformers for semantic precision. For most applications, models like “all-MiniLM-L6-v2” strike the best compromise between vector dimension (384), speed, and token budget. High-dimensional embeddings aren’t always better; they can bloat your Milvus index and slow down search.
Errors you’ll hit: Attempting OpenAI embedding for thousands of long chunks without preprocessing will burn tokens and hit API rate limits quickly. Also, embedding without batching reduces throughput.
Step 5: Store With Metadata to Filter Context
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
connections.connect("default", host="localhost", port="19530")
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384),
FieldSchema(name="token_count", dtype=DataType.INT64),
FieldSchema(name="chunk_text", dtype=DataType.VARCHAR, max_length=1024)
]
schema = CollectionSchema(fields, "Document chunks with token metadata")
collection = Collection("doc_chunks", schema)
token_counts = [count_tokens(chunk) for chunk in unique_chunks]
entities = [
token_counts,
embeddings.tolist(),
unique_chunks
]
collection.insert(entities)
collection.create_index("embedding", {"index_type": "IVF_FLAT", "params": {"nlist": 128}, "metric_type": "L2"})
collection.load()
Why: Storing token counts alongside embeddings lets you filter or rank chunks cheaply without re-tokenizing later. The token metadata cuts down queries that try to cram too much context—and give you control over Milvus’s payload size at search time.
Step 6: Query With Token Budgets in Mind
def search_similar(query: str, top_k=5, max_query_tokens=1000):
query_token_count = count_tokens(query)
if query_token_count > max_query_tokens:
raise ValueError(f"Query exceeds token budget: {query_token_count} > {max_query_tokens}")
query_embedding = model.encode([query])[0].tolist()
results = collection.search(
[query_embedding],
"embedding",
param={"metric_type": "L2", "params": {"nprobe": 10}},
limit=top_k,
output_fields=["token_count", "chunk_text"]
)
filtered_results = [res for res in results[0] if res.entity.get("token_count", 0) + query_token_count < max_query_tokens]
return filtered_results
query = "Efficient token usage with Milvus"
result_docs = search_similar(query)
for hit in result_docs:
print(hit.entity.get("chunk_text"))
Why: Your LLM’s context window is precious. If you forget to check your query tokens plus relevant chunk tokens, you blow your limits—leading to errors or truncated prompts. Milvus’s filtering based on stored token metadata helps you dynamically stay under budget.
Errors you’ll hit: Passing too large combined token sets into your generator leads to failed completions or weird context jumps. I once had a system crash after ignoring token caps. Not fun.
The Gotchas
- Embedding API Token Counting Is Hurting You: OpenAI embeddings count tokens you don’t always expect, like implicit prompt tokens or separators. Always do dry runs with counts per chunk before bulk embedding.
- Milvus Storage Costs Climb Fast: Milvus’s Apache-2.0 licensed repo milvus-io/milvus has 43,421 stars—yeah, it’s popular—but the vector dimension and number of vectors you store cause rapid RAM/storage usage. Oversized vectors without token pruning inflate costs.
- Tokenizers Don’t Agree: If your chunk creation tokenizer and LLM tokenizer mismatch, you’ll either overestimate or underestimate tokens. Use the exact tokenizer your LLM requires.
- Index Creation Time & Memory: Using high nlist values in IVF_FLAT indexes improves recall but adds latency and RAM draw. Find your sweet spot. I usually start at nlist=128.
- Chunk Coherence vs Size: Bigger chunks hold more context but cost more tokens. Smaller chunks cause fragmentation and drop precision. Experiment.
Full Code
from transformers import GPT2TokenizerFast
from sentence_transformers import SentenceTransformer
from hashlib import sha256
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
# Step 1: Initialize tokenizer and models
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = SentenceTransformer('all-MiniLM-L6-v2')
def count_tokens(text: str) -> int:
return len(tokenizer.encode(text))
def chunk_text(text: str, max_tokens: int = 500):
words = text.split()
chunks = []
current_chunk = []
current_tokens = 0
for word in words:
word_tokens = count_tokens(word)
if current_tokens + word_tokens > max_tokens:
chunks.append(" ".join(current_chunk))
current_chunk = [word]
current_tokens = word_tokens
else:
current_chunk.append(word)
current_tokens += word_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def deduplicate_chunks(chunks):
seen = set()
unique_chunks = []
for chunk in chunks:
fingerprint = sha256(chunk.encode("utf-8")).hexdigest()
if fingerprint not in seen:
unique_chunks.append(chunk)
seen.add(fingerprint)
return unique_chunks
def encode_chunks(chunks):
embeddings = model.encode(chunks, convert_to_tensor=True)
return embeddings
# Connect to Milvus
connections.connect("default", host="localhost", port="19530")
# Define schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384),
FieldSchema(name="token_count", dtype=DataType.INT64),
FieldSchema(name="chunk_text", dtype=DataType.VARCHAR, max_length=1024)
]
schema = CollectionSchema(fields, "Document chunks with token metadata")
collection = Collection("doc_chunks", schema)
def insert_chunks(chunks):
unique_chunks = deduplicate_chunks(chunks)
embeddings = encode_chunks(unique_chunks)
token_counts = [count_tokens(chunk) for chunk in unique_chunks]
entities = [
token_counts,
embeddings.tolist(),
unique_chunks
]
collection.insert(entities)
collection.create_index("embedding", {"index_type": "IVF_FLAT", "params": {"nlist": 128}, "metric_type": "L2"})
collection.load()
def search_similar(query: str, top_k=5, max_query_tokens=1000):
query_token_count = count_tokens(query)
if query_token_count > max_query_tokens:
raise ValueError(f"Query token count ({query_token_count}) exceeds limit ({max_query_tokens})")
query_embedding = model.encode([query])[0].tolist()
results = collection.search(
[query_embedding],
"embedding",
param={"metric_type": "L2", "params": {"nprobe": 10}},
limit=top_k,
output_fields=["token_count", "chunk_text"]
)
filtered_results = [res for res in results[0] if res.entity.get("token_count", 0) + query_token_count < max_query_tokens]
return filtered_results
# Example Usage
if __name__ == "__main__":
raw_text = ("This is an example paragraph that we'd like to chunk, deduplicate, embed, and store. " * 100)
chunks = chunk_text(raw_text)
insert_chunks(chunks)
query = "example paragraph token usage"
found = search_similar(query)
for hit in found:
print(hit.entity.get("chunk_text"))
What’s Next
Now that you’ve tamed token bloat feeding into Milvus, the logical next step is to implement dynamic query prompt trimming—meaning your application should monitor combined token length (query plus retrieved context) and remove or paraphrase low-value chunks automatically before calling your LLM. This will save you dollars and prevent runtime token limit errors in production.
FAQ
Q: How do I confirm that my token counts match the LLM’s internal counting?
A: The safest bet is to use the tokenizer provided by your LLM. For OpenAI models, tiktoken is the canonical tokenizer. GPT-2 tokenizer is a reasonable proxy but not exact. Always run test cases with your model's counting to verify.
Q: What’s the maximum number of vectors Milvus can handle before performance suffers?
A: Milvus is optimized for millions of vectors, but practically, your index type, vector dimension, and hardware dictate performance. For example, IVF_FLAT with nlist=128 is manageable at a few million vectors on decent servers, but latency and RAM can spike without batching and pruning.
Q: Can I automate token pruning at insertion time?
A: Absolutely, but be wary. You can drop or summarize chunks exceeding token limits before embedding, but over-pruning reduces semantic richness, hurting downstream search quality. Use adaptive thresholds tuned on your dataset.
Milvus Stats Overview
| Metric | Value | Comment |
|---|---|---|
| GitHub Stars | 43,421 | Indicates high adoption and community support |
| Forks | 3,909 | Demonstrates active contribution and custom use cases |
| Open Issues | 1,098 | Signposts ongoing development and bug tracking |
| License | Apache-2.0 | Permissive license favorable for enterprise use |
| Last Updated | March 21, 2026 | Project is actively maintained |
Recommendations for Different Developer Types
1. The Startup Hacker: If you’re building quick MVPs, focus on off-the-shelf embedding models like 'all-MiniLM-L6-v2' and basic token chunking to keep drift and costs down. Use Milvus's built-in indexing and keep an eye on your token usage with simple counters.
2. The Data Scientist: Experiment with semantic chunking approaches—try sentence boundary detection or paragraph encoding—to improve embedding fidelity. Incorporate token count metadata for pruning at query time. You might also look into custom embedding fine-tuning based on chunk token complexity.
3. The Enterprise Engineer: Build an adaptive pipeline that incorporates real-time token budget monitoring, chunk deduplication, dynamic vector dimensionality, and index tuning on Milvus. Integrate with your LLM pipelines tightly to prevent overrun scenarios and optimize compute spend.
Data as of March 21, 2026. Sources: https://github.com/milvus-io/milvus, Milvus LangChain Token Limits, Milvus LLM Optimization
Related Articles
- Containerizing Agents with Docker Compose
- AI agent deployment disaster recovery
- Scaling AI agents horizontally
🕒 Published: