\n\n\n\n How to Optimize Token Usage with Milvus (Step by Step) \n

How to Optimize Token Usage with Milvus (Step by Step)

📖 11 min read2,059 wordsUpdated Mar 21, 2026

How to Optimize Token Usage with Milvus (Step by Step)

Handling token usage efficiently with Milvus can reduce unnecessary compute costs and make your embeddings—and thus your vector search—way faster and smarter. While many folks treat “milvus optimize token usage” as a black box, I’m going to show you exactly how you can cut down token bloat in your RAG pipelines, vector search, and downstream querying without sacrificing precision.

Prerequisites

  • Python 3.11+
  • Milvus Server 2.2.9+ (latest stable as of March 21, 2026)
  • pymilvus>=2.2.9
  • Basic familiarity with embeddings and vector search concepts
  • Access to GPU or CPU-based vector encoding (like OpenAI embeddings, Huggingface models, or similar)
  • Familiarity with token limits of your LLMs (e.g., GPT-4’s 8k tokens) and how they drive cost and latency

What You’re Actually Building

We’re crafting a vector search pipeline that trims the fat from your text inputs so Milvus only stores what truly enriches your query context, all while balancing embedding quality. If you’ve ever pumped all your source documents straight into Milvus and watched costs and token counts explode, this is the fix.

Step-by-Step

Step 1: Measure Your Token Before Committing

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
 return len(tokenizer.encode(text))

sample_text = "This paragraph will be tokenized and counted to prevent wasting tokens."
print(f"Token count: {count_tokens(sample_text)}")

Why: If you blindly send massive texts as-is to the embedding stage, you’re burning tokens you don’t need to pay for or store. The GPT-2 tokenizer is a cheap, easy proxy that maps roughly to OpenAI-style token counts. This initial counting stage prevents you from overly long chunks sneaking into Milvus.

Errors you’ll hit: Using a tokenizer that doesn’t match your LLM leads to under- or over-counting. For example, Huggingface tokenizers for T5 differ significantly from GPT-3/4 tokenizers. Always check which tokenizer aligns with your model’s usage.

Step 2: Chunk Text Intelligently – Go Semantic Over Static

def chunk_text(text: str, max_tokens: int = 500):
 words = text.split()
 chunks = []
 current_chunk = []
 current_tokens = 0
 for word in words:
 word_tokens = count_tokens(word)
 if current_tokens + word_tokens > max_tokens:
 chunks.append(" ".join(current_chunk))
 current_chunk = [word]
 current_tokens = word_tokens
 else:
 current_chunk.append(word)
 current_tokens += word_tokens
 if current_chunk:
 chunks.append(" ".join(current_chunk))
 return chunks

long_text = " ".join(["word"] * 2000) # Example long text
split_chunks = chunk_text(long_text, max_tokens=500)
print(f"Created {len(split_chunks)} chunks.")

Why: Splitting by token counts instead of fixed character lengths prevents accidentally blowing token limits at embedding or querying time. I’ve seen pipelines crash or degrade because token counts spiked unexpectedly when spaces or UTF-8 characters appeared. Semantic chunking (like sentence boundaries or paragraph breaks) often works better, but simpler token maxing works reliably.

Errors you’ll hit: Naive character chops create poor query matches — context fragments don’t represent coherent meaning. Oversized chunks cause embedding API errors or push you out of free tiers fast.

Step 3: Deduplicate Before Pushing to Milvus

from hashlib import sha256

def deduplicate_chunks(chunks):
 seen = set()
 unique_chunks = []
 for chunk in chunks:
 fingerprint = sha256(chunk.encode("utf-8")).hexdigest()
 if fingerprint not in seen:
 unique_chunks.append(chunk)
 seen.add(fingerprint)
 return unique_chunks

unique_chunks = deduplicate_chunks(split_chunks)
print(f"Deduplicated to {len(unique_chunks)} unique chunks.")

Why: Redundancy is the enemy. I can’t stress this enough—many real datasets have weird repetition, whether corrupted PDFs or logs. Deduplication avoids wasted embedding and Milvus storage tokens, saving compute dollars and preventing search noise later.

Errors you’ll hit: Skipping deduplication fills Milvus with duplicate vectors, slows down search, and inflates storage. Your token-based budgeting will explode.

Step 4: Encode Chunks Into Vectors Efficiently

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def encode_chunks(chunks):
 embeddings = model.encode(chunks, convert_to_tensor=True)
 return embeddings

embeddings = encode_chunks(unique_chunks)
print(f"Produced embeddings shape: {embeddings.shape}")

Why: Choose smaller, faster embedding models unless you specifically need the biggest transformers for semantic precision. For most applications, models like “all-MiniLM-L6-v2” strike the best compromise between vector dimension (384), speed, and token budget. High-dimensional embeddings aren’t always better; they can bloat your Milvus index and slow down search.

Errors you’ll hit: Attempting OpenAI embedding for thousands of long chunks without preprocessing will burn tokens and hit API rate limits quickly. Also, embedding without batching reduces throughput.

Step 5: Store With Metadata to Filter Context

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection

connections.connect("default", host="localhost", port="19530")

fields = [
 FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
 FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384),
 FieldSchema(name="token_count", dtype=DataType.INT64),
 FieldSchema(name="chunk_text", dtype=DataType.VARCHAR, max_length=1024)
]

schema = CollectionSchema(fields, "Document chunks with token metadata")
collection = Collection("doc_chunks", schema)

token_counts = [count_tokens(chunk) for chunk in unique_chunks]

entities = [
 token_counts,
 embeddings.tolist(),
 unique_chunks
]

collection.insert(entities)
collection.create_index("embedding", {"index_type": "IVF_FLAT", "params": {"nlist": 128}, "metric_type": "L2"})
collection.load()

Why: Storing token counts alongside embeddings lets you filter or rank chunks cheaply without re-tokenizing later. The token metadata cuts down queries that try to cram too much context—and give you control over Milvus’s payload size at search time.

Step 6: Query With Token Budgets in Mind

def search_similar(query: str, top_k=5, max_query_tokens=1000):
 query_token_count = count_tokens(query)
 if query_token_count > max_query_tokens:
 raise ValueError(f"Query exceeds token budget: {query_token_count} > {max_query_tokens}")

 query_embedding = model.encode([query])[0].tolist()
 results = collection.search(
 [query_embedding],
 "embedding",
 param={"metric_type": "L2", "params": {"nprobe": 10}},
 limit=top_k,
 output_fields=["token_count", "chunk_text"]
 )
 
 filtered_results = [res for res in results[0] if res.entity.get("token_count", 0) + query_token_count < max_query_tokens]
 return filtered_results

query = "Efficient token usage with Milvus"
result_docs = search_similar(query)
for hit in result_docs:
 print(hit.entity.get("chunk_text"))

Why: Your LLM’s context window is precious. If you forget to check your query tokens plus relevant chunk tokens, you blow your limits—leading to errors or truncated prompts. Milvus’s filtering based on stored token metadata helps you dynamically stay under budget.

Errors you’ll hit: Passing too large combined token sets into your generator leads to failed completions or weird context jumps. I once had a system crash after ignoring token caps. Not fun.

The Gotchas

  1. Embedding API Token Counting Is Hurting You: OpenAI embeddings count tokens you don’t always expect, like implicit prompt tokens or separators. Always do dry runs with counts per chunk before bulk embedding.
  2. Milvus Storage Costs Climb Fast: Milvus’s Apache-2.0 licensed repo milvus-io/milvus has 43,421 stars—yeah, it’s popular—but the vector dimension and number of vectors you store cause rapid RAM/storage usage. Oversized vectors without token pruning inflate costs.
  3. Tokenizers Don’t Agree: If your chunk creation tokenizer and LLM tokenizer mismatch, you’ll either overestimate or underestimate tokens. Use the exact tokenizer your LLM requires.
  4. Index Creation Time & Memory: Using high nlist values in IVF_FLAT indexes improves recall but adds latency and RAM draw. Find your sweet spot. I usually start at nlist=128.
  5. Chunk Coherence vs Size: Bigger chunks hold more context but cost more tokens. Smaller chunks cause fragmentation and drop precision. Experiment.

Full Code

from transformers import GPT2TokenizerFast
from sentence_transformers import SentenceTransformer
from hashlib import sha256
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection

# Step 1: Initialize tokenizer and models
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = SentenceTransformer('all-MiniLM-L6-v2')

def count_tokens(text: str) -> int:
 return len(tokenizer.encode(text))

def chunk_text(text: str, max_tokens: int = 500):
 words = text.split()
 chunks = []
 current_chunk = []
 current_tokens = 0
 for word in words:
 word_tokens = count_tokens(word)
 if current_tokens + word_tokens > max_tokens:
 chunks.append(" ".join(current_chunk))
 current_chunk = [word]
 current_tokens = word_tokens
 else:
 current_chunk.append(word)
 current_tokens += word_tokens
 if current_chunk:
 chunks.append(" ".join(current_chunk))
 return chunks

def deduplicate_chunks(chunks):
 seen = set()
 unique_chunks = []
 for chunk in chunks:
 fingerprint = sha256(chunk.encode("utf-8")).hexdigest()
 if fingerprint not in seen:
 unique_chunks.append(chunk)
 seen.add(fingerprint)
 return unique_chunks

def encode_chunks(chunks):
 embeddings = model.encode(chunks, convert_to_tensor=True)
 return embeddings

# Connect to Milvus
connections.connect("default", host="localhost", port="19530")

# Define schema
fields = [
 FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
 FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384),
 FieldSchema(name="token_count", dtype=DataType.INT64),
 FieldSchema(name="chunk_text", dtype=DataType.VARCHAR, max_length=1024)
]
schema = CollectionSchema(fields, "Document chunks with token metadata")

collection = Collection("doc_chunks", schema)

def insert_chunks(chunks):
 unique_chunks = deduplicate_chunks(chunks)
 embeddings = encode_chunks(unique_chunks)
 token_counts = [count_tokens(chunk) for chunk in unique_chunks]

 entities = [
 token_counts,
 embeddings.tolist(),
 unique_chunks
 ]

 collection.insert(entities)
 collection.create_index("embedding", {"index_type": "IVF_FLAT", "params": {"nlist": 128}, "metric_type": "L2"})
 collection.load()

def search_similar(query: str, top_k=5, max_query_tokens=1000):
 query_token_count = count_tokens(query)
 if query_token_count > max_query_tokens:
 raise ValueError(f"Query token count ({query_token_count}) exceeds limit ({max_query_tokens})")

 query_embedding = model.encode([query])[0].tolist()
 results = collection.search(
 [query_embedding],
 "embedding",
 param={"metric_type": "L2", "params": {"nprobe": 10}},
 limit=top_k,
 output_fields=["token_count", "chunk_text"]
 )

 filtered_results = [res for res in results[0] if res.entity.get("token_count", 0) + query_token_count < max_query_tokens]
 return filtered_results

# Example Usage
if __name__ == "__main__":
 raw_text = ("This is an example paragraph that we'd like to chunk, deduplicate, embed, and store. " * 100)
 chunks = chunk_text(raw_text)
 insert_chunks(chunks)

 query = "example paragraph token usage"
 found = search_similar(query)
 for hit in found:
 print(hit.entity.get("chunk_text"))

What’s Next

Now that you’ve tamed token bloat feeding into Milvus, the logical next step is to implement dynamic query prompt trimming—meaning your application should monitor combined token length (query plus retrieved context) and remove or paraphrase low-value chunks automatically before calling your LLM. This will save you dollars and prevent runtime token limit errors in production.

FAQ

Q: How do I confirm that my token counts match the LLM’s internal counting?

A: The safest bet is to use the tokenizer provided by your LLM. For OpenAI models, tiktoken is the canonical tokenizer. GPT-2 tokenizer is a reasonable proxy but not exact. Always run test cases with your model's counting to verify.

Q: What’s the maximum number of vectors Milvus can handle before performance suffers?

A: Milvus is optimized for millions of vectors, but practically, your index type, vector dimension, and hardware dictate performance. For example, IVF_FLAT with nlist=128 is manageable at a few million vectors on decent servers, but latency and RAM can spike without batching and pruning.

Q: Can I automate token pruning at insertion time?

A: Absolutely, but be wary. You can drop or summarize chunks exceeding token limits before embedding, but over-pruning reduces semantic richness, hurting downstream search quality. Use adaptive thresholds tuned on your dataset.

Milvus Stats Overview

Metric Value Comment
GitHub Stars 43,421 Indicates high adoption and community support
Forks 3,909 Demonstrates active contribution and custom use cases
Open Issues 1,098 Signposts ongoing development and bug tracking
License Apache-2.0 Permissive license favorable for enterprise use
Last Updated March 21, 2026 Project is actively maintained

Recommendations for Different Developer Types

1. The Startup Hacker: If you’re building quick MVPs, focus on off-the-shelf embedding models like 'all-MiniLM-L6-v2' and basic token chunking to keep drift and costs down. Use Milvus's built-in indexing and keep an eye on your token usage with simple counters.

2. The Data Scientist: Experiment with semantic chunking approaches—try sentence boundary detection or paragraph encoding—to improve embedding fidelity. Incorporate token count metadata for pruning at query time. You might also look into custom embedding fine-tuning based on chunk token complexity.

3. The Enterprise Engineer: Build an adaptive pipeline that incorporates real-time token budget monitoring, chunk deduplication, dynamic vector dimensionality, and index tuning on Milvus. Integrate with your LLM pipelines tightly to prevent overrun scenarios and optimize compute spend.

Data as of March 21, 2026. Sources: https://github.com/milvus-io/milvus, Milvus LangChain Token Limits, Milvus LLM Optimization

Related Articles

🕒 Published:

✍️
Written by Jake Chen

AI technology writer and researcher.

Learn more →
Browse Topics: Best Practices | CI/CD | Cloud | Deployment | Migration

More AI Agent Resources

BotclawAi7botAgntaiAgntdev
Scroll to Top