Building a Semantic Search with Weaviate
We’re building a Weaviate semantic search application that lets users query a dataset with contextually similar results. Why? Because traditional keyword-based search is often frustrating and ineffective.
Prerequisites
- Docker 20.10+, Docker Compose 1.29+
- Go 1.19+ (if you’re working with custom Weaviate setups)
- Python 3.11+, pip install weaviate-client>=3.0.0
Step 1: Set Up Weaviate with Docker Compose
First things first, let’s get Weaviate running. We’ll use Docker Compose for a quick setup. This makes it easy to run the database locally and get development going in no time.
version: '3.8'
services:
weaviate:
image: semitechnologies/weaviate:latest
environment:
- QUERY_DEFAULTS=string
- AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true
ports:
- "8080:8080"
volumes:
- ./data:/var/lib/weaviate/data
Run this command to start Weaviate:
docker-compose up -d
Why Docker? It isolates your environment, allowing you to focus on development without having to deal with installation issues. I once decided to install everything manually for a project and ended up with a non-functioning mess. Lesson learned: Docker is a lifesaver.
Step 2: Confirm Weaviate is Running
After your Weaviate instance is up, let’s confirm it’s running correctly. You can hit the following endpoint:
curl http://localhost:8080/v1/schema
If you see a JSON response, you’re good to go. If not, check if the Docker instance is still active. Sometimes Docker just decides to stop for no reason—like that friend who says they’ll come to your party but ghosts you at the last minute.
Step 3: Create a Class in Weaviate
Now, let’s define a class that you will use for storing data. This is essentially a schema that tells Weaviate how to handle your data objects.
import weaviate
client = weaviate.Client("http://localhost:8080")
client.schema.create({
"class": "Document",
"properties": [
{
"name": "content",
"dataType": ["text"]
},
{
"name": "embedding",
"dataType": ["number[]"]
}
]
})
Why this structure? The “content” property holds your text data, and the “embedding” will store the vector representation of your documents for better semantic matching. You’ll need to deal with open issues if you try to couple Weaviate with a different data structure.
Step 4: Add Data to Weaviate
Let’s put some documents into your class. This snippet adds data, along with their embeddings, which can either be generated manually using a model like BERT or through an external service.
import numpy as np
documents = [
{"content": "This is a document about AI.", "embedding": np.random.rand(300).tolist()},
{"content": "Another paper discussing machine learning.", "embedding": np.random.rand(300).tolist()}
]
for doc in documents:
client.data_object.create(doc, class_name="Document")
Notice how I used random embeddings (not recommended for production)? If I had a dime for every time I’ve done something dumb like that, I could retire. In real applications, these embeddings should come from a pretrained transformer model.
Step 5: Querying with Semantic Search
Finally, you can search for documents semantically using Weaviate’s vector search capabilities. The following example queries the database:
query_vector = np.random.rand(300).tolist() # Replace with your actual query embedding
response = client.query.get("Document", ["content"]).with_near_vector({"vector": query_vector}).do()
print(response)
This allows you to find documents based on the meaning behind the query rather than just matching keywords. It’s a huge win for applications that deal with complex datasets and natural language.
The Gotchas
- Document Size: If your document is too large, you might hit the limit—Weaviate has constraints on the size of individual objects. Split long documents into smaller parts when needed.
- Embedding Quality: Garbage in, garbage out. Poor-quality embeddings will result in irrelevant search results. Be meticulous when choosing a model for generating embeddings.
- Indexing Time: Depending on the size of your dataset, indexing can take time. Don’t expect instant results after adding a bunch of documents. Patience is key here.
- Open Issues: Regularly check for open issues on the Weaviate GitHub page. At the time of writing, currently there are 579 issues that could impact your project.
Full Code Example
import weaviate
import numpy as np
# Create Weaviate client
client = weaviate.Client("http://localhost:8080")
# Create schema
client.schema.create({
"class": "Document",
"properties": [
{
"name": "content",
"dataType": ["text"]
},
{
"name": "embedding",
"dataType": ["number[]"]
}
]
})
# Add documents with random embeddings
documents = [
{"content": "This is a document about AI.", "embedding": np.random.rand(300).tolist()},
{"content": "Another paper discussing machine learning.", "embedding": np.random.rand(300).tolist()}
]
for doc in documents:
client.data_object.create(doc, class_name="Document")
# Querying using a random embedding
query_vector = np.random.rand(300).tolist()
response = client.query.get("Document", ["content"]).with_near_vector({"vector": query_vector}).do()
print(response)
What’s Next?
Try integrating a proper embedding generation process using a model like Sentence Transformers. That’ll enhance your search results significantly.
FAQ
1. Can Weaviate handle large datasets?
Yes, but performance will depend on your hardware and the quality of embeddings. Always monitor performance when scaling up.
2. Is Weaviate free to use?
Weaviate is open-source, so yes, but make sure to check the licensing. Currently, it’s under the BSD-3-Clause.
3. How do I visualize the vector data?
You can use Weaviate’s built-in GraphQL interface or create your own dashboards using tools like Grafana connected with Weaviate’s API.
Data Sources
Last updated May 03, 2026. Data sourced from official docs and community benchmarks.
đź•’ Published: