RAG Using zvec Vector Datastore and Local Model
The zvec library implements a lightweight, lightning-fast, in-process vector database. Allibaba released zvec in February 2026. We will see how to use zvec and then build a high performance RAG system. We will use the tiny model qwen3:1.7b as part of the application.
Note: The source code for this example can be found in Ollama_in_Action_Book/source-code/RAG_zvec/app.py. Not all the code in this file is listed here.
Introduction and Architecture
Building a Retrieval-Augmented Generation (RAG) pipeline entirely locally ensures absolute data privacy, eliminates API latency costs, and provides full control over the embedding and generation models. In this chapter, we construct a fully offline RAG system utilizing Ollama for both embeddings (embeddinggemma) and inference (qwen3:1.7b), paired with zvec, a lightweight, high-performance local vector database.
The architecture follows a classic two-phase RAG pattern, adding an additional third step to improve the user experience:
- Ingestion: Parse local text files, chunk the content, generate embeddings via Ollama, and index them into zvec.
- Retrieval & Generation: Embed the user query, perform a similarity search in zvec, and save the retrieved top-k chunks for processing by a local Ollama chat model.
- Use a small LLM model (qwen3:1.7b) to process the retrieved chunks and taking into account the user’s original query and then format a subset of the text in the returned chunks for the user to read.
Design Analysis: Dependency Minimization
A notable design choice in our implementation is the reliance on Python’s standard library for network calls. By utilizing urllib.request instead of third-party libraries like requests or the official ollama-python client library, the dependency footprint is minimized exclusively to zvec. This reduces virtual environment overhead and potential version conflicts, prioritizing a lean deployment.
Implementation Walkthrough
Here we look at some of the code in the source file app.py.
Embedding and Chunking Strategy
The ingestion phase relies on a fixed-size overlapping window strategy. Here is an implementation of a chunking strategy:
1 def chunk_text(text, chunk_size=500, overlap=50):
2 """Split text into overlapping chunks."""
3 chunks = []
4 start = 0
5 while start < len(text):
6 end = start + chunk_size
7 chunks.append(text[start:end])
8 start = end - overlap
9 return chunks
Analysis of code:
- Chunk Size (500 chars): This relatively small chunk size yields high-granularity embeddings. It reduces the risk of retrieving “diluted” context where a single chunk contains multiple disparate concepts.
- Overlap (50 chars): Crucial for preventing context loss at the boundaries of chunks. It ensures that a semantic concept bisected by a hard character limit is still captured cohesively in at least one chunk.
- Embedding Model: The system uses embeddinggemma. The Ollama API endpoint (/api/embeddings) is called directly. If the server fails to respond, a fallback zero-vector [0.0] * 768 is returned to prevent pipeline crashes, though logging or raising an exception might be preferred in production.
Vector Storage with zvec
The zvec integration demonstrates a strictly typed, schema-driven approach to local vector storage.
1 schema = zvec.CollectionSchema(
2 name="example",
3 vectors=zvec.VectorSchema("embedding", zvec.DataType.VECTOR_FP32, 768),
4 fields=zvec.FieldSchema("text", zvec.DataType.STRING),
5 )
Analysis of code:
- Dimensionality Matching: The vector schema is hardcoded to 768 dimensions (FP32), which strictly matches the output tensor of the embeddinggemma model. Any change to the embedding model in the configuration must be accompanied by a corresponding update to this schema.
- Storage Path: The database is initialized locally at ./zvec_example. The implementation includes a defensive teardown (shutil.rmtree) of existing databases on startup. This is excellent for testing and iterative development, though destructive in a persistent production environment.
The following function builds the index using an embedding model for the local Ollama server:
1 def build_index():
2 """Index all text files from the data directory into zvec."""
3 # Define collection schema (embeddinggemma: 768 dimensions)
4 schema = zvec.CollectionSchema(
5 name="example",
6 vectors=zvec.VectorSchema("embedding", zvec.DataType.VECTOR_FP32, 768),
7 fields=zvec.FieldSchema("text", zvec.DataType.STRING),
8 )
9
10 db_path = "./zvec_example"
11 if os.path.exists(db_path):
12 import shutil
13 shutil.rmtree(db_path)
14
15 collection = zvec.create_and_open(path=db_path, schema=schema)
16
17 docs = []
18 doc_count = 0
19 for root, _, files in os.walk(config["data_dir"]):
20 for file in files:
21 if file.lower().endswith(config["extensions"]):
22 try:
23 file_path = Path(root) / file
24 with open(file_path, "r", encoding="utf-8") as f:
25 content = f.read()
26 chunks = chunk_text(content)
27 for i, chunk in enumerate(chunks):
28 embedding = get_embedding(chunk)
29 docs.append(zvec.Doc(
30 id=f"{file}_{i}",
31 vectors={"embedding": embedding},
32 fields={"text": chunk},
33 ))
34 doc_count += len(chunks)
35 except Exception as e:
36 pass
37
38 if docs:
39 collection.insert(docs)
40 print(f"Indexed {doc_count} chunks from {config['data_dir']}")
41 return collection
This function build_index initializes a local vector database and populates it with document embeddings. Specifically, it executes four main operations:
- Schema & Storage Initialization: Defines a strict schema for zvec (768-dimensional FP32 vectors and a string metadata field) and destructively recreates the local database directory (./zvec_example).
- File Traversal: Recursively walks a configured target directory (config[“data_dir”]) to locate specific file types.
- Transformation & Embedding: Reads each file, splits it into overlapping chunks, and retrieves the vector embedding for each chunk via an external call (get_embedding).
- Batch Insertion: Accumulates all processed chunks and their embeddings into a single memory list (docs), then performs a bulk insert into the zvec collection.
Retrieval and LLM Synthesis
The synthesis phase bridges the vector database and the Generative LLM. Function search identifies matching text chunks in the vector database:
1 def search(collection, query, topk=5):
2 """Search the zvec collection for chunks relevant to the query."""
3 query_vector = get_embedding(query)
4 results = collection.query(
5 zvec.VectorQuery("embedding", vector=query_vector),
6 topk=topk,
7 )
8 chunks = []
9 for res in results:
10 text = res.fields.get("text", "") if res.fields else ""
11 if text:
12 chunks.append(text)
13 return chunks
Function search performs a Top-K retrieval. The default topk=5 retrieves roughly 2,500 characters of context. This easily fits within the context window of modern small models like qwen3:1.7b without causing attention dilution (“lost in the middle” syndrome).
System Prompt Engineering and Using a Small LLM to Prepare Output for a User
The ask_ollama function utilizes strict prompt constraints: “Answer the user’s question using ONLY the context provided below. If the context does not contain enough information, say so.” This significantly mitigates hallucination by forcing the model to ground its response exclusively in the retrieved data.
1 def ask_ollama(question, context_chunks):
2 """Send retrieved chunks + user question to the Ollama chat model."""
3 context = "\n\n---\n\n".join(context_chunks)
4 system_prompt = (
5 "You are a helpful assistant. Answer the user's question using ONLY "
6 "the context provided below. If the context does not contain enough "
7 "information, say so. Be concise and accurate.\n\n"
8 f"Context:\n{context}"
9 )
10 url = f"{OLLAMA_BASE}/api/chat"
11 payload = json.dumps({
12 "model": config["chat_model"],
13 "stream": False,
14 "messages": [
15 {"role": "system", "content": system_prompt},
16 {"role": "user", "content": question},
17 ],
18 }).encode("utf-8")
19 req = urllib.request.Request(url, data=payload, headers={"Content-Type": "application/json"})
20 try:
21 with urllib.request.urlopen(req) as res:
22 body = json.loads(res.read().decode("utf-8"))
23 return body["message"]["content"]
24 except Exception as e:
25 return f"Error calling Ollama chat: {e}"
Function ask_ollama uses stateless execution: The /api/chat call sets “stream”: False and does not maintain a conversation history array across loop iterations. This makes it a pure Q&A interface rather than a continuous chat, ensuring each answer is cleanly tied to a fresh zvec retrieval.
Example Run
To run the pipeline, ensure the Ollama daemon is running locally on port 11434 and that both models (embeddinggemma and qwen3:1.7b) have been pulled. Place your .txt files in the ../data directory and execute the script. The system will build the index and immediately drop you into a REPL loop for interactive querying.
Here is an example run:
1 $ uv run app.py
2 Using CPython 3.12.12
3 Creating virtual environment at: .venv
4 Installed 2 packages in 31ms
5 Building zvec index from text files …
6 Indexed 9 chunks from ../data
7
8 RAG chat ready (model: qwen3:1.7b)
9 Type your question, or 'quit' to exit.
10
11 You> who says economics is bullshit?
12
13 Assistant> The context mentions Pauli Blendergast, an economist at the University of Krampton Ohio, who is noted for stating that "economics is bullshit." No other individuals are explicitly cited in the provided text.
14
15 You> what procedures are performed in chemistry labs?
16
17 Assistant> The provided context does not contain information about procedures performed in chemistry labs. It focuses on economic concepts, microeconomics, macroeconomics, and related topics, but does not mention chemistry or laboratory procedures.
18
19 You> how do microeconomics and macroeconomics differ?
20
21 Assistant> Microeconomics focuses on individual agents (e.g., households, firms) and specific markets, analyzing decisions like pricing, resource allocation, and consumer behavior. Macroeconomics examines the entire economy, addressing broader issues such as unemployment, inflation, growth, and fiscal/money policy. While microeconomics deals with "how" resources are used, macroeconomics focuses on "what" the economy produces and "how collectively" it functions.
22
23 You> quit
24 Goodbye!
Dear reader, notice that there was no information in the indexed text to answer the second example query and this program correctly refused to hallucinate (or make up) an answer.
Wrap Up for RAG Using zvec Vector Datastore and Local Model
In this chapter, we built a completely offline, privacy-preserving RAG architecture by bridging Alibaba’s recently released in-process vector database, zvec, with local Ollama inference. By intentionally minimizing external dependencies and utilizing a strictly typed, schema-driven datastore, we eliminated the network overhead and deployment bloat typical of client-server vector databases. The fixed-size overlapping chunking strategy, combined with the 768-dimensional embeddinggemma model, ensures high-fidelity semantic retrieval. Simultaneously, the compact qwen3:1.7b model demonstrates that a heavily constrained, prompt-engineered generation phase can effectively synthesize retrieved context without hallucination.
The resulting pipeline serves as a robust, lightweight foundation for edge-deployable AI applications. Because the entire storage and inference stack executes locally within the same process, the pattern is exceptionally portable, fast, and secure. Moving forward, this baseline implementation can be extended to handle more complex retrieval requirements, such as integrating dynamic semantic chunking, implementing Reciprocal Rank Fusion (RRF) for hybrid multi-vector queries, or introducing multi-turn conversational memory. Ultimately, combining embedded vector storage with small-parameter LLMs proves that high-performance, domain-specific RAG does not require massive cloud infrastructure.