
Chapter 1: RAG Fundamentals
1.1 What Is RAG
RAG (Retrieval-Augmented Generation) is a technical architecture that combines information retrieval with large language model generation. The core idea is simple: instead of letting an LLM make up an answer from thin air, first retrieve relevant passages from an external knowledge base, then generate the answer based on those passages.
Take a typical example. A user asks, “How do you calculate the D0 collection-entry rate?” A standard LLM might answer, “Roughly overdue orders divided by total orders,” which is vague and may be wrong. A RAG system first retrieves the exact definition from the metric dictionary, such as “D0 order collection entry = 1 - dpd0_repay / dpd0_cnt,” and then generates an accurate answer from that evidence. The answer becomes traceable, the definition can be verified, and hallucinations are reduced significantly.
1.2 Core Components
A RAG system has five core stages:
1. Chunking: Split long documents into smaller passages suitable for retrieval. Chunk quality directly determines retrieval quality. Common strategies include splitting by heading hierarchy, by paragraph, or by a fixed character count. If chunks are too large, they introduce noise and may exceed context limits; if too small, they lose contextual relationships.
2. Embedding: Map text into dense vectors so semantically similar text ends up close in vector space. Embeddings are the foundation of vector retrieval and define how well the system understands meaning. Common models include OpenAI’s text-embedding-3 family (strong quality but paid), BGE-small-zh (open-source and CPU-friendly), and m3e-base (good for Chinese).
3. Retrieval: Given a user query, retrieve the most relevant top-k passages from a vector database or index. Retrieval is the heart of RAG because it determines what knowledge the LLM gets to see.
| Method | Principle | Strengths | Weaknesses |
|---|---|---|---|
| BM25 | Term frequency + inverse document frequency | No embedding model required, runs locally, strong on domain-specific terminology | Cannot handle synonyms or semantic similarity well |
| Vector retrieval | Semantic vector similarity | Understands meaning, supports synonym matching | Requires an embedding model and vector database |
| Hybrid retrieval | BM25 + vector fusion | Combines both strengths for broader recall | More complex to implement |
4. Prompt Assembly: Combine retrieval results, the user question, and system instructions into the LLM input. Prompt design determines whether the model behaves as intended. Common strategies include enforcing answer structure (conclusion first, evidence second), confidence constraints (say “I don’t know” when evidence is insufficient), scope boundaries (explicitly state what cannot be answered), and citation formatting (answers must include sources).
5. LLM Generation: Produce the final answer from the assembled prompt. The generation model can be deployed locally (for example, Qwen, LLaMA, or ChatGLM) or called through an API (for example, GPT-4 or Claude).
1.3 Chunking Strategies
The right chunking strategy depends on document structure and business context:
| Strategy | Principle | Best For | Potential Issues |
|---|---|---|---|
| By heading hierarchy | Split on # headings |
Markdown and structured documents | A single heading may contain no meaningful content |
| By paragraph | Split on paragraph breaks | Documents with clear paragraph structure | Paragraphs may still be too long |
| Fixed size | Hard split by character count (for example, 900 chars) | General-purpose scenarios | May break semantic units |
| Recursive splitting | Recursively split by paragraph, then sentence | General-purpose scenarios | More complex to implement |
In practice, chunk quality matters more than algorithm choice. Common optimizations include filtering out meaningless heading-only fragments, cleaning Markdown noise (list markers, numbering, code markers), filtering very short content (fewer than 40 characters), and filtering by content density (for example, chunks where heading lines account for more than 70%).
1.4 Choosing a Retrieval Strategy
Vector retrieval and BM25 are not mutually exclusive. In real systems, you can keep both and switch between them through configuration.
When BM25 works well: business documents full of domain-specific terms such as “collection-entry rate,” “spread,” or “average amount per case.” For exact-match recall, BM25 can perform as well as vector retrieval and does not require an extra embedding model, which makes it a good fit for resource-constrained environments. Its downside is weak handling of synonyms and conversational phrasing, such as “How much did they borrow?” versus “contract amount.”
When vector retrieval works well: scenarios that require stronger semantic understanding, such as knowledge bases rich in synonyms, multilingual documents, or cases where semantic ranking matters. The cost is the need for an embedding model and vector database.
A stronger setup is hybrid retrieval plus reranking: use BM25 or vector retrieval in stage one to recall the top 20 candidates quickly, then rerank them with a CrossEncoder or BGE-Rerank model and keep the top 4. This two-stage pipeline usually improves relevance substantially.
1.5 Evaluation Methods
RAG evaluation is usually split into retrieval quality and generation quality.
Mainstream Evaluation Frameworks
| Framework | Core Idea |
|---|---|
| RAGAS | Uses an LLM to score faithfulness, answer relevance, and context relevance |
| ARES | Uses an LLM as a judge and scores answers against reference answers |
| Trulens | Provides measurable dashboards for both retrieval and generation |
| LangSmith | Supports tracing and evaluation for production RAG systems, including human annotation workflows |
Retrieval-Side Evaluation
| Metric | Meaning | Evaluation Goal |
|---|---|---|
| Recall@K | Ratio of top-k results that contain the correct answer | Whether recall is comprehensive |
| MRR | Mean reciprocal rank of the correct answer | Whether the best result appears near the top |
| NDCG | Relevance-weighted ranking quality | Overall ranking quality |
Generation-Side Evaluation
| Method | Description | Limitation |
|---|---|---|
| LLM-as-Judge | Use GPT-4 or Claude to evaluate generated answer quality | Prompt-template-driven and somewhat subjective |
| BLEU/ROUGE | Automatic scoring based on n-gram overlap | Cannot judge factual correctness |
| Human evaluation | Random sampling with manual quality review | Expensive and hard to scale |
The hard part of generation evaluation is this: automated metrics such as BLEU and ROUGE only measure textual overlap. They cannot tell whether an answer is factually correct or semantically aligned. In business scenarios with high accuracy requirements, manual spot checks remain indispensable.
My RAG Evaluation Practice
Combining industry practice with practical resource constraints, I use the following approach:
- Retrieval side: quantify retrieval quality with Recall@K and MRR, and regularly inspect the relevance of the top 6 results
- Generation side: rely mainly on manual review, sampling 20 to 30 real queries in each iteration
- End-to-end: collect bad cases from real meeting-query scenarios and keep iterating on the system
1.6 Optimization Directions
RAG is a system that needs continuous iteration. Common optimization directions include:
- Retrieval side: query rewriting (make the query more retrieval-friendly), query expansion (add synonyms), hybrid retrieval plus reranking
- Generation side: Self-RAG to let the model decide whether retrieval is needed, and Corrective-RAG to detect hallucinations and retrieve again
- Knowledge-base side: layered knowledge (rank by importance or confidence), and incremental indexing (auto-update when new documents arrive)
Chapter 2: Background and Motivation
2.1 Business Context
I work as a business analyst in the lending industry. In my daily work, I rely on a multi-agent system for data analysis and report generation. Among those agents, the asset agent is the one I use most often, mainly for two core scenarios:
- Meeting-time data lookup: quickly query asset data by specific dimensions during meetings, such as collection-entry rate or spread for a given channel, package, or customer segment
- Report generation: repeatedly query data and summarize it under fixed definitions when producing daily or weekly reports
2.2 The Capabilities and Evolution of the Asset Agent
The asset agent is responsible for full-spectrum asset analysis questions: disbursement scale, collection-entry rate, spread, channel distribution, package performance, and more. In meetings, when I need to answer questions like, “What was the D0 collection-entry rate for front-loaded new customers this week, and how did it change week over week?”, the agent has to pull data from a database or Excel files, filter dimensions, calculate metrics, and present a conclusion.
The previous implementation depended on asset-analysis skills. Those skills encapsulated all definitions, metric formulas, and analysis logic in advance, and the agent would call the relevant skill based on the user’s question.
However, as the business expanded, the dimensions and metrics embedded in the skills kept growing:
| Dimension / Metric | Continuously Growing Content |
|---|---|
| Channels | Facebook / Google / TikTok / Organic / non-paid, plus 7 paid packages |
| Packages | 7 paid packages + APK packages |
| Customer segments | Existing customers / new customers (pure new / non-pure new) / multi-loan existing / multi-loan new |
| Products | Front-loaded / back-loaded à single-loan / multi-loan |
| Metrics | D-1 / D0 collection entry, D0 spread, cumulative spread, average amount per case, bad-debt rate, etc. |
| RG grades | A / B / C / D / E / F / G |
The limitations of the skills-based approach became increasingly obvious:
- Poor readability: definitions existed only as code, so callers could get results but could not directly inspect the reasoning process
- High maintenance cost: when dimensions and metrics grew, the skills code had to be updated in sync, and reuse across scenarios was hard
- Limited extensibility: when adding new analysis dimensions or handling exploratory questions, the skills structure was too rigid
That is why I needed a RAG knowledge base as a complement to skills: move descriptive knowledge such as definitions, standards, and SOPs into the knowledge base and let RAG provide question-answer retrieval, while keeping structured computation logic in skills. Each part does what it is best at.
2.3 Pressure on Knowledge Management
Even after introducing a knowledge base, managing the knowledge itself remains difficult. Asset-analysis knowledge is scattered across multiple places:
- Feishu docs: metric dictionaries, dimension definitions, analysis SOPs
- Code comments: some business logic is embedded directly in Python scripts
- DM conversations: temporary definition agreements that exist only in message history
This creates several problems:
- Inconsistent definitions: the same term may be defined differently across documents, which can lead the agent to answer inconsistently
- Unsynced updates: code logic changes, but documentation does not, causing knowledge drift
- Poor reusability: it is hard to reuse definition knowledge quickly in new scenarios such as automated report generation
- High maintenance cost: it takes too long to understand the agent’s actual capability boundaries
The number of asset-analysis dimensions and metrics will keep growing, and the complexity of knowledge management will keep rising with it. I needed a systematic knowledge-management mechanism, and that was the original driver behind RAG plus a knowledge vector store.
2.4 Why RAG
Across the industry, mainstream knowledge-management stacks have shifted toward RAG. In AI-native applications and enterprise knowledge-base Q&A scenarios, RAG has become the default architectural choice in practice:
| Approach | Industry Status |
|---|---|
| Pure document search | Last-generation approach, lacks semantic understanding |
| Hard-coded rules | Expensive to maintain, difficult to scale |
| Pure LLM memory | Limited by context window, cannot handle large knowledge sets |
| RAG | Mainstream industry solution, suitable for almost all knowledge-Q&A scenarios |
RAG’s core strengths, natural-language Q&A, updatable knowledge, and traceable answers, align exactly with the needs of fast meeting-time lookup and report generation.
As an AI application builder, I also see hands-on RAG experience as essential. My learning goal was clear: chunking strategy, retrieval pipeline, prompt constraints, and evaluation can only really be understood by building them yourself and validating them in a real asset-agent workflow.
Chapter 3: System Build Process
3.1 Overview of the Build Flow
Environment setup (install dependencies / Ollama / pull model)
â
Configuration management (write config.yaml)
â
Knowledge-base construction (organize markdown files)
â
Index build (python build_index_v2.py)
â
Start service (python app.py)
â
Validation tests (/health â /search â /ask)
3.2 Environment Setup
Python dependencies: fastapi uvicorn pydantic pyyaml jieba rank_bm25 chromadb requests
Ollama model: use ollama pull qwen2.5:0.5b. One model serves two purposes, both embedding and generation, which saves resources.
3.3 Project Structure
asset_rag/
âââ app.py # FastAPI service, core inference logic
âââ build_index_v2.py # Index build script
âââ config.yaml # All configurable parameters
âââ prompt.md # System prompt template
âââ index.json # BM25 index (generated after build)
âââ chroma_db/ # Vector database (when vector mode is enabled)
asset_knowledge_base/ # Knowledge base, separate from the service
âââ 01_metric_definitions/
âââ 02_dimension_definitions/
âââ 03_data_dictionary/
âââ 04_analysis_sop/
Design idea: keep the knowledge base independently maintained, and let the service pick up new knowledge automatically after the index is rebuilt.
3.4 Configuration Management
Centralize all parameters in config.yaml:
ollama_base_url: http://127.0.0.1:11434
ollama_model: qwen2.5:0.5b
top_k: 6
max_context_chunks: 4
chunk_max_chars: 900
use_vector: false # Switch retrieval mode
3.5 Core Code: Service Entry Point
app = FastAPI(title='asset-local-rag')
@app.post('/ask')
def ask(req: AskRequest):
hits = retrieve(req.question, req.top_k or TOP_K)
answer = call_ollama(req.question, hits)
return {'answer': answer, 'citations': hits}
3.6 Core Code: Unified Retrieval Entry
def retrieve(query: str, top_k: int):
if USE_VECTOR:
return retrieve_vector(query, top_k) # Chroma vectors
else:
return retrieve_bm25(query, top_k) # BM25
BM25 retrieval: uses rank_bm25 plus jieba tokenization. It performs well on domain-specific terms and does not require an embedding model.
Vector retrieval: enabled when use_vector: true. Ollama is used for embeddings, and the vectors are stored in Chroma.
3.7 Core Code: Prompt Assembly
def call_ollama(question: str, contexts: list):
context_text = '\n\n'.join([
f"[Source: {c['path']}]\n{c['text']}"
for c in contexts[:MAX_CONTEXT]
])
prompt = f"{PROMPT}\n\nKnowledge Passages:\n{context_text}\n\nUser Question: {question}"
resp = requests.post(f'{OLLAMA_URL}/api/generate',
json={'model': OLLAMA_MODEL, 'prompt': prompt})
return resp.json()['response'].strip()
Key design choice: pass at most 4 chunks into the LLM, and label each chunk with its source.
3.8 Index Construction
python3 build_index_v2.py
build_index_v2.py reads all markdown files, splits them by # headings, filters meaningless fragments and Markdown noise, and outputs index.json.
3.9 Startup and Validation
python3 app.py
# Health check
curl http://127.0.0.1:8787/health
# RAG QA test
curl -X POST http://127.0.0.1:8787/ask \
-H "Content-Type: application/json" \
-d '{"question": "How do you calculate the D0 collection-entry rate?"}'
Chapter 4: Knowledge Base Construction
4.1 Design Principles
The knowledge base is the “brain” of the RAG system, and its quality directly determines retrieval quality. I follow three principles:
- Clear boundaries: the first version only covers asset-data-related content
- Layered structure: organize content by “definitions â dimensions â data dictionary â analysis SOP”
- Consistent format: use uniform Markdown, and give each concept its own
#heading
4.2 Directory Structure
asset_knowledge_base/
âââ 01_metric_definitions/metric_dictionary.md # Metric definitions and formulas
âââ 02_dimension_definitions/dimension_dictionary.md # Channels, customer segments, packages, etc.
âââ 03_data_dictionary/asset_data.md # Asset data field descriptions
âââ 04_analysis_sop/ # Analysis approaches for lower disbursement, worsening first delinquency, worsening spread, etc.
The layering serves different purposes: the definition layer explains “what it is,” the data layer explains “what fields exist,” and the application layer explains “how to use it.”
4.3 Markdown Writing Guidelines
Give each concept its own # heading so it does not get split across different chunks:
## D0 Order Collection Entry Rate
Definition: the share of orders that remain unpaid on the due date.
Formula: D0 order collection entry = 1 - dpd0_repay / dpd0_cnt
Avoid writing pure bullet lists. Each list item should include explanatory text.
4.4 Defining Knowledge Boundaries
| Can Answer | Cannot Answer |
|---|---|
| Disbursement scale, average amount per case, collection-entry rate, spread | Approval rate, approval workflow |
| Definitions of channels, packages, and customer segments | Collection strategy |
| Metric definitions and data field explanations | operation_data |
Boundary control is implemented in the prompt layer. The knowledge base itself does not need special handling.
4.5 Maintenance Workflow
Organize markdown files â commit to asset_knowledge_base/ â run build_index_v2.py â service picks up changes automatically
Chapter 5: Chunking Strategy Iteration
5.1 Why Chunking Matters
Chunking is one of the easiest parts of RAG to underestimate, even though it has an outsized effect. Chunk quality directly determines whether retrieved content is complete.
5.2 v1: Heading-Based Splits Plus Hard Character Limits
Split on # headings. If a section is within 900 characters, keep it as a single chunk; if it is longer, split it further by accumulating paragraphs.
Problems exposed by v1:
| Problem | Symptom | Impact |
|---|---|---|
| Heading-only fragments | A heading like # Secondary Heading becomes a chunk even with no body text |
Retrieval returns empty content |
| Markdown noise | List markers and numbering interfere with embeddings | Vector quality drops |
| Broken context | Definitions and formulas get split apart | Retrieved content is incomplete |
5.3 v2: Filtering Plus Cleanup
Before building the index, filter meaningless fragments such as text under 40 characters or chunks where heading lines exceed 70%, and clean Markdown noise such as list markers, numbering, and code markers.
def is_meaningful_chunk(text: str) -> bool:
if len(stripped) < 40: return False
if title_lines / len(lines) > 0.7: return False
return True
5.4 Reflection
Even with v2, a hardcoded 900-character split still has limitations. Formulas and explanations may end up in different chunks, tables can get broken apart, and cross-section references can fail. v2 is a practical answer under resource constraints, not the optimal one.
Chapter 6: Retrieval Pipeline Design
6.1 BM25 Retrieval
BM25 is a sparse retrieval algorithm and can be understood as an upgraded TF-IDF: tokenize the query, calculate TF, introduce IDF to penalize common words, and normalize for document length.
Strengths: no embedding model required, runs locally, and performs well on domain-specific terms such as “D0 collection entry” or “average amount per case.”
Weaknesses: it only matches literal wording, so synonyms such as “average amount per case” versus “average contract amount” cannot be recalled well.
6.2 Vector Retrieval
Switch it on or off through use_vector: true/false in config.yaml. Ollama handles embedding generation, and Chroma stores the vectors.
Downside: it requires an extra embedding model and vector database. In practice, it ran slowly on a 2 vCPU / 3.8 GB machine.
6.3 Unified Retrieval Entry
def retrieve(query: str, top_k: int):
if USE_VECTOR:
return retrieve_vector(query, top_k)
else:
return retrieve_bm25(query, top_k)
Both modes are switched through configuration and are transparent to upper layers.
6.4 Example Retrieval Result
{
"id": "01_metric_definitions/metric_dictionary.md#3",
"path": "01_metric_definitions/metric_dictionary.md",
"text": "## D0 Order Collection Entry\n\nDefinition: still unpaid on the due date...\nFormula: D0 order collection entry = 1 - dpd0_repay / dpd0_cnt",
"score": 8.67
}
score is the BM25 relevance score and can be used as a confidence signal.
Chapter 7: Prompt Engineering and Safety Constraints
7.1 System Prompt Design
Prompt design determines whether the model behaves the way you expect. I follow three principles:
- Whitelist the scope: explicitly list what topics can be answered
- Blacklist refusal cases: explicitly list out-of-scope domains and standard refusal language
- Constrain uncertainty: if evidence is insufficient, the model must not invent an answer and should say “I don’t know”
7.2 My Prompt
You are the local RAG QA model for the asset agent.
Strict rules:
- You may answer only based on the provided knowledge passages
- You may answer only within the first-version scope of nigeria_asset.asset_data
- For questions about approval rate, collection execution, or other out-of-scope topics, you must answer: "The current knowledge base does not support this question"
- Do not fabricate fields, definitions, or conclusions
7.3 User Prompt Assembly
context_text = '\n\n'.join([
f"[Source: {c['path']}]\n{c['text']}"
for c in contexts[:MAX_CONTEXT] # At most 4 chunks
])
prompt = f"{PROMPT}\n\nKnowledge Passages:\n{context_text}\n\nUser Question: {question}"
Key design choice: MAX_CONTEXT = 4 avoids exceeding the model’s context limit, and each chunk includes a source label for traceability.
7.4 Boundary Test Results
| Question | Expected Behavior | Actual |
|---|---|---|
| “How do you calculate D0 collection entry?” | Answer from the metric dictionary | â |
| “What was this week’s approval rate?” | “Not supported by the knowledge base” | â |
| “Which orders are overdue?” | Refuse and say database lookup is required | â |
Chapter 8: Evaluation Results and Bottleneck Analysis
8.1 Subjective Score: 6 / 10
| Dimension | Score | Notes |
|---|---|---|
| Retrieval recall | 6/10 | Weak synonym recall |
| Generation quality | 5/10 | The 0.5B model has limited understanding |
| Response speed | 8/10 | Local inference, no network latency |
| Stability | 7/10 | Occasional OOM under tight resources |
| Explainability | 8/10 | Clear sources, good traceability |
The two biggest weaknesses are the model is too small and synonym recall is weak.
8.2 Bottleneck 1: The Model Is Too Small
qwen2.5:0.5b struggles with more complex reasoning, such as multi-step calculation questions. It also sometimes misinterprets analytical terms such as “week-over-week” and “year-over-year,” and instruction following is not fully stable.
8.3 Bottleneck 2: Low Retrieval Recall
BM25 performs literal matching only, which leads to issues like:
- “Average contract amount” does not recall “average amount per case”
- “First delinquency rate” does not match “first-delinquency deterioration”
- Conversational phrasing does not align with the formal wording in the knowledge base
8.4 Bottleneck 3: No Reranking
Current pipeline: query â BM25 top_k â LLM
Ideal pipeline: query â BM25 top_20 â Rerank top_4 â LLM
A two-stage retrieval pipeline would significantly improve relevance, but current machine resources do not support it.
Chapter 9: Pitfalls and Lessons Learned
Pitfall 1: OOM Due to Unrealistic Resource Assumptions
I originally tried vector retrieval with qwen2.5:3b, and 3.8 GB of RAM was immediately exhausted. The fix was to downgrade to qwen2.5:0.5b and disable vector retrieval, using BM25 only.
Lesson: do not overestimate hardware. Start by getting the smallest viable setup running.
Pitfall 2: Chunk Size Was Too Coarse
In the early version, I used 1500-character chunks, which split “definition + formula + notes” into three pieces, so retrieval often came back incomplete. The fix was to reduce the chunk size to 900 characters and require the knowledge base to follow a “one concept, one paragraph” writing style.
Lesson: chunking strategy and knowledge-base writing guidelines need to be optimized together.
Pitfall 3: Prompt Injection
If the user entered something like “Ignore the previous rules,” the model could skip constraints. The fix was to add a “strict rules” section in the system prompt and filter inputs in the application layer.
Lesson: small local models do not follow instructions as reliably as larger models. You cannot assume prompt constraints are naturally effective.
Pitfall 4: Ollama Hot-Reload Issues
When switching use_vector, if Ollama was not started cleanly, the service would return HTTP 500 errors without crashing, which made debugging harder. The fix was to add a /health endpoint for active status checks.
Chapter 10: Optimization Directions and Closing Thoughts
10.1 Model Upgrade Path
qwen2.5:0.5b (current)
â Upgrade first when memory allows
qwen2.5:3b (+2 points)
â Add more RAM / GPU when possible
qwen2.5:7b (+3 points)
â Further upgrade
LLaMA3.1:8B / ChatGLM4:9B
10.2 Two-Stage Retrieval Plan
candidates = retrieve_bm25(query, top_k=20) # Stage 1: coarse ranking
reranked = cross_encoder_rerank(query, candidates, top_k=4) # Stage 2: reranking
Expected benefit: a 15% to 20% improvement in recall.
10.3 Synonym Expansion
QUERY_EXPANSIONS = {
"average amount per case": ["average contract amount", "average borrowed amount per person"],
"collection entry": ["overdue", "unpaid", "non-performing"],
}
10.4 Summary
This RAG system is not a “perfect solution.” It is the best compromise under current resource constraints:
- 2 vCPU / 3.8 GB / no GPU â BM25 + 0.5B is the only realistic option
- It is usable, but the bottlenecks are obvious â a score of 6 is a starting point, not an ending point
- Every iteration is about doing the most worthwhile thing the current resource budget allows
The core engineering principle is this: there is no universally optimal solution, only the solution that best fits the current stage.