The rise of open-weight large language models (LLMs) like Llama 3, Mistral, and Gemma has democratized AI development. But while these models are powerful out of the box, their true potential is unlocked when you customize them with your own data. Whether you’re building a domain-specific chatbot, a customer support system, or an internal knowledge assistant, understanding how to enhance LLMs with your data is crucial.
In this post, we’ll explore four complementary approaches: prompt context injection, fine-tuning, Retrieval-Augmented Generation (RAG), and Model Context Protocol (MCP). Each has its place in the AI engineer’s toolkit, and choosing the right one—or combination—can make the difference between a mediocre and a game-changing application.
What it is: Directly including relevant information in your prompt alongside the user’s query.
How it works:
context = """
Company Policy: Employees can take up to 15 days of PTO per year.
PTO requests must be submitted at least 2 weeks in advance.
"""
user_question = "How many vacation days do I get?"
prompt = f"{context}\n\nQuestion: {user_question}\nAnswer:"Pro tip: Structure your context with clear delimiters and headers. Models respond better to organized information:
prompt = f"""
<knowledge_base>
Product: Widget Pro X
Price: $299
Features: Waterproof, 10-hour battery, wireless charging
</knowledge_base>
<user_query>
{user_question}
</user_query>
Provide a helpful answer based only on the knowledge base above.
"""What it is: A hybrid approach that combines a vector database with an LLM. When a user asks a question, RAG retrieves the most relevant documents from your knowledge base and injects them into the prompt.
How it works:
from sentence_transformers import SentenceTransformer
import chromadb
# 1. Embed your documents
embedder = SentenceTransformer('all-MiniLM-L6-v2')
documents = [
"Python 3.12 introduces improved error messages",
"FastAPI is a modern web framework for Python",
"Docker containers provide consistent environments"
]
client = chromadb.Client()
collection = client.create_collection("docs")
for i, doc in enumerate(documents):
embedding = embedder.encode(doc)
collection.add(
embeddings=[embedding.tolist()],
documents=[doc],
ids=[f"doc_{i}"]
)
# 2. Retrieve relevant context
query = "How do I build web APIs in Python?"
query_embedding = embedder.encode(query)
results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=3
)
# 3. Inject into prompt
context = "\n".join(results['documents'][0])
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"User Query → Embedding Model → Vector Search → Top-K Documents → LLM Prompt → ResponseHybrid search (semantic + keyword):
# Combine dense (vector) and sparse (BM25) retrieval
from rank_bm25 import BM25Okapi
vector_results = semantic_search(query, top_k=10)
keyword_results = bm25_search(query, top_k=10)
final_results = rerank(vector_results + keyword_results, top_k=5)HyDE (Hypothetical Document Embeddings):
# Generate a hypothetical answer first, then search for similar documents
hypothetical_answer = llm.generate(f"Answer this question: {query}")
results = vector_search(hypothetical_answer) # Often more accurate!Parent-child chunking:
# Retrieve small chunks but provide larger context to LLM
chunk = retrieve_best_chunk(query)
parent_document = get_parent_document(chunk)
context = parent_document # Full context for better answersWhat it is: Retraining the model’s weights on your specific dataset to make it better at your task.
How it works:
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import Dataset
# 1. Prepare your dataset
data = [
{"input": "How do I reset my password?",
"output": "Click 'Forgot Password' on the login page and follow the email instructions."},
{"input": "What's your return policy?",
"output": "We accept returns within 30 days with original receipt and packaging."},
# ... hundreds or thousands more examples
]
dataset = Dataset.from_list(data)
# 2. Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
# 3. Configure training
training_args = TrainingArguments(
output_dir="./finetuned-model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
logging_steps=10
)
# 4. Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset
)
trainer.train()For most use cases, you don’t need full fine-tuning. LoRA (Low-Rank Adaptation) fine-tunes only a small fraction of parameters:
from peft import LoraConfig, get_peft_model
# Only train ~1% of parameters!
lora_config = LoraConfig(
r=8, # Rank
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none"
)
model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.print_trainable_parameters()}")
# Output: trainable params: 4.2M || all params: 8B || trainable%: 0.05What it is: MCP is an emerging standard (developed by Anthropic) that allows LLMs to securely connect to external data sources and tools in real-time.
Think of it as “plug-and-play APIs for AI models”—instead of embedding knowledge or training on it, you give the model access to live data sources.
How it works:
# MCP Server exposing your database
from mcp import Server, Tool
server = Server("company-database")
@server.tool()
async def query_customer_info(customer_id: str) -> dict:
"""Fetch customer information from our CRM"""
return await db.customers.find_one({"id": customer_id})
@server.tool()
async def get_order_status(order_id: str) -> dict:
"""Check the status of an order"""
return await db.orders.find_one({"id": order_id})
# Start the server
await server.start()# Client side - LLM uses these tools dynamically
from mcp import Client
client = Client()
client.connect("company-database")
# The LLM can now call these tools during inference!
response = llm.chat(
"What's the status of order #12345 for customer John?",
tools=client.list_tools()
)
# Behind the scenes, the LLM will:
# 1. Recognize it needs customer and order data
# 2. Call get_order_status("12345")
# 3. Call query_customer_info based on the order result
# 4. Synthesize a natural language responseUser Query → LLM → Decides to call MCP tool → Tool executes → Result injected → LLM continues → Final response| Aspect | RAG | MCP |
|---|---|---|
| Data freshness | Periodic updates | Real-time |
| Actions | Read-only | Read + write |
| Scope | Document search | Any API/database |
| Latency | Single retrieval | Multiple tool calls |
| Use case | Knowledge Q&A | Agentic workflows |
In practice, the most powerful systems use multiple techniques together:
class CustomerSupportAgent:
def __init__(self):
self.llm = load_model("llama-3-8b-finetuned") # Fine-tuned on support tone
self.rag = RAGSystem(vector_db="faqs") # RAG for documentation
self.mcp = MCPClient(["crm", "orders", "inventory"]) # MCP for live data
async def answer(self, query: str, customer_id: str):
# 1. RAG: Search documentation
docs = self.rag.retrieve(query, top_k=3)
# 2. MCP: Fetch customer context
customer = await self.mcp.call("get_customer", customer_id)
orders = await self.mcp.call("get_recent_orders", customer_id)
# 3. Fine-tuned model: Generate response with all context
context = f"""
Documentation: {docs}
Customer: {customer['name']}, tier: {customer['tier']}
Recent orders: {orders}
"""
response = await self.llm.generate(
prompt=f"{context}\n\nCustomer question: {query}\nResponse:",
max_tokens=500
)
return response| Requirement | Recommended Approach |
|---|---|
| Quick prototype | Prompt context injection |
| Large knowledge base | RAG |
| Specific output format | Fine-tuning (LoRA) |
| Real-time data | MCP |
| Domain-specific language | Fine-tuning + RAG |
| Multi-step workflows | MCP + Fine-tuning |
| Cost-sensitive | Prompt context or RAG |
| Action-capable agent | MCP |
# Track key metrics
metrics = {
"retrieval_precision": 0.85, # RAG: Are we finding the right docs?
"answer_accuracy": 0.92, # LLM: Is the final answer correct?
"hallucination_rate": 0.03, # Are we making things up?
"latency_p95": 1.2, # Seconds to respond
"cost_per_query": 0.002 # USD
}The techniques we’ve covered are just the beginning. The next wave includes:
Enhancing open-source LLMs with your own data isn’t a one-size-fits-all problem. Each approach—prompt context, RAG, fine-tuning, and MCP—has unique strengths:
The best solutions combine multiple techniques, tailored to your specific requirements. Start simple, measure rigorously, and evolve as you learn what works for your use case.
The era of open-weight models has made AI accessible to everyone. Now it’s up to us to make it useful by grounding it in our unique data and domains.
What approach are you using? Share your experiences in the comments below!