Lightweight RAG System: Overcoming 12-Hour Runtime Constraints in Google Colab
Retrieval-Augmented Generation (RAG) has become a game-changer in modern NLP applications. Yet for researchers and practitioners without enterprise-grade infrastructure, running RAG on free-tier cloud services like Google Colab presents serious challenges. Colab's 12-hour runtime limit, 15 GB RAM ceiling, storage that evaporates after each session—combine these constraints and designing an optimal RAG system becomes a genuine puzzle.
In this article, we'll show you how to build efficient RAG systems under resource constraints. From chunking strategies to quantized models, caching optimization—all with practical code examples.
The Problem: Colab's Limitations
Attempting to run RAG on Colab, you'll quickly hit these roadblocks:
1. Memory Pressure: An embedding model + retriever + LLM all competing for GPU memory? Your VRAM fills up quickly. For example, sentence-transformers/all-MiniLM-L6-v2 (80MB) + Mistral-7B (14GB quantized) + the vector database itself...
2. Runtime Limitations: 12 hours isn't enough for fine-tuning on large corpora. Your automation scripts die when the session ends.
3. Storage Volatility: The /content directory gets wiped after each session. Computed embeddings, cached indices—all gone. Endless recomputation = wasted time.
4. Network I/O: Model downloads, data fetching—internet bandwidth becomes the bottleneck.
Solution 1: Intelligent Chunking Strategies
In RAG, retrieval quality depends on chunking quality. Naive chunking (e.g., fixed 512-token windows) often falls short.
Semantic Chunking
Split chunks at semantic boundaries rather than artificial token limits:
from sentence_transformers import SentenceTransformer
import numpy as np
def semantic_chunking(text, model_name="all-MiniLM-L6-v2", threshold=0.5):
"""
Splits text based on semantic similarity.
If cosine distance between consecutive sentences exceeds threshold,
a new chunk begins.
"""
model = SentenceTransformer(model_name)
# Sentence splitting (basic version; enhance with spacy/nltk)
sentences = text.replace(".", ".\n").split("\n")
sentences = [s.strip() for s in sentences if s.strip()]
if len(sentences) == 1:
return [text]
# Encode embeddings
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Cosine similarity
similarity = np.dot(embeddings[i], embeddings[i-1]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1])
)
if similarity < threshold:
# Start new chunk
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunks
# Usage
text = "..." # Long document
chunks = semantic_chunking(text, threshold=0.6)
print(f"Chunks created: {len(chunks)}")
Sliding Window with Overlap
Add overlap to prevent information loss at chunk boundaries:
def chunking_with_overlap(text, chunk_size=512, overlap=100):
"""
Fixed-size chunks with overlap. Example: chunk 1 is tokens 0-512,
chunk 2 is tokens 412-924 (100 token overlap).
"""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
Colab best practice: Although semantic chunking is computationally intensive, it significantly reduces chunk count. It optimizes the embedding time vs. retrieval accuracy trade-off.
Solution 2: Quantized Models
Full-precision (float32) models consume enormous GPU memory. Quantization compresses model weights to int8 or int4—drastic memory savings with minimal performance loss.
Quantization with ONNX Runtime
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from optimum.onnxruntime import ORTModelForSequenceClassification
# Original model
model_id = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
# Convert to ONNX + quantize
ort_model = ORTModelForSequenceClassification.from_pretrained(
model_id,
export=True,
quantization_config="dynamic" # int8
)
# Inference remains the same
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = ort_model(**inputs)
GPTQ Quantization (for LLMs)
For larger language models, use GPTQ (more aggressive quantization):
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
model_name_or_path = "mistralai/Mistral-7B-Instruct-v0.1"
quantized_model_dir = "./mistral-7b-gptq"
# Download or compute quantized model
model = AutoGPTQForCausalLM.from_quantized(
model_name_or_path,
device="cuda:0",
use_safetensors=True,
use_triton=False, # Colab compatibility
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
# Inference
inputs = tokenizer("What is the capital of France?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
Colab benefit: Quantized Mistral-7B runs on ~7-8 GB RAM (vs. 14 GB full precision).
Solution 3: Caching & Persistence
Recomputing embeddings in Colab wastes time. Add a caching layer:
SQLite-based Cache
import sqlite3
import json
import hashlib
from datetime import datetime
class EmbeddingCache:
def __init__(self, db_path="embeddings_cache.db"):
self.conn = sqlite3.connect(db_path)
self.cursor = self.conn.cursor()
self._create_table()
def _create_table(self):
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS embeddings (
text_hash TEXT PRIMARY KEY,
embedding TEXT,
created_at TIMESTAMP
)
""")
self.conn.commit()
def _hash_text(self, text):
return hashlib.md5(text.encode()).hexdigest()
def get(self, text):
"""Retrieve embedding from cache"""
hash_val = self._hash_text(text)
self.cursor.execute(
"SELECT embedding FROM embeddings WHERE text_hash = ?",
(hash_val,)
)
result = self.cursor.fetchone()
if result:
return json.loads(result[0])
return None
def set(self, text, embedding):
"""Store embedding in cache"""
hash_val = self._hash_text(text)
self.cursor.execute("""
INSERT OR REPLACE INTO embeddings (text_hash, embedding, created_at)
VALUES (?, ?, ?)
""", (hash_val, json.dumps(embedding.tolist()), datetime.now()))
self.conn.commit()
# Usage
cache = EmbeddingCache("rag_cache.db")
model = SentenceTransformer("all-MiniLM-L6-v2")
def get_embedding_cached(text):
cached = cache.get(text)
if cached is not None:
return np.array(cached)
embedding = model.encode(text)
cache.set(text, embedding)
return embedding
Persisting to Google Drive
Use Google Drive to persist data across sessions:
from google.colab import drive
import pickle
# Mount drive
drive.mount('/content/drive', force_remount=True)
# Save vector database
def save_vector_db(vector_db, path="/content/drive/MyDrive/rag_index.pkl"):
with open(path, "wb") as f:
pickle.dump(vector_db, f)
print(f"Saved to {path}")
def load_vector_db(path="/content/drive/MyDrive/rag_index.pkl"):
try:
with open(path, "rb") as f:
return pickle.load(f)
except FileNotFoundError:
return None
Practical Example: End-to-End Lightweight RAG
Let's put it all together:
# 1. Setup
!pip install sentence-transformers faiss-cpu optimum[onnxruntime]
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from google.colab import drive
drive.mount('/content/drive')
# 2. Models (quantized)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# 3. Data preparation
documents = [
"The municipality is the center of public services.",
"Citizens can submit requests online.",
"Service requests are processed within 5 days.",
# ... more documents
]
# 4. Chunking
chunks = []
for doc in documents:
doc_chunks = semantic_chunking(doc, threshold=0.7)
chunks.extend(doc_chunks)
print(f"Total chunks: {len(chunks)}")
# 5. Embedding + FAISS Index
embeddings = embedding_model.encode(chunks, show_progress_bar=True)
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings.astype('float32'))
# 6. Retrieval function
def retrieve(query, k=5):
query_embedding = embedding_model.encode([query])
distances, indices = index.search(query_embedding.astype('float32'), k)
results = []
for idx in indices[0]:
results.append({
"chunk": chunks[idx],
"distance": float(distances[0][list(indices[0]).index(idx)])
})
return results
# 7. Generation (with Colab-compatible models)
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="deepset/roberta-base-squad2", # Lightweight model
)
def rag_query(question):
# Retrieve
retrieved = retrieve(question, k=3)
context = " ".join([r["chunk"] for r in retrieved])
# Generate
result = qa_pipeline(
question=question,
context=context,
truncation=True,
max_length=512
)
return {
"answer": result["answer"],
"confidence": result["score"],
"context": context
}
# Test
response = rag_query("How long does it take to process service requests?")
print(response)
# 8. Save for next session
save_vector_db({
"index": index,
"chunks": chunks,
"embeddings": embeddings
}, path="/content/drive/MyDrive/rag_system.pkl")
Colab Runtime Optimization Tips
- Model Download Caching:
import os
os.environ['HF_HOME'] = '/content/drive/MyDrive/hf_cache'
# Models download to drive, no re-download needed
- Batch Processing:
def batch_encode(texts, batch_size=32):
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
embeddings = embedding_model.encode(batch)
all_embeddings.append(embeddings)
return np.vstack(all_embeddings)
- Selective GPU Usage:
import torch
if torch.cuda.is_available():
device = "cuda"
else:
device = "cpu" # Fallback
model = model.to(device)
Conclusion
Lightweight RAG in Colab is possible—just requires smart design decisions. Optimize documents with semantic chunking, protect memory with quantized models, recover compute with caching.
This combination gives you:
- ✅ Stable operation throughout 12-hour sessions
- ✅ Staying within ~15 GB memory limits
- ✅ Persistence across sessions
- ✅ Reasonable retrieval quality
In our next article, we'll combine real-time automation with GitHub Actions for 24/7 operations. Stay tuned!
References: