Lightweight RAG System: Overcoming 12-Hour Runtime Constraints in Google Colab

Community Article Published November 23, 2025

Retrieval-Augmented Generation (RAG) has become a game-changer in modern NLP applications. Yet for researchers and practitioners without enterprise-grade infrastructure, running RAG on free-tier cloud services like Google Colab presents serious challenges. Colab's 12-hour runtime limit, 15 GB RAM ceiling, storage that evaporates after each session—combine these constraints and designing an optimal RAG system becomes a genuine puzzle.

In this article, we'll show you how to build efficient RAG systems under resource constraints. From chunking strategies to quantized models, caching optimization—all with practical code examples.

The Problem: Colab's Limitations

Attempting to run RAG on Colab, you'll quickly hit these roadblocks:

1. Memory Pressure: An embedding model + retriever + LLM all competing for GPU memory? Your VRAM fills up quickly. For example, sentence-transformers/all-MiniLM-L6-v2 (80MB) + Mistral-7B (14GB quantized) + the vector database itself...

2. Runtime Limitations: 12 hours isn't enough for fine-tuning on large corpora. Your automation scripts die when the session ends.

3. Storage Volatility: The /content directory gets wiped after each session. Computed embeddings, cached indices—all gone. Endless recomputation = wasted time.

4. Network I/O: Model downloads, data fetching—internet bandwidth becomes the bottleneck.

Solution 1: Intelligent Chunking Strategies

In RAG, retrieval quality depends on chunking quality. Naive chunking (e.g., fixed 512-token windows) often falls short.

Semantic Chunking

Split chunks at semantic boundaries rather than artificial token limits:

from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunking(text, model_name="all-MiniLM-L6-v2", threshold=0.5):
    """
    Splits text based on semantic similarity.
    If cosine distance between consecutive sentences exceeds threshold,
    a new chunk begins.
    """
    model = SentenceTransformer(model_name)
    
    # Sentence splitting (basic version; enhance with spacy/nltk)
    sentences = text.replace(".", ".\n").split("\n")
    sentences = [s.strip() for s in sentences if s.strip()]
    
    if len(sentences) == 1:
        return [text]
    
    # Encode embeddings
    embeddings = model.encode(sentences)
    
    chunks = []
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        # Cosine similarity
        similarity = np.dot(embeddings[i], embeddings[i-1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1])
        )
        
        if similarity < threshold:
            # Start new chunk
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
    
    chunks.append(" ".join(current_chunk))
    return chunks

# Usage
text = "..."  # Long document
chunks = semantic_chunking(text, threshold=0.6)
print(f"Chunks created: {len(chunks)}")

Sliding Window with Overlap

Add overlap to prevent information loss at chunk boundaries:

def chunking_with_overlap(text, chunk_size=512, overlap=100):
    """
    Fixed-size chunks with overlap. Example: chunk 1 is tokens 0-512,
    chunk 2 is tokens 412-924 (100 token overlap).
    """
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    
    return chunks

Colab best practice: Although semantic chunking is computationally intensive, it significantly reduces chunk count. It optimizes the embedding time vs. retrieval accuracy trade-off.

Solution 2: Quantized Models

Full-precision (float32) models consume enormous GPU memory. Quantization compresses model weights to int8 or int4—drastic memory savings with minimal performance loss.

Quantization with ONNX Runtime

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from optimum.onnxruntime import ORTModelForSequenceClassification

# Original model
model_id = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# Convert to ONNX + quantize
ort_model = ORTModelForSequenceClassification.from_pretrained(
    model_id, 
    export=True,
    quantization_config="dynamic"  # int8
)

# Inference remains the same
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = ort_model(**inputs)

GPTQ Quantization (for LLMs)

For larger language models, use GPTQ (more aggressive quantization):

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "mistralai/Mistral-7B-Instruct-v0.1"
quantized_model_dir = "./mistral-7b-gptq"

# Download or compute quantized model
model = AutoGPTQForCausalLM.from_quantized(
    model_name_or_path,
    device="cuda:0",
    use_safetensors=True,
    use_triton=False,  # Colab compatibility
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

# Inference
inputs = tokenizer("What is the capital of France?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))

Colab benefit: Quantized Mistral-7B runs on ~7-8 GB RAM (vs. 14 GB full precision).

Solution 3: Caching & Persistence

Recomputing embeddings in Colab wastes time. Add a caching layer:

SQLite-based Cache

import sqlite3
import json
import hashlib
from datetime import datetime

class EmbeddingCache:
    def __init__(self, db_path="embeddings_cache.db"):
        self.conn = sqlite3.connect(db_path)
        self.cursor = self.conn.cursor()
        self._create_table()
    
    def _create_table(self):
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS embeddings (
                text_hash TEXT PRIMARY KEY,
                embedding TEXT,
                created_at TIMESTAMP
            )
        """)
        self.conn.commit()
    
    def _hash_text(self, text):
        return hashlib.md5(text.encode()).hexdigest()
    
    def get(self, text):
        """Retrieve embedding from cache"""
        hash_val = self._hash_text(text)
        self.cursor.execute(
            "SELECT embedding FROM embeddings WHERE text_hash = ?",
            (hash_val,)
        )
        result = self.cursor.fetchone()
        if result:
            return json.loads(result[0])
        return None
    
    def set(self, text, embedding):
        """Store embedding in cache"""
        hash_val = self._hash_text(text)
        self.cursor.execute("""
            INSERT OR REPLACE INTO embeddings (text_hash, embedding, created_at)
            VALUES (?, ?, ?)
        """, (hash_val, json.dumps(embedding.tolist()), datetime.now()))
        self.conn.commit()

# Usage
cache = EmbeddingCache("rag_cache.db")
model = SentenceTransformer("all-MiniLM-L6-v2")

def get_embedding_cached(text):
    cached = cache.get(text)
    if cached is not None:
        return np.array(cached)
    
    embedding = model.encode(text)
    cache.set(text, embedding)
    return embedding

Persisting to Google Drive

Use Google Drive to persist data across sessions:

from google.colab import drive
import pickle

# Mount drive
drive.mount('/content/drive', force_remount=True)

# Save vector database
def save_vector_db(vector_db, path="/content/drive/MyDrive/rag_index.pkl"):
    with open(path, "wb") as f:
        pickle.dump(vector_db, f)
    print(f"Saved to {path}")

def load_vector_db(path="/content/drive/MyDrive/rag_index.pkl"):
    try:
        with open(path, "rb") as f:
            return pickle.load(f)
    except FileNotFoundError:
        return None

Practical Example: End-to-End Lightweight RAG

Let's put it all together:

# 1. Setup
!pip install sentence-transformers faiss-cpu optimum[onnxruntime]

import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from google.colab import drive

drive.mount('/content/drive')

# 2. Models (quantized)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# 3. Data preparation
documents = [
    "The municipality is the center of public services.",
    "Citizens can submit requests online.",
    "Service requests are processed within 5 days.",
    # ... more documents
]

# 4. Chunking
chunks = []
for doc in documents:
    doc_chunks = semantic_chunking(doc, threshold=0.7)
    chunks.extend(doc_chunks)

print(f"Total chunks: {len(chunks)}")

# 5. Embedding + FAISS Index
embeddings = embedding_model.encode(chunks, show_progress_bar=True)
dimension = embeddings.shape[1]

index = faiss.IndexFlatL2(dimension)
index.add(embeddings.astype('float32'))

# 6. Retrieval function
def retrieve(query, k=5):
    query_embedding = embedding_model.encode([query])
    distances, indices = index.search(query_embedding.astype('float32'), k)
    
    results = []
    for idx in indices[0]:
        results.append({
            "chunk": chunks[idx],
            "distance": float(distances[0][list(indices[0]).index(idx)])
        })
    return results

# 7. Generation (with Colab-compatible models)
from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2",  # Lightweight model
)

def rag_query(question):
    # Retrieve
    retrieved = retrieve(question, k=3)
    context = " ".join([r["chunk"] for r in retrieved])
    
    # Generate
    result = qa_pipeline(
        question=question,
        context=context,
        truncation=True,
        max_length=512
    )
    return {
        "answer": result["answer"],
        "confidence": result["score"],
        "context": context
    }

# Test
response = rag_query("How long does it take to process service requests?")
print(response)

# 8. Save for next session
save_vector_db({
    "index": index,
    "chunks": chunks,
    "embeddings": embeddings
}, path="/content/drive/MyDrive/rag_system.pkl")

Colab Runtime Optimization Tips

Model Download Caching:

import os
os.environ['HF_HOME'] = '/content/drive/MyDrive/hf_cache'
# Models download to drive, no re-download needed

Batch Processing:

def batch_encode(texts, batch_size=32):
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        embeddings = embedding_model.encode(batch)
        all_embeddings.append(embeddings)
    return np.vstack(all_embeddings)

Selective GPU Usage:

import torch
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"  # Fallback

model = model.to(device)

Conclusion

Lightweight RAG in Colab is possible—just requires smart design decisions. Optimize documents with semantic chunking, protect memory with quantized models, recover compute with caching.

This combination gives you:

✅ Stable operation throughout 12-hour sessions
✅ Staying within ~15 GB memory limits
✅ Persistence across sessions
✅ Reasonable retrieval quality

In our next article, we'll combine real-time automation with GitHub Actions for 24/7 operations. Stay tuned!

References:

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote