sema-api / docs /README.md
kamau1's picture
Added documentation for using custom models
0745795

Sema Translation API - Complete Documentation

Welcome to the comprehensive documentation for the Sema Translation API - an enterprise-grade translation service supporting 200+ languages with custom HuggingFace models and a focus on African languages.

๐Ÿ“š Documentation Overview

This documentation covers all aspects of the Sema Translation API, from custom model implementation to advanced deployment scenarios and future application ideas.

๐Ÿš€ Core Documentation

Custom Models Implementation

Essential Reading - Detailed documentation of how we implemented custom HuggingFace models:

  • Unified sematech/sema-utils repository structure
  • CTranslate2 optimization for 2-4x faster inference
  • Model loading pipeline and caching strategy
  • Performance benchmarks and monitoring
  • Model update and versioning process

API Capabilities

Complete overview of enhanced API features:

  • 55+ African languages (updated from 23)
  • Server-side performance timing
  • Language detection with confidence scores
  • Comprehensive language metadata system

Future Considerations

Roadmap and application ideas:

  • Authentication & user management with Supabase
  • Database integration and caching strategies
  • Document translation and real-time streaming
  • Innovative application ideas (chatbots, education, government services)

Deployment Architecture

Infrastructure and deployment details:

  • HuggingFace Spaces deployment process
  • Performance characteristics and resource requirements
  • Monitoring with Prometheus and structured logging
  • CI/CD pipeline and scaling considerations

๐Ÿ“– Additional Documentation

Project Overview

High-level project introduction and goals

API Reference

Complete endpoint documentation with examples

๐ŸŒŸ Key Achievements & Features

Custom HuggingFace Models Integration

  • Unified Repository: sematech/sema-utils containing all models
  • Optimized Performance: CTranslate2 INT8 quantization (75% size reduction)
  • Automatic Updates: HuggingFace Hub integration with version management
  • Enterprise Caching: Intelligent model caching and loading strategies

Enhanced African Language Support

  • 55+ African Languages: Complete FLORES-200 African language coverage
  • Regional Distribution: West, East, Southern, Central, and North Africa
  • Multiple Scripts: Latin, Arabic, Ethiopic, Tifinagh support
  • Cultural Context: Native names and regional information

Performance & Monitoring

  • Server-Side Timing: Request performance tracking in headers and responses
  • Prometheus Metrics: Comprehensive monitoring and analytics
  • Request Tracking: Unique request IDs for debugging
  • Health Monitoring: System status and model availability checks

๐Ÿ”ง Technical Implementation Highlights

Model Architecture

Custom HuggingFace Models (sematech/sema-utils)
โ”œโ”€โ”€ Translation: NLLB-200 3.3B (CTranslate2 optimized)
โ”œโ”€โ”€ Language Detection: FastText LID.176
โ”œโ”€โ”€ Tokenization: SentencePiece
โ””โ”€โ”€ Language Database: FLORES-200 complete

Performance Metrics

  • Model Size: 2.5GB (optimized from 6.6GB)
  • Inference Speed: 0.2-2.5 seconds depending on text length
  • Memory Usage: ~3.2GB for models, 50-100MB per request
  • Language Detection: 0.01-0.05 seconds with 99%+ accuracy

API Enhancements

  • Request Timing: Server-side performance measurement
  • Language Metadata: Complete language information system
  • Error Handling: Comprehensive validation and error responses
  • Rate Limiting: 60 requests/minute with graceful degradation

๐Ÿš€ Quick Start Examples

Basic Translation with Timing

curl -v -X POST "https://sematech-sema-api.hf.space/api/v1/translate" \
  -H "Content-Type: application/json" \
  -d '{"text": "Habari ya asubuhi", "target_language": "eng_Latn"}'

# Response includes timing information:
# X-Response-Time: 1.234s
# X-Request-ID: 550e8400-e29b-41d4-a716-446655440000

African Languages Discovery

# Get all 55+ African languages
curl "https://sematech-sema-api.hf.space/api/v1/languages/african"

# Search for specific African languages
curl "https://sematech-sema-api.hf.space/api/v1/languages/search?q=Akan"
curl "https://sematech-sema-api.hf.space/api/v1/languages/search?q=Bambara"

Language Detection with Confidence

curl -X POST "https://sematech-sema-api.hf.space/api/v1/detect-language" \
  -H "Content-Type: application/json" \
  -d '{"text": "Habari ya asubuhi"}'

# Returns: detected language, confidence score, timing information

๐ŸŽฏ Application Use Cases

1. Multilingual Chatbot Implementation

async def process_user_input(user_text):
    # 1. Detect language
    detection = await detect_language(user_text)
    
    # 2. Decide processing flow
    if detection.is_english:
        response = await llm_chat(user_text)
    else:
        # Translate โ†’ Process โ†’ Translate back
        english_input = await translate(user_text, "eng_Latn")
        english_response = await llm_chat(english_input)
        response = await translate(english_response, detection.detected_language)
    
    return response

2. African News Platform

  • Aggregate news from multiple African countries
  • Translate between African languages
  • Provide summaries in user's preferred language

3. Educational Platform

  • Interactive language learning with African languages
  • Cultural context and pronunciation guides
  • Progress tracking across multiple languages

4. Government Services

  • Multilingual official document translation
  • Emergency notifications in local languages
  • Citizen services in preferred languages

๐Ÿ“Š API Statistics & Metrics

Language Coverage

  • Total Languages: 200+ (FLORES-200 complete)
  • African Languages: 55+ (updated from 23)
  • Writing Scripts: Latin, Arabic, Ethiopic, Tifinagh, Cyrillic, Han, etc.
  • Geographic Regions: Comprehensive global coverage

Performance Benchmarks

  • Translation Speed: 0.2-2.5s depending on text length
  • Language Detection: 0.01-0.05s with 99%+ accuracy
  • Model Efficiency: 75% size reduction with maintained quality
  • Concurrent Handling: Linear scaling with available resources

Quality Metrics

  • BLEU Scores: Industry-standard translation quality
  • African Languages: Specialized cultural context preservation
  • Uptime: 99.9% target availability
  • Error Rate: <1% under normal load

๐Ÿ”ฎ Future Roadmap

Immediate (3-6 months)

  • User authentication and usage tracking
  • Database integration with PostgreSQL
  • Redis caching for improved performance
  • Advanced monitoring dashboards

Medium-term (6-12 months)

  • Document translation with formatting preservation
  • Real-time translation streaming via WebSocket
  • Domain-specific models (medical, legal, technical)
  • Mobile SDK development

Long-term (1-2 years)

  • AI-powered translation ecosystem
  • Enterprise integration platform
  • African language research contributions
  • Voice-to-voice translation capabilities

๐Ÿ› ๏ธ Development & Deployment

Local Development

# Clone and setup
git clone https://github.com/lewiskimaru/sema.git
cd sema/backend/sema-api

# Install dependencies
pip install -r requirements.txt

# Run locally
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Testing

# Run comprehensive tests
python tests/test_african_languages_update.py
python tests/test_performance_timing.py
python tests/simple_test.py

Deployment

  • Platform: HuggingFace Spaces
  • Auto-deployment: Git integration
  • Model Updates: Automatic from sematech/sema-utils
  • Monitoring: Prometheus metrics and health checks

๐Ÿ“ž Support & Resources

Documentation Links

Model Repository

Community & Support

  • GitHub Repository: Complete source code and issue tracking
  • Model Contributions: Community-driven improvements
  • Research Collaboration: Academic partnerships for African language research

The Sema Translation API represents a significant advancement in African language technology, combining custom HuggingFace models with enterprise-grade infrastructure to serve diverse global communities.

Documentation last updated: June 2024 | API Version: 2.0.0