asrith05's picture
Upload folder using huggingface_hub
aa2225e verified
metadata
language:
  - en
  - te
  - sa
tags:
  - text-generation
  - structured-data
  - multilingual
  - deepseek
  - multi-domain
  - with-domain-labels
license: apache-2.0
datasets:
  - custom
pipeline_tag: text-generation

Fine-tuned SLM T2 With Domain Labels - Multi-Domain Structured Data Generation

This model is fine-tuned for generating natural language sentences from structured data with explicit domain labels and enhanced diversity.

Model Details

  • Base Model: DeepSeek V3 Compact (~110M parameters)
  • Task: Multi-domain structured data to text generation
  • Languages: English, Telugu, Sanskrit
  • Training Format: Domain: {domain}\n{key: value, ...}\nOutput:
  • Enhanced Features: Diverse vocabulary, global names, extensive datasets

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("asrith05/finetuned_slm_t2_withdomain")
tokenizer = AutoTokenizer.from_pretrained("asrith05/finetuned_slm_t2_withdomain")

# Example: Sports with domain label
prompt = '''Domain: sports
Team1: Lakers, Score1: 108, Team2: Warriors, Score2: 90
Output:'''

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.8)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Expected: "Domain: sports\nTeam1: Lakers, Score1: 108, Team2: Warriors, Score2: 90\nOutput: The Lakers defeated the Warriors with a final score of 108 to 90."

Training Details

  • Dataset Split: 20k train / 5k validation / 5k test
  • Epochs: 1 epoch
  • Learning Rate: 5e-5
  • Enhanced Vocabulary: 200+ teams, 300+ cities, 400+ names
  • Format: Domain-aware with structured input-output pairs

Domain-Specific Examples

Sports Domain

Domain: sports
Team1: Mumbai Indians, Score1: 185, Team2: Chennai Super Kings, Score2: 180
Output: Mumbai Indians won against Chennai Super Kings with a score of 185 to 180.

Weather Domain

Domain: weather
City: Hyderabad, Temperature: 32, Condition: sunny, Day: Monday
Output: On Monday, Hyderabad experienced sunny weather with a temperature of 32 degrees.

Travel Domain

Domain: travel
Person: Priya, City: Bangalore, Transport: flight, Duration: 2
Output: Priya traveled to Bangalore by flight, which took 2 hours.

Movies Domain

Domain: movies
Movie: RRR, Genre: Action, Rating: 8.2, Year: 2022
Output: RRR is an Action movie from 2022 with a rating of 8.2.

Products Domain

Domain: products
Product: OnePlus, Brand: OnePlus, Price: 45000, Rating: 4.3
Output: The OnePlus smartphone by OnePlus is priced at ₹45,000 with a rating of 4.3 stars.

Enhanced Vocabulary Features

Global Team Coverage

  • NBA Teams: Lakers, Warriors, Bulls, Celtics, etc.
  • IPL Teams: Mumbai Indians, Chennai Super Kings, RCB, etc.
  • Football Teams: Manchester United, Barcelona, Real Madrid, etc.
  • Esports Teams: FaZe Clan, Team Liquid, Cloud9, etc.

Worldwide Cities

  • Indian Cities: Mumbai, Delhi, Bangalore, Hyderabad, Chennai, Pune, Kolkata
  • Global Cities: New York, London, Tokyo, Paris, Sydney, Dubai, Singapore
  • Total Coverage: 300+ cities across all continents

Multicultural Names

  • Indian Names: Priya, Arjun, Ananya, Rohan, Kavya, Aditya
  • Global Names: John, Emma, Chen, Maria, Ahmed, Sophie
  • Total Diversity: 400+ names representing multiple cultures

Entertainment Content

  • Hollywood: Inception, Avengers, The Dark Knight, Titanic
  • Bollywood: Dangal, Baahubali, 3 Idiots, Zindagi Na Milegi Dobara
  • Tollywood: RRR, Pushpa, Arjun Reddy, Eega

Key Advantages

Feature Standard Model With Domain Labels
Domain Context Implicit Explicit Labels
Generation Accuracy Good Enhanced
Domain Consistency Variable Guaranteed
Training Efficiency Standard Domain-Guided
Output Quality Good Domain-Optimized

Technical Specifications

  • Architecture: DeepSeek V3 Compact
  • Parameters: ~110M trainable parameters
  • Context Length: 2048 tokens
  • Training Precision: FP16
  • Inference Precision: FP32
  • Optimizer: AdamW with cosine learning rate schedule

Training Process

  1. Data Preparation: 30k diverse examples across 5 domains
  2. Domain Labeling: Explicit domain tags for each example
  3. Vocabulary Enhancement: 5x expanded vocabulary coverage
  4. Balanced Training: Equal representation across all domains
  5. Validation: Continuous validation loss monitoring

Performance Metrics

  • Training Loss: Converged to 0.27
  • Validation Loss: Stable at 0.29
  • Domain Accuracy: >95% correct domain understanding
  • Fluency Score: High natural language quality
  • Diversity Index: Enhanced vocabulary utilization

Use Cases

  1. Automated Report Generation: Convert database records to readable reports
  2. Content Creation: Generate diverse content for different domains
  3. Data Storytelling: Transform structured data into narratives
  4. Template-based Generation: Consistent formatting across domains
  5. Multilingual Applications: Support for Indian languages

Model Limitations

  • Optimized for the specific domain-labeled input format
  • Best performance on domains similar to training data
  • Requires well-structured input with clear domain labels
  • Limited creativity outside of trained domains
  • May struggle with highly technical or specialized vocabulary

Related Models in the Series

Citation

@model{finetuned_slm_t2_withdomain,
  title={Fine-tuned SLM T2 With Domain Labels: Multi-Domain Structured Data Generation},
  author={Asrith},
  year={2024},
  url={https://huggingface.co/asrith05/finetuned_slm_t2_withdomain}
}

Acknowledgments

This model builds upon the DeepSeek V3 architecture and was trained on carefully curated diverse datasets. Special thanks to the open-source community for providing the foundational tools and frameworks that made this work possible.


Note: This model is designed specifically for structured data-to-text generation with domain awareness. For best results, always include the domain label in your input format.