Fine-tuned SLM T2 With Domain Labels - Multi-Domain Structured Data Generation

This model is fine-tuned for generating natural language sentences from structured data with explicit domain labels and enhanced diversity.

Model Details

  • Base Model: DeepSeek V3 Compact (~110M parameters)
  • Task: Multi-domain structured data to text generation
  • Languages: English, Telugu, Sanskrit
  • Training Format: Domain: {domain}\n{key: value, ...}\nOutput:
  • Enhanced Features: Diverse vocabulary, global names, extensive datasets

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("asrith05/finetuned_slm_t2_withdomain")
tokenizer = AutoTokenizer.from_pretrained("asrith05/finetuned_slm_t2_withdomain")

# Example: Sports with domain label
prompt = '''Domain: sports
Team1: Lakers, Score1: 108, Team2: Warriors, Score2: 90
Output:'''

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.8)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Expected: "Domain: sports\nTeam1: Lakers, Score1: 108, Team2: Warriors, Score2: 90\nOutput: The Lakers defeated the Warriors with a final score of 108 to 90."

Training Details

  • Dataset Split: 20k train / 5k validation / 5k test
  • Epochs: 1 epoch
  • Learning Rate: 5e-5
  • Enhanced Vocabulary: 200+ teams, 300+ cities, 400+ names
  • Format: Domain-aware with structured input-output pairs

Domain-Specific Examples

Sports Domain

Domain: sports
Team1: Mumbai Indians, Score1: 185, Team2: Chennai Super Kings, Score2: 180
Output: Mumbai Indians won against Chennai Super Kings with a score of 185 to 180.

Weather Domain

Domain: weather
City: Hyderabad, Temperature: 32, Condition: sunny, Day: Monday
Output: On Monday, Hyderabad experienced sunny weather with a temperature of 32 degrees.

Travel Domain

Domain: travel
Person: Priya, City: Bangalore, Transport: flight, Duration: 2
Output: Priya traveled to Bangalore by flight, which took 2 hours.

Movies Domain

Domain: movies
Movie: RRR, Genre: Action, Rating: 8.2, Year: 2022
Output: RRR is an Action movie from 2022 with a rating of 8.2.

Products Domain

Domain: products
Product: OnePlus, Brand: OnePlus, Price: 45000, Rating: 4.3
Output: The OnePlus smartphone by OnePlus is priced at โ‚น45,000 with a rating of 4.3 stars.

Enhanced Vocabulary Features

Global Team Coverage

  • NBA Teams: Lakers, Warriors, Bulls, Celtics, etc.
  • IPL Teams: Mumbai Indians, Chennai Super Kings, RCB, etc.
  • Football Teams: Manchester United, Barcelona, Real Madrid, etc.
  • Esports Teams: FaZe Clan, Team Liquid, Cloud9, etc.

Worldwide Cities

  • Indian Cities: Mumbai, Delhi, Bangalore, Hyderabad, Chennai, Pune, Kolkata
  • Global Cities: New York, London, Tokyo, Paris, Sydney, Dubai, Singapore
  • Total Coverage: 300+ cities across all continents

Multicultural Names

  • Indian Names: Priya, Arjun, Ananya, Rohan, Kavya, Aditya
  • Global Names: John, Emma, Chen, Maria, Ahmed, Sophie
  • Total Diversity: 400+ names representing multiple cultures

Entertainment Content

  • Hollywood: Inception, Avengers, The Dark Knight, Titanic
  • Bollywood: Dangal, Baahubali, 3 Idiots, Zindagi Na Milegi Dobara
  • Tollywood: RRR, Pushpa, Arjun Reddy, Eega

Key Advantages

Feature Standard Model With Domain Labels
Domain Context Implicit Explicit Labels
Generation Accuracy Good Enhanced
Domain Consistency Variable Guaranteed
Training Efficiency Standard Domain-Guided
Output Quality Good Domain-Optimized

Technical Specifications

  • Architecture: DeepSeek V3 Compact
  • Parameters: ~110M trainable parameters
  • Context Length: 2048 tokens
  • Training Precision: FP16
  • Inference Precision: FP32
  • Optimizer: AdamW with cosine learning rate schedule

Training Process

  1. Data Preparation: 30k diverse examples across 5 domains
  2. Domain Labeling: Explicit domain tags for each example
  3. Vocabulary Enhancement: 5x expanded vocabulary coverage
  4. Balanced Training: Equal representation across all domains
  5. Validation: Continuous validation loss monitoring

Performance Metrics

  • Training Loss: Converged to 0.27
  • Validation Loss: Stable at 0.29
  • Domain Accuracy: >95% correct domain understanding
  • Fluency Score: High natural language quality
  • Diversity Index: Enhanced vocabulary utilization

Use Cases

  1. Automated Report Generation: Convert database records to readable reports
  2. Content Creation: Generate diverse content for different domains
  3. Data Storytelling: Transform structured data into narratives
  4. Template-based Generation: Consistent formatting across domains
  5. Multilingual Applications: Support for Indian languages

Model Limitations

  • Optimized for the specific domain-labeled input format
  • Best performance on domains similar to training data
  • Requires well-structured input with clear domain labels
  • Limited creativity outside of trained domains
  • May struggle with highly technical or specialized vocabulary

Related Models in the Series

Citation

@model{finetuned_slm_t2_withdomain,
  title={Fine-tuned SLM T2 With Domain Labels: Multi-Domain Structured Data Generation},
  author={Asrith},
  year={2024},
  url={https://huggingface.co/asrith05/finetuned_slm_t2_withdomain}
}

Acknowledgments

This model builds upon the DeepSeek V3 architecture and was trained on carefully curated diverse datasets. Special thanks to the open-source community for providing the foundational tools and frameworks that made this work possible.


Note: This model is designed specifically for structured data-to-text generation with domain awareness. For best results, always include the domain label in your input format.

Downloads last month
14
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support