finetuned_slm_t2_withdomain / README.md

asrith05

Upload folder using huggingface_hub

aa2225e verified 2 months ago

preview code

raw

history blame contribute delete

6.82 kB

metadata

language:
  - en
  - te
  - sa
tags:
  - text-generation
  - structured-data
  - multilingual
  - deepseek
  - multi-domain
  - with-domain-labels
license: apache-2.0
datasets:
  - custom
pipeline_tag: text-generation

Fine-tuned SLM T2 With Domain Labels - Multi-Domain Structured Data Generation

This model is fine-tuned for generating natural language sentences from structured data with explicit domain labels and enhanced diversity.

Model Details

Base Model: DeepSeek V3 Compact (~110M parameters)
Task: Multi-domain structured data to text generation
Languages: English, Telugu, Sanskrit
Training Format: Domain: {domain}\n{key: value, ...}\nOutput:
Enhanced Features: Diverse vocabulary, global names, extensive datasets

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("asrith05/finetuned_slm_t2_withdomain")
tokenizer = AutoTokenizer.from_pretrained("asrith05/finetuned_slm_t2_withdomain")

# Example: Sports with domain label
prompt = '''Domain: sports
Team1: Lakers, Score1: 108, Team2: Warriors, Score2: 90
Output:'''

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.8)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Expected: "Domain: sports\nTeam1: Lakers, Score1: 108, Team2: Warriors, Score2: 90\nOutput: The Lakers defeated the Warriors with a final score of 108 to 90."

Training Details

Dataset Split: 20k train / 5k validation / 5k test
Epochs: 1 epoch
Learning Rate: 5e-5
Enhanced Vocabulary: 200+ teams, 300+ cities, 400+ names
Format: Domain-aware with structured input-output pairs

Domain-Specific Examples

Sports Domain

Domain: sports
Team1: Mumbai Indians, Score1: 185, Team2: Chennai Super Kings, Score2: 180
Output: Mumbai Indians won against Chennai Super Kings with a score of 185 to 180.

Weather Domain

Domain: weather
City: Hyderabad, Temperature: 32, Condition: sunny, Day: Monday
Output: On Monday, Hyderabad experienced sunny weather with a temperature of 32 degrees.

Travel Domain

Domain: travel
Person: Priya, City: Bangalore, Transport: flight, Duration: 2
Output: Priya traveled to Bangalore by flight, which took 2 hours.

Movies Domain

Domain: movies
Movie: RRR, Genre: Action, Rating: 8.2, Year: 2022
Output: RRR is an Action movie from 2022 with a rating of 8.2.

Products Domain

Domain: products
Product: OnePlus, Brand: OnePlus, Price: 45000, Rating: 4.3
Output: The OnePlus smartphone by OnePlus is priced at ₹45,000 with a rating of 4.3 stars.

Enhanced Vocabulary Features

Global Team Coverage

NBA Teams: Lakers, Warriors, Bulls, Celtics, etc.
IPL Teams: Mumbai Indians, Chennai Super Kings, RCB, etc.
Football Teams: Manchester United, Barcelona, Real Madrid, etc.
Esports Teams: FaZe Clan, Team Liquid, Cloud9, etc.

Worldwide Cities

Indian Cities: Mumbai, Delhi, Bangalore, Hyderabad, Chennai, Pune, Kolkata
Global Cities: New York, London, Tokyo, Paris, Sydney, Dubai, Singapore
Total Coverage: 300+ cities across all continents

Multicultural Names

Indian Names: Priya, Arjun, Ananya, Rohan, Kavya, Aditya
Global Names: John, Emma, Chen, Maria, Ahmed, Sophie
Total Diversity: 400+ names representing multiple cultures

Entertainment Content

Hollywood: Inception, Avengers, The Dark Knight, Titanic
Bollywood: Dangal, Baahubali, 3 Idiots, Zindagi Na Milegi Dobara
Tollywood: RRR, Pushpa, Arjun Reddy, Eega

Key Advantages

Feature	Standard Model	With Domain Labels
Domain Context	Implicit	Explicit Labels
Generation Accuracy	Good	Enhanced
Domain Consistency	Variable	Guaranteed
Training Efficiency	Standard	Domain-Guided
Output Quality	Good	Domain-Optimized

Technical Specifications

Architecture: DeepSeek V3 Compact
Parameters: ~110M trainable parameters
Context Length: 2048 tokens
Training Precision: FP16
Inference Precision: FP32
Optimizer: AdamW with cosine learning rate schedule

Training Process

Data Preparation: 30k diverse examples across 5 domains
Domain Labeling: Explicit domain tags for each example
Vocabulary Enhancement: 5x expanded vocabulary coverage
Balanced Training: Equal representation across all domains
Validation: Continuous validation loss monitoring

Performance Metrics

Training Loss: Converged to 0.27
Validation Loss: Stable at 0.29
Domain Accuracy: >95% correct domain understanding
Fluency Score: High natural language quality
Diversity Index: Enhanced vocabulary utilization

Use Cases

Automated Report Generation: Convert database records to readable reports
Content Creation: Generate diverse content for different domains
Data Storytelling: Transform structured data into narratives
Template-based Generation: Consistent formatting across domains
Multilingual Applications: Support for Indian languages

Model Limitations

Optimized for the specific domain-labeled input format
Best performance on domains similar to training data
Requires well-structured input with clear domain labels
Limited creativity outside of trained domains
May struggle with highly technical or specialized vocabulary

Related Models in the Series

asrith05/finetuned_slm_t2 - No domain labels version
asrith05/slm - Entity extraction specialist
asrith05/deepseek_pretrain_90k - Base pretrained model

Citation

@model{finetuned_slm_t2_withdomain,
  title={Fine-tuned SLM T2 With Domain Labels: Multi-Domain Structured Data Generation},
  author={Asrith},
  year={2024},
  url={https://huggingface.co/asrith05/finetuned_slm_t2_withdomain}
}

Acknowledgments

This model builds upon the DeepSeek V3 architecture and was trained on carefully curated diverse datasets. Special thanks to the open-source community for providing the foundational tools and frameworks that made this work possible.

Note: This model is designed specifically for structured data-to-text generation with domain awareness. For best results, always include the domain label in your input format.