README.md · LuvU4ever/qwen2.5-3b-qlora-merged-v3 at main

qwen2.5-3b-qlora-merged-v3 / README.md

LuvU4ever

🚀 Copy unsloth/Qwen2.5-3B-Instruct - Verified working với Inference Endpoints

720ba0a verified 4 months ago

preview code

raw

history blame contribute delete

5.38 kB

	---
	base_model: unsloth/Qwen2.5-3B-Instruct
	tags:
	- qwen2.5
	- instruct
	- unsloth
	- vietnamese
	- inference-ready
	- production-ready
	language:
	- en
	- zh
	- vi
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Qwen-2.5 3B Instruct - Production Ready

	🚀 Verified working với Hugging Face Inference Endpoints!

	Đây là copy của `unsloth/Qwen2.5-3B-Instruct` được optimize cho production deployment. Model này đã được test và verified hoạt động hoàn hảo với HF Inference Endpoints.

	## ✨ Đặc điểm

	- ✅ Inference Endpoints Ready: Verified hoạt động 100% với HF Inference Endpoints
	- ✅ No Quantization Issues: Không có vấn đề quantization với TGI
	- ✅ Production Optimized: Sẵn sàng cho production environment
	- ✅ Vietnamese Excellence: Hỗ trợ tiếng Việt xuất sắc
	- ✅ Multi-language: Hỗ trợ 29+ ngôn ngữ
	- ✅ High Performance: 3B parameters với hiệu suất cao

	## 🚀 Quick Deploy

	1-Click Deploy trên Inference Endpoints:

	1. 🔗 Vào [LuvU4ever/qwen2.5-3b-qlora-merged-v3](https://huggingface.co/LuvU4ever/qwen2.5-3b-qlora-merged-v3)
	2. 🚀 Click Deploy → Inference Endpoints
	3. ⚙️ Chọn GPU [small] instance
	4. ✅ Click Create Endpoint

	## 💻 Cách sử dụng

	### Local Inference

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model
	model = AutoModelForCausalLM.from_pretrained(
	"LuvU4ever/qwen2.5-3b-qlora-merged-v3",
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True
	)

	tokenizer = AutoTokenizer.from_pretrained("LuvU4ever/qwen2.5-3b-qlora-merged-v3")

	# Chat với model
	messages = [
	{"role": "user", "content": "Xin chào! Bạn có thể giúp tôi gì?"}
	]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	inputs = tokenizer([text], return_tensors="pt").to(model.device)

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	response = tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
	print(response)
	```

	### API Usage (Inference Endpoints)

	```python
	import requests
	import json

	# Cấu hình API
	API_URL = "YOUR_ENDPOINT_URL" # Lấy từ Inference Endpoints
	headers = {
	"Authorization": "Bearer YOUR_HF_TOKEN",
	"Content-Type": "application/json"
	}

	def chat_with_model(message, max_tokens=200):
	payload = {
	"inputs": f"<\|im_start\|>user\n{message}<\|im_end\|>\n<\|im_start\|>assistant\n",
	"parameters": {
	"max_new_tokens": max_tokens,
	"temperature": 0.7,
	"do_sample": True,
	"stop": ["<\|im_end\|>"],
	"return_full_text": False
	}
	}

	response = requests.post(API_URL, headers=headers, json=payload)

	if response.status_code == 200:
	result = response.json()
	return result[0]["generated_text"].strip()
	else:
	return f"Error: {response.status_code} - {response.text}"

	# Sử dụng
	response = chat_with_model("Việt Nam có những món ăn truyền thống nào?")
	print(response)
	```

	### Batch Processing

	```python
	def batch_chat(messages_list):
	results = []
	for msg in messages_list:
	response = chat_with_model(msg)
	results.append({"question": msg, "answer": response})
	return results

	# Example
	questions = [
	"Hà Nội có gì đặc biệt?",
	"Cách nấu phở bò?",
	"Lịch sử Việt Nam có gì thú vị?"
	]

	results = batch_chat(questions)
	for item in results:
	print(f"Q: {item['question']}")
	print(f"A: {item['answer']}\n")
	```

	## 📊 Specifications

	\| Spec \| Value \|
	\|------\|-------\|
	\| Model Size \| ~3B parameters \|
	\| Architecture \| Qwen2.5 \|
	\| Context Length \| 32,768 tokens \|
	\| Languages \| 29+ languages \|
	\| Deployment \| ✅ HF Inference Endpoints \|
	\| Format \| Safetensors \|
	\| License \| Apache 2.0 \|

	## 🎯 Use Cases

	- 💬 Chatbots: Customer service, virtual assistants
	- 📝 Content Generation: Blog posts, articles, creative writing
	- 🔍 Q&A Systems: Knowledge bases, FAQ automation
	- 🌐 Multi-language: Translation và cross-language tasks
	- 💼 Business: Report generation, email drafting
	- 🎓 Education: Tutoring, explanation generation

	## 🔧 Chat Format

	Model sử dụng Qwen chat template:

	```
	<\|im_start\|>user
	Your question here
	<\|im_end\|>
	<\|im_start\|>assistant
	AI response here
	<\|im_end\|>
	```

	## ⚠️ Important Notes

	- Model hoạt động tốt nhất với temperature 0.7-0.8
	- Sử dụng stop tokens `["<\|im_end\|>"]` để tránh over-generation
	- Với câu hỏi tiếng Việt, model cho kết quả rất tự nhiên
	- Verified compatibility với TGI container

	## 🏆 Performance

	- ✅ Inference Endpoints: Tested and verified working
	- ⚡ Speed: ~20-50 tokens/second on GPU small
	- 🎯 Accuracy: Excellent cho Vietnamese và English
	- 💾 Memory: ~6GB VRAM for inference

	---

	## 📞 Support

	- 🐛 Issues: Report tại GitHub issues
	- 📚 Docs: Xem Qwen2.5 documentation
	- 💬 Community: HuggingFace discussions

	🎉 Ready for production deployment!