🤖 نموذج LFM2-Arabic-Tuned لاستخراج البيانات

هذا النموذج هو نسخة مُعدلة (Fine-tuned) من LiquidAI/LFM2-2.6B تم تدريبها باستخدام LLaMA-Factory.

تم تدريب النموذج خصيصاً على مهمة استخراج البيانات المنظمة (JSON) من النصوص العربية بناءً على قالب (Schema) محدد.

حقق النموذج eval_loss بقيمة 0.2957 على مجموعة بيانات التقييم.

🚀 وصف النموذج

هذا النموذج تم تدريبه (fine-tuned) للعمل كـ "أداة تحليل نصوص" (NLP Parser) دقيقة. بدلاً من الرد كمساعد دردشة عام، تم تدريبه على اتباع قالب Pydantic (JSON Schema) بدقة لاستخراج المعلومات المطلوبة فقط.

عند إعطائه نصاً عربياً خاماً ومهمة "استخراج التفاصيل"، سيقوم النموذج بإرجاع JSON نظيف يحتوي على:

story_title (العنوان)
story_keywords (الكلمات المفتاحية)
story_summary (ملخص من 5 نقاط)
story_category (التصنيف)
story_entities (الكيانات)

🛠️ كيفية الاستخدام

تم تدريب هذا النموذج باستخدام قالب qwen. لاستخدامه بشكل صحيح والحصول على أفضل النتائج، يجب عليك بناء الـ Prompt بنفس التنسيق الذي تم التدريب عليه.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import json

# اسم النموذج على Hugging Face
model_id = "YoussefAI/LFM2-Arabic-Tuned-results"

# تحميل النموذج والـ Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

# إنشاء الـ pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=1024,
    do_sample=False, # لا نريد إبداعاً، نريد دقة
    temperature=0.1
)

# 1. النص الذي تريد تحليله
story_text = """
ذكرت مجلة فوربس أن العائلة تلعب دورا محوريا في تشكيل علاقة الأفراد بالمال،
حيث تتأثر هذه العلاقة بأنماط السلوك المالي المتوارثة عبر الأجيال.
التقرير الذي يستند إلى أبحاث الأستاذ الجامعي شاين إنيت حول
الرفاه المالي يوضح أن لكل شخص "شخصية مالية" تتحدد وفقا لطريقة
تفاعله مع المال، والتي تتأثر بشكل مباشر بتربية الأسرة وتجارب الطفولة.
"""

# 2. رسالة النظام وقالب الـ Schema (يجب أن تكون مطابقة للتدريب)
system_message = (
    "You are a professional NLP data parser.\n"
    "Follow the provided `Task` by the user and the `Output Scheme` to generate the `Output JSON`.\n"
    "Do not generate any introduction or conclusion."
)

# هذا هو القالب الذي "يفهمه" النموذج
# (ملاحظة: هذا مجرد مثال، القالب الفعلي كان أكثر تفصيلاً)
output_scheme = """
{"properties": {"story_title": {"type": "string"}, "story_keywords": {"type": "array"}, "story_summary": {"type": "array"}, "story_category": {"type": "string"}, "story_entities": {"type": "array"}}, "required": ["story_title", "story_keywords", "story_summary", "story_category", "story_entities"]}
"""

# 3. بناء الـ Prompt بنفس تنسيق التدريب
messages = [
    {"role": "system", "content": system_message},
    {
        "role": "user",
        "content": f"""# Story:
{story_text.strip()}

# Task:
Extrat the story details into a JSON.

# Output Scheme:
{output_scheme}

# Output JSON:
```json
"""
    }
]

# 4. تطبيق القالب وتوليد الرد
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = pipe(prompt)
full_response = response[0]["generated_text"]

# 5. تنظيف الرد لاستخراج الـ JSON فقط
try:
    assistant_part = full_response.split("<|im_start|>assistant\n")[-1]
    clean_response = assistant_part.split("<|im_end|>")[0].strip()
    
    # إزالة علامات ```json
    if clean_response.startswith("```json"):
        clean_response = clean_response[7:]
    if clean_response.endswith("```"):
        clean_response = clean_response[:-3]

    # طباعة الـ JSON بشكل جميل
    parsed_json = json.loads(clean_response)
    print(json.dumps(parsed_json, indent=2, ensure_ascii=False))

except Exception as e:
    print(f"Failed to parse JSON: {e}")
    print(f"Raw response: {full_response}")

⚠️ قيود النموذج

هذا النموذج ليس مساعد دردشة (Chatbot). تم تدريبه بشكل مكثف على مهمة واحدة (استخراج JSON). إذا سألته أسئلة عامة (مثل "ما هي عاصمة فرنسا؟")، فمن المحتمل أن يعطي إجابة خاطئة أو يحاول إخراج JSON فارغ.

📊 بيانات التدريب

تم تدريب النموذج على مجموعة بيانات خاصة مكونة من 5000 سجل. تم إنشاء هذه البيانات عن طريق معالجة نصوص عربية متنوعة (من ملف collected_arabic_texts (1).jsonl) باستخدام gpt-4o-mini لإنشاء الإجابات النموذجية (الـ JSON) لكل نص.

📉 نتائج التدريب

Training Loss	Epoch	Step	Validation Loss
0.3167	0.08	100	0.3168
0.2952	0.16	200	0.2911
0.2794	0.24	300	0.2800
0.2486	0.32	400	0.2727
0.2464	0.4	500	0.2693
0.2500	0.48	600	0.2630
0.2442	0.56	700	0.2621
0.2595	0.64	800	0.2602
0.2609	0.72	900	0.2538
0.2523	0.8	1000	0.2477
0.2497	0.88	1100	0.2493
0.2432	0.96	1200	0.2458
0.1578	1.04	1300	0.2544
0.1580	1.12	1400	0.2541
0.1592	1.2	1500	0.2533
0.1576	1.28	1600	0.2517
0.1585	1.36	1700	0.2497
0.1557	1.44	1800	0.2538
0.1626	1.52	1900	0.2504
0.1467	1.6	2000	0.2530
0.1540	1.68	2100	0.2501
0.1562	1.76	2200	0.2482
0.1625	1.84	2300	0.2493
0.1534	1.92	2400	0.2489
0.1385	2.0	2500	0.2472
0.0656	2.08	2600	0.2894
0.0655	2.16	2700	0.2900
0.0705	2.24	2800	0.2938
0.0686	2.32	2900	0.2936
0.0638	2.4	3000	0.2942
0.0665	2.48	3100	0.2937
0.0715	2.56	3200	0.2938
0.0629	2.64	3300	0.2947
0.0709	2.72	3400	0.2944
0.0632	2.8	3500	0.2951
0.0654	2.88	3600	0.2956
0.0657	2.96	3700	0.2958

Framework versions

PEFT 0.17.1
Transformers 4.57.1
Pytorch 2.5.1+cu124
Datasets 3.2.0
Tokenizers 0.22.1

Downloads last month: -; Downloads are not tracked for this model. How to track