Spaces:

George-API
/

phi4training

Sleeping

App Files Files Community

George-API commited on Mar 9

Commit

a57357b

verified ·

1 Parent(s): 44dd860

Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

.gitignore +3 -0
README.md +192 -12
app.py +162 -0
check_tokenization.py +103 -0
dataset_config.json +40 -0
hardware_config.json +49 -0
requirements.txt +20 -0
run_transformers_training.py +615 -0
transformers_config.json +75 -0
update_space.py +219 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+.env
+*.pyc
+__pycache__

README.md CHANGED Viewed

@@ -1,12 +1,192 @@
----
-title: Phi4training
-emoji: 🌖
-colorFrom: blue
-colorTo: yellow
-sdk: gradio
-sdk_version: 5.20.1
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Phase 1: Domain Adaptation (Unsupervised)
+This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/phi-4-research-assistant](https://huggingface.co/George-API/phi-4-research-assistant).
+## Overview
+Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.
+## Files
+### Core Training Files
+- `run_transformers_training.py`: Main script for domain adaptation
+- `transformers_config.json`: Model and training parameters
+- `hardware_config.json`: Hardware-specific optimizations
+- `dataset_config.json`: Dataset loading and processing settings
+- `requirements.txt`: Required Python packages
+### Analysis & Utilities
+- `check_tokenization.py`: Script to analyze token distributions
+- `update_space.py`: Hugging Face Space update utility
+- `.env`: Environment variables (API tokens, etc.)
+## Setup
+1. **Environment Setup**:
+   ```bash
+   python -m venv venv
+   source venv/bin/activate  # or `venv\Scripts\activate` on Windows
+   pip install -r requirements.txt
+   ```
+2. **Environment Variables**:
+   Create `.env` file with:
+   ```
+   HUGGINGFACE_TOKEN=your_token_here
+   ```
+3. **Verify Setup**:
+   ```bash
+   python check_tokenization.py  # Ensures tokenizer works
+   ```
+## How It Works
+1. **Data Loading**: Loads pre-tokenized data from the Hugging Face dataset
+2. **Sequential Processing**: Processes data in order, maintaining the integrity of research papers
+3. **Efficient Training**: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training
+4. **Checkpointing**: Saves regular checkpoints and pushes to Hub
+5. **Monitoring**: Logs detailed metrics and statistics during training
+6. **Model Publishing**: Pushes the trained model to Hugging Face Hub
+## Key Features
+### Memory-Efficient Training
+The training setup is optimized for A10G GPUs:
+- Uses pre-quantized 4-bit model (no additional quantization needed)
+- Gradient checkpointing for memory efficiency
+- Flash attention for faster training
+- bfloat16 mixed precision training
+- Optimized batch sizes for maximum throughput
+### Sequential Processing
+The training script ensures that chunks from the same research paper are processed together by:
+- Sorting the dataset by ID
+- Using a SequentialSampler to maintain order
+- Processing chunks sequentially (average 1,673 tokens per chunk)
+### Data Collator
+The `SimpleDataCollator` class:
+- Preserves pre-tokenized data format
+- Processes each entry independently
+- Provides detailed logging of processing statistics
+- Handles errors gracefully
+### Checkpointing
+The training process saves checkpoints:
+- Every 200 steps
+- Pushes to Hub on every save
+- Maintains up to 5 recent checkpoints
+- Automatically resumes from the latest checkpoint if interrupted
+## Hardware Requirements
+This training setup is optimized for:
+- 2x NVIDIA A10G GPUs (24GB VRAM each)
+- 92GB System RAM
+- CUDA 11.8 or higher
+Memory breakdown per GPU:
+- Model (4-bit): ~3.5GB
+- Optimizer states: ~1GB
+- Batch memory: ~2GB
+- Peak usage: 18-20GB
+- Safe headroom: 4-6GB
+## Configuration
+Key parameters in `transformers_config.json`:
+- `model_name`: unsloth/phi-4-unsloth-bnb-4bit
+- `learning_rate`: 2e-5
+- `num_train_epochs`: 3
+- `per_device_train_batch_size`: 16
+- `gradient_accumulation_steps`: 4
+- `effective_batch_size`: 128 (16 * 4 * 2 GPUs)
+- `max_seq_length`: 2048
+- `lr_scheduler_type`: "cosine"
+- `warmup_ratio`: 0.03
+- `neftune_noise_alpha`: 5
+The configuration is optimized for:
+- Maximum memory efficiency with pre-quantized model
+- Stable training with cosine learning rate schedule
+- Effective gradient updates with accumulation
+- Regular checkpointing and Hub updates
+## Running Domain Adaptation
+To start domain adaptation:
+```bash
+python run_transformers_training.py
+```
+The script will:
+1. Load the pre-quantized model and dataset
+2. Apply optimized training parameters
+3. Process the data sequentially
+4. Train the model for 3 epochs
+5. Save and push checkpoints to Hub regularly
+## Using the Model
+After training, you can use the domain-adapted model:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load the domain-adapted model
+model_name = "George-API/phi-4-research-assistant"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name,
+                                           device_map="auto",
+                                           torch_dtype="bfloat16")
+# Generate text
+input_text = "The hippocampus is involved in"
+inputs = tokenizer(input_text, return_tensors="pt")
+outputs = model.generate(**inputs, max_length=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Chat Format Example
+Phi-4 works best with its native chat template:
+```python
+from transformers import pipeline
+pipeline = pipeline(
+    "text-generation",
+    model="George-API/phi-4-research-assistant",
+    model_kwargs={"torch_dtype": "bfloat16"},
+    device_map="auto",
+)
+messages = [
+    {"role": "system", "content": "You are an expert in cognitive science."},
+    {"role": "user", "content": "Explain the role of the hippocampus in memory formation."},
+]
+outputs = pipeline(messages, max_new_tokens=256)
+print(outputs[0]["generated_text"])
+```
+## Expected Outcomes
+After domain adaptation, the model should:
+- Have a better understanding of cognitive science terminology
+- Show improved performance on domain-specific tasks
+- Be ready for supervised fine-tuning in Phase 2
+## Next Steps
+After completing domain adaptation:
+1. Evaluate the model's performance on cognitive science texts
+2. Proceed to Phase 2 (Supervised Fine-Tuning)
+3. Use TensorBoard to analyze training metrics

app.py ADDED Viewed

	@@ -0,0 +1,162 @@

+import gradio as gr
+import os
+import subprocess
+import sys
+import json
+import re
+from threading import Thread
+import datetime
+import torch
+import threading
+def load_env_variables():
+    """Load environment variables from system or .env file."""
+    if os.environ.get("SPACE_ID"):
+        print("Running in Hugging Face Space")
+        if "/" in os.environ.get("SPACE_ID", ""):
+            username = os.environ.get("SPACE_ID").split("/")[0]
+            os.environ["HF_USERNAME"] = username
+            print(f"Set HF_USERNAME from SPACE_ID: {username}")
+    else:
+        try:
+            from dotenv import load_dotenv
+            env_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), ".env")
+            if os.path.exists(env_path):
+                load_dotenv(env_path)
+                print(f"Loaded environment variables from {env_path}")
+        except ImportError:
+            print("python-dotenv not installed, skipping .env loading")
+def check_environment():
+    """Check the environment for GPU availability and other requirements."""
+    env_info = {
+        "System": {
+            "Platform": sys.platform,
+            "Python Version": sys.version.split()[0]
+        },
+        "GPU": {
+            "CUDA Available": torch.cuda.is_available(),
+            "Device Count": torch.cuda.device_count() if torch.cuda.is_available() else 0
+        },
+        "Environment Variables": {
+            "HF_TOKEN": bool(os.environ.get("HF_TOKEN")),
+            "HF_USERNAME": bool(os.environ.get("HF_USERNAME")),
+            "HF_SPACE_NAME": bool(os.environ.get("HF_SPACE_NAME"))
+        }
+    }
+    if torch.cuda.is_available():
+        env_info["GPU"]["Device Name"] = torch.cuda.get_device_name(0)
+        env_info["GPU"]["Memory (GB)"] = round(torch.cuda.get_device_properties(0).total_memory / (1024**3), 2)
+    return env_info
+def run_training_process():
+    """Run the training process using the configuration files."""
+    try:
+        current_dir = os.path.dirname(os.path.abspath(__file__))
+        training_script = os.path.join(current_dir, "run_transformers_training.py")
+        # Start the training process
+        process = subprocess.Popen(
+            [sys.executable, training_script],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            text=True,
+            bufsize=1
+        )
+        # Process the output line by line
+        for line in process.stdout:
+            print(line.strip())
+        process.wait()
+        return process.returncode
+    except Exception as e:
+        print(f"Error in training process: {e}")
+        return 1
+def start_training(learning_rate, num_train_epochs, per_device_train_batch_size,
+                 gradient_accumulation_steps):
+    """Start the training process with the specified parameters."""
+    try:
+        load_env_variables()
+        current_dir = os.path.dirname(os.path.abspath(__file__))
+        # Load and update transformers config
+        with open(os.path.join(current_dir, "transformers_config.json"), "r") as f:
+            config = json.load(f)
+        # Update training parameters
+        config["training"].update({
+            "num_train_epochs": num_train_epochs,
+            "learning_rate": learning_rate,
+            "per_device_train_batch_size": per_device_train_batch_size,
+            "gradient_accumulation_steps": gradient_accumulation_steps
+        })
+        # Update hub settings if username is available
+        if os.environ.get("HF_USERNAME"):
+            config["huggingface_hub"].update({
+                "hub_model_id": f"{os.environ['HF_USERNAME']}/Phi4-Cognitive-Science"
+            })
+        # Save updated config
+        with open(os.path.join(current_dir, "transformers_config.json"), "w") as f:
+            json.dump(config, f, indent=4)
+        # Start training in a separate thread
+        thread = threading.Thread(target=run_training_process)
+        thread.daemon = True
+        thread.start()
+        return "Training started! Check the Hugging Face Space logs for progress."
+    except Exception as e:
+        return f"Error starting training: {str(e)}"
+with gr.Blocks(title="Phi-4 Training Interface") as demo:
+    gr.Markdown("# Phi-4 Unsupervised Training for Cognitive Science")
+    with gr.Tab("Training"):
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("## Model Configuration")
+                gr.Markdown("**Model**: unsloth/phi-4-unsloth-bnb-4bit")
+                gr.Markdown("**Dataset**: George-API/cognitive-data")
+                gr.Markdown("## Training Parameters")
+                learning_rate = gr.Slider(minimum=1e-6, maximum=1e-4, value=2e-5, step=1e-6,
+                                       label="Learning Rate")
+                num_train_epochs = gr.Slider(minimum=1, maximum=5, value=3, step=1,
+                                          label="Number of Epochs")
+                per_device_train_batch_size = gr.Slider(minimum=4, maximum=24, value=12, step=4,
+                                                      label="Per Device Train Batch Size (Unsloth Optimized)")
+                gradient_accumulation_steps = gr.Slider(minimum=1, maximum=8, value=4, step=1,
+                                                     label="Gradient Accumulation Steps")
+                start_btn = gr.Button("Start Training", variant="primary")
+                training_output = gr.Textbox(label="Training Output", interactive=False)
+    with gr.Tab("Environment"):
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("## Environment Information")
+                env_info = gr.JSON(label="Environment Info")
+                check_env_btn = gr.Button("Check Environment")
+    # Set up event handlers
+    start_btn.click(
+        fn=start_training,
+        inputs=[learning_rate, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps],
+        outputs=training_output
+    )
+    check_env_btn.click(
+        fn=check_environment,
+        inputs=[],
+        outputs=env_info
+    )
+if __name__ == "__main__":
+    load_env_variables()
+    demo.launch()

check_tokenization.py ADDED Viewed

	@@ -0,0 +1,103 @@

+#!/usr/bin/env python
+import json
+from transformers import AutoTokenizer
+import numpy as np
+from tqdm import tqdm
+import matplotlib.pyplot as plt
+def load_tokenizers():
+    """Load both tokenizers."""
+    print("Loading tokenizers...")
+    phi_tokenizer = AutoTokenizer.from_pretrained(
+        "unsloth/phi-4-unsloth-bnb-4bit",
+        trust_remote_code=True
+    )
+    deepseek_tokenizer = AutoTokenizer.from_pretrained(
+        "deepseek-ai/deepseek-llama-7b-base",
+        trust_remote_code=True
+    )
+    return phi_tokenizer, deepseek_tokenizer
+def analyze_token_counts(jsonl_path, phi_tokenizer, deepseek_tokenizer, sample_size=100):
+    """Analyze token count differences between tokenizers."""
+    token_counts = {
+        'phi': [],
+        'deepseek': [],
+        'differences': []
+    }
+    print(f"Analyzing token counts from {jsonl_path}")
+    with open(jsonl_path, 'r', encoding='utf-8') as f:
+        data = [json.loads(line) for line in f]
+    # Take a random sample if sample_size specified
+    if sample_size and sample_size < len(data):
+        data = np.random.choice(data, sample_size, replace=False)
+    for item in tqdm(data, desc="Processing entries"):
+        text = item.get('text', '') or item.get('content', '')
+        # Get token counts
+        phi_tokens = len(phi_tokenizer.encode(text))
+        deepseek_tokens = len(deepseek_tokenizer.encode(text))
+        token_counts['phi'].append(phi_tokens)
+        token_counts['deepseek'].append(deepseek_tokens)
+        token_counts['differences'].append(phi_tokens - deepseek_tokens)
+    return token_counts
+def plot_comparison(token_counts):
+    """Create visualization of token count differences."""
+    plt.figure(figsize=(12, 6))
+    # Plot token count distributions
+    plt.subplot(1, 2, 1)
+    plt.hist([token_counts['phi'], token_counts['deepseek']],
+            label=['Phi-4', 'DeepSeek'], alpha=0.6)
+    plt.title('Token Count Distribution')
+    plt.xlabel('Number of Tokens')
+    plt.ylabel('Frequency')
+    plt.legend()
+    # Plot differences
+    plt.subplot(1, 2, 2)
+    plt.hist(token_counts['differences'], bins=30)
+    plt.title('Token Count Differences\n(Phi-4 minus DeepSeek)')
+    plt.xlabel('Difference in Tokens')
+    plt.ylabel('Frequency')
+    plt.tight_layout()
+    plt.savefig('tokenization_analysis.png')
+    print("Saved visualization to tokenization_analysis.png")
+def main():
+    # Load tokenizers
+    phi_tokenizer, deepseek_tokenizer = load_tokenizers()
+    # Analyze token counts
+    token_counts = analyze_token_counts(
+        "../../../../data_processing/data/training_data.jsonl",
+        phi_tokenizer,
+        deepseek_tokenizer
+    )
+    # Calculate statistics
+    phi_mean = np.mean(token_counts['phi'])
+    deepseek_mean = np.mean(token_counts['deepseek'])
+    diff_mean = np.mean(token_counts['differences'])
+    diff_std = np.std(token_counts['differences'])
+    print("\nAnalysis Results:")
+    print(f"Phi-4 average tokens: {phi_mean:.1f}")
+    print(f"DeepSeek average tokens: {deepseek_mean:.1f}")
+    print(f"Average difference: {diff_mean:.1f} ± {diff_std:.1f}")
+    print(f"Max Phi-4 tokens: {max(token_counts['phi'])}")
+    print(f"Max DeepSeek tokens: {max(token_counts['deepseek'])}")
+    # Create visualization
+    plot_comparison(token_counts)
+if __name__ == "__main__":
+    main()

dataset_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "dataset": {
+        "name": "George-API/cognitive-data",
+        "split": "train",
+        "column_mapping": {
+            "text": "conversations",
+            "id": "id"
+        },
+        "processing": {
+            "sort_by_id": true,
+            "maintain_paper_order": true,
+            "max_seq_length": 2048
+        }
+    },
+    "data_formatting": {
+        "chat_template": "phi",
+        "roles": {
+            "system": "System: {content}\n\n",
+            "human": "Human: {content}\n\n",
+            "assistant": "Assistant: {content}\n\n"
+        },
+        "metadata_handling": {
+            "include_paper_id": true,
+            "include_chunk_number": true,
+            "metadata_format": "Paper ID: {paper_id} | Chunk: {chunk_number}"
+        }
+    },
+    "data_loading": {
+        "batch_size": 16,
+        "shuffle": false,
+        "drop_last": false,
+        "num_workers": 2,
+        "pin_memory": false
+    },
+    "validation": {
+        "log_samples": 3,
+        "log_interval": 50,
+        "metrics": ["processed", "skipped", "avg_tokens", "unique_papers"]
+    }
+}

hardware_config.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "hardware_name": "2xA10G",
+  "specs": {
+    "gpu_count": 2,
+    "gpu_type": "A10G",
+    "vram_per_gpu": 24,
+    "total_vram": 48,
+    "vcpu_count": 24,
+    "ram": 92
+  },
+  "training_optimizations": {
+    "per_device_batch_size": 16,
+    "gradient_accumulation_steps": 4,
+    "effective_batch_size": 128,
+    "memory_optimizations": {
+      "use_gradient_checkpointing": true,
+      "pin_memory": true,
+      "num_workers": 2
+    },
+    "distributed_settings": {
+      "device_map": "auto",
+      "ddp_find_unused_parameters": false
+    }
+  },
+  "memory_breakdown": {
+    "model_size": "~3.5GB (pre-quantized 4-bit)",
+    "optimizer_states": "~1GB",
+    "batch_memory_per_gpu": "~2GB",
+    "peak_memory_estimate": "18-20GB",
+    "safe_headroom": "4-6GB"
+  },
+  "compute_environment": "A10G_CLOUD",
+  "distributed_type": "DATA_PARALLEL",
+  "mixed_precision": "bf16",
+  "num_gpus": 2,
+  "training_parameters": {
+    "per_device_train_batch_size": 16,
+    "gradient_accumulation_steps": 4,
+    "dataloader_num_workers": 2,
+    "dataloader_pin_memory": true,
+    "gradient_checkpointing": true,
+    "max_grad_norm": 1.0
+  },
+  "memory_optimization": {
+    "offload_to_cpu": false,
+    "use_flash_attention": true,
+    "use_gradient_checkpointing": true
+  }
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+accelerate>=0.27.0
+bitsandbytes>=0.41.0
+datasets>=2.15.0
+filelock>=3.13.1
+gradio>=5.17.0
+huggingface-hub>=0.19.0
+matplotlib>=3.7.0
+numpy>=1.24.0
+packaging>=23.0
+psutil>=5.9.0
+python-dotenv>=1.0.0
+pyyaml>=6.0.1
+regex>=2023.0.0
+requests>=2.31.0
+safetensors>=0.4.1
+tensorboard>=2.15.0
+torch>=2.0.0
+tqdm>=4.65.0
+transformers>=4.36.0
+typing-extensions>=4.8.0

run_transformers_training.py ADDED Viewed

	@@ -0,0 +1,615 @@

+#!/usr/bin/env python
+# coding=utf-8
+import os
+import sys
+import json
+import argparse
+import logging
+from datetime import datetime
+import torch
+from datasets import load_dataset
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    TrainingArguments,
+    Trainer,
+    TrainerCallback,
+    set_seed,
+    BitsAndBytesConfig
+)
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(levelname)s - %(message)s",
+    handlers=[logging.StreamHandler(sys.stdout)]
+)
+logger = logging.getLogger(__name__)
+# Check for BitsAndBytes
+try:
+    from transformers import BitsAndBytesConfig
+    bitsandbytes_available = True
+except ImportError:
+    bitsandbytes_available = False
+    logger.warning("BitsAndBytes not available. 4-bit quantization will not be used.")
+# Check for PEFT
+try:
+    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
+    peft_available = True
+except ImportError:
+    peft_available = False
+    logger.warning("PEFT not available. Parameter-efficient fine-tuning will not be used.")
+# Import Unsloth
+try:
+    from unsloth import FastLanguageModel
+    from unsloth.chat_templates import get_chat_template
+    unsloth_available = True
+except ImportError:
+    unsloth_available = False
+    logger.warning("Unsloth not available. Please install with: pip install unsloth")
+def load_env_variables():
+    """Load environment variables from system, .env file, or Hugging Face Space variables."""
+    # Check if we're running in a Hugging Face Space
+    if os.environ.get("SPACE_ID"):
+        logging.info("Running in Hugging Face Space")
+        # Log the presence of variables (without revealing values)
+        logging.info(f"HF_TOKEN available: {bool(os.environ.get('HF_TOKEN'))}")
+        logging.info(f"HF_USERNAME available: {bool(os.environ.get('HF_USERNAME'))}")
+        # If username is not set, try to extract from SPACE_ID
+        if not os.environ.get("HF_USERNAME") and "/" in os.environ.get("SPACE_ID", ""):
+            username = os.environ.get("SPACE_ID").split("/")[0]
+            os.environ["HF_USERNAME"] = username
+            logging.info(f"Set HF_USERNAME from SPACE_ID: {username}")
+    else:
+        # Try to load from .env file if not in a Space
+        try:
+            from dotenv import load_dotenv
+            # Updated path to .env file in the new directory structure
+            env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "shared", ".env")
+            if os.path.exists(env_path):
+                load_dotenv(env_path)
+                logging.info(f"Loaded environment variables from {env_path}")
+                logging.info(f"HF_TOKEN loaded from .env file: {bool(os.environ.get('HF_TOKEN'))}")
+                logging.info(f"HF_USERNAME loaded from .env file: {bool(os.environ.get('HF_USERNAME'))}")
+                logging.info(f"HF_SPACE_NAME loaded from .env file: {bool(os.environ.get('HF_SPACE_NAME'))}")
+            else:
+                logging.warning(f"No .env file found at {env_path}")
+        except ImportError:
+            logging.warning("python-dotenv not installed, not loading from .env file")
+    if not os.environ.get("HF_USERNAME"):
+        logger.warning("HF_USERNAME is not set. Using default username.")
+    if not os.environ.get("HF_SPACE_NAME"):
+        logger.warning("HF_SPACE_NAME is not set. Using default space name.")
+    # Set HF_TOKEN for huggingface_hub
+    if os.environ.get("HF_TOKEN"):
+        os.environ["HUGGING_FACE_HUB_TOKEN"] = os.environ.get("HF_TOKEN")
+def load_configs(base_path):
+    """Load all configuration files."""
+    configs = {}
+    # List of config files to load
+    config_files = [
+        "transformers_config.json",
+        "hardware_config.json",
+        "dataset_config.json"
+    ]
+    for config_file in config_files:
+        file_path = os.path.join(base_path, config_file)
+        try:
+            with open(file_path, "r") as f:
+                config_name = config_file.replace("_config.json", "")
+                configs[config_name] = json.load(f)
+                logger.info(f"Loaded {config_name} configuration from {file_path}")
+        except Exception as e:
+            logger.error(f"Error loading {config_file}: {e}")
+            raise
+    return configs
+def parse_args():
+    parser = argparse.ArgumentParser(description="Fine-tune a language model on a text dataset")
+    parser.add_argument("--config_dir", type=str, default=".", help="Directory containing configuration files")
+    return parser.parse_args()
+def load_model_and_tokenizer(config):
+    """Load model and tokenizer with proper error handling and optimizations."""
+    try:
+        if config.get("use_unsloth", False) and unsloth_available:
+            logger.info("Using Unsloth optimizations")
+            model, tokenizer = FastLanguageModel.from_pretrained(
+                model_name=config.get("model_name"),
+                max_seq_length=config.get("max_seq_length", 2048),
+                dtype=None,  # Let Unsloth choose optimal dtype
+                load_in_4bit=config.get("load_in_4bit", True),
+                device_map="auto",
+            )
+            # Apply Unsloth's training optimizations with config parameters
+            model = FastLanguageModel.get_peft_model(
+                model,
+                r=config.get("unsloth_r", 32),
+                target_modules=config.get("unsloth_target_modules",
+                    ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]),
+                lora_alpha=config.get("unsloth_alpha", 16),
+                lora_dropout=config.get("unsloth_dropout", 0.05),
+                bias="none",
+                use_gradient_checkpointing=config.get("gradient_checkpointing", True),
+                random_state=config.get("seed", 42),
+            )
+            logger.info("Unsloth optimizations applied successfully")
+        else:
+            if config.get("use_unsloth", False):
+                logger.warning("Unsloth requested but not available. Falling back to standard training.")
+            # Standard quantization setup
+            quantization_config = None
+            if config.get("load_in_4bit", False) and bitsandbytes_available:
+                logger.info("Using 4-bit quantization")
+                quantization_config = BitsAndBytesConfig(
+                    load_in_4bit=True,
+                    bnb_4bit_quant_type="nf4",
+                    bnb_4bit_compute_dtype=torch.float16,
+                    bnb_4bit_use_double_quant=True
+                )
+            # Load model with standard settings
+            model = AutoModelForCausalLM.from_pretrained(
+                config.get("model_name"),
+                quantization_config=quantization_config,
+                device_map="auto",
+                trust_remote_code=config.get("trust_remote_code", True),
+                use_cache=not config.get("gradient_checkpointing", True)
+            )
+            # Load tokenizer
+            tokenizer = AutoTokenizer.from_pretrained(
+                config.get("model_name"),
+                use_fast=config.get("use_fast_tokenizer", True),
+                trust_remote_code=config.get("trust_remote_code", True)
+            )
+            # Enable gradient checkpointing if requested
+            if config.get("gradient_checkpointing", True) and hasattr(model, "gradient_checkpointing_enable"):
+                model.gradient_checkpointing_enable(use_reentrant=False)
+                logger.info("Gradient checkpointing enabled")
+        # Set up tokenizer settings
+        if config.get("chat_template"):
+            if unsloth_available and config.get("use_unsloth", False):
+                chat_template = get_chat_template("phi")
+                tokenizer.chat_template = chat_template
+            else:
+                tokenizer.chat_template = config.get("chat_template")
+            logger.info(f"Set chat template to {config.get('chat_template')}")
+        # Ensure proper token settings
+        if tokenizer.pad_token_id is None:
+            tokenizer.pad_token_id = tokenizer.eos_token_id
+            logger.info(f"Set pad_token_id to eos_token_id: {tokenizer.pad_token_id}")
+        return model, tokenizer
+    except Exception as e:
+        logger.error(f"Error in model/tokenizer loading: {str(e)}")
+        raise
+def load_dataset_with_mapping(dataset_config):
+    """Load and prepare dataset with proper column mapping."""
+    try:
+        # Load dataset
+        dataset = load_dataset(
+            dataset_config["dataset"]["name"],
+            split=dataset_config["dataset"]["split"]
+        )
+        logger.info(f"Dataset loaded successfully with {len(dataset)} examples")
+        # Apply column mapping if specified
+        if "column_mapping" in dataset_config["dataset"]:
+            mapping = dataset_config["dataset"]["column_mapping"]
+            dataset = dataset.rename_columns({v: k for k, v in mapping.items()})
+            logger.info(f"Applied column mapping: {mapping}")
+        # Sort dataset if required
+        if dataset_config["dataset"]["processing"]["sort_by_id"]:
+            logger.info("Sorting dataset by ID to maintain paper chunk order")
+            dataset = dataset.sort("id")
+            # Log first few IDs to verify sorting
+            sample_ids = [example["id"] for example in dataset.select(range(min(5, len(dataset))))]
+            logger.info(f"First few IDs after sorting: {sample_ids}")
+        return dataset
+    except Exception as e:
+        logger.error(f"Error loading dataset: {str(e)}")
+        raise
+def main():
+    # Set up logging
+    logger.info("Starting training process")
+    # Parse arguments
+    args = parse_args()
+    # Load environment variables
+    load_env_variables()
+    # Load all configurations
+    try:
+        configs = load_configs(args.config_dir)
+        logger.info("All configurations loaded successfully")
+        # Extract specific configs
+        model_config = configs["transformers"]
+        hardware_config = configs["hardware"]
+        dataset_config = configs["dataset"]
+        # Apply hardware-specific settings
+        per_device_batch_size = hardware_config["training_optimizations"]["per_device_batch_size"]
+        gradient_accumulation = hardware_config["training_optimizations"]["gradient_accumulation_steps"]
+        # Update model config with hardware settings
+        model_config["training"].update({
+            "per_device_train_batch_size": per_device_batch_size,
+            "gradient_accumulation_steps": gradient_accumulation,
+            "gradient_checkpointing": hardware_config["training_optimizations"]["memory_optimizations"]["use_gradient_checkpointing"]
+        })
+    except Exception as e:
+        logger.error(f"Error loading configurations: {e}")
+        return 1
+    # Set random seed for reproducibility
+    seed = model_config.get("seed", 42)
+    set_seed(seed)
+    logger.info(f"Set random seed to {seed}")
+    # Check if we're running in a Hugging Face Space
+    if os.environ.get("SPACE_ID") and not os.environ.get("HF_USERNAME"):
+        # Extract username from SPACE_ID
+        username = os.environ.get("SPACE_ID").split("/")[0]
+        logger.info(f"Extracted username from SPACE_ID: {username}")
+        # Set hub_model_id if not already set and push_to_hub is enabled
+        if model_config.get("push_to_hub", False) and not model_config.get("hub_model_id"):
+            model_name = model_config.get("model_name", "").split("/")[-1]
+            model_config["hub_model_id"] = f"{username}/finetuned-{model_name}"
+            logger.info(f"Set hub_model_id to {model_config['hub_model_id']}")
+    # Load model and tokenizer
+    logger.info(f"Loading model: {model_config.get('model_name')}")
+    try:
+        model, tokenizer = load_model_and_tokenizer(model_config)
+        logger.info("Model and tokenizer loaded successfully")
+        # Prepare model for k-bit training if using PEFT
+        if model_config.get("use_peft", False) and peft_available:
+            logger.info("Preparing model for parameter-efficient fine-tuning")
+            try:
+                model = prepare_model_for_kbit_training(model)
+                # Get target modules
+                target_modules = model_config.get("target_modules", ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"])
+                # Create LoRA config
+                lora_config = LoraConfig(
+                    r=model_config.get("lora_r", 16),
+                    lora_alpha=model_config.get("lora_alpha", 32),
+                    lora_dropout=model_config.get("lora_dropout", 0.05),
+                    bias="none",
+                    task_type="CAUSAL_LM",
+                    target_modules=target_modules
+                )
+                # Apply LoRA to model
+                model = get_peft_model(model, lora_config)
+                logger.info(f"Applied LoRA with r={model_config.get('lora_r', 16)}, alpha={model_config.get('lora_alpha', 32)}")
+            except Exception as e:
+                logger.error(f"Error setting up PEFT: {e}")
+                return 1
+        # Load dataset with proper mapping
+        try:
+            dataset = load_dataset_with_mapping(dataset_config)
+            logger.info("Dataset loaded and prepared successfully")
+        except Exception as e:
+            logger.error(f"Error loading dataset: {e}")
+            return 1
+        # Simple data collator that processes each entry independently
+        class SimpleDataCollator:
+            def __init__(self, tokenizer):
+                self.tokenizer = tokenizer
+                self.stats = {"processed": 0, "skipped": 0, "total_tokens": 0}
+                self.pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0
+                self.prompt_counter = 0
+                self.paper_counters = {}
+                logger.info("SimpleDataCollator initialized - using phi-4 chat format")
+            def format_phi_chat(self, messages):
+                """Format messages according to phi-4's chat template."""
+                formatted_chat = ""
+                for message in messages:
+                    # Extract role and content
+                    if isinstance(message, dict):
+                        role = message.get("role", "").lower()
+                        content = message.get("content", "")
+                    else:
+                        role = getattr(message, "role", "").lower()
+                        content = getattr(message, "content", "")
+                    # Format based on role
+                    if role == "human" or role == "user":
+                        formatted_chat += f"Human: {content}\n\n"
+                    elif role == "assistant":
+                        formatted_chat += f"Assistant: {content}\n\n"
+                    elif role == "system":
+                        # For system messages, we prepend them with a special format
+                        formatted_chat = f"System: {content}\n\n" + formatted_chat
+                    else:
+                        logger.warning(f"Unknown role '{role}' - treating as system message")
+                        formatted_chat += f"System: {content}\n\n"
+                return formatted_chat.strip()
+            def __call__(self, features):
+                batch = {"input_ids": [], "attention_mask": [], "labels": []}
+                for example in features:
+                    try:
+                        # Get ID and conversation fields
+                        paper_id = example.get("id", "") if isinstance(example, dict) else getattr(example, "id", "")
+                        conversation = example.get("conversations", []) if isinstance(example, dict) else getattr(example, "conversations", [])
+                        if not conversation:
+                            self.stats["skipped"] += 1
+                            continue
+                        # Increment counters
+                        self.prompt_counter += 1
+                        if paper_id not in self.paper_counters:
+                            self.paper_counters[paper_id] = 0
+                        self.paper_counters[paper_id] += 1
+                        # Add metadata as system message
+                        metadata = {
+                            "role": "system",
+                            "content": f"Paper ID: {paper_id} | Chunk: {self.paper_counters[paper_id]}"
+                        }
+                        # Format the conversation using phi-4's chat template
+                        formatted_content = self.format_phi_chat([metadata] + conversation)
+                        # Tokenize with the model's chat template
+                        inputs = self.tokenizer(
+                            formatted_content,
+                            add_special_tokens=True,
+                            truncation=True,
+                            max_length=model_config.get("max_seq_length", 2048),
+                            return_tensors=None,  # Return list instead of tensors
+                        )
+                        input_ids = inputs["input_ids"]
+                        attention_mask = inputs["attention_mask"]
+                        if len(input_ids) > 0:
+                            # For causal language modeling, labels are the same as inputs
+                            labels = input_ids.copy()
+                            batch["input_ids"].append(input_ids)
+                            batch["attention_mask"].append(attention_mask)
+                            batch["labels"].append(labels)
+                            self.stats["processed"] += 1
+                            self.stats["total_tokens"] += len(input_ids)
+                            # Debug logging for first few examples
+                            if self.stats["processed"] <= 3:
+                                logger.info(f"Example {self.stats['processed']} format:")
+                                logger.info(f"Paper ID: {paper_id} | Chunk: {self.paper_counters[paper_id]}")
+                                logger.info(f"Token count: {len(input_ids)}")
+                                logger.info(f"Content preview:\n{formatted_content[:500]}...")
+                        else:
+                            self.stats["skipped"] += 1
+                    except Exception as e:
+                        logger.warning(f"Error processing example: {str(e)[:100]}...")
+                        self.stats["skipped"] += 1
+                        continue
+                # Handle empty batches
+                if not batch["input_ids"]:
+                    logger.warning("Empty batch, returning dummy tensors")
+                    return {
+                        "input_ids": torch.zeros((1, 1), dtype=torch.long),
+                        "attention_mask": torch.zeros((1, 1), dtype=torch.long),
+                        "labels": torch.zeros((1, 1), dtype=torch.long)
+                    }
+                # Pad the batch
+                max_length = max(len(ids) for ids in batch["input_ids"])
+                for i in range(len(batch["input_ids"])):
+                    padding_length = max_length - len(batch["input_ids"][i])
+                    if padding_length > 0:
+                        batch["input_ids"][i].extend([self.pad_token_id] * padding_length)
+                        batch["attention_mask"][i].extend([0] * padding_length)
+                        batch["labels"][i].extend([-100] * padding_length)  # Don't compute loss on padding
+                # Convert to tensors
+                batch = {k: torch.tensor(v) for k, v in batch.items()}
+                # Log stats periodically
+                if self.stats["processed"] % 100 == 0 and self.stats["processed"] > 0:
+                    logger.info(f"Data collator stats: processed={self.stats['processed']}, "
+                               f"skipped={self.stats['skipped']}, "
+                               f"avg_tokens={self.stats['total_tokens']/self.stats['processed']:.1f}, "
+                               f"unique_papers={len(self.paper_counters)}")
+                return batch
+        # Create data collator
+        data_collator = SimpleDataCollator(tokenizer)
+        # Simple logging callback
+        class LoggingCallback(TrainerCallback):
+            def __init__(self):
+                self.last_log_time = datetime.now()
+                self.training_start_time = datetime.now()
+            def on_step_end(self, args, state, control, **kwargs):
+                # Log every 50 steps or every 5 minutes, whichever comes first
+                current_time = datetime.now()
+                time_diff = (current_time - self.last_log_time).total_seconds()
+                elapsed_time = (current_time - self.training_start_time).total_seconds() / 60  # in minutes
+                if state.global_step % 50 == 0 or time_diff > 300:  # 300 seconds = 5 minutes
+                    loss = state.log_history[-1]['loss'] if state.log_history else 'N/A'
+                    lr = state.log_history[-1]['learning_rate'] if state.log_history else 'N/A'
+                    if isinstance(loss, float):
+                        loss_str = f"{loss:.4f}"
+                    else:
+                        loss_str = str(loss)
+                    if isinstance(lr, float):
+                        lr_str = f"{lr:.8f}"
+                    else:
+                        lr_str = str(lr)
+                    logger.info(f"Step: {state.global_step} | Loss: {loss_str} | LR: {lr_str} | Elapsed: {elapsed_time:.2f} min")
+                    self.last_log_time = current_time
+        # Set up training arguments
+        logger.info("Setting up training arguments")
+        training_args = TrainingArguments(
+            output_dir=model_config.get("output_dir", "./results"),
+            num_train_epochs=model_config.get("num_train_epochs", 3),
+            per_device_train_batch_size=model_config.get("per_device_train_batch_size", 4),  # Use config value, can be > 1
+            gradient_accumulation_steps=model_config.get("gradient_accumulation_steps", 8),
+            learning_rate=model_config.get("learning_rate", 5e-5),
+            weight_decay=model_config.get("weight_decay", 0.01),
+            warmup_ratio=model_config.get("warmup_ratio", 0.1),
+            lr_scheduler_type=model_config.get("lr_scheduler_type", "cosine"),
+            logging_steps=model_config.get("logging_steps", 10),
+            save_strategy=model_config.get("save_strategy", "steps"),  # Updated to use steps by default
+            save_steps=model_config.get("save_steps", 100),  # Save every 100 steps by default
+            save_total_limit=model_config.get("save_total_limit", 3),  # Keep last 3 checkpoints
+            fp16=model_config.get("fp16", True),
+            bf16=model_config.get("bf16", False),
+            max_grad_norm=model_config.get("max_grad_norm", 1.0),
+            push_to_hub=model_config.get("push_to_hub", False),
+            hub_model_id=model_config.get("hub_model_id", None),
+            hub_token=os.environ.get("HF_TOKEN", None),
+            report_to="tensorboard",
+            remove_unused_columns=False,  # Keep the conversations column
+            gradient_checkpointing=model_config.get("gradient_checkpointing", True),  # Enable gradient checkpointing
+            dataloader_pin_memory=False,  # Reduce memory usage
+            optim=model_config.get("optim", "adamw_torch"),
+            ddp_find_unused_parameters=False,  # Improve distributed training efficiency
+            dataloader_drop_last=False,  # Process all examples
+            dataloader_num_workers=0,  # Sequential data loading
+        )
+        # Create a sequential sampler to ensure dataset is processed in order
+        logger.info("Creating sequential sampler to maintain dataset order")
+        # Create trainer with callback
+        logger.info("Creating trainer")
+        # Check if we should resume from checkpoint
+        resume_from_checkpoint = False
+        output_dir = model_config.get("output_dir", "./results")
+        if os.path.exists(output_dir):
+            checkpoints = [folder for folder in os.listdir(output_dir) if folder.startswith("checkpoint-")]
+            if checkpoints:
+                latest_checkpoint = max(checkpoints, key=lambda x: int(x.split("-")[1]))
+                resume_from_checkpoint = os.path.join(output_dir, latest_checkpoint)
+                logger.info(f"Found checkpoint: {resume_from_checkpoint}. Training will resume from this point.")
+        trainer = Trainer(
+            model=model,
+            args=training_args,
+            train_dataset=dataset,
+            data_collator=data_collator,
+            callbacks=[LoggingCallback()]
+        )
+        # Override the default data loader to disable shuffling
+        # This is necessary because TrainingArguments doesn't have a direct shuffle parameter
+        def get_train_dataloader_no_shuffle():
+            """Create a train DataLoader with shuffling disabled."""
+            logger.info("Creating train dataloader with sequential sampler (no shuffling)")
+            # Create a sequential sampler to ensure dataset is processed in order
+            train_sampler = torch.utils.data.SequentialSampler(dataset)
+            return torch.utils.data.DataLoader(
+                dataset,
+                batch_size=training_args.per_device_train_batch_size,
+                sampler=train_sampler,  # Use sequential sampler instead of shuffle parameter
+                collate_fn=data_collator,
+                drop_last=False,
+                num_workers=0,
+                pin_memory=False
+            )
+        # Replace the default data loader with our non-shuffling version
+        trainer.get_train_dataloader = get_train_dataloader_no_shuffle
+        # Start training
+        logger.info("Starting training")
+        logger.info(f"Processing with batch size = {training_args.per_device_train_batch_size}, each entry processed independently")
+        # Create a lock file to indicate training is in progress
+        lock_file = os.path.join(os.path.dirname(os.path.abspath(__file__)), "TRAINING_IN_PROGRESS.lock")
+        with open(lock_file, "w") as f:
+            f.write(f"Training started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
+            f.write(f"Expected completion: After {training_args.num_train_epochs} epochs\n")
+            f.write("DO NOT UPDATE OR RESTART THIS SPACE UNTIL TRAINING COMPLETES\n")
+        logger.info(f"Created lock file: {lock_file}")
+        try:
+            trainer.train(resume_from_checkpoint=resume_from_checkpoint)
+            logger.info("Training completed successfully")
+            # Save model
+            if model_config.get("push_to_hub", False):
+                logger.info(f"Pushing model to hub: {model_config.get('hub_model_id')}")
+                trainer.push_to_hub()
+                logger.info("Model pushed to hub successfully")
+            else:
+                logger.info(f"Saving model to {model_config.get('output_dir', './results')}")
+                trainer.save_model()
+                logger.info("Model saved successfully")
+        except Exception as e:
+            logger.error(f"Training failed with error: {str(e)}")
+            raise
+        finally:
+            # Remove the lock file when training completes or fails
+            if os.path.exists(lock_file):
+                os.remove(lock_file)
+                logger.info(f"Removed lock file: {lock_file}")
+            return 0
+    except Exception as e:
+        logger.error(f"Error in main training loop: {str(e)}")
+        return 1
+if __name__ == "__main__":
+    sys.exit(main())

transformers_config.json ADDED Viewed

	@@ -0,0 +1,75 @@

+{
+  "model": {
+    "name": "unsloth/phi-4-unsloth-bnb-4bit",
+    "trust_remote_code": true,
+    "use_fast_tokenizer": true
+  },
+  "tokenizer": {
+    "chat_template": "phi",
+    "max_seq_length": 2048,
+    "padding_side": "right",
+    "add_eos_token": true
+  },
+  "training": {
+    "per_device_train_batch_size": 16,
+    "gradient_accumulation_steps": 4,
+    "learning_rate": 2e-5,
+    "num_train_epochs": 3,
+    "max_steps": -1,
+    "logging_steps": 10,
+    "save_steps": 200,
+    "save_total_limit": 5,
+    "push_to_hub": true,
+    "hub_strategy": "every_save",
+    "gradient_checkpointing": true,
+    "optim": "adamw_torch",
+    "lr_scheduler_type": "cosine",
+    "warmup_ratio": 0.03,
+    "weight_decay": 0.01,
+    "max_grad_norm": 1.0,
+    "neftune_noise_alpha": 5
+  },
+  "checkpointing": {
+    "output_dir": "./results",
+    "save_strategy": "steps",
+    "save_steps": 100,
+    "save_total_limit": 3,
+    "hub_strategy": "every_save"
+  },
+  "unsloth": {
+    "enabled": true,
+    "r": 32,
+    "alpha": 16,
+    "dropout": 0.05,
+    "target_modules": [
+      "q_proj",
+      "k_proj",
+      "v_proj",
+      "o_proj",
+      "gate_proj",
+      "up_proj",
+      "down_proj"
+    ]
+  },
+  "logging": {
+    "logging_steps": 50,
+    "log_level": "info"
+  },
+  "huggingface_hub": {
+    "push_to_hub": true,
+    "hub_model_id": "phi-4-research-assistant",
+    "hub_private_repo": true
+  },
+  "model_name_or_path": "unsloth/phi-4-unsloth-bnb-4bit",
+  "model_revision": "main",
+  "use_flash_attention": true,
+  "torch_dtype": "bfloat16",
+  "bf16": true
+}

update_space.py ADDED Viewed

	@@ -0,0 +1,219 @@

+#!/usr/bin/env python
+"""
+Quick script to update your Hugging Face Space for phi-4-unsloth-bnb-4bit training.
+This script handles the specific requirements for the 4-bit quantized Phi-4 model training,
+including proper configuration and dependency management.
+"""
+import os
+import sys
+import json
+import subprocess
+import argparse
+import logging
+from pathlib import Path
+from huggingface_hub import HfApi, login
+import getpass
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(levelname)s - %(message)s",
+    handlers=[logging.StreamHandler(sys.stdout)]
+)
+logger = logging.getLogger(__name__)
+def load_env_variables():
+    """Load environment variables from system or .env file."""
+    # Check if we're running in a Hugging Face Space
+    if os.environ.get("SPACE_ID"):
+        logger.info("Running in Hugging Face Space")
+        if "/" in os.environ.get("SPACE_ID", ""):
+            username = os.environ.get("SPACE_ID").split("/")[0]
+            os.environ["HF_USERNAME"] = username
+            logger.info(f"Set HF_USERNAME from SPACE_ID: {username}")
+    else:
+        try:
+            from dotenv import load_dotenv
+            env_path = Path(__file__).parent.parent / ".env"
+            if env_path.exists():
+                load_dotenv(env_path)
+                logger.info(f"Loaded environment variables from {env_path}")
+            else:
+                logger.warning(f"No .env file found at {env_path}")
+        except ImportError:
+            logger.warning("python-dotenv not installed, skipping .env loading")
+    # Verify required variables
+    required_vars = {
+        "HF_TOKEN": os.environ.get("HF_TOKEN"),
+        "HF_USERNAME": os.environ.get("HF_USERNAME"),
+        "HF_SPACE_NAME": os.environ.get("HF_SPACE_NAME", "phi4-cognitive-training")
+    }
+    missing_vars = [k for k, v in required_vars.items() if not v]
+    if missing_vars:
+        raise ValueError(f"Missing required environment variables: {', '.join(missing_vars)}")
+    return required_vars
+def verify_configs():
+    """Verify that all necessary configuration files exist and are valid."""
+    current_dir = Path(__file__).parent
+    required_files = [
+        "transformers_config.json",
+        "hardware_config.json",
+        "dataset_config.json",
+        "requirements.txt",
+        "run_transformers_training.py"
+    ]
+    missing_files = []
+    for file in required_files:
+        if not (current_dir / file).exists():
+            missing_files.append(file)
+    if missing_files:
+        raise FileNotFoundError(f"Missing required files: {', '.join(missing_files)}")
+    # Verify JSON configs
+    json_files = [f for f in required_files if f.endswith('.json')]
+    for json_file in json_files:
+        try:
+            with open(current_dir / json_file) as f:
+                json.load(f)
+            logger.info(f"Verified {json_file} is valid JSON")
+        except json.JSONDecodeError as e:
+            raise ValueError(f"Invalid JSON in {json_file}: {e}")
+def update_requirements():
+    """Update requirements.txt with necessary packages."""
+    current_dir = Path(__file__).parent
+    req_path = current_dir / "requirements.txt"
+    required_packages = {
+        "torch>=2.0.0",
+        "transformers>=4.36.0",
+        "accelerate>=0.27.0",
+        "bitsandbytes>=0.41.0",
+        "tensorboard>=2.15.0",
+        "gradio>=5.17.0",
+        "huggingface-hub>=0.19.0",
+        "datasets>=2.15.0"
+    }
+    # Read existing requirements
+    existing_requirements = set()
+    if req_path.exists():
+        with open(req_path) as f:
+            existing_requirements = {line.strip() for line in f if line.strip()}
+    # Add new requirements
+    updated_requirements = existing_requirements.union(required_packages)
+    # Write updated requirements
+    with open(req_path, 'w') as f:
+        for req in sorted(updated_requirements):
+            f.write(f"{req}\n")
+    logger.info("Updated requirements.txt with necessary packages")
+def create_space(username, space_name):
+    """Create or get a Hugging Face Space."""
+    try:
+        api = HfApi()
+        space_id = f"{username}/{space_name}"
+        logger.info(f"Checking Space {space_id}...")
+        try:
+            space_info = api.space_info(repo_id=space_id)
+            logger.info(f"Space {space_id} exists")
+            return space_info
+        except Exception:
+            logger.info(f"Creating new Space {space_id}...")
+        space_info = api.create_repo(
+            repo_id=space_id,
+            repo_type="space",
+            space_sdk="gradio",
+            private=False
+        )
+        logger.info(f"Space {space_id} created successfully")
+        return space_info
+    except Exception as e:
+        raise RuntimeError(f"Error with Space {username}/{space_name}: {e}")
+def main():
+    parser = argparse.ArgumentParser(description='Update Hugging Face Space for Phi-4 training')
+    parser.add_argument('--space_name', type=str, help='Space name (default: from env)')
+    parser.add_argument('--force', action='store_true', help='Skip confirmation')
+    args = parser.parse_args()
+    if not args.force:
+        print("\n" + "!"*80)
+        print("WARNING: Updating the Space will INTERRUPT any ongoing training!")
+        print("Make sure all checkpoints are saved before proceeding.")
+        print("!"*80 + "\n")
+        confirm = input("Type 'update' to confirm: ")
+        if confirm.lower() != 'update':
+            logger.info("Update cancelled")
+            return False
+    try:
+        # Load environment variables
+        env_vars = load_env_variables()
+        logger.info(f"Environment variables loaded: USERNAME={env_vars['HF_USERNAME']}, SPACE_NAME={env_vars['HF_SPACE_NAME']}")
+        # Verify configurations
+        verify_configs()
+        logger.info("All configuration files verified successfully")
+        # Update requirements
+        update_requirements()
+        logger.info("Requirements updated successfully")
+        # Get space name
+        space_name = args.space_name or env_vars["HF_SPACE_NAME"]
+        logger.info(f"Using space name: {space_name}")
+        # Login to Hugging Face
+        logger.info("Logging in to Hugging Face...")
+        login(token=env_vars["HF_TOKEN"])
+        logger.info("Successfully logged in to Hugging Face")
+        # Create/get space
+        space_info = create_space(env_vars["HF_USERNAME"], space_name)
+        logger.info(f"Space info: {space_info}")
+        # Upload files
+        current_dir = Path(__file__).parent
+        logger.info(f"Uploading files from {current_dir} to Space {env_vars['HF_USERNAME']}/{space_name}...")
+        # Create .gitignore
+        with open(current_dir / ".gitignore", "w") as f:
+            f.write(".env\n*.pyc\n__pycache__\n")
+        logger.info("Created .gitignore file")
+        api = HfApi()
+        api.upload_folder(
+            folder_path=str(current_dir),
+            repo_id=f"{env_vars['HF_USERNAME']}/{space_name}",
+            repo_type="space",
+            ignore_patterns=[".env", "*.pyc", "__pycache__", "TRAINING_IN_PROGRESS.lock"]
+        )
+        logger.info(f"Files uploaded successfully")
+        space_url = f"https://huggingface.co/spaces/{env_vars['HF_USERNAME']}/{space_name}"
+        logger.info(f"Space URL: {space_url}")
+        print(f"\nSpace created successfully! You can view it at:\n{space_url}")
+        return True
+    except Exception as e:
+        logger.error(f"Error updating Space: {str(e)}")
+        return False
+if __name__ == "__main__":
+    success = main()
+    sys.exit(0 if success else 1)