--- library_name: transformers language: - en - fr - it - pt - hi - es - th - de base_model: - meta-llama/Llama-3.1-70B tags: - facebook - meta - pytorch - llama - llama-3 license: llama3.3 datasets: - Salesforce/wikitext model-index: - name: ionos/Llama-3.3-70B-Instruct-FP8 results: - task: type: text-generation dataset: name: gsm8k type: gsm8k metrics: - name: GSM8K (5-shot) type: GSM8K (5-shot) value: 48.37 source: name: IONOS AI Model Hub url: https://cloud.ionos.com - task: type: text-generation dataset: name: hellaswag type: hellaswag metrics: - name: HellaSwag (10-shot) type: HellaSwag (10-shot) value: 74.27 source: name: IONOS AI Model Hub url: https://cloud.ionos.com - task: type: text-generation dataset: name: mmlu type: mmlu metrics: - name: MMLU (5-shot) type: MMLU (5-shot) value: 80.67 source: name: IONOS AI Model Hub url: https://cloud.ionos.com --- # IONOS Llama 3.3 70B Instruct FP8 ## Model Overview **Description:** The IONOS Llama 3.3 70B Instruct FP8 is an optimized version of Meta's Llama 3.3 70B Instruct model, featuring advanced FP8 quantization for enhanced performance and efficiency. This auto-regressive language model utilizes an optimized transformer architecture and has been quantized using SmoothQuant with LLM Compressor, making it ideal for production deployments while maintaining high accuracy. **Third-Party Community Model:** This model is based on Meta's Llama-3.3-70B-Instruct and has been quantized and optimized by IONOS to deliver an efficient, enterprise-ready solution. IONOS does not own or develop the original model architecture. For complete details about the base model, refer to the [Meta Llama-3.3-70B-Instruct Model Card](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct). ## License and Terms of Use **License:** This model is licensed under the [Llama 3.3 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE) **Required Attribution:** "Built with Llama, quantized by IONOS" **Intended Use:** Commercial and non-commercial applications, particularly suitable for developers and enterprises seeking production-ready, pre-quantized models for efficient deployment scenarios. ## Technical Specifications **Model Architecture:** - **Base Architecture:** Transformer (Llama 3.3) - **Quantization Method:** SmoothQuant with LLM Compressor - **Precision Optimization:** 16-bit to 8-bit parameter reduction - **Memory Efficiency:** Approximately 50% reduction in disk size and GPU memory requirements **Input Specifications:** - **Type:** Text - **Format:** UTF-8 encoded strings - **Context Window:** Up to 128,000 tokens - **Input Structure:** 1D token sequences **Output Specifications:** - **Type:** Generated text - **Format:** UTF-8 encoded strings - **Output Structure:** 1D token sequences ## Platform Compatibility **Supported Runtime Engines:** - vLLM (recommended for production deployments) - Compatible with standard transformer inference frameworks ## Implementation Examples ### IONOS AI Model Hub Integration ```python import requests # Configuration IONOS_API_TOKEN = "your_api_token_here" API_ENDPOINT = "https://openai.inference.de-txl.ionos.com/v1/chat/completions" # API request response = requests.post( API_ENDPOINT, headers={ "Authorization": f"Bearer {IONOS_API_TOKEN}", "Content-Type": "application/json" }, json={ "model": "meta-llama/Llama-3.3-70B-Instruct", "messages": [ {"role": "user", "content": "Explain quantum computing in simple terms."} ], "temperature": 0.7, "max_tokens": 1024, "top_p": 0.9 } ) print(response.json()) ``` ### vLLM Deployment ```python from vllm import LLM, SamplingParams def deploy_llama_model(): """Deploy and run inference with IONOS Llama 3.3 70B Instruct FP8""" # Sample prompts for testing prompts = [ "Explain the benefits of renewable energy", "Write a Python function to calculate fibonacci numbers", "Describe the process of machine learning model training", "What are the key principles of sustainable development?" ] # Configure sampling parameters sampling_params = SamplingParams( temperature=0.8, top_p=0.95, max_tokens=512 ) # Initialize the model llm = LLM(model="ionos/Llama-3.3-70B-Instruct-FP8") # Generate responses outputs = llm.generate(prompts, sampling_params) # Display results for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt}") print(f"Response: {generated_text}") print("-" * 80) if __name__ == '__main__': deploy_llama_model() ``` ## Training and Optimization Details **Quantization Process:** This model employs SmoothQuant quantization implemented through LLM Compressor. SmoothQuant redistributes quantization difficulty from activations to weights by applying mathematically equivalent transformations, enabling effective FP8 quantization. The quantization calibration was performed using the WikiText dataset. The quantization process specifically targets the weights and activations of linear operators within transformer blocks, preserving model accuracy while significantly reducing computational requirements. **Calibration Dataset:** - **WikiText**: Used for SmoothQuant calibration to optimize quantization parameters **Evaluation Datasets:** - MMLU (Massive Multitask Language Understanding) - GSM8K (Grade School Math 8K) - ARC Challenge (AI2 Reasoning Challenge) - IFEVAL (Instruction Following Evaluation) ## Performance Benchmarks Comprehensive evaluation results comparing FP8 quantized model against the original BF16 precision: | Benchmark | Llama-3.3-70B FP16 | IONOS Llama-3.3-70B FP8 | Performance Retention | Difference | |-----------|---------------------|----------------------|----------------------|------------| | **GSM8K** | 48.14% | 48.37% | **100.5%** | **+0.23%** | | **HellaSwag** | 75.01% | 74.27% | 99.0% | -0.74% | | **MMLU** | 81.01% | 80.67% | 99.6% | -0.34% | | **Average** | 68.06% | 67.77% | **99.6%** | -0.29% | **Key Performance Insights:** - Minimal accuracy degradation (Maintains 99% of original performance) - 50% reduction in memory footprint - Improved inference speed and throughput - Maintained reasoning and instruction-following capabilities --- *Building the future of sovereign AI in Europe.*