---
library_name: transformers
language:
- en
- fr
- it
- pt
- hi
- es
- th
- de
base_model:
- meta-llama/Llama-3.1-70B
tags:
- facebook
- meta
- pytorch
- llama
- llama-3
license: llama3.3
datasets:
- Salesforce/wikitext
model-index:
  - name: ionos/Llama-3.3-70B-Instruct-FP8
    results:
      - task:
          type: text-generation
        dataset:
          name: gsm8k
          type: gsm8k
        metrics:
          - name: GSM8K (5-shot)
            type: GSM8K (5-shot)
            value: 48.37
        source:
          name: IONOS AI Model Hub
          url: https://cloud.ionos.com
      - task:
          type: text-generation
        dataset:
          name: hellaswag
          type: hellaswag
        metrics:
          - name: HellaSwag (10-shot)
            type: HellaSwag (10-shot)
            value: 74.27
        source:
          name: IONOS AI Model Hub
          url: https://cloud.ionos.com
      - task:
          type: text-generation
        dataset:
          name: mmlu
          type: mmlu
        metrics:
          - name: MMLU (5-shot)
            type: MMLU (5-shot)
            value: 80.67
        source:
          name: IONOS AI Model Hub
          url: https://cloud.ionos.com
---
# IONOS Llama 3.3 70B Instruct FP8

## Model Overview

**Description:**
The IONOS Llama 3.3 70B Instruct FP8 is an optimized version of Meta's Llama 3.3 70B Instruct model, featuring advanced FP8 quantization for enhanced performance and efficiency. This auto-regressive language model utilizes an optimized transformer architecture and has been quantized using SmoothQuant with LLM Compressor, making it ideal for production deployments while maintaining high accuracy.

**Third-Party Community Model:**
This model is based on Meta's Llama-3.3-70B-Instruct and has been quantized and optimized by IONOS to deliver an efficient, enterprise-ready solution. IONOS does not own or develop the original model architecture. For complete details about the base model, refer to the [Meta Llama-3.3-70B-Instruct Model Card](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).

## License and Terms of Use

**License:** This model is licensed under the [Llama 3.3 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE)

**Required Attribution:** "Built with Llama, quantized by IONOS"

**Intended Use:** Commercial and non-commercial applications, particularly suitable for developers and enterprises seeking production-ready, pre-quantized models for efficient deployment scenarios.

## Technical Specifications

**Model Architecture:**
- **Base Architecture:** Transformer (Llama 3.3)
- **Quantization Method:** SmoothQuant with LLM Compressor
- **Precision Optimization:** 16-bit to 8-bit parameter reduction
- **Memory Efficiency:** Approximately 50% reduction in disk size and GPU memory requirements

**Input Specifications:**
- **Type:** Text
- **Format:** UTF-8 encoded strings
- **Context Window:** Up to 128,000 tokens
- **Input Structure:** 1D token sequences

**Output Specifications:**
- **Type:** Generated text
- **Format:** UTF-8 encoded strings
- **Output Structure:** 1D token sequences

## Platform Compatibility

**Supported Runtime Engines:**
- vLLM (recommended for production deployments)
- Compatible with standard transformer inference frameworks

## Implementation Examples

### IONOS AI Model Hub Integration

```python
import requests

# Configuration
IONOS_API_TOKEN = "your_api_token_here"
API_ENDPOINT = "https://openai.inference.de-txl.ionos.com/v1/chat/completions"

# API request
response = requests.post(
    API_ENDPOINT,
    headers={
        "Authorization": f"Bearer {IONOS_API_TOKEN}",
        "Content-Type": "application/json"
    },
    json={
        "model": "meta-llama/Llama-3.3-70B-Instruct",
        "messages": [
            {"role": "user", "content": "Explain quantum computing in simple terms."}
        ],
        "temperature": 0.7,
        "max_tokens": 1024,
        "top_p": 0.9
    }
)

print(response.json())
```

### vLLM Deployment

```python
from vllm import LLM, SamplingParams

def deploy_llama_model():
    """Deploy and run inference with IONOS Llama 3.3 70B Instruct FP8"""
    
    # Sample prompts for testing
    prompts = [
        "Explain the benefits of renewable energy",
        "Write a Python function to calculate fibonacci numbers",
        "Describe the process of machine learning model training",
        "What are the key principles of sustainable development?"
    ]
    
    # Configure sampling parameters
    sampling_params = SamplingParams(
        temperature=0.8, 
        top_p=0.95,
        max_tokens=512
    )
    
    # Initialize the model
    llm = LLM(model="ionos/Llama-3.3-70B-Instruct-FP8")
    
    # Generate responses
    outputs = llm.generate(prompts, sampling_params)
    
    # Display results
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt}")
        print(f"Response: {generated_text}")
        print("-" * 80)

if __name__ == '__main__':
    deploy_llama_model()
```

## Training and Optimization Details

**Quantization Process:**
This model employs SmoothQuant quantization implemented through LLM Compressor. SmoothQuant redistributes quantization difficulty from activations to weights by applying mathematically equivalent transformations, enabling effective FP8 quantization. The quantization calibration was performed using the WikiText dataset. The quantization process specifically targets the weights and activations of linear operators within transformer blocks, preserving model accuracy while significantly reducing computational requirements.

**Calibration Dataset:**
- **WikiText**: Used for SmoothQuant calibration to optimize quantization parameters

**Evaluation Datasets:**
- MMLU (Massive Multitask Language Understanding)
- GSM8K (Grade School Math 8K)
- ARC Challenge (AI2 Reasoning Challenge)
- IFEVAL (Instruction Following Evaluation)

## Performance Benchmarks

Comprehensive evaluation results comparing FP8 quantized model against the original BF16 precision:

| Benchmark | Llama-3.3-70B FP16 | IONOS Llama-3.3-70B FP8 | Performance Retention | Difference |
|-----------|---------------------|----------------------|----------------------|------------|
| **GSM8K** | 48.14% | 48.37% | **100.5%** | **+0.23%** |
| **HellaSwag** | 75.01% | 74.27% | 99.0% | -0.74% |
| **MMLU** | 81.01% | 80.67% | 99.6% | -0.34% |
| **Average** | 68.06% | 67.77% | **99.6%** | -0.29% |

**Key Performance Insights:**
- Minimal accuracy degradation (Maintains 99% of original performance)
- 50% reduction in memory footprint
- Improved inference speed and throughput
- Maintained reasoning and instruction-following capabilities

---

*Building the future of sovereign AI in Europe.*