IONOS Llama 3.3 70B Instruct FP8

Model Overview

Description: The IONOS Llama 3.3 70B Instruct FP8 is an optimized version of Meta's Llama 3.3 70B Instruct model, featuring advanced FP8 quantization for enhanced performance and efficiency. This auto-regressive language model utilizes an optimized transformer architecture and has been quantized using SmoothQuant with LLM Compressor, making it ideal for production deployments while maintaining high accuracy.

Third-Party Community Model: This model is based on Meta's Llama-3.3-70B-Instruct and has been quantized and optimized by IONOS to deliver an efficient, enterprise-ready solution. IONOS does not own or develop the original model architecture. For complete details about the base model, refer to the Meta Llama-3.3-70B-Instruct Model Card.

License and Terms of Use

License: This model is licensed under the Llama 3.3 Community License

Required Attribution: "Built with Llama, quantized by IONOS"

Intended Use: Commercial and non-commercial applications, particularly suitable for developers and enterprises seeking production-ready, pre-quantized models for efficient deployment scenarios.

Technical Specifications

Model Architecture:

  • Base Architecture: Transformer (Llama 3.3)
  • Quantization Method: SmoothQuant with LLM Compressor
  • Precision Optimization: 16-bit to 8-bit parameter reduction
  • Memory Efficiency: Approximately 50% reduction in disk size and GPU memory requirements

Input Specifications:

  • Type: Text
  • Format: UTF-8 encoded strings
  • Context Window: Up to 128,000 tokens
  • Input Structure: 1D token sequences

Output Specifications:

  • Type: Generated text
  • Format: UTF-8 encoded strings
  • Output Structure: 1D token sequences

Platform Compatibility

Supported Runtime Engines:

  • vLLM (recommended for production deployments)
  • Compatible with standard transformer inference frameworks

Implementation Examples

IONOS AI Model Hub Integration

import requests

# Configuration
IONOS_API_TOKEN = "your_api_token_here"
API_ENDPOINT = "https://openai.inference.de-txl.ionos.com/v1/chat/completions"

# API request
response = requests.post(
    API_ENDPOINT,
    headers={
        "Authorization": f"Bearer {IONOS_API_TOKEN}",
        "Content-Type": "application/json"
    },
    json={
        "model": "meta-llama/Llama-3.3-70B-Instruct",
        "messages": [
            {"role": "user", "content": "Explain quantum computing in simple terms."}
        ],
        "temperature": 0.7,
        "max_tokens": 1024,
        "top_p": 0.9
    }
)

print(response.json())

vLLM Deployment

from vllm import LLM, SamplingParams

def deploy_llama_model():
    """Deploy and run inference with IONOS Llama 3.3 70B Instruct FP8"""
    
    # Sample prompts for testing
    prompts = [
        "Explain the benefits of renewable energy",
        "Write a Python function to calculate fibonacci numbers",
        "Describe the process of machine learning model training",
        "What are the key principles of sustainable development?"
    ]
    
    # Configure sampling parameters
    sampling_params = SamplingParams(
        temperature=0.8, 
        top_p=0.95,
        max_tokens=512
    )
    
    # Initialize the model
    llm = LLM(model="ionos/Llama-3.3-70B-Instruct-FP8")
    
    # Generate responses
    outputs = llm.generate(prompts, sampling_params)
    
    # Display results
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt}")
        print(f"Response: {generated_text}")
        print("-" * 80)

if __name__ == '__main__':
    deploy_llama_model()

Training and Optimization Details

Quantization Process: This model employs SmoothQuant quantization implemented through LLM Compressor. SmoothQuant redistributes quantization difficulty from activations to weights by applying mathematically equivalent transformations, enabling effective FP8 quantization. The quantization calibration was performed using the WikiText dataset. The quantization process specifically targets the weights and activations of linear operators within transformer blocks, preserving model accuracy while significantly reducing computational requirements.

Calibration Dataset:

  • WikiText: Used for SmoothQuant calibration to optimize quantization parameters

Evaluation Datasets:

  • MMLU (Massive Multitask Language Understanding)
  • GSM8K (Grade School Math 8K)
  • ARC Challenge (AI2 Reasoning Challenge)
  • IFEVAL (Instruction Following Evaluation)

Performance Benchmarks

Comprehensive evaluation results comparing FP8 quantized model against the original BF16 precision:

Benchmark Llama-3.3-70B FP16 IONOS Llama-3.3-70B FP8 Performance Retention Difference
GSM8K 48.14% 48.37% 100.5% +0.23%
HellaSwag 75.01% 74.27% 99.0% -0.74%
MMLU 81.01% 80.67% 99.6% -0.34%
Average 68.06% 67.77% 99.6% -0.29%

Key Performance Insights:

  • Minimal accuracy degradation (Maintains 99% of original performance)
  • 50% reduction in memory footprint
  • Improved inference speed and throughput
  • Maintained reasoning and instruction-following capabilities

Building the future of sovereign AI in Europe.

Downloads last month
219
Safetensors
Model size
70.6B params
Tensor type
F16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ionos/Llama-3.3-70B-Instruct-FP8

Quantized
(79)
this model

Dataset used to train ionos/Llama-3.3-70B-Instruct-FP8

Evaluation results