You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

🧠 Model Description

Designed for evaluating function calls in the context of Model Context Protocol (MCP) tools.
It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values.

The mcp-tool-use-quality-ranger-0.6b is a fine-tuned sequence classification model created to evaluate the quality of function calls in conversational AI systems.

Max Context Length: 32,768 Tokens

It determines if a given function call:

  • Selects the correct tool
  • Has correct parameter names and structure
  • Contains correct parameter values

It produces one of four possible classification labels:

Label Meaning
VALID_CALL βœ… The tool name, parameters, and values are all correct, or no suitable tool exists and no function call is made.
TOOL_ERROR ❌ The tool name does not exist or does not match the user intent.
PARAM_NAME_ERROR ❌ The correct tool is used, but parameter names are missing, extra, or incorrect.
PARAM_VALUE_ERROR ❌ Tool and parameter names are correct, but parameter values are wrong or incorrectly formatted.

πŸ”½ Quantized Version


πŸ“Š Benchmark Evaluation

The mcp-tool-use-quality-ranger-0.6b was evaluated in a binary classification setting,
where the prediction is Correct if the function call evaluation matched the gold label, and Incorrect otherwise.

Model #Params Avg. Latency Avg Binary Accuracy Qualifire mcp-tool-use-quality Benchmark Binary Accuracy Limbic Benchmark Binary Accuracy
qualifire/mcp-tool-use-quality-ranger-4b [private] 4B 0.30[sec] 0.962 0.971 0.954
qualifire/mcp-tool-use-quality-ranger-0.6b 0.6B 0.09[sec] 0.928 0.949 0.907
gemini-2.5-flash - 4.87[sec] 0.858 0.871 0.845
quotientai/limbic-tool-use-0.5B-32K 0.5B 0.79[sec] 0.798 0.708 0.887

πŸ“Œ Metrics Definitions

  • Avg. Binary Accuracy – Mean accuracy across all evaluated benchmarks,
    where predictions are mapped to binary outcomes as follows:

    • Qualifire TUQ Benchmark

      • Correct β†’ VALID_CALL
      • Incorrect β†’ TOOL_ERROR, PARAM_NAME_ERROR or PARAM_VALUE_ERROR
    • Limbic Benchmark

      • Correct β†’ correct
      • Incorrect β†’ incorrect_tool, incorrect_parameter_names or incorrect_parameter_values
  • Qualifire TUQ Benchmark link – Qualifire Tool Selection Quality Benchmark.

  • Limbic Benchmark link – Limbic Eval Tool Use MCP Benchmark.


πŸ“œ Evaluation Prompt Template

The model uses the following structured evaluation process:

  1. TOOL SELECTION

    • Check if the tool name exists in available_tools
    • Check if tool purpose matches user intent
    • Fail β†’ TOOL_ERROR❌
  2. PARAMETER STRUCTURE

    • All required parameters are present
    • No extra parameters
    • Parameter names exactly match the schema
    • Fail β†’ PARAM_NAME_ERROR❌
  3. PARAMETER VALUES

    • Values have correct data types
    • Values match user request
    • No fabricated or incorrect values
    • Fail β†’ PARAM_VALUE_ERROR❌

If all checks pass β†’ VALID_CALLβœ…


πŸ“¦ Requirements

  • transformers>=4.51.0
  • huggingface_hub
  • torch

πŸ’» Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
import torch
from huggingface_hub import hf_hub_download

# Model name
model_name = "qualifire/mcp-tool-use-quality-ranger-0.6b"

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create pipeline
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Load prompt template
file_path = hf_hub_download(repo_id=model_name, filename="tsq_prompt_template.txt")
with open(file_path, encoding="utf-8") as f:
    PROMPT_TEMPLATE = f.read()

# Example inputs
example_tools_list = '''[
  {
    "name": "order_food",
    "description": "Order food from a restaurant.\nArgs:\nrestaurant_url: URL of the restaurant\nitem_name: Name of the item to order",
    "inputSchema": {
      "type": "object",
      "title": "order_foodArguments",
      "required": ["item_url", "item_name"],
      "properties": {
        "item_url": {
          "type": "string",
          "title": "Item Url"
        },
        "item_name": {
          "type": "string",
          "title": "Item Name"
        }
      }
    }
  }
'''


example_message_history = '''[
  {
    "role": "user",
    "content": "Could you please order 2 Margherita pizzas for delivery to 123 Main Street, Anytown?"
  },
  {
    "completion_message": {
      "content": {
        "type": "text",
        "text": ""
      },
      "role": "assistant",
      "stop_reason": "tool_calls",
      "tool_calls": [
        {
          "id": "call_p8yj1p",
          "function": {
            "name": "order_food",
            "arguments": {
              "item": "Margherita Pizza",
              "quantity": 3, 
              "delivery_address": "123 Main Street, Anytown"
            }
          }
        }
      ]
    }
  }
]'''

# Format input
example_input = PROMPT_TEMPLATE.format(
    message_history=example_message_history,
    available_tools=example_tools_list
)

# Get prediction
result = pipe(example_input)[0]
print(result)

✨ Example Output

{'label': 'PARAM_VALUE_ERROR', 'score': 0.8680815696716309}

The value for quantity should be 2, not 3. Therefore, the correct label is: PARAM_VALUE_ERROR.

Downloads last month
300
Safetensors
Model size
596M params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for qualifire/mcp-tool-use-quality-ranger-0.6b

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(310)
this model
Quantizations
2 models

Collection including qualifire/mcp-tool-use-quality-ranger-0.6b