π§ Model Description
Designed for evaluating function calls in the context of Model Context Protocol (MCP) tools.
It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values.
The mcp-tool-use-quality-ranger-0.6b is a fine-tuned sequence classification model created to evaluate the quality of function calls in conversational AI systems.
Max Context Length: 32,768 Tokens

It determines if a given function call:
- Selects the correct tool
- Has correct parameter names and structure
- Contains correct parameter values
It produces one of four possible classification labels:
Label | Meaning |
---|---|
VALID_CALL | β The tool name, parameters, and values are all correct, or no suitable tool exists and no function call is made. |
TOOL_ERROR | β The tool name does not exist or does not match the user intent. |
PARAM_NAME_ERROR | β The correct tool is used, but parameter names are missing, extra, or incorrect. |
PARAM_VALUE_ERROR | β Tool and parameter names are correct, but parameter values are wrong or incorrectly formatted. |
π½ Quantized Version
π Benchmark Evaluation
The mcp-tool-use-quality-ranger-0.6b was evaluated in a binary classification setting,
where the prediction is Correct if the function call evaluation matched the gold label, and Incorrect otherwise.
Model | #Params | Avg. Latency | Avg Binary Accuracy | Qualifire mcp-tool-use-quality Benchmark Binary Accuracy | Limbic Benchmark Binary Accuracy |
---|---|---|---|---|---|
qualifire/mcp-tool-use-quality-ranger-4b [private] | 4B | 0.30[sec] | 0.962 | 0.971 | 0.954 |
qualifire/mcp-tool-use-quality-ranger-0.6b | 0.6B | 0.09[sec] | 0.928 | 0.949 | 0.907 |
gemini-2.5-flash | - | 4.87[sec] | 0.858 | 0.871 | 0.845 |
quotientai/limbic-tool-use-0.5B-32K | 0.5B | 0.79[sec] | 0.798 | 0.708 | 0.887 |
π Metrics Definitions
Avg. Binary Accuracy β Mean accuracy across all evaluated benchmarks,
where predictions are mapped to binary outcomes as follows:Qualifire TUQ Benchmark
- Correct β
VALID_CALL
- Incorrect β
TOOL_ERROR
,PARAM_NAME_ERROR
orPARAM_VALUE_ERROR
- Correct β
Limbic Benchmark
- Correct β
correct
- Incorrect β
incorrect_tool
,incorrect_parameter_names
orincorrect_parameter_values
- Correct β
Qualifire TUQ Benchmark link β Qualifire Tool Selection Quality Benchmark.
Limbic Benchmark link β Limbic Eval Tool Use MCP Benchmark.
π Evaluation Prompt Template
The model uses the following structured evaluation process:
TOOL SELECTION
- Check if the tool name exists in
available_tools
- Check if tool purpose matches user intent
- Fail β
TOOL_ERROR
β
- Check if the tool name exists in
PARAMETER STRUCTURE
- All required parameters are present
- No extra parameters
- Parameter names exactly match the schema
- Fail β
PARAM_NAME_ERROR
β
PARAMETER VALUES
- Values have correct data types
- Values match user request
- No fabricated or incorrect values
- Fail β
PARAM_VALUE_ERROR
β
If all checks pass β VALID_CALL
β
π¦ Requirements
transformers>=4.51.0
huggingface_hub
torch
π» Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
import torch
from huggingface_hub import hf_hub_download
# Model name
model_name = "qualifire/mcp-tool-use-quality-ranger-0.6b"
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create pipeline
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Load prompt template
file_path = hf_hub_download(repo_id=model_name, filename="tsq_prompt_template.txt")
with open(file_path, encoding="utf-8") as f:
PROMPT_TEMPLATE = f.read()
# Example inputs
example_tools_list = '''[
{
"name": "order_food",
"description": "Order food from a restaurant.\nArgs:\nrestaurant_url: URL of the restaurant\nitem_name: Name of the item to order",
"inputSchema": {
"type": "object",
"title": "order_foodArguments",
"required": ["item_url", "item_name"],
"properties": {
"item_url": {
"type": "string",
"title": "Item Url"
},
"item_name": {
"type": "string",
"title": "Item Name"
}
}
}
}
'''
example_message_history = '''[
{
"role": "user",
"content": "Could you please order 2 Margherita pizzas for delivery to 123 Main Street, Anytown?"
},
{
"completion_message": {
"content": {
"type": "text",
"text": ""
},
"role": "assistant",
"stop_reason": "tool_calls",
"tool_calls": [
{
"id": "call_p8yj1p",
"function": {
"name": "order_food",
"arguments": {
"item": "Margherita Pizza",
"quantity": 3,
"delivery_address": "123 Main Street, Anytown"
}
}
}
]
}
}
]'''
# Format input
example_input = PROMPT_TEMPLATE.format(
message_history=example_message_history,
available_tools=example_tools_list
)
# Get prediction
result = pipe(example_input)[0]
print(result)
β¨ Example Output
{'label': 'PARAM_VALUE_ERROR', 'score': 0.8680815696716309}
The value for quantity should be 2, not 3. Therefore, the correct label is: PARAM_VALUE_ERROR.
- Downloads last month
- 300