qualifire/mcp-tool-use-quality-ranger-0.6b

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🧠 Model Description

Designed for evaluating function calls in the context of Model Context Protocol (MCP) tools.
It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values.

The mcp-tool-use-quality-ranger-0.6b is a fine-tuned sequence classification model created to evaluate the quality of function calls in conversational AI systems.

Max Context Length: 32,768 Tokens

It determines if a given function call:

Selects the correct tool
Has correct parameter names and structure
Contains correct parameter values

It produces one of four possible classification labels:

Label	Meaning
VALID_CALL	✅ The tool name, parameters, and values are all correct, or no suitable tool exists and no function call is made.
TOOL_ERROR	❌ The tool name does not exist or does not match the user intent.
PARAM_NAME_ERROR	❌ The correct tool is used, but parameter names are missing, extra, or incorrect.
PARAM_VALUE_ERROR	❌ Tool and parameter names are correct, but parameter values are wrong or incorrectly formatted.

🔽 Quantized Version

🪶 GGUF: qualifire/mcp-tool-use-quality-ranger-0.6b-GGUF

📊 Benchmark Evaluation

The mcp-tool-use-quality-ranger-0.6b was evaluated in a binary classification setting,
where the prediction is Correct if the function call evaluation matched the gold label, and Incorrect otherwise.

Model	#Params	Avg. Latency	Avg Binary Accuracy	Qualifire mcp-tool-use-quality Benchmark Binary Accuracy	Limbic Benchmark Binary Accuracy
qualifire/mcp-tool-use-quality-ranger-4b [private]	4B	0.30[sec]	0.962	0.971	0.954
qualifire/mcp-tool-use-quality-ranger-0.6b	0.6B	0.09[sec]	0.928	0.949	0.907
gemini-2.5-flash	-	4.87[sec]	0.858	0.871	0.845
quotientai/limbic-tool-use-0.5B-32K	0.5B	0.79[sec]	0.798	0.708	0.887

📌 Metrics Definitions

Avg. Binary Accuracy – Mean accuracy across all evaluated benchmarks,
where predictions are mapped to binary outcomes as follows:
- Qualifire TUQ Benchmark
  - Correct → VALID_CALL
  - Incorrect → TOOL_ERROR, PARAM_NAME_ERROR or PARAM_VALUE_ERROR
- Limbic Benchmark
  - Correct → correct
  - Incorrect → incorrect_tool, incorrect_parameter_names or incorrect_parameter_values
Qualifire TUQ Benchmark link – Qualifire Tool Selection Quality Benchmark.
Limbic Benchmark link – Limbic Eval Tool Use MCP Benchmark.

📜 Evaluation Prompt Template

The model uses the following structured evaluation process:

TOOL SELECTION
- Check if the tool name exists in available_tools
- Check if tool purpose matches user intent
- Fail → TOOL_ERROR❌
PARAMETER STRUCTURE
- All required parameters are present
- No extra parameters
- Parameter names exactly match the schema
- Fail → PARAM_NAME_ERROR❌
PARAMETER VALUES
- Values have correct data types
- Values match user request
- No fabricated or incorrect values
- Fail → PARAM_VALUE_ERROR❌

If all checks pass → VALID_CALL✅

📦 Requirements

transformers>=4.51.0
huggingface_hub
torch

💻 Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
import torch
from huggingface_hub import hf_hub_download

# Model name
model_name = "qualifire/mcp-tool-use-quality-ranger-0.6b"

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create pipeline
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Load prompt template
file_path = hf_hub_download(repo_id=model_name, filename="tsq_prompt_template.txt")
with open(file_path, encoding="utf-8") as f:
    PROMPT_TEMPLATE = f.read()

# Example inputs
example_tools_list = '''[
  {
    "name": "order_food",
    "description": "Order food from a restaurant.\nArgs:\nrestaurant_url: URL of the restaurant\nitem_name: Name of the item to order",
    "inputSchema": {
      "type": "object",
      "title": "order_foodArguments",
      "required": ["item_url", "item_name"],
      "properties": {
        "item_url": {
          "type": "string",
          "title": "Item Url"
        },
        "item_name": {
          "type": "string",
          "title": "Item Name"
        }
      }
    }
  }
'''


example_message_history = '''[
  {
    "role": "user",
    "content": "Could you please order 2 Margherita pizzas for delivery to 123 Main Street, Anytown?"
  },
  {
    "completion_message": {
      "content": {
        "type": "text",
        "text": ""
      },
      "role": "assistant",
      "stop_reason": "tool_calls",
      "tool_calls": [
        {
          "id": "call_p8yj1p",
          "function": {
            "name": "order_food",
            "arguments": {
              "item": "Margherita Pizza",
              "quantity": 3, 
              "delivery_address": "123 Main Street, Anytown"
            }
          }
        }
      ]
    }
  }
]'''

# Format input
example_input = PROMPT_TEMPLATE.format(
    message_history=example_message_history,
    available_tools=example_tools_list
)

# Get prediction
result = pipe(example_input)[0]
print(result)

✨ Example Output

{'label': 'PARAM_VALUE_ERROR', 'score': 0.8680815696716309}

The value for quantity should be 2, not 3. Therefore, the correct label is: PARAM_VALUE_ERROR.

Downloads last month: 300

Safetensors

Model size

596M params

Tensor type

F16

Model tree for qualifire/mcp-tool-use-quality-ranger-0.6b

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

(310)

this model

Quantizations

2 models

Collection including qualifire/mcp-tool-use-quality-ranger-0.6b

tool-use-quality

Collection

3 items • Updated 2 days ago