Model Card for LLaMA-3.2-3B Tool Caller

This model (LoRA adapter) is a fine-tuned version of LLaMA-3.2-3B that specializes in tool calling capabilities. It has been trained to decide when to use one of two available tools: search_documents or check_and_connect based on user queries, responding with properly formatted JSON function calls.

Model Details

Model Description

This model is a Parameter-Efficient Fine-Tuning (PEFT) adaptation of LLaMA-3.2-3B focused on tool use. It employs Low-Rank Adaptation (LoRA) to efficiently fine-tune the base model for function calling capabilities.

Developed by: Uness.fr
Model type: Fine-tuned LLM (LoRA)
Language(s) (NLP): English
License: [Same as base model - specify LLaMA 3.2 license]
Finetuned from model: unsloth/Llama-3.2-3B-Instruct (4-bit quantized version)

Model Sources

Repository: https://huggingface.co/asanchez75/Llama-3.2-3B-tool-search
Base model: https://huggingface.co/unsloth/Llama-3.2-3B-Instruct
Training dataset: https://huggingface.co/datasets/asanchez75/tool_finetuning_dataset

Uses

Direct Use

This model is designed to be used as an AI assistant that can intelligently determine when to call external tools. It specializes in two specific functions:

search_documents: Triggered when users ask for medical information (prefixed with "Search information about")
check_and_connect: Triggered when users ask about system status or connectivity

The model outputs properly formatted JSON function calls that can be parsed by downstream applications to execute the appropriate tools.

Downstream Use

This model can be integrated into:

AI assistants that need to understand when to delegate tasks to external tools

Out-of-Scope Use

This model should not be used for:

General text generation without tool calling
Tasks requiring more than the two trained tools
Critical systems where reliability is essential without human oversight
Applications requiring factual accuracy guarantees

Bias, Risks, and Limitations

The model inherits biases from the base LLaMA-3.2-3B model
Performance depends on how similar user queries are to the training data format
There's a strong dependency on the specific prefixing pattern used in training ("Search information about")

Recommendations

Users (both direct and downstream) should:

Follow the same prompting patterns used in training for optimal results
Include the "Search information about" prefix for queries intended for the search_documents tool
Be aware that the model expects a specific system prompt format
Test thoroughly before deployment in production environments
Consider implementing fallback mechanisms for unrecognized query types

How to Get Started with the Model

Use the code below to get started with the model:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model and tokenizer
model_path = "your-username/llama-3-2-3b-tool-caller-lora"  # Replace with actual path
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Define the prompting format (must match training)
SYSTEM_PROMPT = """Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 18 May 2025"""

USER_INSTRUCTION_HEADER = """Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt. 
Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.
{ "type": "function", "function": { "name": "check_and_connect", "description": "check_and_connect", "parameters": { "properties": {}, "type": "object" } } }
{ "type": "function", "function": { "name": "search_documents", "description": "\n Searches for documents based on a user's query string. Use this to find information on a specific topic.\n\n ", "parameters": { "properties": { "query": { "description": "The actual search phrase or question. For example, 'What are the causes of climate change?' or 'population of Madre de Dios'.", "type": "string" } }, "required": [ "query" ], "type": "object" } } }
"""

# Example 1: Information query (add the prefix)
user_query = "What is the capital of France?"
formatted_query = f"Search information about {user_query}"  # Add prefix for search_documents in French

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"{USER_INSTRUCTION_HEADER}{formatted_query}"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(input_ids=inputs, max_new_tokens=128)
response = tokenizer.decode(outputs[0, inputs.shape[-1]:], skip_special_tokens=True)
print(response)

# Example 2: System status query (no prefix needed)
status_query = "Are we connected?"
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"{USER_INSTRUCTION_HEADER}{status_query}"},
]

# Generate response...

Training Details

Training Data

The model was trained on a custom dataset with 1,050 examples from asanchez75/tool_finetuning_dataset:

1,000 examples derived from the "maximedb/natural_questions" dataset, modified with "Search information about" prefix
50 examples of system status queries for the "check_and_connect" tool

The dataset was created in JSONL format with each entry having a complete conversation structure including system, user, and assistant messages.

Training Procedure

The model was fine-tuned using Unsloth's optimized implementation of LoRA over a 4-bit quantized version of LLaMA-3.2-3B-Instruct.

Training Hyperparameters

Training regime: 4-bit quantization with LoRA
LoRA rank: 16
LoRA alpha: 16
LoRA dropout: 0
Target modules: "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"
Learning rate: 2e-4
Batch size: 2 per device
Gradient accumulation steps: 4
Warmup steps: 5
Number of epochs: 3
Optimizer: adamw_8bit
Weight decay: 0.01
LR scheduler: linear
Max sequence length: 2048
Packing: False
Random seed: 3407

Speeds, Sizes, Times

Training hardware: [GPU type, e.g., NVIDIA A100, etc.]
Training time: [Approximately X minutes based on training code output]
Model size: Base model is 3B parameters; LoRA adapter is significantly smaller

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on sample inference examples from both categories:

Information queries with "Search information about" prefix
System status queries

Metrics

Accuracy: Measured by whether the model correctly selects the appropriate tool for the query type
Format correctness: Whether the JSON output is properly formatted and parsable

Results

Qualitative evaluation showed the model successfully distinguishes between:

Queries that should trigger the search_documents tool (when prefixed appropriately)
Queries that should trigger the check_and_connect tool

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [GPU model]
Hours used: [Estimated from training time]
Cloud Provider: [If applicable]
Compute Region: [If applicable]
Carbon Emitted: [Estimate if available]

Technical Specifications

Model Architecture and Objective

Base architecture: LLaMA-3.2-3B
Adaptation method: LoRA fine-tuning
Objective: Train the model to output properly formatted JSON function calls based on input query type

Compute Infrastructure

Hardware

The model was trained using CUDA-compatible GPU(s)
Memory usage metrics are reported in the training script

Software

Unsloth: Fast implementation of LLaMA models
PyTorch: Deep learning framework
Transformers: Hugging Face's transformers library
PEFT: Parameter-Efficient Fine-Tuning library
TRL: Transformer Reinforcement Learning library

Framework versions

PEFT 0.15.2
Transformers [version]
PyTorch [version]
Unsloth [version]

Model Card Contact

[Your contact information]

asanchez75
/

Llama-3.2-3B-tool-search