Llama 3.2 1B E-commerce Intent (GPTQ 4-bit)

This is a fine-tuned version of meta-llama/Llama-3.2-1B that has been specifically trained to act as an e-commerce intent detection model. Given a catalog of products and a user's request, it outputs a structured JSON object representing the user's intent (add or remove), the product name, and the quantity.

This version of the model is quantized to 4-bit using GPTQ, making it highly efficient for inference in terms of memory usage and speed. The QLoRA adapter was merged into the final GPTQ model - no separate adapter loading is required.

Model Description

The base model, Llama 3.2 1B, was fine-tuned using the QLoRA method on a synthetic dataset of 3000 examples. The training objective was to teach the model to ignore conversational pleasantries and strictly output a JSON object that can be directly parsed by a backend system for managing a shopping cart.

Dataset

The model was fine-tuned on a custom synthetic dataset of 3000 examples.

You can access the dataset here: jtlicardo/ecommerce-intent-3k

Intended Use & Limitations

This model is designed for a specific task: parsing user requests in an e-commerce context. It should not be used as a general-purpose chatbot.

  • Primary Use: Backend service for intent detection from user text.
  • Out-of-Scope: General conversation, answering questions, or any task not related to adding/removing items from a list.

How to Use

The model expects a prompt formatted in a specific way, following the TinyLlama-Chat template. You must provide the Catalog and the User request.

Important: You need to install optimum and auto-gptq to run this 4-bit GPTQ model.

pip install -q optimum auto-gptq transformers

Here's how to run inference in Python:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Model repository on the Hugging Face Hub
model_id = "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit"

# Load the tokenizer and the 4-bit quantized model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

# --- Define the prompt ---
catalog = """Catalog:
Shampoo (400ml bottle)
Hand Soap (250ml dispenser)
Peanut Butter (340g jar)
Headphones
Green Tea (25 tea bags)"""

user_query = "Could you please take off 4 pairs of headphons from my cart?"

# --- Format the prompt using the model's chat template ---
# The model was trained to see this structure.
prompt = f"<|user|>\n{catalog}\n\nUser:\n{user_query}\n<|assistant|>\n"

# --- Generate the output ---
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
outputs = pipe(
    prompt,
    max_new_tokens=50,       # Max length of the JSON output
    do_sample=False,         # Use deterministic output
    temperature=None,        # Not needed for do_sample=False
    top_p=None,              # Not needed for do_sample=False
    return_full_text=False   # Only return the generated part
)

# The output will be a clean JSON string
generated_json = outputs[0]['generated_text'].strip()
print(generated_json)
# Expected output:
# {"action": "remove", "product": "Headphones", "quantity": 4}

Training Procedure

This model was fine-tuned using the trl library's SFTTrainer.

  • Method: QLoRA (4-bit quantization with LoRA adapters)
  • Dataset: A custom JSONL file with 3000 prompt/completion pairs.
  • Configuration: completion_only_loss=True was used to ensure the model only learned to generate the assistant's JSON response.
Downloads last month
2
Safetensors
Model size
393M params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit

Quantized
(210)
this model

Dataset used to train jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit

Collection including jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit