Model Information

A lightweight Llama-based model with a conversational tone of voice. This model is based on Llama-3.2-1B-Instruct and has been finetuned by the team at Restack to respond in a natural, conversational tone suitable for voice interactions.

The model is compatible with the Ultravox speech-to-text model. You can replace the Llama-3.2-1B-Instruct backbone of ultravox-v0_5-llama-3_2-1b with this model, to obtain a speech-to-text model that responds in a conversational tone.

How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# --------------------------------------------------------------------------------------
# *** Settings

model_id = "restack/conversational-v1.1-Llama-3.2-1B-Instruct"

system_prompt = "You are a helpful assistant. You answer questions in a natural, conversational tone, like in a spoken conversation."

user_prompt = "In the context of machine learning, what is regularization?"

eot_token = "<|eot_id|>"
pad_token = "<|finetune_right_pad_id|>"
assistant_header = "<|start_header_id|>assistant<|end_header_id|>\n\n"

prompt_template = {
    "system": f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nPLACEHOLDER_SYSTEM_PROMPT{eot_token}",
    "user": f"<|start_header_id|>user<|end_header_id|>\n\nPLACEHOLDER_QUESTION{eot_token}",
    "assistant": f"{assistant_header}PLACEHOLDER_ANSWER{eot_token}",
}

torch_dtype = torch.bfloat16
max_length = 1024
temperature = 0.5

# --------------------------------------------------------------------------------------
# *** Load model

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# --------------------------------------------------------------------------------------
# *** Inference

pad_token_id = tokenizer.encode(pad_token, add_special_tokens=False)[0]

# Combine system prompt, user prompt, and assistant header.
messages = (
    prompt_template["system"].replace("PLACEHOLDER_SYSTEM_PROMPT", system_prompt)
    + prompt_template["user"].replace("PLACEHOLDER_QUESTION", user_prompt)
    + assistant_header
)

tokens = tokenizer(
    messages,
    add_special_tokens=False,
    return_tensors="pt",
)

input_ids = tokens["input_ids"].to(model.device)
attention_mask = tokens["attention_mask"].to(model.device)

model.eval()

generated_ids = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=max_length,
    pad_token_id=pad_token_id,
    temperature=temperature,
)

generated_text = tokenizer.batch_decode(
    generated_ids,
    skip_special_tokens=False,
)[0]

# Remove the prompt and special tokens, leaving only the assistant answer.
assistant_response = (
    generated_text.split(assistant_header)[-1].replace(eot_token, "").strip()
)

print(assistant_response)
# Model prediction:
# So, regularization is basically a way to prevent overfitting in machine learning.
# Think of it like this: if you're trying to fit a model to a bunch of data, it's easy
# to get it to fit the noise in the data instead of the actual pattern. That's called
# overfitting. Regularization helps by adding a penalty term to the loss function. It
# makes the model more simple and less likely to fit the noise, so it doesn't overfit.

Details

{
    "base_model": "fixie-ai/ultravox-v0_5-llama-3_2-1b",
    "batch_config": {
        "accumulation_steps": 2,
        "batch_size": 16,
        "batch_size_val": 32
    },
    "learning_rate_params": {
        "div_factor": 25.0,
        "final_div_factor": 1000.0,
        "learning_rate": 0.0002,
        "lr_scheduler_name": "OneCycleLR",
        "pct_start": 0.3
    },
    "lora_alpha": 64,
    "lora_dropout": 0.2,
    "lora_r": 64,
    "n_epochs": 5,
    "padding_side": "left",
    "quantization_config": {
        "llm_int8_skip_modules": [
            "lm_head"
        ],
        "llm_int8_threshold": 6.0,
        "load_in_4bit": false,
        "load_in_8bit": true
    },
    "target_modules": [
        "k_proj",
        "q_proj",
        "v_proj"
    ],
    "torch_dtype": "torch.bfloat16",
    "train_bias": "lora_only",
}
Downloads last month
66
Safetensors
Model size
1.24B params
Tensor type
F32
BF16
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Space using restack/conversational-v1.1-Llama-3.2-1B-Instruct 1