SmolLM2-360M-Instruct-TIFA

Model Description

SmolLM2-360M-Instruct-TIFA is a fine-tuned version of unsloth/SmolLM2-360M-Instruct specifically trained for TIFA (Text-to-Image Faithfulness Assessment). This model generates structured evaluation questions to assess how faithfully text-to-image models represent given text descriptions. This is an improved version with 360M parameters (vs 135M) and enhanced training data.

Intended Use

This model is designed to automatically generate evaluation questions for text-to-image models by creating four specific types of questions:

Negative question: Should have "no" as the answer (testing for absent elements)
Object/attribute identification: Should have a single word answer directly from the description
Alternative object/attribute identification: Should have a different single word answer from the description
Positive question: Should have "yes" as the answer (testing for present elements)

Model Details

Base Model: unsloth/SmolLM2-360M-Instruct
Model Size: 360M parameters
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Training Framework: Transformers + TRL + PEFT + Unsloth
License: apache-2.0

Training Details

Training Configuration

Training Method: Supervised Fine-Tuning (SFT) with LoRA
LoRA Configuration:
- r: 16
- lora_alpha: 32
- lora_dropout: 0.05
- Target modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Training Parameters:
- Epochs: 3
- Learning Rate: 1e-4
- Batch Size: 8 (per device)
- Gradient Accumulation Steps: 2
- Max Sequence Length: 512
- Optimizer: AdamW
- Weight Decay: 0.01
- Warmup Steps: 200

Dataset

The model was trained on a structured dataset containing 10,000 examples created using Gemini, formatted as conversation data in JSONL format.

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_path = "kawchar85/SmolLM2-360M-Instruct-TIFA"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto"
)

# Create pipeline
chat_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
)

def get_message(desc):
    system_msg = """\
You are a helpful assistant. Your job is to write exactly four multiple-choice questions that test if an image matches its description.
Rules:
Q1: Focus on something NOT in the description. Answer must be 'no' (choices: no, yes).
Q2: Answer must be one exact word from the description; provide 4 UNIQUE choices.
Q3: Answer must be a DIFFERENT exact word from the description than what was used in Q2; provide 4 UNIQUE choices.
Q4: Focus on something present in the description. Answer must be 'yes' (choices: no, yes).
Make each question cover a distinct detail. Ensure all questions are meaningful, valid, and relevant to the description.
For description "a red car parked near a tall building":
Q1: Is there a person washing the car? 
C: no, yes
A: no
Q2: What color is the car?
C: blue, red, green, yellow
A: red
Q3: What type of structure is near the car?
C: house, building, garage, tree
A: building
Q4: Is there a car in the image?
C: no, yes
A: yes
"""
    
    user_msg = f'Create four multiple-choice questions for this description: "{desc}".'
    return [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_msg}
    ]

# Generate evaluation questions
description = "khaki triangles and azure crescents"
messages = get_message(description)

output = chat_pipe(
    messages, 
    max_new_tokens=256, 
    do_sample=False,
)

print(output[0]["generated_text"])

Example Output

For the description "khaki triangles and azure crescents", the model generates:

Q1: Are there any circles in the image?
C: no, yes
A: no
Q2: What color are the triangles?
C: blue, khaki, red, green
A: khaki
Q3: What shapes are azure colored?
C: squares, triangles, crescents, circles
A: crescents
Q4: Are there triangular shapes in the image?
C: no, yes
A: yes

Improvements Over Previous Version

This 360M parameter model offers several advantages over the 135M version:

Better consistency: More reliable generation of distinct Q2 and Q3 questions
Enhanced comprehension: Better understanding of complex descriptions
Improved question quality: More natural and varied question formulations
Larger training dataset: Trained on 10k examples (vs 5k) for better generalization

Limitations

The model is specialized for TIFA evaluation and may not perform well on general conversation tasks
Limited to generating 4-question evaluation sets in the trained format
Performance depends on the quality and diversity of the training dataset
Less frequent but still possible duplication of questions for Q2 and Q3 in complex scenarios

Technical Specifications

Architecture: Transformer-based language model (360M parameters)
Precision: FP16
Context Length: 512 tokens
Inference Speed: Optimized for quick question generation
Training Framework: Enhanced with Unsloth optimizations

Citation

@misc{smollm2-360m-it-tifa-2025,
  title={SmolLM2-360M-Instruct-TIFA: A Fine-tuned Model for Text-to-Image Faithfulness Assessment},
  author={kawchar85},
  year={2025},
  url={https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA}
}

kawchar85
/

SmolLM2-360M-Instruct-TIFA