SmolLM2-360M-Instruct-TIFA
Model Description
SmolLM2-360M-Instruct-TIFA is a fine-tuned version of unsloth/SmolLM2-360M-Instruct specifically trained for TIFA (Text-to-Image Faithfulness Assessment). This model generates structured evaluation questions to assess how faithfully text-to-image models represent given text descriptions. This is an improved version with 360M parameters (vs 135M) and enhanced training data.
Intended Use
This model is designed to automatically generate evaluation questions for text-to-image models by creating four specific types of questions:
- Negative question: Should have "no" as the answer (testing for absent elements)
- Object/attribute identification: Should have a single word answer directly from the description
- Alternative object/attribute identification: Should have a different single word answer from the description
- Positive question: Should have "yes" as the answer (testing for present elements)
Model Details
- Base Model: unsloth/SmolLM2-360M-Instruct
- Model Size: 360M parameters
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Training Framework: Transformers + TRL + PEFT + Unsloth
- License: apache-2.0
Training Details
Training Configuration
Training Method: Supervised Fine-Tuning (SFT) with LoRA
LoRA Configuration:
- r: 16
- lora_alpha: 32
- lora_dropout: 0.05
- Target modules:
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Training Parameters:
- Epochs: 3
- Learning Rate: 1e-4
- Batch Size: 8 (per device)
- Gradient Accumulation Steps: 2
- Max Sequence Length: 512
- Optimizer: AdamW
- Weight Decay: 0.01
- Warmup Steps: 200
Dataset
The model was trained on a structured dataset containing 10,000 examples created using Gemini, formatted as conversation data in JSONL format.
Usage
Installation
pip install transformers torch
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_path = "kawchar85/SmolLM2-360M-Instruct-TIFA"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
trust_remote_code=True,
device_map="auto"
)
# Create pipeline
chat_pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=False,
)
def get_message(desc):
system_msg = """\
You are a helpful assistant. Your job is to write exactly four multiple-choice questions that test if an image matches its description.
Rules:
Q1: Focus on something NOT in the description. Answer must be 'no' (choices: no, yes).
Q2: Answer must be one exact word from the description; provide 4 UNIQUE choices.
Q3: Answer must be a DIFFERENT exact word from the description than what was used in Q2; provide 4 UNIQUE choices.
Q4: Focus on something present in the description. Answer must be 'yes' (choices: no, yes).
Make each question cover a distinct detail. Ensure all questions are meaningful, valid, and relevant to the description.
For description "a red car parked near a tall building":
Q1: Is there a person washing the car?
C: no, yes
A: no
Q2: What color is the car?
C: blue, red, green, yellow
A: red
Q3: What type of structure is near the car?
C: house, building, garage, tree
A: building
Q4: Is there a car in the image?
C: no, yes
A: yes
"""
user_msg = f'Create four multiple-choice questions for this description: "{desc}".'
return [
{"role": "system", "content": system_msg},
{"role": "user", "content": user_msg}
]
# Generate evaluation questions
description = "khaki triangles and azure crescents"
messages = get_message(description)
output = chat_pipe(
messages,
max_new_tokens=256,
do_sample=False,
)
print(output[0]["generated_text"])
Example Output
For the description "khaki triangles and azure crescents", the model generates:
Q1: Are there any circles in the image?
C: no, yes
A: no
Q2: What color are the triangles?
C: blue, khaki, red, green
A: khaki
Q3: What shapes are azure colored?
C: squares, triangles, crescents, circles
A: crescents
Q4: Are there triangular shapes in the image?
C: no, yes
A: yes
Improvements Over Previous Version
This 360M parameter model offers several advantages over the 135M version:
- Better consistency: More reliable generation of distinct Q2 and Q3 questions
- Enhanced comprehension: Better understanding of complex descriptions
- Improved question quality: More natural and varied question formulations
- Larger training dataset: Trained on 10k examples (vs 5k) for better generalization
Limitations
- The model is specialized for TIFA evaluation and may not perform well on general conversation tasks
- Limited to generating 4-question evaluation sets in the trained format
- Performance depends on the quality and diversity of the training dataset
- Less frequent but still possible duplication of questions for Q2 and Q3 in complex scenarios
Technical Specifications
- Architecture: Transformer-based language model (360M parameters)
- Precision: FP16
- Context Length: 512 tokens
- Inference Speed: Optimized for quick question generation
- Training Framework: Enhanced with Unsloth optimizations
Citation
@misc{smollm2-360m-it-tifa-2025,
title={SmolLM2-360M-Instruct-TIFA: A Fine-tuned Model for Text-to-Image Faithfulness Assessment},
author={kawchar85},
year={2025},
url={https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA}
}
- Downloads last month
- 1
Model tree for kawchar85/SmolLM2-360M-Instruct-TIFA
Base model
HuggingFaceTB/SmolLM2-360M