SmolLM2-1.7B-Instruct-TIFA-Random
Model Description
SmolLM2-1.7B-Instruct-TIFA-Random is a fine-tuned version of unsloth/SmolLM2-1.7B-Instruct specifically trained for TIFA (Text-to-Image Faithfulness Assessment) with flexible question generation. Unlike previous structured versions, this model generates diverse, natural evaluation questions without rigid formatting constraints, making it more adaptable for various evaluation scenarios.
Model Series: 135M | 360M | 1.7B-Structured | 1.7B-Random
Key Innovation: Flexible Structure
This model represents a paradigm shift from rigid question structures to flexible, natural question generation:
- Previous models: Fixed Q1/Q2/Q3/Q4 structure with predetermined answer types
- This model: Dynamic question generation focusing on visual verification without structural constraints
- Benefit: More natural, diverse questions that better reflect real-world evaluation needs
Intended Use
This model generates 4 visual verification questions for text-to-image evaluation, focusing on:
- Colors, shapes, objects, materials - Core visual elements
- Spatial relationships - Positioning and arrangement
- Presence/absence verification - What exists or doesn't exist
- Mixed question types - Both yes/no and multiple choice questions
- Natural diversity - Questions adapt to description content rather than following templates
Model Details
- Base Model: unsloth/SmolLM2-1.7B-Instruct
- Model Size: 1.7B parameters
- Fine-tuning Method: Enhanced LoRA with flexible structure training
- Training Framework: Transformers + TRL + PEFT + Unsloth
- License: apache-2.0
Training Details
Advanced Training Configuration
Training Method: Supervised Fine-Tuning with category-balanced validation
Enhanced LoRA Configuration:
- r: 32
- lora_alpha: 64
- lora_dropout: 0.05
- Target modules:
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Optimized Training Parameters:
- Epochs: 2
- Learning Rate: 5e-5
- Batch Size: 16
- Gradient Accumulation: 2 steps (effective batch size: 32)
- Max Sequence Length: 1024
- LR Scheduler: Cosine with 3% warmup
- Validation: Category-balanced evaluation every 250 steps
Enhanced Dataset
- Size: 18,000 examples
- Structure: Flexible question generation without rigid templates
- Validation: Category-balanced split ensuring robust evaluation
- Coverage: Diverse visual elements, materials, spatial relationships, and verification tasks
Usage
Installation
pip install transformers torch
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_path = "kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
trust_remote_code=True,
device_map="auto"
)
# Create pipeline
chat_pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=False,
)
def get_message(description):
system = """\
You are a TIFA (Text-to-Image Faithfulness evaluation with question Answering) question generator. Given an image description, create exactly 4 visual verification questions with multiple choice answers. Each question should test different visual aspects that can be verified by looking at the image.
Guidelines:
- Focus on colors, shapes, objects, materials, spatial relationships, and other visually verifiable elements
- Mix yes/no questions (2 choices: "no", "yes") and multiple choice questions (4 choices)
- Each question should test a DIFFERENT aspect of the description
- Ensure questions can be answered by visual inspection of the image
- Use elements explicitly mentioned in the description
- Include both positive verification (testing presence, answer: "yes") and negative verification (testing absence, answer: "no")
- Make distractors realistic and relevant to the domain
Format each question as:
Q[number]: [question text]
C: [comma-separated choices]
A: [correct answer]
Generate questions that test visual faithfulness between the description and image."""
user_msg = f'Create 4 visual verification questions for this description: "{description}"'
return [
{"role": "system", "content": system},
{"role": "user", "content": user_msg}
]
# Generate evaluation questions
description = "a lighthouse overlooking the ocean"
messages = get_message(description)
output = chat_pipe(
messages,
max_new_tokens=256,
do_sample=False,
)
print(output[0]["generated_text"])
Example Outputs
For "a lighthouse overlooking the ocean":
Q1: What type of structure is prominently featured?
C: windmill, lighthouse, tower, castle
A: lighthouse
Q2: What body of water is visible?
C: lake, river, ocean, pond
A: ocean
Q3: Is the lighthouse positioned above the water?
C: no, yes
A: yes
Q4: Are there any mountains in the scene?
C: no, yes
A: no
Citation
@misc{smollm2-1-7b-it-tifa-random-2025,
title={SmolLM2-1.7B-Instruct-TIFA-Random: Flexible Question Generation for Text-to-Image Faithfulness Assessment},
author={kawchar85},
year={2025},
url={https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random}
}
Model Series Comparison
- Downloads last month
- 2
Model tree for kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random
Base model
HuggingFaceTB/SmolLM2-1.7B