SmolLM2-1.7B-Instruct-TIFA-Random

Model Description

SmolLM2-1.7B-Instruct-TIFA-Random is a fine-tuned version of unsloth/SmolLM2-1.7B-Instruct specifically trained for TIFA (Text-to-Image Faithfulness Assessment) with flexible question generation. Unlike previous structured versions, this model generates diverse, natural evaluation questions without rigid formatting constraints, making it more adaptable for various evaluation scenarios.

Model Series: 135M | 360M | 1.7B-Structured | 1.7B-Random

Key Innovation: Flexible Structure

This model represents a paradigm shift from rigid question structures to flexible, natural question generation:

Previous models: Fixed Q1/Q2/Q3/Q4 structure with predetermined answer types
This model: Dynamic question generation focusing on visual verification without structural constraints
Benefit: More natural, diverse questions that better reflect real-world evaluation needs

Intended Use

This model generates 4 visual verification questions for text-to-image evaluation, focusing on:

Colors, shapes, objects, materials - Core visual elements
Spatial relationships - Positioning and arrangement
Presence/absence verification - What exists or doesn't exist
Mixed question types - Both yes/no and multiple choice questions
Natural diversity - Questions adapt to description content rather than following templates

Model Details

Base Model: unsloth/SmolLM2-1.7B-Instruct
Model Size: 1.7B parameters
Fine-tuning Method: Enhanced LoRA with flexible structure training
Training Framework: Transformers + TRL + PEFT + Unsloth
License: apache-2.0

Training Details

Advanced Training Configuration

Training Method: Supervised Fine-Tuning with category-balanced validation
Enhanced LoRA Configuration:
- r: 32
- lora_alpha: 64
- lora_dropout: 0.05
- Target modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Optimized Training Parameters:
- Epochs: 2
- Learning Rate: 5e-5
- Batch Size: 16
- Gradient Accumulation: 2 steps (effective batch size: 32)
- Max Sequence Length: 1024
- LR Scheduler: Cosine with 3% warmup
- Validation: Category-balanced evaluation every 250 steps

Enhanced Dataset

Size: 18,000 examples
Structure: Flexible question generation without rigid templates
Validation: Category-balanced split ensuring robust evaluation
Coverage: Diverse visual elements, materials, spatial relationships, and verification tasks

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_path = "kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto"
)

# Create pipeline
chat_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
)

def get_message(description):
    system = """\
You are a TIFA (Text-to-Image Faithfulness evaluation with question Answering) question generator. Given an image description, create exactly 4 visual verification questions with multiple choice answers. Each question should test different visual aspects that can be verified by looking at the image.

Guidelines:
- Focus on colors, shapes, objects, materials, spatial relationships, and other visually verifiable elements
- Mix yes/no questions (2 choices: "no", "yes") and multiple choice questions (4 choices)
- Each question should test a DIFFERENT aspect of the description
- Ensure questions can be answered by visual inspection of the image
- Use elements explicitly mentioned in the description
- Include both positive verification (testing presence, answer: "yes") and negative verification (testing absence, answer: "no")
- Make distractors realistic and relevant to the domain

Format each question as:
Q[number]: [question text]
C: [comma-separated choices]
A: [correct answer]

Generate questions that test visual faithfulness between the description and image."""
    
    user_msg = f'Create 4 visual verification questions for this description: "{description}"'
    return [
        {"role": "system", "content": system},
        {"role": "user", "content": user_msg}
    ]

# Generate evaluation questions
description = "a lighthouse overlooking the ocean"
messages = get_message(description)

output = chat_pipe(
    messages, 
    max_new_tokens=256,
    do_sample=False,
)

print(output[0]["generated_text"])

Example Outputs

For "a lighthouse overlooking the ocean":

Q1: What type of structure is prominently featured?
C: windmill, lighthouse, tower, castle
A: lighthouse

Q2: What body of water is visible?
C: lake, river, ocean, pond
A: ocean

Q3: Is the lighthouse positioned above the water?
C: no, yes
A: yes

Q4: Are there any mountains in the scene?
C: no, yes
A: no

Citation

@misc{smollm2-1-7b-it-tifa-random-2025,
  title={SmolLM2-1.7B-Instruct-TIFA-Random: Flexible Question Generation for Text-to-Image Faithfulness Assessment},
  author={kawchar85},
  year={2025},
  url={https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random}
}

Model Series Comparison

Model	Parameters	Dataset	Structure	Best For
135M	135M	5k	Fixed Q1-Q4	Quick evaluation, resource-constrained
360M	360M	10k	Fixed Q1-Q4	Balanced performance
1.7B	1.7B	10k	Fixed Q1-Q4	Structured evaluation
1.7B-Random	1.7B	18k	Flexible	Research, natural evaluation

kawchar85
/

SmolLM2-1.7B-Instruct-TIFA-Random