SmolLM2-1.7B-Instruct-TIFA-Random

Model Description

SmolLM2-1.7B-Instruct-TIFA-Random is a fine-tuned version of unsloth/SmolLM2-1.7B-Instruct specifically trained for TIFA (Text-to-Image Faithfulness Assessment) with flexible question generation. Unlike previous structured versions, this model generates diverse, natural evaluation questions without rigid formatting constraints, making it more adaptable for various evaluation scenarios.

Model Series: 135M | 360M | 1.7B-Structured | 1.7B-Random

Key Innovation: Flexible Structure

This model represents a paradigm shift from rigid question structures to flexible, natural question generation:

  • Previous models: Fixed Q1/Q2/Q3/Q4 structure with predetermined answer types
  • This model: Dynamic question generation focusing on visual verification without structural constraints
  • Benefit: More natural, diverse questions that better reflect real-world evaluation needs

Intended Use

This model generates 4 visual verification questions for text-to-image evaluation, focusing on:

  • Colors, shapes, objects, materials - Core visual elements
  • Spatial relationships - Positioning and arrangement
  • Presence/absence verification - What exists or doesn't exist
  • Mixed question types - Both yes/no and multiple choice questions
  • Natural diversity - Questions adapt to description content rather than following templates

Model Details

  • Base Model: unsloth/SmolLM2-1.7B-Instruct
  • Model Size: 1.7B parameters
  • Fine-tuning Method: Enhanced LoRA with flexible structure training
  • Training Framework: Transformers + TRL + PEFT + Unsloth
  • License: apache-2.0

Training Details

Advanced Training Configuration

  • Training Method: Supervised Fine-Tuning with category-balanced validation

  • Enhanced LoRA Configuration:

    • r: 32
    • lora_alpha: 64
    • lora_dropout: 0.05
    • Target modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
  • Optimized Training Parameters:

    • Epochs: 2
    • Learning Rate: 5e-5
    • Batch Size: 16
    • Gradient Accumulation: 2 steps (effective batch size: 32)
    • Max Sequence Length: 1024
    • LR Scheduler: Cosine with 3% warmup
    • Validation: Category-balanced evaluation every 250 steps

Enhanced Dataset

  • Size: 18,000 examples
  • Structure: Flexible question generation without rigid templates
  • Validation: Category-balanced split ensuring robust evaluation
  • Coverage: Diverse visual elements, materials, spatial relationships, and verification tasks

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_path = "kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto"
)

# Create pipeline
chat_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
)

def get_message(description):
    system = """\
You are a TIFA (Text-to-Image Faithfulness evaluation with question Answering) question generator. Given an image description, create exactly 4 visual verification questions with multiple choice answers. Each question should test different visual aspects that can be verified by looking at the image.

Guidelines:
- Focus on colors, shapes, objects, materials, spatial relationships, and other visually verifiable elements
- Mix yes/no questions (2 choices: "no", "yes") and multiple choice questions (4 choices)
- Each question should test a DIFFERENT aspect of the description
- Ensure questions can be answered by visual inspection of the image
- Use elements explicitly mentioned in the description
- Include both positive verification (testing presence, answer: "yes") and negative verification (testing absence, answer: "no")
- Make distractors realistic and relevant to the domain

Format each question as:
Q[number]: [question text]
C: [comma-separated choices]
A: [correct answer]

Generate questions that test visual faithfulness between the description and image."""
    
    user_msg = f'Create 4 visual verification questions for this description: "{description}"'
    return [
        {"role": "system", "content": system},
        {"role": "user", "content": user_msg}
    ]

# Generate evaluation questions
description = "a lighthouse overlooking the ocean"
messages = get_message(description)

output = chat_pipe(
    messages, 
    max_new_tokens=256,
    do_sample=False,
)

print(output[0]["generated_text"])

Example Outputs

For "a lighthouse overlooking the ocean":

Q1: What type of structure is prominently featured?
C: windmill, lighthouse, tower, castle
A: lighthouse

Q2: What body of water is visible?
C: lake, river, ocean, pond
A: ocean

Q3: Is the lighthouse positioned above the water?
C: no, yes
A: yes

Q4: Are there any mountains in the scene?
C: no, yes
A: no

Citation

@misc{smollm2-1-7b-it-tifa-random-2025,
  title={SmolLM2-1.7B-Instruct-TIFA-Random: Flexible Question Generation for Text-to-Image Faithfulness Assessment},
  author={kawchar85},
  year={2025},
  url={https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random}
}

Model Series Comparison

Model Parameters Dataset Structure Best For
135M 135M 5k Fixed Q1-Q4 Quick evaluation, resource-constrained
360M 360M 10k Fixed Q1-Q4 Balanced performance
1.7B 1.7B 10k Fixed Q1-Q4 Structured evaluation
1.7B-Random 1.7B 18k Flexible Research, natural evaluation
Downloads last month
2
Safetensors
Model size
1.71B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random