tugcantopaloglu/visionfashion

Model Summary

VisionFashion is a dual-encoder model that learns a joint embedding space for fashion images and text descriptions.
It combines a Vision Transformer (ViT-B/32) encoder with a BERT-base text encoder and is trained in two stages:

Source can be found in https://github.com/tugcantopaloglu/vision-fashion-paper-deeplearning/

Contrastive pre-training à la CLIP on the DeepFashion-MultiModal dataset
Task-specific fine-tuning for (i) Category classification and (ii) Attribute prediction.

The result is a single checkpoint that supports:

Image → Text & Text → Image retrieval
Category prediction (17 classes)
Multi-label attribute prediction (92 attributes)

Intended Uses & Limitations

✅ Intended Use	🚫 Limitations
Fashion search & recommendation systems	Not suitable for medical or safety‑critical use‑cases
Exploratory data analysis of clothing collections	Biased toward western-style garments present in DeepFashion
Academic research on multi‑modal learning	Fails on out‑of‑distribution domains (e.g. non‑fashion images)

Ethical Considerations

The dataset contains images of real people; ensure you comply with its license. Potential gender, body‑type and cultural biases inherited from the data were not fully audited.

How to Use

from transformers import AutoTokenizer
from timm import create_model
import torch

# Text encoder
tok = AutoTokenizer.from_pretrained("bert-base-uncased")

# Vision + Text dual encoder
model = create_model(
    "visionfashion_vit_bert",
    pretrained=True,
    checkpoint_path="visionfashion_vit_bert.pth"
).eval().to("cuda")

# Encode a batch
text = tok(["red floral summer dress"], return_tensors="pt").to("cuda")
img  = load_preprocessed_images(batch_paths).to("cuda")  # RGB, 224×224, 0-1

with torch.no_grad():
    img_emb, txt_emb = model(img, text["input_ids"], text["attention_mask"])

# Cosine similarity = relevance score
scores = torch.matmul(img_emb, txt_emb.T)

Training Data

Dataset: DeepFashion-MultiModal (380 k train / 27 k val / 40 k test)
Images resized to 224 × 224, random aug: resize‑crop, horizontal flip, color‑jitter
Captions cleaned & tokenised with BERT tokenizer (max 40 tokens)

Training Procedure

Stage	Loss	Epochs	Batch	LR sched	GPU
Contrastive	InfoNCE (τ = 0.07)	20	512	Cosine + warm‑up	1 × A100 80 GB
Fine‑tune	CE (category) + BCE (attrs)	10	256	Cosine	same

Mixed‑precision (FP16) with AdamW (β=0.9/0.999, weight‑decay 0.01).

Evaluation Results

Task	Metric	Score	Split
Image → Text retrieval	Recall@10	0.549 ± 0.01	test
Text → Image retrieval	Recall@10	0.554 ± 0.01	test
Category prediction	Top‑1 Acc.	0.947	test
Attribute prediction	Avg. R@5	0.729	test

See VisionFashion.pdf for ablation studies and qualitative retrieval examples.

Citation

@unpublished{topaloglu2025visionfashion,
  author  = {Tuğcan Topaloğlu},
  title   = {{VisionFashion}: Multi-Modal Style Embedding Learning with Vision Transformers and BERT for Fashion Image Analysis and Recommendation},
  year    = {2025},
  note    = {Work in progress},
  url     = {https://huggingface.co/tugcantopaloglu/visionfashion}
}

Licence

Code and weights are released under the MIT Licence.
The DeepFashion dataset follows its own licence terms—please comply with them before redistribution.

Contact / Questions

Open an issue in the GitHub repo or ping @tugcantopaloglu on the Hugging Face Hub.

tugcantopaloglu
/

visionfashion