Model Summary
VisionFashion is a dual-encoder model that learns a joint embedding space for fashion images and text descriptions.
It combines a Vision Transformer (ViT-B/32) encoder with a BERT-base text encoder and is trained in two stages:
Source can be found in https://github.com/tugcantopaloglu/vision-fashion-paper-deeplearning/
- Contrastive pre-training à la CLIP on the DeepFashion-MultiModal dataset
- Task-specific fine-tuning for (i) Category classification and (ii) Attribute prediction.
The result is a single checkpoint that supports:
- Image → Text & Text → Image retrieval
- Category prediction (17 classes)
- Multi-label attribute prediction (92 attributes)
Intended Uses & Limitations
✅ Intended Use | 🚫 Limitations |
---|---|
Fashion search & recommendation systems | Not suitable for medical or safety‑critical use‑cases |
Exploratory data analysis of clothing collections | Biased toward western-style garments present in DeepFashion |
Academic research on multi‑modal learning | Fails on out‑of‑distribution domains (e.g. non‑fashion images) |
Ethical Considerations
The dataset contains images of real people; ensure you comply with its license. Potential gender, body‑type and cultural biases inherited from the data were not fully audited.
How to Use
from transformers import AutoTokenizer
from timm import create_model
import torch
# Text encoder
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
# Vision + Text dual encoder
model = create_model(
"visionfashion_vit_bert",
pretrained=True,
checkpoint_path="visionfashion_vit_bert.pth"
).eval().to("cuda")
# Encode a batch
text = tok(["red floral summer dress"], return_tensors="pt").to("cuda")
img = load_preprocessed_images(batch_paths).to("cuda") # RGB, 224×224, 0-1
with torch.no_grad():
img_emb, txt_emb = model(img, text["input_ids"], text["attention_mask"])
# Cosine similarity = relevance score
scores = torch.matmul(img_emb, txt_emb.T)
Training Data
- Dataset: DeepFashion-MultiModal (380 k train / 27 k val / 40 k test)
- Images resized to 224 × 224, random aug: resize‑crop, horizontal flip, color‑jitter
- Captions cleaned & tokenised with BERT tokenizer (max 40 tokens)
Training Procedure
Stage | Loss | Epochs | Batch | LR sched | GPU |
---|---|---|---|---|---|
Contrastive | InfoNCE (τ = 0.07) | 20 | 512 | Cosine + warm‑up | 1 × A100 80 GB |
Fine‑tune | CE (category) + BCE (attrs) | 10 | 256 | Cosine | same |
Mixed‑precision (FP16) with AdamW (β=0.9/0.999, weight‑decay 0.01).
Evaluation Results
Task | Metric | Score | Split |
---|---|---|---|
Image → Text retrieval | Recall@10 | 0.549 ± 0.01 | test |
Text → Image retrieval | Recall@10 | 0.554 ± 0.01 | test |
Category prediction | Top‑1 Acc. | 0.947 | test |
Attribute prediction | Avg. R@5 | 0.729 | test |
See VisionFashion.pdf for ablation studies and qualitative retrieval examples.
Citation
@unpublished{topaloglu2025visionfashion,
author = {Tuğcan Topaloğlu},
title = {{VisionFashion}: Multi-Modal Style Embedding Learning with Vision Transformers and BERT for Fashion Image Analysis and Recommendation},
year = {2025},
note = {Work in progress},
url = {https://huggingface.co/tugcantopaloglu/visionfashion}
}
Licence
Code and weights are released under the MIT Licence.
The DeepFashion dataset follows its own licence terms—please comply with them before redistribution.
Contact / Questions
Open an issue in the GitHub repo or ping @tugcantopaloglu on the Hugging Face Hub.
Model tree for tugcantopaloglu/visionfashion
Base model
google-bert/bert-base-uncased