Model Summary

VisionFashion is a dual-encoder model that learns a joint embedding space for fashion images and text descriptions.
It combines a Vision Transformer (ViT-B/32) encoder with a BERT-base text encoder and is trained in two stages:

Source can be found in https://github.com/tugcantopaloglu/vision-fashion-paper-deeplearning/

  1. Contrastive pre-training à la CLIP on the DeepFashion-MultiModal dataset
  2. Task-specific fine-tuning for (i) Category classification and (ii) Attribute prediction.

The result is a single checkpoint that supports:

  • Image → Text & Text → Image retrieval
  • Category prediction (17 classes)
  • Multi-label attribute prediction (92 attributes)

Intended Uses & Limitations

✅ Intended Use 🚫 Limitations
Fashion search & recommendation systems Not suitable for medical or safety‑critical use‑cases
Exploratory data analysis of clothing collections Biased toward western-style garments present in DeepFashion
Academic research on multi‑modal learning Fails on out‑of‑distribution domains (e.g. non‑fashion images)

Ethical Considerations

The dataset contains images of real people; ensure you comply with its license. Potential gender, body‑type and cultural biases inherited from the data were not fully audited.


How to Use

from transformers import AutoTokenizer
from timm import create_model
import torch

# Text encoder
tok = AutoTokenizer.from_pretrained("bert-base-uncased")

# Vision + Text dual encoder
model = create_model(
    "visionfashion_vit_bert",
    pretrained=True,
    checkpoint_path="visionfashion_vit_bert.pth"
).eval().to("cuda")

# Encode a batch
text = tok(["red floral summer dress"], return_tensors="pt").to("cuda")
img  = load_preprocessed_images(batch_paths).to("cuda")  # RGB, 224×224, 0-1

with torch.no_grad():
    img_emb, txt_emb = model(img, text["input_ids"], text["attention_mask"])

# Cosine similarity = relevance score
scores = torch.matmul(img_emb, txt_emb.T)

Training Data

  • Dataset: DeepFashion-MultiModal (380 k train / 27 k val / 40 k test)
  • Images resized to 224 × 224, random aug: resize‑crop, horizontal flip, color‑jitter
  • Captions cleaned & tokenised with BERT tokenizer (max 40 tokens)

Training Procedure

Stage Loss Epochs Batch LR sched GPU
Contrastive InfoNCE (τ = 0.07) 20 512 Cosine + warm‑up 1 × A100 80 GB
Fine‑tune CE (category) + BCE (attrs) 10 256 Cosine same

Mixed‑precision (FP16) with AdamW (β=0.9/0.999, weight‑decay 0.01).


Evaluation Results

Task Metric Score Split
Image → Text retrieval Recall@10 0.549 ± 0.01 test
Text → Image retrieval Recall@10 0.554 ± 0.01 test
Category prediction Top‑1 Acc. 0.947 test
Attribute prediction Avg. R@5 0.729 test

See VisionFashion.pdf for ablation studies and qualitative retrieval examples.


Citation

@unpublished{topaloglu2025visionfashion,
  author  = {Tuğcan Topaloğlu},
  title   = {{VisionFashion}: Multi-Modal Style Embedding Learning with Vision Transformers and BERT for Fashion Image Analysis and Recommendation},
  year    = {2025},
  note    = {Work in progress},
  url     = {https://huggingface.co/tugcantopaloglu/visionfashion}
}

Licence

Code and weights are released under the MIT Licence.
The DeepFashion dataset follows its own licence terms—please comply with them before redistribution.


Contact / Questions

Open an issue in the GitHub repo or ping @tugcantopaloglu on the Hugging Face Hub.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tugcantopaloglu/visionfashion

Finetuned
(5479)
this model

Dataset used to train tugcantopaloglu/visionfashion