Update README.md
Browse files
README.md
CHANGED
@@ -17,6 +17,8 @@ pipeline_tag: image-to-text
|
|
17 |
**VisionFashion** is a dual-encoder model that learns a joint embedding space for fashion **images** and **text descriptions**.
|
18 |
It combines a **Vision Transformer (ViT-B/32)** encoder with a **BERT-base** text encoder and is trained in two stages:
|
19 |
|
|
|
|
|
20 |
1. **Contrastive pre-training** à la CLIP on the *DeepFashion-MultiModal* dataset
|
21 |
2. **Task-specific fine-tuning** for (i) *Category* classification and (ii) *Attribute* prediction.
|
22 |
|
|
|
17 |
**VisionFashion** is a dual-encoder model that learns a joint embedding space for fashion **images** and **text descriptions**.
|
18 |
It combines a **Vision Transformer (ViT-B/32)** encoder with a **BERT-base** text encoder and is trained in two stages:
|
19 |
|
20 |
+
**Source can be found in https://github.com/tugcantopaloglu/vision-fashion-paper-deeplearning/**
|
21 |
+
|
22 |
1. **Contrastive pre-training** à la CLIP on the *DeepFashion-MultiModal* dataset
|
23 |
2. **Task-specific fine-tuning** for (i) *Category* classification and (ii) *Attribute* prediction.
|
24 |
|