tonywu71 commited on
Commit
650243e
·
verified ·
1 Parent(s): fdd3024

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -32
README.md CHANGED
@@ -18,40 +18,9 @@ It was introduced in the paper [ColPali: Efficient Document Retrieval with Visio
18
  This version is the untrained base version to guarantee deterministic projection layer initialization.
19
 
20
 
21
- ## Usage
22
-
23
- > [!WARNING]
24
- > This version should not be used: it is solely the base version useful for deterministic LoRA initialization.
25
-
26
-
27
- ## Model Training
28
-
29
- ### Dataset
30
- Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%).
31
- Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both [*ViDoRe*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) and in the train set to prevent evaluation contamination.
32
- A validation set is created with 2% of the samples to tune hyperparameters.
33
-
34
- *Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training.*
35
-
36
- ### Parameters
37
-
38
- Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters ([LoRA](https://arxiv.org/abs/2106.09685))
39
- with `alpha=32` and `r=32` on the transformer layers from the language model,
40
- as well as the final randomly initialized projection layer, and use a `paged_adamw_8bit` optimizer.
41
- We train on a 4 GPU setup with data parallelism, a learning rate of 5e-4 with linear decay with 2.5% warmup steps, and a batch size of 8.
42
-
43
- ## Usage
44
-
45
- This should not be used as it is the base model, used only for initiliasation of the linear head weights of the model.
46
-
47
- ## Limitations
48
-
49
- - **Focus**: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages.
50
- - **Support**: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support.
51
-
52
  ## License
53
 
54
- ColQwen2's vision language backbone model (Qwen2-VL) is under `apache2.0` license. The adapters attached to the model are under MIT license.
55
 
56
  ## Contact
57
 
 
18
  This version is the untrained base version to guarantee deterministic projection layer initialization.
19
 
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ## License
22
 
23
+ ColSmol's vision language backbone model (ColSmolVLM) is under `apache2.0` license. The adapters attached to the model are under MIT license.
24
 
25
  ## Contact
26