Update README.md
Browse files
README.md
CHANGED
@@ -18,40 +18,9 @@ It was introduced in the paper [ColPali: Efficient Document Retrieval with Visio
|
|
18 |
This version is the untrained base version to guarantee deterministic projection layer initialization.
|
19 |
|
20 |
|
21 |
-
## Usage
|
22 |
-
|
23 |
-
> [!WARNING]
|
24 |
-
> This version should not be used: it is solely the base version useful for deterministic LoRA initialization.
|
25 |
-
|
26 |
-
|
27 |
-
## Model Training
|
28 |
-
|
29 |
-
### Dataset
|
30 |
-
Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%).
|
31 |
-
Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both [*ViDoRe*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) and in the train set to prevent evaluation contamination.
|
32 |
-
A validation set is created with 2% of the samples to tune hyperparameters.
|
33 |
-
|
34 |
-
*Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training.*
|
35 |
-
|
36 |
-
### Parameters
|
37 |
-
|
38 |
-
Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters ([LoRA](https://arxiv.org/abs/2106.09685))
|
39 |
-
with `alpha=32` and `r=32` on the transformer layers from the language model,
|
40 |
-
as well as the final randomly initialized projection layer, and use a `paged_adamw_8bit` optimizer.
|
41 |
-
We train on a 4 GPU setup with data parallelism, a learning rate of 5e-4 with linear decay with 2.5% warmup steps, and a batch size of 8.
|
42 |
-
|
43 |
-
## Usage
|
44 |
-
|
45 |
-
This should not be used as it is the base model, used only for initiliasation of the linear head weights of the model.
|
46 |
-
|
47 |
-
## Limitations
|
48 |
-
|
49 |
-
- **Focus**: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages.
|
50 |
-
- **Support**: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support.
|
51 |
-
|
52 |
## License
|
53 |
|
54 |
-
|
55 |
|
56 |
## Contact
|
57 |
|
|
|
18 |
This version is the untrained base version to guarantee deterministic projection layer initialization.
|
19 |
|
20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
## License
|
22 |
|
23 |
+
ColSmol's vision language backbone model (ColSmolVLM) is under `apache2.0` license. The adapters attached to the model are under MIT license.
|
24 |
|
25 |
## Contact
|
26 |
|