Upload retriever to Hugging Face Hub
Browse files- .gitattributes +3 -0
- README.md +101 -0
- cls.id +1 -0
- config.yaml +37 -0
- database.lmdb/data.mdb +3 -0
- database.lmdb/lock.mdb +0 -0
- indexes/bm25/cls.id +1 -0
- indexes/bm25/config.yaml +10 -0
- indexes/bm25/context_mapping.pkl +3 -0
- indexes/bm25/data.csc.index.npy +3 -0
- indexes/bm25/indices.csc.index.npy +3 -0
- indexes/bm25/indptr.csc.index.npy +3 -0
- indexes/bm25/multi_field_index_config.yaml +5 -0
- indexes/bm25/params.index.json +12 -0
- indexes/bm25/vocab.index.json +3 -0
- indexes/contriever/cls.id +1 -0
- indexes/contriever/config.yaml +62 -0
- indexes/contriever/context_mapping.pkl +3 -0
- indexes/contriever/index.faiss +3 -0
- indexes/contriever/multi_field_index_config.yaml +5 -0
.gitattributes
CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
database.lmdb/data.mdb filter=lfs diff=lfs merge=lfs -text
|
37 |
+
indexes/bm25/vocab.index.json filter=lfs diff=lfs merge=lfs -text
|
38 |
+
indexes/contriever/index.faiss filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
library_name: FlexRAG
|
4 |
+
tags:
|
5 |
+
- FlexRAG
|
6 |
+
- retrieval
|
7 |
+
- search
|
8 |
+
- lexical
|
9 |
+
- RAG
|
10 |
+
- IR
|
11 |
+
---
|
12 |
+
|
13 |
+
# FlexRAG Retriever
|
14 |
+
|
15 |
+
This is a FlexRetriever created with the [`FlexRAG`](https://github.com/ictnlp/FlexRAG) library (version `0.3.0`).
|
16 |
+
|
17 |
+
## Retriever Attributes
|
18 |
+
The `enwiki_2020_atlas` retriever is a FlexRetriever that provides access to the English Wikipedia corpus from December 2020. It is designed for information retrieval tasks, allowing users to search and retrieve relevant documents based on their queries.
|
19 |
+
The corpus of this retriever was created by the [Atlas](https://github.com/facebookresearch/atlas) project and the index was built using the [FlexRAG](https://github.com/ictnlp/FlexRAG) library.
|
20 |
+
|
21 |
+
| Corpus Attribute | Value |
|
22 |
+
| ---------------- | --------------------------------------------------------------- |
|
23 |
+
| Language | English |
|
24 |
+
| Domain | Wikipedia |
|
25 |
+
| Saved Fields | title, section, text |
|
26 |
+
| Size | 33.1M (29.4M text, 3.8M infobox) |
|
27 |
+
| Dump Date | Dec 2020 |
|
28 |
+
| Provideer | [Atlas](https://github.com/facebookresearch/atlas) |
|
29 |
+
| License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
|
30 |
+
|
31 |
+
|
32 |
+
| Index Attribute | Value |
|
33 |
+
| --------------- | --------------------------------------------------------------- |
|
34 |
+
| Index Name | bm25 |
|
35 |
+
| Index Type | Sparse |
|
36 |
+
| Index Method | Lucene |
|
37 |
+
| Indexed Fields | title, section, text (concat) |
|
38 |
+
| Preprocessing | LengthFilter(min_char=10, max_char=4096) |
|
39 |
+
| Provideer | [FlexRAG](https://github.com/ictnlp/flexrag) |
|
40 |
+
| License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
|
41 |
+
|
42 |
+
| Index Attribute | Value |
|
43 |
+
| --------------- | --------------------------------------------------------------- |
|
44 |
+
| Index Name | contriever |
|
45 |
+
| Index Type | Dense |
|
46 |
+
| Index Method | IVFPQ |
|
47 |
+
| Indexed Fields | title, section, text (concat) |
|
48 |
+
| Query Encoder | `facebook/contriever-msmarco` |
|
49 |
+
| Passage Encoder | `facebook/contriever-msmarco` |
|
50 |
+
| Preprocessing | LengthFilter(min_char=10, max_char=4096) |
|
51 |
+
| Provideer | [FlexRAG](https://github.com/ictnlp/flexrag) |
|
52 |
+
| License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
|
53 |
+
|
54 |
+
## Usage
|
55 |
+
|
56 |
+
### Installation
|
57 |
+
You can install the `FlexRAG` library with `pip`:
|
58 |
+
|
59 |
+
```bash
|
60 |
+
pip install flexrag faiss-cpu
|
61 |
+
```
|
62 |
+
|
63 |
+
### Loading the `FlexRAG` retriever
|
64 |
+
|
65 |
+
You can use this retriever for information retrieval tasks. Here is an example:
|
66 |
+
|
67 |
+
```python
|
68 |
+
from flexrag.retriever import LocalRetriever
|
69 |
+
|
70 |
+
|
71 |
+
# Load the retriever from the HuggingFace Hub
|
72 |
+
retriever = LocalRetriever.load_from_hub("FlexRAG/enwiki_2020_atlas")
|
73 |
+
|
74 |
+
|
75 |
+
# You can retrieve relevant documents now
|
76 |
+
results = retriever.search("Who is Bruce Wayne?")
|
77 |
+
```
|
78 |
+
|
79 |
+
### Running the RAG demo with the retriever
|
80 |
+
|
81 |
+
You can run the **GUI application** of the RAG assistant with this retriever. Here is an example:
|
82 |
+
|
83 |
+
```bash
|
84 |
+
python -m flexrag.entrypoints.run_interactive \
|
85 |
+
assistant_type=modular \
|
86 |
+
modular_config.used_fields=[title,text] \
|
87 |
+
modular_config.retriever_type="FlexRAG/enwiki_2020_atlas" \
|
88 |
+
modular_config.response_type=original \
|
89 |
+
modular_config.generator_type=openai \
|
90 |
+
modular_config.openai_config.model_name='gpt-4o-mini' \
|
91 |
+
modular_config.openai_config.api_key=$OPENAI_KEY \
|
92 |
+
modular_config.do_sample=False
|
93 |
+
```
|
94 |
+
|
95 |
+
## License
|
96 |
+
As the corpus is based on the [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license, the retriever is also licensed under the same license.
|
97 |
+
|
98 |
+
|
99 |
+
FlexRAG Related Links:
|
100 |
+
* 📚[Documentation](https://flexrag.readthedocs.io/en/latest/)
|
101 |
+
* 💻[GitHub Repository](https://github.com/ictnlp/flexrag)
|
cls.id
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
FlexRetriever
|
config.yaml
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
log_interval: 100000
|
2 |
+
top_k: 10
|
3 |
+
batch_size: 4096
|
4 |
+
query_preprocess_pipeline:
|
5 |
+
processor_type: []
|
6 |
+
length_filter_config:
|
7 |
+
max_tokens: null
|
8 |
+
min_tokens: null
|
9 |
+
max_chars: null
|
10 |
+
min_chars: null
|
11 |
+
max_bytes: null
|
12 |
+
min_bytes: null
|
13 |
+
tokenizer_config:
|
14 |
+
tokenizer_type: moses
|
15 |
+
hf_tokenizer_path: null
|
16 |
+
tiktok_tokenizer_name: null
|
17 |
+
lang: null
|
18 |
+
token_normalize_config:
|
19 |
+
lang: en
|
20 |
+
penn: true
|
21 |
+
norm_quote_commas: true
|
22 |
+
norm_numbers: true
|
23 |
+
pre_replace_unicode_punct: false
|
24 |
+
post_remove_control_chars: false
|
25 |
+
perl_parity: false
|
26 |
+
truncate_config:
|
27 |
+
max_chars: null
|
28 |
+
max_bytes: null
|
29 |
+
max_tokens: null
|
30 |
+
tokenizer_config:
|
31 |
+
tokenizer_type: moses
|
32 |
+
hf_tokenizer_path: null
|
33 |
+
tiktok_tokenizer_name: null
|
34 |
+
lang: null
|
35 |
+
retriever_path: null
|
36 |
+
indexes_merge_method: linear
|
37 |
+
used_indexes: null
|
database.lmdb/data.mdb
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3e5065b6ccae1bd8fe3f1755183ece35b474acc43c651b0f0171cca166f9715f
|
3 |
+
size 31990173696
|
database.lmdb/lock.mdb
ADDED
Binary file (8.19 kB). View file
|
|
indexes/bm25/cls.id
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
BM25Index
|
indexes/bm25/config.yaml
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
log_interval: 10000
|
2 |
+
batch_size: 512
|
3 |
+
index_path: /data/zhangzhuocheng/Lab/Python/LLM/datasets/RAG/Corpus/enwiki_2020_atlas/flex/indexes/bm25
|
4 |
+
method: lucene
|
5 |
+
idf_method: null
|
6 |
+
backend: auto
|
7 |
+
k1: 1.5
|
8 |
+
b: 0.75
|
9 |
+
delta: 0.5
|
10 |
+
lang: english
|
indexes/bm25/context_mapping.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8c0cd52bb5be194a99594ca416f6b55846e1d3e98f02bde2c3fc83442ef3559c
|
3 |
+
size 1046861765
|
indexes/bm25/data.csc.index.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:cd881d5b05243218044350a4ffcfc6a04529b02bfc2825fb84539510e23eb94f
|
3 |
+
size 7344164228
|
indexes/bm25/indices.csc.index.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:06568137080de8c414bec72a29673b5b073dca1f5d71bc1d3acc82b65e941463
|
3 |
+
size 7344164228
|
indexes/bm25/indptr.csc.index.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d717c1e86b0299ebf46cc692346272bae97dd350967993f979461d550634473c
|
3 |
+
size 45940392
|
indexes/bm25/multi_field_index_config.yaml
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
indexed_fields:
|
2 |
+
- title
|
3 |
+
- section
|
4 |
+
- text
|
5 |
+
merge_method: concat
|
indexes/bm25/params.index.json
ADDED
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"k1": 1.5,
|
3 |
+
"b": 0.75,
|
4 |
+
"delta": 0.5,
|
5 |
+
"method": "lucene",
|
6 |
+
"idf_method": "lucene",
|
7 |
+
"dtype": "float32",
|
8 |
+
"int_dtype": "int32",
|
9 |
+
"num_docs": 35528846,
|
10 |
+
"version": "0.2.11",
|
11 |
+
"backend": "numpy"
|
12 |
+
}
|
indexes/bm25/vocab.index.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1c15bb4c04602797055fe19a90ba6cea2c08bc96990495d84c3d1d1536682bc7
|
3 |
+
size 235687356
|
indexes/contriever/cls.id
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
FaissIndex
|
indexes/contriever/config.yaml
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
log_interval: 100000
|
2 |
+
batch_size: 2048
|
3 |
+
index_path: null
|
4 |
+
query_encoder_config:
|
5 |
+
encoder_type: hf
|
6 |
+
cohere_config: null
|
7 |
+
hf_config:
|
8 |
+
batch_size: 32
|
9 |
+
log_interval: 1000
|
10 |
+
model_path: facebook/contriever-msmarco
|
11 |
+
tokenizer_path: null
|
12 |
+
trust_remote_code: false
|
13 |
+
device_id:
|
14 |
+
- 0
|
15 |
+
load_dtype: auto
|
16 |
+
max_encode_length: 512
|
17 |
+
encode_method: mean
|
18 |
+
normalize: false
|
19 |
+
prompt: ''
|
20 |
+
task: ''
|
21 |
+
hf_clip_config: null
|
22 |
+
jina_config: null
|
23 |
+
ollama_config: null
|
24 |
+
openai_config: null
|
25 |
+
sentence_transformer_config: null
|
26 |
+
passage_encoder_config:
|
27 |
+
encoder_type: hf
|
28 |
+
cohere_config: null
|
29 |
+
hf_config:
|
30 |
+
batch_size: 32
|
31 |
+
log_interval: 1000
|
32 |
+
model_path: facebook/contriever-msmarco
|
33 |
+
tokenizer_path: null
|
34 |
+
trust_remote_code: false
|
35 |
+
device_id:
|
36 |
+
- 0
|
37 |
+
- 1
|
38 |
+
- 2
|
39 |
+
- 3
|
40 |
+
load_dtype: auto
|
41 |
+
max_encode_length: 512
|
42 |
+
encode_method: mean
|
43 |
+
normalize: false
|
44 |
+
prompt: ''
|
45 |
+
task: ''
|
46 |
+
hf_clip_config: null
|
47 |
+
jina_config: null
|
48 |
+
ollama_config: null
|
49 |
+
openai_config: null
|
50 |
+
sentence_transformer_config: null
|
51 |
+
distance_function: IP
|
52 |
+
index_type: auto
|
53 |
+
n_subquantizers: 8
|
54 |
+
n_bits: 8
|
55 |
+
n_list: 1000
|
56 |
+
factory_str: null
|
57 |
+
index_train_num: -1
|
58 |
+
n_probe: 512
|
59 |
+
device_id: []
|
60 |
+
k_factor: 10
|
61 |
+
polysemous_ht: 0
|
62 |
+
efSearch: 100
|
indexes/contriever/context_mapping.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8c0cd52bb5be194a99594ca416f6b55846e1d3e98f02bde2c3fc83442ef3559c
|
3 |
+
size 1046861765
|
indexes/contriever/index.faiss
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4eb013a80c97be41a945ccc02f887651c7c135ab6f65bb0551c4a2e8818c8fea
|
3 |
+
size 7130740032
|
indexes/contriever/multi_field_index_config.yaml
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
indexed_fields:
|
2 |
+
- title
|
3 |
+
- section
|
4 |
+
- text
|
5 |
+
merge_method: concat
|