nadiinchi commited on
Commit
8b44a1f
·
verified ·
1 Parent(s): 94d8405

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -30,12 +30,14 @@ pipeline_tag: text-ranking
30
 
31
  # Model Card for XProvence-reranker
32
 
 
 
33
  XProvence is a Zero Cost **context pruning model** that seamlessly integrates with reranker for retrieval-augmented generation,
34
  particularly **optimized for question answering**. Given a user question and a retrieved passage, XProvence **removes sentences
35
  from the passage that are not relevant to the user question**. This **speeds up generation** and **reduces context noise**, in
36
  a plug-and-play manner **for any LLM**.
37
 
38
- XProvence extends Provence by supporting 16 languages natively. It also supports 100+ languages through cross lingual transfer, since
39
  it is based on BGE-m3 which is pretrained on 100+ languages.
40
 
41
 
@@ -111,8 +113,8 @@ Interface of the `process` function:
111
  * Input: user question (e.g., a sentence) + retrieved context passage (e.g., a paragraph)
112
  * Output: pruned context passage, i.e., irrelevant sentences are removed + relevance score (can be used for reranking)
113
  * Model Architecture: The model was initialized from [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) and finetuned with two objectives: (1) output a binary mask which can be used to prune irrelevant sentences; and (2) preserve initial reranking capabilities.
114
- * Training data: Training queries from [MIRACL](https://huggingface.co/datasets/miracl/miracl) and documents from [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia), with synthetic silver labelling of which sentences to keep, produced using [aya-expanse-8b](https://huggingface.co/CohereLabs/aya-expanse-8b).
115
- * Languages covered: Arabic, Bengali, English, Spanish, Persian, Finnish, France, Hindi, Indonesian, Japanese, Korean, Russian, Swahili, Telugu, Thai, Chinese
116
  * Context length: 8192 tokens (similar to the pretrained BGE-m3 model)
117
  * Evaluation: we evaluate XProvence on 26 languages from 6 different datasets. We find that XProvence is able to prune irrelevant sentences with little-to-no drop in performance, on all languages, and outperforms existing baselines on the Pareto front.
118
 
 
30
 
31
  # Model Card for XProvence-reranker
32
 
33
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6273df31c3b822dad2d1eef2/4n-bxYfiMPC2LoLM2m7pg.png" alt="image/png" width="600">
34
+
35
  XProvence is a Zero Cost **context pruning model** that seamlessly integrates with reranker for retrieval-augmented generation,
36
  particularly **optimized for question answering**. Given a user question and a retrieved passage, XProvence **removes sentences
37
  from the passage that are not relevant to the user question**. This **speeds up generation** and **reduces context noise**, in
38
  a plug-and-play manner **for any LLM**.
39
 
40
+ XProvence is a multilingual version of [Provence](https://huggingface.co/naver/provence-reranker-debertav3-v1) supporting 16 languages natively. It also supports 100+ languages through cross lingual transfer, since
41
  it is based on BGE-m3 which is pretrained on 100+ languages.
42
 
43
 
 
113
  * Input: user question (e.g., a sentence) + retrieved context passage (e.g., a paragraph)
114
  * Output: pruned context passage, i.e., irrelevant sentences are removed + relevance score (can be used for reranking)
115
  * Model Architecture: The model was initialized from [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) and finetuned with two objectives: (1) output a binary mask which can be used to prune irrelevant sentences; and (2) preserve initial reranking capabilities.
116
+ * Training data: [MIRACL](https://huggingface.co/datasets/miracl/miracl), with synthetic silver labelling of which sentences to keep, produced using [aya-expanse-8b](https://huggingface.co/CohereLabs/aya-expanse-8b).
117
+ * Languages in the training data: Arabic, Bengali, English, Spanish, Persian, Finnish, France, Hindi, Indonesian, Japanese, Korean, Russian, Swahili, Telugu, Thai, Chinese
118
  * Context length: 8192 tokens (similar to the pretrained BGE-m3 model)
119
  * Evaluation: we evaluate XProvence on 26 languages from 6 different datasets. We find that XProvence is able to prune irrelevant sentences with little-to-no drop in performance, on all languages, and outperforms existing baselines on the Pareto front.
120