PleIAs
/

Pleias-RAG-1B

Model card Files Files and versions Community

Pclanglais commited on 3 days ago

Commit

2afe4c2

·

verified ·

1 Parent(s): 5de977e

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -83,6 +83,8 @@ Pleias-RAG-1B has been evaluated on three standard RAG benchmarks, 2wiki, Hotpot
 All the benchmarks only assess the "trivial" mode on questions requiring some form of multi-hop reasoning over sources (answer disseminated into different sources) as well as discrimination of distractor sources.
 ## Deployment
 With 1.2B parameters, Pleias-RAG-1B can be readily deployed in many constrained infrastructures, including desktop systems on CPU RAM.
 We also release an unquantized GGUF version for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.

 All the benchmarks only assess the "trivial" mode on questions requiring some form of multi-hop reasoning over sources (answer disseminated into different sources) as well as discrimination of distractor sources.
 ## Deployment
+The easiest way to deploy Pleias-RAG-1B is through [our official library](https://github.com/Pleias/Pleias-RAG-Library). It features an API-like workflow with standardized export of the structured reasoning/answer output into json format. A [Colab Notebook](https://colab.research.google.com/drive/1oG0qq0I1fSEV35ezSah-a335bZqmo4_7?usp=sharing) is available for easy tests and experimentations.
 With 1.2B parameters, Pleias-RAG-1B can be readily deployed in many constrained infrastructures, including desktop systems on CPU RAM.
 We also release an unquantized GGUF version for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.