Update README.md
Browse files
README.md
CHANGED
@@ -87,9 +87,42 @@ All the benchmarks only assess the "trivial" mode on questions requiring some fo
|
|
87 |
|
88 |
Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
|
89 |
|
90 |
-
##
|
91 |
The easiest way to deploy Pleias-RAG-350M is through [our official library](https://github.com/Pleias/Pleias-RAG-Library). It features an API-like workflow with standardized export of the structured reasoning/answer output into json format. A [Colab Notebook](https://colab.research.google.com/drive/1oG0qq0I1fSEV35ezSah-a335bZqmo4_7?usp=sharing) is available for easy tests and experimentations.
|
92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
93 |
With only 350 million parameters, Pleias-RAG-350M is classified among the *phone-sized SLM*, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
|
94 |
|
95 |
We also release an unquantized [GGUF version](https://huggingface.co/PleIAs/Pleias-RAG-350M-gguf) for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.
|
|
|
87 |
|
88 |
Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
|
89 |
|
90 |
+
## Use and deployment
|
91 |
The easiest way to deploy Pleias-RAG-350M is through [our official library](https://github.com/Pleias/Pleias-RAG-Library). It features an API-like workflow with standardized export of the structured reasoning/answer output into json format. A [Colab Notebook](https://colab.research.google.com/drive/1oG0qq0I1fSEV35ezSah-a335bZqmo4_7?usp=sharing) is available for easy tests and experimentations.
|
92 |
|
93 |
+
A typical minimal example:
|
94 |
+
|
95 |
+
```python
|
96 |
+
rag = RAGWithCitations("PleIAs/Pleias-RAG-350M")
|
97 |
+
|
98 |
+
# Define query and sources
|
99 |
+
query = "What is the capital of France?"
|
100 |
+
sources = [
|
101 |
+
{
|
102 |
+
"text": "Paris is the capital and most populous city of France.",
|
103 |
+
"metadata": {"source": "Geographic Encyclopedia", "reliability": "high"}
|
104 |
+
},
|
105 |
+
{
|
106 |
+
"text": "London is the capital of the United Kingdom",
|
107 |
+
"metadata": {"source": "Travel Guide", "year": 2020}
|
108 |
+
}
|
109 |
+
]
|
110 |
+
|
111 |
+
# Generate a response
|
112 |
+
response = rag.generate(query, sources)
|
113 |
+
|
114 |
+
# Print the final answer with citations
|
115 |
+
print(response["processed"]["clean_answer"])
|
116 |
+
```
|
117 |
+
|
118 |
+
With expected output:
|
119 |
+
```
|
120 |
+
The capital of France is Paris. This can be confirmed by the fact that Paris is explicitly stated to be "the capital and most populous city of France" [1].
|
121 |
+
|
122 |
+
**Citations**
|
123 |
+
[1] "Paris is the capital and most populous city of France" [Source 1]
|
124 |
+
```
|
125 |
+
|
126 |
With only 350 million parameters, Pleias-RAG-350M is classified among the *phone-sized SLM*, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
|
127 |
|
128 |
We also release an unquantized [GGUF version](https://huggingface.co/PleIAs/Pleias-RAG-350M-gguf) for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.
|