Pclanglais commited on
Commit
e9e50d8
·
verified ·
1 Parent(s): d80b341

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -1
README.md CHANGED
@@ -87,9 +87,42 @@ All the benchmarks only assess the "trivial" mode on questions requiring some fo
87
 
88
  Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
89
 
90
- ## Deployment
91
  The easiest way to deploy Pleias-RAG-350M is through [our official library](https://github.com/Pleias/Pleias-RAG-Library). It features an API-like workflow with standardized export of the structured reasoning/answer output into json format. A [Colab Notebook](https://colab.research.google.com/drive/1oG0qq0I1fSEV35ezSah-a335bZqmo4_7?usp=sharing) is available for easy tests and experimentations.
92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  With only 350 million parameters, Pleias-RAG-350M is classified among the *phone-sized SLM*, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
94
 
95
  We also release an unquantized [GGUF version](https://huggingface.co/PleIAs/Pleias-RAG-350M-gguf) for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.
 
87
 
88
  Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
89
 
90
+ ## Use and deployment
91
  The easiest way to deploy Pleias-RAG-350M is through [our official library](https://github.com/Pleias/Pleias-RAG-Library). It features an API-like workflow with standardized export of the structured reasoning/answer output into json format. A [Colab Notebook](https://colab.research.google.com/drive/1oG0qq0I1fSEV35ezSah-a335bZqmo4_7?usp=sharing) is available for easy tests and experimentations.
92
 
93
+ A typical minimal example:
94
+
95
+ ```python
96
+ rag = RAGWithCitations("PleIAs/Pleias-RAG-350M")
97
+
98
+ # Define query and sources
99
+ query = "What is the capital of France?"
100
+ sources = [
101
+ {
102
+ "text": "Paris is the capital and most populous city of France.",
103
+ "metadata": {"source": "Geographic Encyclopedia", "reliability": "high"}
104
+ },
105
+ {
106
+ "text": "London is the capital of the United Kingdom",
107
+ "metadata": {"source": "Travel Guide", "year": 2020}
108
+ }
109
+ ]
110
+
111
+ # Generate a response
112
+ response = rag.generate(query, sources)
113
+
114
+ # Print the final answer with citations
115
+ print(response["processed"]["clean_answer"])
116
+ ```
117
+
118
+ With expected output:
119
+ ```
120
+ The capital of France is Paris. This can be confirmed by the fact that Paris is explicitly stated to be "the capital and most populous city of France" [1].
121
+
122
+ **Citations**
123
+ [1] "Paris is the capital and most populous city of France" [Source 1]
124
+ ```
125
+
126
  With only 350 million parameters, Pleias-RAG-350M is classified among the *phone-sized SLM*, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
127
 
128
  We also release an unquantized [GGUF version](https://huggingface.co/PleIAs/Pleias-RAG-350M-gguf) for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.