PleIAs
/

Pleias-RAG-350M

Text Generation

Transformers

Safetensors

llama

text-generation-inference

Model card Files Files and versions Community

Pclanglais commited on 3 days ago

Commit

e9e50d8

verified ·

1 Parent(s): d80b341

Update README.md

Browse files

Files changed (1) hide show

README.md +34 -1

README.md CHANGED Viewed

@@ -87,9 +87,42 @@ All the benchmarks only assess the "trivial" mode on questions requiring some fo
 Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
-## Deployment
 The easiest way to deploy Pleias-RAG-350M is through [our official library](https://github.com/Pleias/Pleias-RAG-Library). It features an API-like workflow with standardized export of the structured reasoning/answer output into json format. A [Colab Notebook](https://colab.research.google.com/drive/1oG0qq0I1fSEV35ezSah-a335bZqmo4_7?usp=sharing) is available for easy tests and experimentations.
 With only 350 million parameters, Pleias-RAG-350M is classified among the *phone-sized SLM*, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
 We also release an unquantized [GGUF version](https://huggingface.co/PleIAs/Pleias-RAG-350M-gguf) for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.

 Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
+## Use and deployment
 The easiest way to deploy Pleias-RAG-350M is through [our official library](https://github.com/Pleias/Pleias-RAG-Library). It features an API-like workflow with standardized export of the structured reasoning/answer output into json format. A [Colab Notebook](https://colab.research.google.com/drive/1oG0qq0I1fSEV35ezSah-a335bZqmo4_7?usp=sharing) is available for easy tests and experimentations.
+A typical minimal example:
+```python
+rag = RAGWithCitations("PleIAs/Pleias-RAG-350M")
+# Define query and sources
+query = "What is the capital of France?"
+sources = [
+    {
+        "text": "Paris is the capital and most populous city of France.",
+        "metadata": {"source": "Geographic Encyclopedia", "reliability": "high"}
+    },
+    {
+        "text": "London is the capital of the United Kingdom",
+        "metadata": {"source": "Travel Guide", "year": 2020}
+    }
+]
+# Generate a response
+response = rag.generate(query, sources)
+# Print the final answer with citations
+print(response["processed"]["clean_answer"])
+```
+With expected output:
+```
+The capital of France is Paris. This can be confirmed by the fact that Paris is explicitly stated to be "the capital and most populous city of France" [1].
+**Citations**
+[1] "Paris is the capital and most populous city of France" [Source 1]
+```
 With only 350 million parameters, Pleias-RAG-350M is classified among the *phone-sized SLM*, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
 We also release an unquantized [GGUF version](https://huggingface.co/PleIAs/Pleias-RAG-350M-gguf) for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.