deepset

company
Verified
Activity Feed

AI & ML interests

Semantic Search, Language models, Domain adaptation, Question Answering

deepset's activity

anakin87ย 
posted an update 3 days ago
view post
Post
453
Hey, it has been a while... I was busy participating in ๐Ÿ’Ž ๐†๐ž๐ฆ๐ฆ๐š ๐œ๐จ๐ฆ๐ฉ๐ž๐ญ๐ข๐ญ๐ข๐จ๐ง!

Here's the idea: Gemma open models have a large vocabulary size (256K), so improving them for a specific language or cultural context should be pretty affordable - no need for continued pre-training.

My submission: ๐Ÿ’Ž๐ŸŒ๐Ÿ‡ฎ๐Ÿ‡น ๐๐ž๐จ๐ ๐ž๐ง๐ž๐ฌ๐ข๐ฌ - ๐๐จ๐ฌ๐ญ-๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐†๐ž๐ฆ๐ฆ๐š ๐Ÿ๐จ๐ซ ๐ˆ๐ญ๐š๐ฅ๐ข๐š๐ง ๐š๐ง๐ ๐›๐ž๐ฒ๐จ๐ง๐
๐Ÿ““ Kaggle notebook: https://www.kaggle.com/code/anakin87/post-training-gemma-for-italian-and-beyond

In this notebook, I show how I improve the performance of Gemma 2 2B on Italian via Post-Training.
I believe this method is adaptable to other languages and model sizes.

๐˜’๐˜ฆ๐˜บ ๐˜š๐˜ต๐˜ฆ๐˜ฑ๐˜ด
๐Ÿ“Š Choose reference metrics
๐Ÿง‘โ€๐Ÿ”ฌ Data curation for Instruction Fine Tuning: identify existing datasets + generate synthetic data
๐Ÿ‹๏ธโ€โ™‚๏ธ Efficient Instruction Fine Tuning with Spectrum
๐Ÿง‘โ€๐Ÿ”ฌ Data curation for Preference Tuning: identify existing datasets + generate synthetic data
๐Ÿ‘๐Ÿ‘Ž Efficient Direct Preference Optimization with Spectrum
๐Ÿ“ˆ Evaluation


๐Ÿค— Hugging Face collection (with models and datasets): anakin87/gemma-neogenesis-67824b7bf13ac9cfe091fe2e

I'm also planning a ๐ŸŽ Gemma Giveaway (on LinkedIn - https://www.linkedin.com/in/stefano-fiorucci) in the next few days - sharing techniques, datasets, and models I used for my project... so stay tuned! ๐Ÿ“ป
anakin87ย 
posted an update about 1 month ago
view post
Post
1643
Tulu 3 SFT Mixture by AllenAI is a massive, good, multilingual dataset for fine-tuning Language Models.

Unfortunately, it was missing the "language" column.

I added it using the good old fastText.

Check out the dataset here ๐Ÿ‘‰ anakin87/tulu-3-sft-mixture-with-language

  • 1 reply
ยท
anakin87ย 
posted an update about 2 months ago
view post
Post
437
๐Ÿ๐Ÿ๐Ÿ ๐€ ๐’๐ฐ๐š๐ซ๐ฆ ๐จ๐Ÿ ๐€๐ ๐ž๐ง๐ญ๐ฌ ๐ฐ๐ข๐ญ๐ก ๐‹๐ฅ๐š๐ฆ๐š 3.2, ๐†๐๐“-4๐จ ๐ฆ๐ข๐ง๐ข ๐š๐ง๐ ๐‚๐ฅ๐š๐ฎ๐๐ž 3.5 ๐’๐จ๐ง๐ง๐ž๐ญ

๐“๐‹;๐ƒ๐‘: I reimplemented the Swarm concept using Haystack, but made it work with both open and proprietary models ๐Ÿ’ซ

โœ๏ธ blog article: https://haystack.deepset.ai/blog/swarm-of-agents
๐Ÿ““ notebook: https://haystack.deepset.ai/cookbook/swarm


Some time ago OpenAI published Swarm: an educational framework for building multi-agent systems.

Their approach focuses on two main concepts:
ใƒป ๐‘๐จ๐ฎ๐ญ๐ข๐ง๐ž๐ฌ: Each agent follows specific ๐Ÿ“œ instructions and uses ๐Ÿ› ๏ธ tools to execute them.
ใƒป ๐‡๐š๐ง๐๐จ๐Ÿ๐Ÿ๐ฌ ๐Ÿค: Agents can transfer control to one another using tool/function calling.


When I first read these ideas, I thought: ๐˜ด๐˜ช๐˜ฎ๐˜ฑ๐˜ญ๐˜ฆ ๐˜ฃ๐˜ถ๐˜ต ๐˜ฑ๐˜ฐ๐˜ธ๐˜ฆ๐˜ณ๐˜ง๐˜ถ๐˜ญ! And they pair well with the recent unified tool support in Haystack.

๐Ÿง‘โ€๐Ÿ’ป So, I decided to re-implement these concepts using Haystack, and in just a few lines of code, I had a working prototype.

๐Ÿ†’ Bonus feature: this implementation isn't tied to a single model provider - different agents can be powered by different models!

I replicated the ACME customer service example from the original article, with 3 Agents:
๐Ÿ Triage Agent - Llama 3.2 running on Ollama
๐Ÿ Sales Agent - Anthropic Claude 3.5 Sonnet
๐Ÿ Issues and Repairs Agent - OpenAI GPT-4o mini


Want to see the full implementation and give it a try? Check out the blog post and notebook! โœจ

Update README.md

1
#29 opened 2 months ago by
Piterasny
anakin87ย 
posted an update 3 months ago
view post
Post
1106
Ok, you're finally convinced that synthetic data works... โš—๏ธ

๐๐จ๐ฐ ๐ฒ๐จ๐ฎ ๐ฐ๐š๐ง๐ญ ๐ญ๐จ ๐ ๐ž๐ง๐ž๐ซ๐š๐ญ๐ž ๐š๐ง ๐ข๐ง๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ข๐จ๐ง ๐๐š๐ญ๐š๐ฌ๐ž๐ญ ๐Ÿ๐จ๐ซ ๐Ÿ๐ข๐ง๐ž-๐ญ๐ฎ๐ง๐ข๐ง๐  ๐ข๐ง ๐š ๐ฅ๐š๐ง๐ ๐ฎ๐š๐ ๐ž ๐จ๐ญ๐ก๐ž๐ซ ๐ญ๐ก๐š๐ง ๐„๐ง๐ ๐ฅ๐ข๐ฌ๐ก.
But how do you get started?

I explore how to do this with Magpie in my new article
https://huggingface.co/blog/anakin87/multilingual-magpie

---

๐Ÿฆโ€โฌ› ๐–๐ก๐š๐ญ ๐ข๐ฌ ๐Œ๐š๐ ๐ฉ๐ข๐ž?

It's a recent technique for creating synthetic instruction datasets.

Magpie is based on a simple but ingenious idea ๐Ÿ‘‡
if you prompt an instruction-tuned model with a pre-query template, you can make it generate a plausible user query/instruction

Here's an example:
model: Llama-3-8B-Instruct
pre-query template: "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"
generated user instruction: "What are some of the responsibilities of a commercial pilot?"

You can then feed this instruction back into the same model to get the assistant response.

By repeating this process, it's possible to generate large synthetic datasets with relatively little effort.

๐Ÿช„ The authors demonstrate that using these datasets for Supervised Fine Tuning (SFT) can yield strong performance, even competitive with the original instruct model.


๐Ÿง—๐†๐ž๐ง๐ž๐ซ๐š๐ญ๐ข๐ง๐  ๐ง๐จ๐ง-๐„๐ง๐ ๐ฅ๐ข๐ฌ๐ก ๐๐š๐ญ๐š

Most Language Models are primarily trained on English texts, so they tend to produce data in English.

How can we overcome this?

Earlier approaches were complex or costly.

Then @mrm8488 found a simple solution: add the target language to the pre-query template.
For Spanish, the template becomes "<|begin_of_text|><|start_header_id|>user<|end_header_id|>spanish:".

This method works for Spanish and German!

โŒ Unfortunately, it does not work well for other languages (๐Ÿ‡ฎ๐Ÿ‡น, ๐Ÿ‡ณ๐Ÿ‡ฑ, ...)

๐Ÿ‘‡
  • 1 reply
ยท
anakin87ย 
posted an update 4 months ago
view post
Post
1742
๐Ÿ•ต๐Ÿป ๐€๐ ๐ž๐ง๐ญ๐ข๐œ ๐‘๐€๐† ๐ฐ๐ข๐ญ๐ก ๐Ÿฆ™ ๐‹๐ฅ๐š๐ฆ๐š 3.2

I was excited to explore Llama 3.2, but as a simple ๐Ÿ‡ช๐Ÿ‡บ EU guy, I don't have access to Meta's multimodal models ๐Ÿ˜ฟ

๐Ÿค” So I thought: why not challenge the small 3B text model with Agentic RAG?

๐ŸŽฏ The plan:
- Build a system that tries to answer questions using a knowledge base.
- If the documents don't contain the answer, use Web search for additional context.


Check out my experimental notebook here: ๐Ÿ““ https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/llama32_agentic_rag.ipynb


My stack:
๐Ÿ—๏ธ haystack (https://haystack.deepset.ai/): open-source LLM orchestration framework
๐Ÿฆ™ meta-llama/Llama-3.2-3B-Instruct
๐Ÿฆ†๐ŸŒ free DuckDuckGo API, integrated with Haystack

โœจ ๐˜›๐˜ฉ๐˜ฆ ๐˜ณ๐˜ฆ๐˜ด๐˜ถ๐˜ญ๐˜ต๐˜ด? ๐˜Œ๐˜ฏ๐˜ค๐˜ฐ๐˜ถ๐˜ณ๐˜ข๐˜จ๐˜ช๐˜ฏ๐˜จ - ๐˜ข ๐˜ง๐˜ฆ๐˜ธ ๐˜ฎ๐˜ฐ๐˜ฏ๐˜ต๐˜ฉ๐˜ด ๐˜ข๐˜จ๐˜ฐ, ๐˜ต๐˜ฉ๐˜ช๐˜ด ๐˜ญ๐˜ฆ๐˜ท๐˜ฆ๐˜ญ ๐˜ฐ๐˜ง ๐˜ฑ๐˜ฆ๐˜ณ๐˜ง๐˜ฐ๐˜ณ๐˜ฎ๐˜ข๐˜ฏ๐˜ค๐˜ฆ ๐˜ง๐˜ณ๐˜ฐ๐˜ฎ ๐˜ข ๐˜ด๐˜ฎ๐˜ข๐˜ญ๐˜ญ ๐˜ฎ๐˜ฐ๐˜ฅ๐˜ฆ๐˜ญ ๐˜ธ๐˜ฐ๐˜ถ๐˜ญ๐˜ฅ'๐˜ท๐˜ฆ ๐˜ฃ๐˜ฆ๐˜ฆ๐˜ฏ ๐˜ถ๐˜ฏ๐˜ต๐˜ฉ๐˜ช๐˜ฏ๐˜ฌ๐˜ข๐˜ฃ๐˜ญ๐˜ฆ!
This probably reflects the impressive IFEval score of the model (comparable to Llama 3.1 8B).