Nicolas Patry

Narsil

AI & ML interests

None yet

Recent Activity

updated a model 9 days ago
hf-internal-testing/llama3-tokenizer
View all activity

Articles

Organizations

Hugging Face's profile picture Safetensors's profile picture BigScience Workshop's profile picture Hugging Face Internal Testing Organization's profile picture superb's profile picture Deepmind's profile picture Text Generation Inference's profile picture BigScience Catalogue Data Dev's profile picture HuggingFaceM4's profile picture Hugging Face H4's profile picture Hugging Face Extreme-Scale's profile picture H4 Red Team's profile picture Code Llama's profile picture gg-hf's profile picture On-device Squad's profile picture hsramall's profile picture Tinkering's profile picture gg-tt's profile picture Hugging Face Discord Community's profile picture Meta Llama's profile picture nltpt's profile picture s0409's profile picture kernels-community's profile picture

Narsil's activity

reacted to their post with ❤️ 18 days ago
view post
Post
1140
Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !



3x more tokens.

By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani ël de Kok for the beast data structure.
Zero config

That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co/docs/text-generation-inference/conceptual/chunking
posted an update about 1 month ago
view post
Post
1140
Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !



3x more tokens.

By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani ël de Kok for the beast data structure.
Zero config

That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co/docs/text-generation-inference/conceptual/chunking
New activity in huggingchat/chat-ui about 1 month ago

Your feedback on HuggingChat

264
#1 opened over 1 year ago by
victor
reacted to alex-abb's post with 🔥 7 months ago
view post
Post
4822
Hi everyone!
I'm Alex, I'm 16, I've been an internship at Hugging Face for a little over a week and I've already learned a lot about using and prompting LLM models. With @victor as tutor I've just finished a space that analyzes your feelings by prompting an LLM chat model. The aim is to extend it so that it can categorize hugging face posts.

alex-abb/LLM_Feeling_Analyzer
·
reacted to mitkox's post with ❤️ 7 months ago
view post
Post
3423
I've made an on device AI comparison between open source, Apple Intelligence, and Microsoft Copilot+ PC. This OS and applications level integration will bring GenAI to everyone, be it consumers or businesses, over the next year.

Communities and BigTech hold divergent visions regarding the problems they aim to solve, ways to lock in users and enterprises, as well as their commercialization and GTM strategies.

I'm aware that this table has the potential to expand into an epic 30-page saga during an in-depth analysis, but hey, it's a beginning. Do you think I should throw in a few more comparisons? I'm all ears for your thoughts and critiques!

Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it
  • 1 reply
·
reacted to dvilasuero's post with 🔥🤗 7 months ago
view post
Post
8112
Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!

We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.

To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!
·
upvoted an article 7 months ago
view article
Article

🧨 Diffusers welcomes Stable Diffusion 3

93
replied to flashback29's post 8 months ago
view reply

Are you sure you're using the appropriate token ?
Does it still happen ?

If it still persists, the error is really likely to come from the token being not the one you expect.
If it's really not that, we can double check things.

New activity in 01-ai/Yi-1.5-34B-Chat 8 months ago

Adding a fast tokenizer

2
#10 opened 8 months ago by
Narsil
New activity in 01-ai/Yi-1.5-9B-Chat 8 months ago

Adding a fast tokenizer.

#10 opened 8 months ago by
Narsil

Add fast tokenizer

#9 opened 8 months ago by
Narsil
posted an update 8 months ago