joaogante (Joao Gante)

posted an update 2 months ago

Post

524

Let's go! Custom generation code has landed in transformers 🚀

Have you designed a new cool KV cache? Maybe you're comparing new test-time compute ideas you've been researching? Have you found a way to do diffusion with existing models? You can now easily share your findings with the community with custom generation code, sharing the well-known generate interface 🤓

In a nutshell, we have expanded the support of custom modeling code on the Hub with *model-agnostic* custom generation code. Write for one model, reuse with any model -- hopefully, this will democratize access to new generation ideas 🫡

As a creator, you gain the ability to get your ideas in transformers with minimal effort. You'll also have access to all Hub features: a landing page for your creation, discussions, usage metrics, ... 🤓

💎 Resources 💎
- docs: https://huggingface.co/docs/transformers/generation_strategies#custom-decoding-methods
- minimal example: transformers-community/custom_generate_example
- discussion: transformers-community/support#10

reacted to julien-c's post with 🔥 8 months ago

Post

10931

After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team

28 replies

·

reacted to m-ric's post with ❤️ about 1 year ago

Post

3243

𝐍𝐞𝐰 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞 𝐢𝐧 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐬𝐢𝐠𝐧𝐢𝐟𝐢𝐜𝐚𝐧𝐭𝐥𝐲 𝐫𝐞𝐝𝐮𝐜𝐞𝐬 𝐡𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 👏

DoLa decoding, which made a conference paper at ICLR '24, has just been merged in Transformers by @joaogante and Yung-Sung Chuang.
This new decoding method is simple yet extremely impressive!

Reminder: Decoder LLMs (the GPT kind of LLM, the most common one) generate their outputs one token at a time: at each step, given a current text, they compute a logit for each token in their vocabulary that should represent the probability of this token coming next.

Then they either pick the highest logit token (greedy decoding) or sample one with a probability defined by the logits (sampling).

The authors of DoLa wanted to improve that simple method.

They knew this established fact that transformer LMs encode low-level info (like base syntax) in early layers and more high-level info like knowledge in the later layers.

💡 This gave them their key idea: During decoding, rather than picking the token with the highest logit, 𝘄𝗵𝘆 𝗻𝗼𝘁 𝗽𝗶𝗰𝗸 𝘁𝗵𝗲 𝘁𝗼𝗸𝗲𝗻 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗶𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗶𝗻 𝗹𝗼𝗴𝗶𝘁 𝗮𝗰𝗿𝗼𝘀𝘀 𝗹𝗮𝘆𝗲𝗿𝘀?

This gives impressive results:
🚀 𝟱% - 𝟮𝟬% 𝗯𝗮𝘀𝗲 𝗽𝗼𝗶𝗻𝘁𝘀 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗮𝗰𝗿𝗼𝘀𝘀 𝘁𝗵𝗲 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀
🚀 For instance on TruthfulQA / Open-ended, across all model sizes the increase in truthfulness is 14 base points, which is 𝗮𝗿𝗼𝘂𝗻𝗱 𝟰𝟬% 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗰𝗼𝗺𝗽𝗮𝗿𝗲𝗱 𝘁𝗼 𝘀𝘁𝗮𝗻𝗱𝗮𝗿𝗱 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴!

🤔 Wouldn't decoding take longer because of this added contrasting step? 👉 𝗧𝗵𝗲 𝗿𝘂𝗻𝘁𝗶𝗺𝗲 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗶𝘀 𝗻𝗲𝗴𝗹𝗶𝗴𝗶𝗯𝗹𝗲, 𝟭 𝘁𝗼 𝟴% 𝗼𝗻𝗹𝘆.

Paper added to my collection 👉 m-ric/optimization-mechanics-661d543a5fc6ca1dc84284a0

2 replies

·

reacted to alex-abb's post with 🔥 about 1 year ago

Post

4940

Hi everyone!
I'm Alex, I'm 16, I've been an internship at Hugging Face for a little over a week and I've already learned a lot about using and prompting LLM models. With @victor as tutor I've just finished a space that analyzes your feelings by prompting an LLM chat model. The aim is to extend it so that it can categorize hugging face posts.

alex-abb/LLM_Feeling_Analyzer

5 replies

·

posted an update about 1 year ago

Post

3858

New sampling strategy dropped in 🤗 transformers -- Min P sampling 🔥

Are you tired of having top_k arbitrarily discarding high-quality continuations? Or top_p forgetting to exclude low-probability tokens, derailing your generation? Try out the new min_p flag in generate, fresh from a PR merged today! 🥬

Min P consists of a dynamic token filter -- as opposed to Top K, which keeps the K most likely tokens, and Top P, which keeps the most likely tokens up to a fixed cumulative probability, both static filters. Min P takes a base probability (defined in the min_p flag) and multiplies it by the probability of the most likely token in the distribution for the next token. All tokens less likely than the resulting value are filtered. What happens with this strategy?
👉 High probability token present -> aggressive filter (we don't want to miss on that high-probability case and risk derailing generation)
👉 No high probability token present -> relaxed filter (there are many continuation possibilities that the model finds plausible)

You should set min_p to a low value, between 0.05 and 0.1. It behaves particularly well for creative text generation when paired up with temperature > 1.

Kudos to @kalomaze and @menhguin for creating this technique 🔥 Read their discussion in the original issue for benchmarks (https://github.com/huggingface/transformers/issues/27670)

Copy-pasteable version of the example in the image below here: https://pastebin.com/VqXNtuxd

Have fun experimenting! 😎

posted an update over 1 year ago

Post

2711

Adding a long prompt can help you fight LLM hallucinations. However, if you know exactly how you want your LLM output constrained, there are much better strategies! 💪

Did you know you can force your LLM to ALWAYS generate a valid JSON file? Or to follow a well-defined answer template? You can do that and more with the 🤗 transformers-compatible outlines library.

It doesn't only allow you to master your LLM -- your text generation application will also become faster! 🔥 The more constrained your text generation is, the bigger speedups you'll see!

Follow @remi and other outlines folks to stay on top of the constrained generation game 🧠

reacted to m-ric's post with ❤️🔥 over 1 year ago

Post

1723

𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗯𝗲𝗮𝗺 𝘀𝗲𝗮𝗿𝗰𝗵 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴 𝘄𝗼𝗿𝗸? ➡️ 𝙉𝙚𝙬 𝙫𝙞𝙨𝙪𝙖𝙡𝙞𝙯𝙖𝙩𝙞𝙤𝙣 𝙩𝙤𝙤𝙡! 👀

In Decoder-type LLMs like GPT4 or Mistral-Large, the output is generated one token (=word part) at a time. That's why they're nicknamed "stochastic parrots": the "thinking" process only happens one step at a time, so it can seem really myopic.

𝐒𝐨 𝐡𝐨𝐰 𝐢𝐬 𝐭𝐡𝐞 𝐧𝐞𝐱𝐭 𝐭𝐨𝐤𝐞𝐧 𝐬𝐞𝐥𝐞𝐜𝐭𝐞𝐝?

📊 Given its input sentence like "𝘞𝘩𝘢𝘵 𝘪𝘴 𝘵𝘩𝘦 7𝘵𝘩 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳? 𝘛𝘩𝘦 7𝘵𝘩 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳", the Decoder LLM generates, for each token in its vocabulary, a score that represents this token's probability of coming next.
For instance: "𝙞𝙨" gets score 0.56, and "𝙘𝙖𝙣" gets score 0.35.

🤑 𝐆𝐫𝐞𝐞𝐝𝐲 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠 is the naive option where you simply take the next most probable token at each step. But this creates paths that maximize very short-term rewards, thus may overlook better paths for the long term (like this time when you played FIFA all evening and arrived unprepared to your school exam on the next day).
In our example, the next highest score token might be "𝙞𝙨", but this will strongly bias the LLM towards giving an hasty response. On the opposite, starting with "𝙘𝙖𝙣" could have been completed with "𝘣𝘦 𝘰𝘣𝘵𝘢𝘪𝘯𝘦𝘥 𝘧𝘳𝘰𝘮 𝘤𝘰𝘮𝘱𝘶𝘵𝘪𝘯𝘨 𝘱𝘳𝘦𝘷𝘪𝘰𝘶𝘴 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳𝘴 𝘧𝘪𝘳𝘴𝘵", which steers the LLM towards a correct reasoning!

🗺️ 𝐁𝐞𝐚𝐦 𝐬𝐞𝐚𝐫𝐜𝐡 improves on greedy decoding by generating at each step several paths - called beams - instead of one. This allows the generation to explore a much larger space, thus find better completions. In our example, both the "𝙞𝙨" and the "𝙘𝙖𝙣" completion could be tested. ✅

👉 I've created a tool to let you visualize it, thank you @joaogante for your great help!
𝙏𝙧𝙮 𝙞𝙩 𝙝𝙚𝙧𝙚: m-ric/beam_search_visualizer

reacted to chiphuyen's post with ❤️🚀 over 1 year ago

Post

Huggingface is carrying the AI open source ecosystem https://huyenchip.com/2024/03/14/ai-oss.html

4 replies

·

replied to mayank-mishra's post over 1 year ago

In transformers the main blocker is backward compatibility -- we assume in many places that batched inputs come with fixed input length. Once we lift this requirement without breaking backward compatibility, it should be a nice addition! 👍

(Perhaps nested tensors will help)

reacted to trisfromgoogle's post with 🤗❤️ over 1 year ago

Post

I am thrilled to announce Gemma, new 2B and 7B models from Google, based on the same research and technology used to train the Gemini models! These models achieve state-of-the-art performance for their size, and are launched across Transformers, Google Cloud, and many other surfaces worldwide starting today.

Get started using and adapting Gemma in the model Collection: google/gemma-release-65d5efbccdbb8c4202ec078b

These launches are the product of an outstanding collaboration between the Google DeepMind and Hugging Face teams over the last few months -- very proud of the work both teams have done, from integration with Vertex AI to optimization across the stack. Read more about the partnership in the main launch by @philschmid @osanseviero @pcuenq on the launch blog: https://huggingface.co/blog/gemma

More information below if you are curious about training details, eval results, and safety characteristics!

Gemma Tech Report: https://goo.gle/GemmaReport
Launch announcement: https://blog.google/technology/developers/gemma-open-models/

6 replies

·

reacted to JustinLin610's post with ❤️ over 1 year ago

Post

Yesterday we just released Qwen1.5. Maybe someday I can tell more about the experience. But this is is at least a good release even if it is not yet SOTA. There is not so many SOTA by the way. This time, we actually fixed a lot of problems.

1. Context lengths are finally unified for all sizes. Previously, a lot of users kept telling us that 14B only supports 2K (Yeah even dynamic NTK does not work that well and it can only be extended to around 4-5K. Let alone those know nothing about how to use dynamic NTK).

2. If you carefully use our base language models, you will find that they understand special tokens of ChatML, which means that you can directly use LoRA to train on data with ChatML format. Why you can't do this before? This is because if the base language model does not understand the special tokens, you need to make them trained, which means that you should turn on the training of embedding. This is disgusting and it often leads to problems when you use ZeRO3.

3. We did strengthen our base language models except for 72. You should find better base language models, especially for 7 and 14. Why not 72? Nah, hard to say, but will make it better.

4. About the multilingual capabilities. Yes we finally build up our multilingual evaluation system and find out that our new base language models have nice performance in multilingual evaluation for base language models. This tells us that we should pay more attention to the post-training with multilingual data. And we did that too. This is why this time we tell you something about multilingual performance. It is for sure much much better than our models before this release.

5. Chat models are the most promising stuff. Before this release, we gave you the SFT models. But this time, we had very nice SFT+DPO models. Yeah not only annotators like them but also users like them. I am sure you developers will feel that way too.

5 replies

·

reacted to clefourrier's post with 🤗 over 1 year ago

Post

🔥 New LLM leaderboard on the hub: NPHardEval!

It uses questions of logic, of different mathematical complexities, as a proxy for reasoning abilities. It notably removes questions relying on arithmetic, to really focus on logical abilities.
What's interesting imo is the potential to really study a model performance at different levels of complexity.

Bonus: Since the questions can be generated automatically, it's going to be dynamic, updated monthly! 🚀
NPHardEval/NPHardEval-leaderboard

Read more about how their questions are generated in the intro blog: https://huggingface.co/blog/leaderboards-on-the-hub-nphardeval

Congrats to @lizhouf , @wenyueH , @hyfrankl and their teams!

reacted to abidlabs's post with ❤️ over 1 year ago

Post

The next version of Gradio will be significantly more efficient (as well as a bit faster) for anyone who uses Gradio's streaming features. Looking at you chatbot developers @oobabooga @pseudotensor :)

The major change that we're making is that when you stream data, Gradio used to send the entire payload at each token. This is generally the most robust way to ensure all the data is correctly transmitted. We've now switched to sending "diffs" --> so at each time step, we automatically compute the diff between the most recent updates and then only send the latest token (or whatever the diff may be). Coupled with the fact that we are now using SSE, which is a more robust communication protocol than WS (SSE will resend packets if there's any drops), we should have the best of both worlds: efficient *and* robust streaming.

Very cool stuff @aliabid94 ! PR: https://github.com/gradio-app/gradio/pull/7102

reacted to clem's post with ❤️ over 1 year ago

Post

With the Google announcement last week, I think we're now officially the only AI startup out there who has commercial collaborations with all the major cloud providers (AWS, GCP, Azure) and hardware providers (Nvidia, AMD, Intel, Qualcomm,...), making our vision of being the independent and agnostic platform for all AI builders truer than ever!

Let's go!

replied to their post over 1 year ago

@MaziyarPanahi no accuracy penalty at all :) The only catch on the transformers side is that you are limited to a batch size of one (and even that is not a technical limitation of the technique -- we simply haven't built that code path yet)

posted an update over 1 year ago

Post

Up to 3x faster LLM generation with no extra resources/requirements - ngram speculation has landed in 🤗 transformers! 🏎️💨

All you need to do is to add prompt_lookup_num_tokens=10 to your generate call, and you'll get faster LLMs 🔥

How does it work? 🤔

Start with assisted generation, where a smaller model generates candidate sequences. The net result is a significant speedup if the model agrees with the candidate sequences! However, we do require a smaller model trained similarly 😕

The idea introduced (and implemented) by Apoorv Saxena consists of gathering the candidate sequences from the input text itself. If the latest generated ngram is in the input, use the continuation therein as a candidate! No smaller model is required while still achieving significant speedups 🔥

In fact, the penalty of gathering and testing the candidates is so small that you should use this technique whenever possible!

Here is the code example that produces the outputs shown in the video: https://pastebin.com/bms6XtR4

Have fun 🤗

3 replies

·

Joao Gante

AI & ML interests

Recent Activity

Organizations

Joao Gante

AI & ML interests

Recent Activity

Organizations

joaogante's activity