Joao Gante

joaogante

AI & ML interests

None yet

Articles

Organizations

joaogante's activity

reacted to m-ric's post with โค๏ธ 4 months ago
view post
Post
3211
๐๐ž๐ฐ ๐๐ž๐œ๐จ๐๐ข๐ง๐  ๐ญ๐ž๐œ๐ก๐ง๐ข๐ช๐ฎ๐ž ๐ข๐ง ๐ญ๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ ๐ฌ๐ข๐ ๐ง๐ข๐Ÿ๐ข๐œ๐š๐ง๐ญ๐ฅ๐ฒ ๐ซ๐ž๐๐ฎ๐œ๐ž๐ฌ ๐ก๐š๐ฅ๐ฅ๐ฎ๐œ๐ข๐ง๐š๐ญ๐ข๐จ๐ง๐ฌ ๐Ÿ‘

DoLa decoding, which made a conference paper at ICLR '24, has just been merged in Transformers by @joaogante and Yung-Sung Chuang.
This new decoding method is simple yet extremely impressive!

Reminder: Decoder LLMs (the GPT kind of LLM, the most common one) generate their outputs one token at a time: at each step, given a current text, they compute a logit for each token in their vocabulary that should represent the probability of this token coming next.

Then they either pick the highest logit token (greedy decoding) or sample one with a probability defined by the logits (sampling).

The authors of DoLa wanted to improve that simple method.

They knew this established fact that transformer LMs encode low-level info (like base syntax) in early layers and more high-level info like knowledge in the later layers.

๐Ÿ’ก This gave them their key idea: During decoding, rather than picking the token with the highest logit, ๐˜„๐—ต๐˜† ๐—ป๐—ผ๐˜ ๐—ฝ๐—ถ๐—ฐ๐—ธ ๐˜๐—ต๐—ฒ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป ๐˜„๐—ถ๐˜๐—ต ๐˜๐—ต๐—ฒ ๐—บ๐—ผ๐˜€๐˜ ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐˜ƒ๐—ฒ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ ๐—ถ๐—ป ๐—น๐—ผ๐—ด๐—ถ๐˜ ๐—ฎ๐—ฐ๐—ฟ๐—ผ๐˜€๐˜€ ๐—น๐—ฎ๐˜†๐—ฒ๐—ฟ๐˜€?

This gives impressive results:
๐Ÿš€ ๐Ÿฑ% - ๐Ÿฎ๐Ÿฌ% ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—ฝ๐—ผ๐—ถ๐—ป๐˜๐˜€ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ ๐—ฎ๐—ฐ๐—ฟ๐—ผ๐˜€๐˜€ ๐˜๐—ต๐—ฒ ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€
๐Ÿš€ For instance on TruthfulQA / Open-ended, across all model sizes the increase in truthfulness is 14 base points, which is ๐—ฎ๐—ฟ๐—ผ๐˜‚๐—ป๐—ฑ ๐Ÿฐ๐Ÿฌ% ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ฟ๐—ฒ๐—ฑ ๐˜๐—ผ ๐˜€๐˜๐—ฎ๐—ป๐—ฑ๐—ฎ๐—ฟ๐—ฑ ๐—ฑ๐—ฒ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด!

๐Ÿค” Wouldn't decoding take longer because of this added contrasting step? ๐Ÿ‘‰ ๐—ง๐—ต๐—ฒ ๐—ฟ๐˜‚๐—ป๐˜๐—ถ๐—บ๐—ฒ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ ๐—ถ๐˜€ ๐—ป๐—ฒ๐—ด๐—น๐—ถ๐—ด๐—ถ๐—ฏ๐—น๐—ฒ, ๐Ÿญ ๐˜๐—ผ ๐Ÿด% ๐—ผ๐—ป๐—น๐˜†.

Paper added to my collection ๐Ÿ‘‰ m-ric/optimization-mechanics-661d543a5fc6ca1dc84284a0
  • 2 replies
ยท
reacted to alex-abb's post with ๐Ÿ”ฅ 5 months ago
view post
Post
4760
Hi everyone!
I'm Alex, I'm 16, I've been an internship at Hugging Face for a little over a week and I've already learned a lot about using and prompting LLM models. With @victor as tutor I've just finished a space that analyzes your feelings by prompting an LLM chat model. The aim is to extend it so that it can categorize hugging face posts.

alex-abb/LLM_Feeling_Analyzer
ยท
posted an update 6 months ago
view post
Post
2751
New sampling strategy dropped in ๐Ÿค— transformers -- Min P sampling ๐Ÿ”ฅ

Are you tired of having top_k arbitrarily discarding high-quality continuations? Or top_p forgetting to exclude low-probability tokens, derailing your generation? Try out the new min_p flag in generate, fresh from a PR merged today! ๐Ÿฅฌ

Min P consists of a dynamic token filter -- as opposed to Top K, which keeps the K most likely tokens, and Top P, which keeps the most likely tokens up to a fixed cumulative probability, both static filters. Min P takes a base probability (defined in the min_p flag) and multiplies it by the probability of the most likely token in the distribution for the next token. All tokens less likely than the resulting value are filtered. What happens with this strategy?
๐Ÿ‘‰ High probability token present -> aggressive filter (we don't want to miss on that high-probability case and risk derailing generation)
๐Ÿ‘‰ No high probability token present -> relaxed filter (there are many continuation possibilities that the model finds plausible)

You should set min_p to a low value, between 0.05 and 0.1. It behaves particularly well for creative text generation when paired up with temperature > 1.

Kudos to @kalomaze and @menhguin for creating this technique ๐Ÿ”ฅ Read their discussion in the original issue for benchmarks (https://github.com/huggingface/transformers/issues/27670)

Copy-pasteable version of the example in the image below here: https://pastebin.com/VqXNtuxd

Have fun experimenting! ๐Ÿ˜Ž
posted an update 7 months ago
view post
Post
2553
Adding a long prompt can help you fight LLM hallucinations. However, if you know exactly how you want your LLM output constrained, there are much better strategies! ๐Ÿ’ช

Did you know you can force your LLM to ALWAYS generate a valid JSON file? Or to follow a well-defined answer template? You can do that and more with the ๐Ÿค— transformers-compatible outlines library.

It doesn't only allow you to master your LLM -- your text generation application will also become faster! ๐Ÿ”ฅ The more constrained your text generation is, the bigger speedups you'll see!

Follow @remi and other outlines folks to stay on top of the constrained generation game ๐Ÿง 
reacted to m-ric's post with โค๏ธ๐Ÿ”ฅ 8 months ago
view post
Post
1702
๐—›๐—ผ๐˜„ ๐—ฑ๐—ผ๐—ฒ๐˜€ ๐—ฏ๐—ฒ๐—ฎ๐—บ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฑ๐—ฒ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด ๐˜„๐—ผ๐—ฟ๐—ธ? โžก๏ธ ๐™‰๐™š๐™ฌ ๐™ซ๐™ž๐™จ๐™ช๐™–๐™ก๐™ž๐™ฏ๐™–๐™ฉ๐™ž๐™ค๐™ฃ ๐™ฉ๐™ค๐™ค๐™ก! ๐Ÿ‘€

In Decoder-type LLMs like GPT4 or Mistral-Large, the output is generated one token (=word part) at a time. That's why they're nicknamed "stochastic parrots": the "thinking" process only happens one step at a time, so it can seem really myopic.

๐’๐จ ๐ก๐จ๐ฐ ๐ข๐ฌ ๐ญ๐ก๐ž ๐ง๐ž๐ฑ๐ญ ๐ญ๐จ๐ค๐ž๐ง ๐ฌ๐ž๐ฅ๐ž๐œ๐ญ๐ž๐?

๐Ÿ“Š Given its input sentence like "๐˜ž๐˜ฉ๐˜ข๐˜ต ๐˜ช๐˜ด ๐˜ต๐˜ฉ๐˜ฆ 7๐˜ต๐˜ฉ ๐˜๐˜ช๐˜ฃ๐˜ฐ๐˜ฏ๐˜ข๐˜ค๐˜ค๐˜ช ๐˜ฏ๐˜ถ๐˜ฎ๐˜ฃ๐˜ฆ๐˜ณ? ๐˜›๐˜ฉ๐˜ฆ 7๐˜ต๐˜ฉ ๐˜๐˜ช๐˜ฃ๐˜ฐ๐˜ฏ๐˜ข๐˜ค๐˜ค๐˜ช ๐˜ฏ๐˜ถ๐˜ฎ๐˜ฃ๐˜ฆ๐˜ณ", the Decoder LLM generates, for each token in its vocabulary, a score that represents this token's probability of coming next.
For instance: "๐™ž๐™จ" gets score 0.56, and "๐™˜๐™–๐™ฃ" gets score 0.35.

๐Ÿค‘ ๐†๐ซ๐ž๐ž๐๐ฒ ๐๐ž๐œ๐จ๐๐ข๐ง๐  is the naive option where you simply take the next most probable token at each step. But this creates paths that maximize very short-term rewards, thus may overlook better paths for the long term (like this time when you played FIFA all evening and arrived unprepared to your school exam on the next day).
In our example, the next highest score token might be "๐™ž๐™จ", but this will strongly bias the LLM towards giving an hasty response. On the opposite, starting with "๐™˜๐™–๐™ฃ" could have been completed with "๐˜ฃ๐˜ฆ ๐˜ฐ๐˜ฃ๐˜ต๐˜ข๐˜ช๐˜ฏ๐˜ฆ๐˜ฅ ๐˜ง๐˜ณ๐˜ฐ๐˜ฎ ๐˜ค๐˜ฐ๐˜ฎ๐˜ฑ๐˜ถ๐˜ต๐˜ช๐˜ฏ๐˜จ ๐˜ฑ๐˜ณ๐˜ฆ๐˜ท๐˜ช๐˜ฐ๐˜ถ๐˜ด ๐˜๐˜ช๐˜ฃ๐˜ฐ๐˜ฏ๐˜ข๐˜ค๐˜ค๐˜ช ๐˜ฏ๐˜ถ๐˜ฎ๐˜ฃ๐˜ฆ๐˜ณ๐˜ด ๐˜ง๐˜ช๐˜ณ๐˜ด๐˜ต", which steers the LLM towards a correct reasoning!

๐Ÿ—บ๏ธ ๐๐ž๐š๐ฆ ๐ฌ๐ž๐š๐ซ๐œ๐ก improves on greedy decoding by generating at each step several paths - called beams - instead of one. This allows the generation to explore a much larger space, thus find better completions. In our example, both the "๐™ž๐™จ" and the "๐™˜๐™–๐™ฃ" completion could be tested. โœ…

๐Ÿ‘‰ I've created a tool to let you visualize it, thank you @joaogante for your great help!
๐™๐™ง๐™ฎ ๐™ž๐™ฉ ๐™๐™š๐™ง๐™š: m-ric/beam_search_visualizer
reacted to chiphuyen's post with โค๏ธ๐Ÿš€ 8 months ago
replied to mayank-mishra's post 8 months ago
view reply

In transformers the main blocker is backward compatibility -- we assume in many places that batched inputs come with fixed input length. Once we lift this requirement without breaking backward compatibility, it should be a nice addition! ๐Ÿ‘

(Perhaps nested tensors will help)

reacted to trisfromgoogle's post with ๐Ÿค—โค๏ธ 9 months ago
view post
Post
I am thrilled to announce Gemma, new 2B and 7B models from Google, based on the same research and technology used to train the Gemini models! These models achieve state-of-the-art performance for their size, and are launched across Transformers, Google Cloud, and many other surfaces worldwide starting today.

Get started using and adapting Gemma in the model Collection: google/gemma-release-65d5efbccdbb8c4202ec078b

These launches are the product of an outstanding collaboration between the Google DeepMind and Hugging Face teams over the last few months -- very proud of the work both teams have done, from integration with Vertex AI to optimization across the stack. Read more about the partnership in the main launch by @philschmid @osanseviero @pcuenq on the launch blog: https://huggingface.co/blog/gemma

More information below if you are curious about training details, eval results, and safety characteristics!

Gemma Tech Report: https://goo.gle/GemmaReport
Launch announcement: https://blog.google/technology/developers/gemma-open-models/
ยท
reacted to JustinLin610's post with โค๏ธ 9 months ago
view post
Post
Yesterday we just released Qwen1.5. Maybe someday I can tell more about the experience. But this is is at least a good release even if it is not yet SOTA. There is not so many SOTA by the way. This time, we actually fixed a lot of problems.

1. Context lengths are finally unified for all sizes. Previously, a lot of users kept telling us that 14B only supports 2K (Yeah even dynamic NTK does not work that well and it can only be extended to around 4-5K. Let alone those know nothing about how to use dynamic NTK).

2. If you carefully use our base language models, you will find that they understand special tokens of ChatML, which means that you can directly use LoRA to train on data with ChatML format. Why you can't do this before? This is because if the base language model does not understand the special tokens, you need to make them trained, which means that you should turn on the training of embedding. This is disgusting and it often leads to problems when you use ZeRO3.

3. We did strengthen our base language models except for 72. You should find better base language models, especially for 7 and 14. Why not 72? Nah, hard to say, but will make it better.

4. About the multilingual capabilities. Yes we finally build up our multilingual evaluation system and find out that our new base language models have nice performance in multilingual evaluation for base language models. This tells us that we should pay more attention to the post-training with multilingual data. And we did that too. This is why this time we tell you something about multilingual performance. It is for sure much much better than our models before this release.

5. Chat models are the most promising stuff. Before this release, we gave you the SFT models. But this time, we had very nice SFT+DPO models. Yeah not only annotators like them but also users like them. I am sure you developers will feel that way too.

ยท
reacted to clefourrier's post with ๐Ÿค— 9 months ago
view post
Post
๐Ÿ”ฅ New LLM leaderboard on the hub: NPHardEval!

It uses questions of logic, of different mathematical complexities, as a proxy for reasoning abilities. It notably removes questions relying on arithmetic, to really focus on logical abilities.
What's interesting imo is the potential to really study a model performance at different levels of complexity.

Bonus: Since the questions can be generated automatically, it's going to be dynamic, updated monthly! ๐Ÿš€
NPHardEval/NPHardEval-leaderboard

Read more about how their questions are generated in the intro blog: https://huggingface.co/blog/leaderboards-on-the-hub-nphardeval

Congrats to @lizhouf , @wenyueH , @hyfrankl and their teams!
reacted to abidlabs's post with โค๏ธ 10 months ago
view post
Post
The next version of Gradio will be significantly more efficient (as well as a bit faster) for anyone who uses Gradio's streaming features. Looking at you chatbot developers @oobabooga @pseudotensor :)

The major change that we're making is that when you stream data, Gradio used to send the entire payload at each token. This is generally the most robust way to ensure all the data is correctly transmitted. We've now switched to sending "diffs" --> so at each time step, we automatically compute the diff between the most recent updates and then only send the latest token (or whatever the diff may be). Coupled with the fact that we are now using SSE, which is a more robust communication protocol than WS (SSE will resend packets if there's any drops), we should have the best of both worlds: efficient *and* robust streaming.

Very cool stuff @aliabid94 ! PR: https://github.com/gradio-app/gradio/pull/7102
reacted to clem's post with โค๏ธ 10 months ago
view post
Post
With the Google announcement last week, I think we're now officially the only AI startup out there who has commercial collaborations with all the major cloud providers (AWS, GCP, Azure) and hardware providers (Nvidia, AMD, Intel, Qualcomm,...), making our vision of being the independent and agnostic platform for all AI builders truer than ever!

Let's go!
replied to their post 10 months ago
view reply

@MaziyarPanahi no accuracy penalty at all :) The only catch on the transformers side is that you are limited to a batch size of one (and even that is not a technical limitation of the technique -- we simply haven't built that code path yet)

posted an update 10 months ago
view post
Post
Up to 3x faster LLM generation with no extra resources/requirements - ngram speculation has landed in ๐Ÿค— transformers! ๐ŸŽ๏ธ๐Ÿ’จ

All you need to do is to add prompt_lookup_num_tokens=10 to your generate call, and you'll get faster LLMs ๐Ÿ”ฅ


How does it work? ๐Ÿค”

Start with assisted generation, where a smaller model generates candidate sequences. The net result is a significant speedup if the model agrees with the candidate sequences! However, we do require a smaller model trained similarly ๐Ÿ˜•

The idea introduced (and implemented) by Apoorv Saxena consists of gathering the candidate sequences from the input text itself. If the latest generated ngram is in the input, use the continuation therein as a candidate! No smaller model is required while still achieving significant speedups ๐Ÿ”ฅ

In fact, the penalty of gathering and testing the candidates is so small that you should use this technique whenever possible!

Here is the code example that produces the outputs shown in the video: https://pastebin.com/bms6XtR4

Have fun ๐Ÿค—
  • 3 replies
ยท