Diff attention tends to have sparser maps than regular attention, but is not a form of "sparse attention", which is a different concept.
Jade
AI & ML interests
Recent Activity
Organizations

Mixtral Instruct isn't a base model though? It's IFT
Sounds neat but doesn't load for me
runtime error
:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (6) Could not resolve host: huggingface.co
Warning: Problem : timeout. Will retry in 1 seconds. 6 retries left.
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (6) Could not resolve host: huggingface.co
Warning: Problem : timeout. Will retry in 1 seconds. 5 retries left.
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (6) Could not resolve host: huggingface.co
Warning: Problem : timeout. Will retry in 1 seconds. 4 retries left.
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (6) Could not resolve host: huggingface.co
Warning: Problem : timeout. Will retry in 1 seconds. 3 retries left.

π Created by enhancing the ARC dataset with AI-generated reasoning from Google's Gemini Pro, this resource aims to improve question answering models' ability to tackle complex science queries.
π Features:
- 1068 training examples
- Detailed reasoning steps for nuanced understanding
- Questions spanning physics, chemistry, biology, & more!
π Ideal for benchmarking QA models, enhancing model interpretability, and studying in-context examples.
π Dive in and help your models learn the art of reasoning!
π Explore more: Locutusque/arc-cot

vikhyatk/moondream2

This dataset uses a subset of HuggingFaceTB/cosmopedia, a synthetic textbook-quality dataset, and Genstruct to generate user/assistant response pairs.
My current results are mixed, but I'm excited to see how much work is happening around synthetic data generation in the community. Most crucial next step is working more on data filtering from cosmopedia.
Massive thanks to @euclaise @teknium and the other NouseResearch folks for sharing this model β€οΈ

- StarCoder2-15B matches or outperforms CodeLlama 34B, and approaches DeepSeek-33B on multiple benchmarks.
- StarCoder2-3B outperforms StarCoderBase-15B and similar sized models.
- The Stack v2 a 4x larger dataset than the Stack v1, resulting in 900B unique code tokens π
As always, we released everything from models and datasets to curation code. Enjoy!
π StarCoder2 collection: bigcode/starcoder2-65de6da6e87db3383572be1a
π Paper: https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view
π BlogPost: https://huggingface.co/blog/starcoder2
π Code Leaderboard: bigcode/bigcode-models-leaderboard

Understanding and Generating Music Intrinsically with LLM
ChatMusician: Understanding and Generating Music Intrinsically with LLM (2402.16153)
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.
You ought to link the actual paper https://huggingface.co/papers/2402.13753

The holy grail
The one cable to rule them all

"To prevent catastrophic forgetting, I used weight averaging between iterations."
Can you please elaborate !? Tnx π€
Language models tend to 'forget' information and skills when finetuned for too long. One way to prevent this is, instead of training the weights directly, train updated weights and then average the updated weights with the previous weights at each epoch (or, in this case, generation+finetuning cycle), and use the average instead of the raw update.

A related concept is prompt tuning: Before LoRA became common, parameter-efficient tuning was often done by training a soft prompt and prepending it to all sequences

No I did not say it is SOTA. It is impossible for such a small model to be very powerful but it might be useful in some cases I guess.
I believe their point is that it's SOTA for its size, not across all sizes

1. Context lengths are finally unified for all sizes. Previously, a lot of users kept telling us that 14B only supports 2K (Yeah even dynamic NTK does not work that well and it can only be extended to around 4-5K. Let alone those know nothing about how to use dynamic NTK).
2. If you carefully use our base language models, you will find that they understand special tokens of ChatML, which means that you can directly use LoRA to train on data with ChatML format. Why you can't do this before? This is because if the base language model does not understand the special tokens, you need to make them trained, which means that you should turn on the training of embedding. This is disgusting and it often leads to problems when you use ZeRO3.
3. We did strengthen our base language models except for 72. You should find better base language models, especially for 7 and 14. Why not 72? Nah, hard to say, but will make it better.
4. About the multilingual capabilities. Yes we finally build up our multilingual evaluation system and find out that our new base language models have nice performance in multilingual evaluation for base language models. This tells us that we should pay more attention to the post-training with multilingual data. And we did that too. This is why this time we tell you something about multilingual performance. It is for sure much much better than our models before this release.
5. Chat models are the most promising stuff. Before this release, we gave you the SFT models. But this time, we had very nice SFT+DPO models. Yeah not only annotators like them but also users like them. I am sure you developers will feel that way too.

Memphis is a series of models which advance human-data models, offering good performance without relying on proprietary model outputs (e.g. GPT-generated datasets). I've developed a new iterative finetuning procedure to improve the reasoning ability of these models beyond what is possible using only SFT on the same data.
Currently, I've released two models: Memphis-CoT-3B, and Memphis-scribe-3B.
To create these models, I've created new datasets:
- euclaise/reddit-instruct : A dataset of instruction/QA-like data scraped from Reddit. A curated version, filtered using Lilac and neural embedding models, is available at euclaise/reddit-instruct-curated
- euclaise/TinyCoT : TinyCoT is a mtea-dataset that aggregates a variety of different human-sourced reasoning data. It is a curated version of my previous MegaCoT dataset euclaise/MegaCoT, which contains 629k responses which get cut down to 28k for TinyCoT. There's also an intermediate version euclaise/MiniCoT, which has 129k responses.
Memphis-CoT is trained on reddit-instruct, a filtered version of oasst2 sablo/oasst2_curated, and TinyCoT. Multiple iterations were performed on TinyCoT, while reddit-instruct and oasst2 were only used for the initial model.
Memphis-scribe further finetunes Memphis-CoT on more creative tasks. It was finetuned from Memphis-CoT on 18 different datasets, including datasets like euclaise/WritingPrompts_curated, lemonilia/LimaRP, and more.
To prevent catastrophic forgetting, I used weight averaging between iterations.
- euclaise/Memphis-CoT-3B
- euclaise/Memphis-scribe-3B

π₯ CroissantLLM is a truly bilingual language model trained on 3 trillion tokens of French and English data. In its size category (<2B), it is the best model in French, but it also rivals the best monolingual English models !
πΎ To train it, we collected, filtered and cleaned huge quantities of permissively licensed French data, across various domains (legal, administrative, cultural, scientific), and different text modalities (speech transcriptions, movie subtitles, encyclopedias, forums, webpages)...
βοΈ Assessing LLM performance is not easy, especially outside of English, and to this end we crafted a novel evaluation benchmark, FrenchBench, aiming to assess reasoning, factual knowledge, and linguistic capabilities of models in French !
π The best current LLMs are hidden behind a shroud of mystery, trained with undisclosed training data mixes or strategies. We go the opposite way, releasing all of the project's artefacts (model checkpoints, data, training details, evaluation benchmarks...) We obtain 81 % of the Stanford FMTI transparency criterias, far ahead of even most open initiatives !
π§ͺBeyond a powerful industrial resource, our transparent initiative is a stepping stone for many scientific questions ! How does teaching a model two languages instead of one splits its monolingual ability ? Does training on so much French help the model integrate French-centric knowledge and cultural biases ? How does the model memorize the training data ?
Many more things to say, for those interested, I recommend checking out:
ποΈ The blogpost: https://huggingface.co/blog/manu/croissant-llm-blog
π The 45 page report with lots of gems: https://arxiv.org/abs/2402.00786
π€ Models, Data, Demo:


StableLM 3B benchmarks the best, although StableLM 2 1.6B and Qwen 1.8B crush it in GSM8K (albeit with more restrictive licenses).
For small tests I usually use falcon-rw-1b - permissive license, 1.3B params.
MiniMA 2 might be worth trying too - it's pruned from LLaMA, so you get the advantage of being compatible with LLaMA-based frameworks (although I had issues trying to get it to run in vLLM)