euclaise (Jade)

replied to singhsidhukuldeep's post 11 months ago

Diff attention tends to have sparser maps than regular attention, but is not a form of "sparse attention", which is a different concept.

replied to alozowski's post over 1 year ago

Mixtral Instruct isn't a base model though? It's IFT

replied to ggbetz's post over 1 year ago

Sounds neat but doesn't load for me

runtime error
:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (6) Could not resolve host: huggingface.co
Warning: Problem : timeout. Will retry in 1 seconds. 6 retries left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (6) Could not resolve host: huggingface.co
Warning: Problem : timeout. Will retry in 1 seconds. 5 retries left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (6) Could not resolve host: huggingface.co
Warning: Problem : timeout. Will retry in 1 seconds. 4 retries left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (6) Could not resolve host: huggingface.co
Warning: Problem : timeout. Will retry in 1 seconds. 3 retries left.

reacted to Locutusque's post with ❤️ over 1 year ago

Post

🚀 Excited to unveil the Augmented ARC-Challenge Dataset with Chain-of-Thought Reasoning! 🧠✨

📚 Created by enhancing the ARC dataset with AI-generated reasoning from Google's Gemini Pro, this resource aims to improve question answering models' ability to tackle complex science queries.

🔍 Features:
- 1068 training examples
- Detailed reasoning steps for nuanced understanding
- Questions spanning physics, chemistry, biology, & more!

🌟 Ideal for benchmarking QA models, enhancing model interpretability, and studying in-context examples.

🔗 Dive in and help your models learn the art of reasoning!

🔎 Explore more: Locutusque/arc-cot

reacted to vikhyatk's post with ❤️ over 1 year ago

Post

Just released moondream2 - a small 1.8B parameter vision language model. Now fully open source (Apache 2.0) so you can use it without restrictions on commercial use!

vikhyatk/moondream2

8 replies

·

reacted to davanstrien's post with ❤️ over 1 year ago

Post

Introducing davanstrien/cosmopedia_chat (v0.0.1), my first experiment using the new NousResearch Genstruct model NousResearch/Genstruct-7B

This dataset uses a subset of HuggingFaceTB/cosmopedia, a synthetic textbook-quality dataset, and Genstruct to generate user/assistant response pairs.

My current results are mixed, but I'm excited to see how much work is happening around synthetic data generation in the community. Most crucial next step is working more on data filtering from cosmopedia.

Massive thanks to @euclaise @teknium and the other NouseResearch folks for sharing this model ❤️

reacted to loubnabnl's post with ❤️ over 1 year ago

Post

⭐ Today we’re releasing The Stack v2 & StarCoder2: a series of 3B, 7B & 15B code generation models trained on 3.3 to 4.5 trillion tokens of code:

- StarCoder2-15B matches or outperforms CodeLlama 34B, and approaches DeepSeek-33B on multiple benchmarks.
- StarCoder2-3B outperforms StarCoderBase-15B and similar sized models.
- The Stack v2 a 4x larger dataset than the Stack v1, resulting in 900B unique code tokens 🚀
As always, we released everything from models and datasets to curation code. Enjoy!

🔗 StarCoder2 collection: bigcode/starcoder2-65de6da6e87db3383572be1a
🔗 Paper: https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view
🔗 BlogPost: https://huggingface.co/blog/starcoder2
🔗 Code Leaderboard: bigcode/bigcode-models-leaderboard

reacted to akhaliq's post with 👍 over 1 year ago

Post

ChatMusician

Understanding and Generating Music Intrinsically with LLM

ChatMusician: Understanding and Generating Music Intrinsically with LLM (2402.16153)

While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.

replied to akhaliq's post over 1 year ago

You ought to link the actual paper https://huggingface.co/papers/2402.13753

reacted to kirch's post with 👍 over 1 year ago

Post

At long last, it's been found

The holy grail

The one cable to rule them all

replied to their post over 1 year ago

"To prevent catastrophic forgetting, I used weight averaging between iterations."
Can you please elaborate !? Tnx 🤗

Language models tend to 'forget' information and skills when finetuned for too long. One way to prevent this is, instead of training the weights directly, train updated weights and then average the updated weights with the previous weights at each epoch (or, in this case, generation+finetuning cycle), and use the average instead of the raw update.

replied to MoritzLaurer's post over 1 year ago

A related concept is prompt tuning: Before LoRA became common, parameter-efficient tuning was often done by training a soft prompt and prepending it to all sequences

replied to JustinLin610's post over 1 year ago

No I did not say it is SOTA. It is impossible for such a small model to be very powerful but it might be useful in some cases I guess.

I believe their point is that it's SOTA for its size, not across all sizes

reacted to JustinLin610's post with ❤️ over 1 year ago

Post

Yesterday we just released Qwen1.5. Maybe someday I can tell more about the experience. But this is is at least a good release even if it is not yet SOTA. There is not so many SOTA by the way. This time, we actually fixed a lot of problems.

1. Context lengths are finally unified for all sizes. Previously, a lot of users kept telling us that 14B only supports 2K (Yeah even dynamic NTK does not work that well and it can only be extended to around 4-5K. Let alone those know nothing about how to use dynamic NTK).

2. If you carefully use our base language models, you will find that they understand special tokens of ChatML, which means that you can directly use LoRA to train on data with ChatML format. Why you can't do this before? This is because if the base language model does not understand the special tokens, you need to make them trained, which means that you should turn on the training of embedding. This is disgusting and it often leads to problems when you use ZeRO3.

3. We did strengthen our base language models except for 72. You should find better base language models, especially for 7 and 14. Why not 72? Nah, hard to say, but will make it better.

4. About the multilingual capabilities. Yes we finally build up our multilingual evaluation system and find out that our new base language models have nice performance in multilingual evaluation for base language models. This tells us that we should pay more attention to the post-training with multilingual data. And we did that too. This is why this time we tell you something about multilingual performance. It is for sure much much better than our models before this release.

5. Chat models are the most promising stuff. Before this release, we gave you the SFT models. But this time, we had very nice SFT+DPO models. Yeah not only annotators like them but also users like them. I am sure you developers will feel that way too.

5 replies

·

posted an update over 1 year ago

Post

Memphis: Advancing language model reasoning without relying on proprietary model outputs

Memphis is a series of models which advance human-data models, offering good performance without relying on proprietary model outputs (e.g. GPT-generated datasets). I've developed a new iterative finetuning procedure to improve the reasoning ability of these models beyond what is possible using only SFT on the same data.

Currently, I've released two models: Memphis-CoT-3B, and Memphis-scribe-3B.

To create these models, I've created new datasets:
- euclaise/reddit-instruct : A dataset of instruction/QA-like data scraped from Reddit. A curated version, filtered using Lilac and neural embedding models, is available at euclaise/reddit-instruct-curated
- euclaise/TinyCoT : TinyCoT is a mtea-dataset that aggregates a variety of different human-sourced reasoning data. It is a curated version of my previous MegaCoT dataset euclaise/MegaCoT, which contains 629k responses which get cut down to 28k for TinyCoT. There's also an intermediate version euclaise/MiniCoT, which has 129k responses.

Memphis-CoT is trained on reddit-instruct, a filtered version of oasst2 sablo/oasst2_curated, and TinyCoT. Multiple iterations were performed on TinyCoT, while reddit-instruct and oasst2 were only used for the initial model.

Memphis-scribe further finetunes Memphis-CoT on more creative tasks. It was finetuned from Memphis-CoT on 18 different datasets, including datasets like euclaise/WritingPrompts_curated, lemonilia/LimaRP, and more.

To prevent catastrophic forgetting, I used weight averaging between iterations.

- euclaise/Memphis-CoT-3B
- euclaise/Memphis-scribe-3B

2 replies

·

reacted to manu's post with ❤️ over 1 year ago

Post

These past months, I've been busy baking a special sort of Croissant 🥐 with an awesome team !

🥐 CroissantLLM is a truly bilingual language model trained on 3 trillion tokens of French and English data. In its size category (<2B), it is the best model in French, but it also rivals the best monolingual English models !

💾 To train it, we collected, filtered and cleaned huge quantities of permissively licensed French data, across various domains (legal, administrative, cultural, scientific), and different text modalities (speech transcriptions, movie subtitles, encyclopedias, forums, webpages)...

⚖️ Assessing LLM performance is not easy, especially outside of English, and to this end we crafted a novel evaluation benchmark, FrenchBench, aiming to assess reasoning, factual knowledge, and linguistic capabilities of models in French !

🔎 The best current LLMs are hidden behind a shroud of mystery, trained with undisclosed training data mixes or strategies. We go the opposite way, releasing all of the project's artefacts (model checkpoints, data, training details, evaluation benchmarks...) We obtain 81 % of the Stanford FMTI transparency criterias, far ahead of even most open initiatives !

🧪Beyond a powerful industrial resource, our transparent initiative is a stepping stone for many scientific questions ! How does teaching a model two languages instead of one splits its monolingual ability ? Does training on so much French help the model integrate French-centric knowledge and cultural biases ? How does the model memorize the training data ?

Many more things to say, for those interested, I recommend checking out:

🗞️ The blogpost: https://huggingface.co/blog/manu/croissant-llm-blog
📖 The 45 page report with lots of gems: https://arxiv.org/abs/2402.00786
🤖 Models, Data, Demo:

croissantllm

3 replies

·

replied to AlekseyKorshuk's post over 1 year ago

StableLM 3B benchmarks the best, although StableLM 2 1.6B and Qwen 1.8B crush it in GSM8K (albeit with more restrictive licenses).

For small tests I usually use falcon-rw-1b - permissive license, 1.3B params.

MiniMA 2 might be worth trying too - it's pruned from LLaMA, so you get the advantage of being compatible with LLaMA-based frameworks (although I had issues trying to get it to run in vLLM)

Jade

AI & ML interests

Recent Activity

Organizations

Jade

AI & ML interests

Recent Activity

Organizations

euclaise's activity