Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
anakin87ย 
posted an update 3 days ago
Post
452
Hey, it has been a while... I was busy participating in ๐Ÿ’Ž ๐†๐ž๐ฆ๐ฆ๐š ๐œ๐จ๐ฆ๐ฉ๐ž๐ญ๐ข๐ญ๐ข๐จ๐ง!

Here's the idea: Gemma open models have a large vocabulary size (256K), so improving them for a specific language or cultural context should be pretty affordable - no need for continued pre-training.

My submission: ๐Ÿ’Ž๐ŸŒ๐Ÿ‡ฎ๐Ÿ‡น ๐๐ž๐จ๐ ๐ž๐ง๐ž๐ฌ๐ข๐ฌ - ๐๐จ๐ฌ๐ญ-๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐†๐ž๐ฆ๐ฆ๐š ๐Ÿ๐จ๐ซ ๐ˆ๐ญ๐š๐ฅ๐ข๐š๐ง ๐š๐ง๐ ๐›๐ž๐ฒ๐จ๐ง๐
๐Ÿ““ Kaggle notebook: https://www.kaggle.com/code/anakin87/post-training-gemma-for-italian-and-beyond

In this notebook, I show how I improve the performance of Gemma 2 2B on Italian via Post-Training.
I believe this method is adaptable to other languages and model sizes.

๐˜’๐˜ฆ๐˜บ ๐˜š๐˜ต๐˜ฆ๐˜ฑ๐˜ด
๐Ÿ“Š Choose reference metrics
๐Ÿง‘โ€๐Ÿ”ฌ Data curation for Instruction Fine Tuning: identify existing datasets + generate synthetic data
๐Ÿ‹๏ธโ€โ™‚๏ธ Efficient Instruction Fine Tuning with Spectrum
๐Ÿง‘โ€๐Ÿ”ฌ Data curation for Preference Tuning: identify existing datasets + generate synthetic data
๐Ÿ‘๐Ÿ‘Ž Efficient Direct Preference Optimization with Spectrum
๐Ÿ“ˆ Evaluation


๐Ÿค— Hugging Face collection (with models and datasets): anakin87/gemma-neogenesis-67824b7bf13ac9cfe091fe2e

I'm also planning a ๐ŸŽ Gemma Giveaway (on LinkedIn - https://www.linkedin.com/in/stefano-fiorucci) in the next few days - sharing techniques, datasets, and models I used for my project... so stay tuned! ๐Ÿ“ป
In this post