Gemstone-256x80_cooldown

Gemstone-256x80_cooldown is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. This model has had its learning rate linearly decreased to 0 over 10% of the current training tokens, over token counts from 10 to 100 billion. Revisions are named "step_{x}_cooldown_{y}" where x is the step count when we began the cooldown, and y is the current step count. The main branch is the last cooldown for this model which is a cooldown from 100 billion tokens to 110 billion tokens.

Training

We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048.

Data

Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model.

Using Gemstone-256x80_cooldown

The Gemstones are based on the gemma-2b architecture and use modeling_gemma.py to run using the transformers library.

Licence

This model is released under the apache-2.0 licence.

Contact

Please, feel free to contact us with any questions, or open a discussion thread.

Citation

@article{mcleish2024gemstones
    title={Gemstones: A Model Suite for Multi-Faceted Scaling Laws}, 
    author={Sean McLeish and John Kirchenbauer and David Yu Miller and Siddharth Singh and Abhinav Bhatele and Micah Goldblum and Ashwinee Panda and Tom Goldstein},
    journal={arXiv preprint arXiv:2502.},
    year={2025}
}

tomg-group-umd
/

Gemstone-256x80_cooldown