Abstract
Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce TinyGSM, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on TinyGSM, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is generated. Our approach is simple and has two key components: 1) the high-quality dataset TinyGSM, 2) the use of a verifier, which selects the final outputs from multiple candidate generations.
Community
This will open the door to having very specific model runs locally, making AI accessible for all children everywhere. Instead of needing an AI tutor with high generalization, we can have a tutor that answers the same questions asked many years ago. By using a verifier trained on a tiny amount of data from GSM, we can intentionally contaminate it, resulting in a SML that is good at answering GSM-like questions. This is indeed a smart move!
Intentional overfitting or contamination can be beneficial, especially for educational AI tutors. For instance, Grade 7 math questions haven't changed significantly over time. A specialized AI tutor for this grade should focus on these specific questions, using overfitting as a tool for precision rather than generalization. This approach aligns with the educational domain's needs, ensuring that the AI remains focused on relevant material.
I wonder if you guys can share the TinyGSM dataset? I like to try your approach for other STEM topics and different grades, to have many SMLs each sophisticated on one topic and one grade.
Thanks.
I apologize for the late response and thank you for your interest in our dataset!
Please find our dataset here: https://huggingface.co/datasets/TinyGSM/TinyGSM
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Math-Shepherd: A Label-Free Step-by-Step Verifier for LLMs in Mathematical Reasoning (2023)
- Learning From Mistakes Makes LLM Better Reasoner (2023)
- Mixed Distillation Helps Smaller Language Model Better Reasoning (2023)
- LLM-Assisted Code Cleaning For Training Accurate Code Generators (2023)
- Outcome-supervised Verifiers for Planning in Mathematical Reasoning (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Thank you for sharing the work!
Regarding the construction of the TinyGSM dataset used for training, I was wondering if some arrangements/checks were made to avoid coincidental leakage of duplicates or near-duplicates of GSM8K's test set. As the scale and diversity were the main objectives in creating the dataset, it might be worth checking.
Once the TinyGSM dataset is available, we can also run the check ourselves, like we did with other math reasoning datasets where we found this to be a common issue.
(I'm sorry that this is a very late reply!)
Thank you for the great question and for your interest in our work! We ran n-gram checks to make sure that there is no verbatim duplication, but we are not sure whether there are duplication by rephrasing. One possible way to check for semantic similarity is by comparing the embeddings of some trained language models, but this doesn't work well for math: for example, changing the numbers would make the question different but still have the same semantics.
Our dataset is available here: https://huggingface.co/datasets/TinyGSM/TinyGSM
Any duplication checks are welcome and we'd love to learn about your findings. Thank you!
Dear Ronen Eldan, I would like to try to finetune this Math GSM model to improve its performance. If it the model is reliable then it could become a module in the TimeCapsuleTeacher(TM) platform, for teaching Math. Can you give me the model.py and train.py and configs and remote access to fast GPU/TPU to finetune a separate version of the GSM model weights according to my own MathTrain.txt and finetuning methods? Is there a simple way to automatically test the performance-benchmarks of the finetuned model periodically during finetuning? (One local save of model weights to support an instance of inference mode, while the finetuning still proceeds.)
Thank you for your interest in our work!
Please note that our dataset is now available at https://huggingface.co/datasets/TinyGSM/TinyGSM .
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper