|  |  | 
					
						
						|  | --- | 
					
						
						|  | license: apache-2.0 | 
					
						
						|  | tags: | 
					
						
						|  | - sweelol-ai | 
					
						
						|  | - text-generation | 
					
						
						|  | - gemma | 
					
						
						|  | - distillation | 
					
						
						|  | - pruning | 
					
						
						|  | - lora | 
					
						
						|  | - prompt-tuning | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | # {model_name} | 
					
						
						|  |  | 
					
						
						|  | ## Model Description | 
					
						
						|  |  | 
					
						
						|  | This model is part of the **Sweelol AI Hub** collection, resulting from experiments in efficient fine-tuning and knowledge distillation on the Gemma-3-270m architecture using the Databricks Dolly-15k dataset on Kaggle TPUs/GPUs. | 
					
						
						|  |  | 
					
						
						|  | **Full Research Notebook & Benchmark Results:** [Link to your final Kaggle Benchmark notebook here] | 
					
						
						|  |  | 
					
						
						|  | **Key Details:** | 
					
						
						|  | *   **Base Model:** `google/gemma-3-270m` | 
					
						
						|  | *   **Training Data:** Databricks Dolly-15k (subset) | 
					
						
						|  | *   **Fine-Tuning Method:** {method_description} | 
					
						
						|  | *   **Purpose:** {purpose} | 
					
						
						|  |  | 
					
						
						|  | This is a placeholder README. A detailed model card with full results and usage instructions will be added shortly. | 
					
						
						|  |  | 
					
						
						|  | ## Evaluation Results | 
					
						
						|  |  | 
					
						
						|  | This table compares the performance of this **Finetuned-Pruned** model against the original, un-tuned `google/gemma-3-270m` base model. | 
					
						
						|  |  | 
					
						
						|  | | Benchmark Task | Sweelol Finetuned-Pruned | Baseline (Gemma-3-270m) | Change | | 
					
						
						|  | | :--- | :--- | :--- | :--- | | 
					
						
						|  | | **Average MMLU (5 tasks)** | 25.18% | 24.88% | **+0.30%** | | 
					
						
						|  | | HellaSwag (Common Sense) | 29.50% | 43.50% | -14.00% | | 
					
						
						|  | | ---------------------------------- | ---------- | ---------- | -------- | | 
					
						
						|  | | *MMLU Sub-task Breakdown:* | | | | | 
					
						
						|  | | MMLU - Formal Logic | **28.57%** | 25.40% | **+3.17%** | | 
					
						
						|  | | MMLU - High School Computer Science | **25.00%** | 24.00% | **+1.00%** | | 
					
						
						|  | | MMLU - Professional Law | 25.00% | 27.00% | -2.00% | | 
					
						
						|  | | MMLU - Abstract Algebra | 22.00% | 22.00% | 0.00% | | 
					
						
						|  | | MMLU - High School Mathematics | 21.00% | 26.00% | -5.00% | | 
					
						
						|  |  | 
					
						
						|  | #### Summary of Findings | 
					
						
						|  | Fine-tuning the pruned model resulted in a solid overall improvement on MMLU, particularly in formal logic. However, like the pruned-only baseline, it suffered a significant drop in common-sense reasoning (HellaSwag). | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## Evaluation | 
					
						
						|  |  | 
					
						
						|  | ### Testing Data & Metrics | 
					
						
						|  |  | 
					
						
						|  | All models were evaluated on a comprehensive suite of tasks from the `lm-evaluation-harness`, including 5 diverse subsets of **MMLU** (for academic reasoning) and **HellaSwag** (for common-sense reasoning). The primary metric is zero-shot accuracy on a 200-sample subset of each task's test split. | 
					
						
						|  |  | 
					
						
						|  | ### Results | 
					
						
						|  |  | 
					
						
						|  | This table summarizes the final benchmark scores for all models created in the **Sweelol AI Comparative Study**. All fine-tuned models were trained on a subset of the `databricks/databricks-dolly-15k` dataset. | 
					
						
						|  |  | 
					
						
						|  | | Model | Technique | Average MMLU | HellaSwag | MMLU CompSci | MMLU Logic | MMLU Law | MMLU Math | MMLU Algebra | | 
					
						
						|  | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | 
					
						
						|  | | **Baseline** | *(Pre-trained)* | 24.88% | **43.50%** | 24.00% | 25.40% | **27.00%** | **26.00%** | 22.00% | | 
					
						
						|  | | **Pruned-Baseline**| Pruning | **26.17%** | 29.50% | **28.00%** | **29.37%** | 26.00% | 24.50% | **23.00%** | | 
					
						
						|  | | **Prompt-Tune** | PEFT | 25.77% | 39.00% | 27.00% | **29.37%** | **27.50%** | 22.00% | **23.00%** | | 
					
						
						|  | | **Finetuned-Pruned**| Pruning + FT | 25.18% | 29.50% | 25.00% | 28.57% | 25.00% | 21.00% | 22.00% | | 
					
						
						|  | | **LoRA** | PEFT | 24.60% | 26.00% | 25.00% | 28.57% | 25.00% | 21.00% | 22.00% | | 
					
						
						|  | | **KD-Pruned** | Distillation | 23.98% | 33.00% | 26.00% | 25.40% | 25.00% | 21.50% | 22.00% | | 
					
						
						|  | | **Full-Finetune** | Full FT | 22.60% | 39.00% | 26.00% | 23.02% | 23.50% | 21.50% | 19.00% | | 
					
						
						|  |  | 
					
						
						|  | #### Summary of Key Findings | 
					
						
						|  |  | 
					
						
						|  | 1.  **Pruning is a Superpower for Logic:** The `Pruned-Baseline` model, with no fine-tuning, was the **undisputed champion on average MMLU performance**. It achieved the highest scores in Formal Logic and Computer Science, suggesting that pruning enhances the model's core, pre-trained reasoning abilities. | 
					
						
						|  |  | 
					
						
						|  | 2.  **Prompt Tuning is the Efficiency King:** The `Prompt-Tune` model was the second-best performer on MMLU and retained strong common-sense performance (HellaSwag). This makes it the most efficient and effective overall technique, delivering top-tier results with minimal training. | 
					
						
						|  |  | 
					
						
						|  | 3.  **The "Alignment Tax" is Real:** Both `Full-Finetune` and `KD-Pruned` models, while trained on instruction data, showed a significant drop in performance on the MMLU reasoning tasks compared to the baseline. This is a classic example of the "alignment tax," where teaching a model to be a helpful assistant can sometimes dilute its raw, academic reasoning capabilities. | 
					
						
						|  |  | 
					
						
						|  | 4.  **Common Sense is Fragile:** Techniques that heavily modified the model's structure or weights (`Pruning`, `LoRA`) resulted in a significant drop in performance on the `HellaSwag` common-sense benchmark. The `Baseline` model remains the champion of common sense. | 
					
						
						|  |  | 
					
						
						|  | This comprehensive benchmark provides a clear, data-driven guide for selecting the right optimization technique for a given task. | 
					
						
						|  |  | 
					
						
						|  | # Gemma 3 model card | 
					
						
						|  |  | 
					
						
						|  | **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core) | 
					
						
						|  |  | 
					
						
						|  | **Resources and Technical Documentation**: | 
					
						
						|  |  | 
					
						
						|  | * [Gemma 3 Technical Report][g3-tech-report] | 
					
						
						|  | * [Responsible Generative AI Toolkit][rai-toolkit] | 
					
						
						|  | * [Gemma on Kaggle][kaggle-gemma] | 
					
						
						|  | * [Gemma on Vertex Model Garden][vertex-mg-gemma3] | 
					
						
						|  |  | 
					
						
						|  | **Terms of Use**: [Terms][terms] | 
					
						
						|  |  | 
					
						
						|  | **Authors**: Google DeepMind | 
					
						
						|  |  | 
					
						
						|  | ## Model Information | 
					
						
						|  |  | 
					
						
						|  | Summary description and brief definition of inputs and outputs. | 
					
						
						|  |  | 
					
						
						|  | ### Description | 
					
						
						|  |  | 
					
						
						|  | Gemma is a family of lightweight, state-of-the-art open models from Google, | 
					
						
						|  | built from the same research and technology used to create the Gemini models. | 
					
						
						|  | Gemma 3 models are multimodal, handling text and image input and generating text | 
					
						
						|  | output, with open weights for both pre-trained variants and instruction-tuned | 
					
						
						|  | variants. Gemma 3 has a large, 128K context window, multilingual support in over | 
					
						
						|  | 140 languages, and is available in more sizes than previous versions. Gemma 3 | 
					
						
						|  | models are well-suited for a variety of text generation and image understanding | 
					
						
						|  | tasks, including question answering, summarization, and reasoning. Their | 
					
						
						|  | relatively small size makes it possible to deploy them in environments with | 
					
						
						|  | limited resources such as laptops, desktops or your own cloud infrastructure, | 
					
						
						|  | democratizing access to state of the art AI models and helping foster innovation | 
					
						
						|  | for everyone. | 
					
						
						|  |  | 
					
						
						|  | ### Inputs and outputs | 
					
						
						|  |  | 
					
						
						|  | -   **Input:** | 
					
						
						|  | -  Text string, such as a question, a prompt, or a document to be summarized | 
					
						
						|  | -  Images, normalized to 896 x 896 resolution and encoded to 256 tokens | 
					
						
						|  | each, for the 4B, 12B, and 27B sizes. | 
					
						
						|  | -  Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and | 
					
						
						|  | 32K tokens for the 1B and 270M sizes. | 
					
						
						|  |  | 
					
						
						|  | -   **Output:** | 
					
						
						|  | -   Generated text in response to the input, such as an answer to a | 
					
						
						|  | question, analysis of image content, or a summary of a document | 
					
						
						|  | -   Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, | 
					
						
						|  | and 32K tokens for the 1B and 270M sizes per request, subtracting the | 
					
						
						|  | request input tokens | 
					
						
						|  |  | 
					
						
						|  | ### Citation | 
					
						
						|  |  | 
					
						
						|  | ```none | 
					
						
						|  | @article{gemma_2025, | 
					
						
						|  | title={Gemma 3}, | 
					
						
						|  | url={https://arxiv.org/abs/2503.19786}, | 
					
						
						|  | publisher={Google DeepMind}, | 
					
						
						|  | author={Gemma Team}, | 
					
						
						|  | year={2025} | 
					
						
						|  | } | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ## Model Data | 
					
						
						|  |  | 
					
						
						|  | Data used for model training and how the data was processed. | 
					
						
						|  |  | 
					
						
						|  | ### Training Dataset | 
					
						
						|  |  | 
					
						
						|  | These models were trained on a dataset of text data that includes a wide variety | 
					
						
						|  | of sources. The 27B model was trained with 14 trillion tokens, the 12B model was | 
					
						
						|  | trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens, | 
					
						
						|  | the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The | 
					
						
						|  | knowledge cutoff date for the training data was August 2024. Here are the key | 
					
						
						|  | components: | 
					
						
						|  |  | 
					
						
						|  | -   Web Documents: A diverse collection of web text ensures the model is | 
					
						
						|  | exposed to a broad range of linguistic styles, topics, and vocabulary. The | 
					
						
						|  | training dataset includes content in over 140 languages. | 
					
						
						|  | -   Code: Exposing the model to code helps it to learn the syntax and | 
					
						
						|  | patterns of programming languages, which improves its ability to generate | 
					
						
						|  | code and understand code-related questions. | 
					
						
						|  | -   Mathematics: Training on mathematical text helps the model learn logical | 
					
						
						|  | reasoning, symbolic representation, and to address mathematical queries. | 
					
						
						|  | -   Images: A wide range of images enables the model to perform image | 
					
						
						|  | analysis and visual data extraction tasks. | 
					
						
						|  |  | 
					
						
						|  | The combination of these diverse data sources is crucial for training a powerful | 
					
						
						|  | multimodal model that can handle a wide variety of different tasks and data | 
					
						
						|  | formats. | 
					
						
						|  |  | 
					
						
						|  | ### Data Preprocessing | 
					
						
						|  |  | 
					
						
						|  | Here are the key data cleaning and filtering methods applied to the training | 
					
						
						|  | data: | 
					
						
						|  |  | 
					
						
						|  | -   CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering | 
					
						
						|  | was applied at multiple stages in the data preparation process to ensure | 
					
						
						|  | the exclusion of harmful and illegal content. | 
					
						
						|  | -   Sensitive Data Filtering: As part of making Gemma pre-trained models | 
					
						
						|  | safe and reliable, automated techniques were used to filter out certain | 
					
						
						|  | personal information and other sensitive data from training sets. | 
					
						
						|  | -   Additional methods: Filtering based on content quality and safety in | 
					
						
						|  | line with [our policies][safety-policies]. | 
					
						
						|  |  | 
					
						
						|  | ## Implementation Information | 
					
						
						|  |  | 
					
						
						|  | Details about the model internals. | 
					
						
						|  |  | 
					
						
						|  | ### Hardware | 
					
						
						|  |  | 
					
						
						|  | Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, | 
					
						
						|  | TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant | 
					
						
						|  | computational power. TPUs, designed specifically for matrix operations common in | 
					
						
						|  | machine learning, offer several advantages in this domain: | 
					
						
						|  |  | 
					
						
						|  | -   Performance: TPUs are specifically designed to handle the massive | 
					
						
						|  | computations involved in training VLMs. They can speed up training | 
					
						
						|  | considerably compared to CPUs. | 
					
						
						|  | -   Memory: TPUs often come with large amounts of high-bandwidth memory, | 
					
						
						|  | allowing for the handling of large models and batch sizes during training. | 
					
						
						|  | This can lead to better model quality. | 
					
						
						|  | -   Scalability: TPU Pods (large clusters of TPUs) provide a scalable | 
					
						
						|  | solution for handling the growing complexity of large foundation models. | 
					
						
						|  | You can distribute training across multiple TPU devices for faster and more | 
					
						
						|  | efficient processing. | 
					
						
						|  | -   Cost-effectiveness: In many scenarios, TPUs can provide a more | 
					
						
						|  | cost-effective solution for training large models compared to CPU-based | 
					
						
						|  | infrastructure, especially when considering the time and resources saved | 
					
						
						|  | due to faster training. | 
					
						
						|  | -   These advantages are aligned with | 
					
						
						|  | [Google's commitments to operate sustainably][sustainability]. | 
					
						
						|  |  | 
					
						
						|  | ### Software | 
					
						
						|  |  | 
					
						
						|  | Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. | 
					
						
						|  |  | 
					
						
						|  | JAX allows researchers to take advantage of the latest generation of hardware, | 
					
						
						|  | including TPUs, for faster and more efficient training of large models. ML | 
					
						
						|  | Pathways is Google's latest effort to build artificially intelligent systems | 
					
						
						|  | capable of generalizing across multiple tasks. This is specially suitable for | 
					
						
						|  | foundation models, including large language models like these ones. | 
					
						
						|  |  | 
					
						
						|  | Together, JAX and ML Pathways are used as described in the | 
					
						
						|  | [paper about the Gemini family of models][gemini-2-paper]; *"the 'single | 
					
						
						|  | controller' programming model of Jax and Pathways allows a single Python | 
					
						
						|  | process to orchestrate the entire training run, dramatically simplifying the | 
					
						
						|  | development workflow."* | 
					
						
						|  |  | 
					
						
						|  | ## Evaluation | 
					
						
						|  |  | 
					
						
						|  | Model evaluation metrics and results. | 
					
						
						|  |  | 
					
						
						|  | ### Benchmark Results | 
					
						
						|  |  | 
					
						
						|  | These models were evaluated against a large collection of different datasets and | 
					
						
						|  | metrics to cover different aspects of text generation. Evaluation results marked | 
					
						
						|  | with **IT** are for instruction-tuned models. Evaluation results marked with | 
					
						
						|  | **PT** are for pre-trained models. | 
					
						
						|  |  | 
					
						
						|  | #### Gemma 3 270M | 
					
						
						|  |  | 
					
						
						|  | | **Benchmark**             |  **n-shot**   | **Gemma 3 PT 270M** | | 
					
						
						|  | | :------------------------ | :-----------: | ------------------: | | 
					
						
						|  | | [HellaSwag][hellaswag]    |    10-shot    |                40.9 | | 
					
						
						|  | | [BoolQ][boolq]            |    0-shot     |                61.4 | | 
					
						
						|  | | [PIQA][piqa]              |    0-shot     |                67.7 | | 
					
						
						|  | | [TriviaQA][triviaqa]      |    5-shot     |                15.4 | | 
					
						
						|  | | [ARC-c][arc]              |    25-shot    |                29.0 | | 
					
						
						|  | | [ARC-e][arc]              |    0-shot     |                57.7 | | 
					
						
						|  | | [WinoGrande][winogrande]  |    5-shot     |                52.0 | | 
					
						
						|  |  | 
					
						
						|  | [hellaswag]: https://arxiv.org/abs/1905.07830 | 
					
						
						|  | [boolq]: https://arxiv.org/abs/1905.10044 | 
					
						
						|  | [piqa]: https://arxiv.org/abs/1911.11641 | 
					
						
						|  | [triviaqa]: https://arxiv.org/abs/1705.03551 | 
					
						
						|  | [arc]: https://arxiv.org/abs/1911.01547 | 
					
						
						|  | [winogrande]: https://arxiv.org/abs/1907.10641 | 
					
						
						|  |  | 
					
						
						|  | | **Benchmark**             |  **n-shot**   | **Gemma 3 IT 270m** | | 
					
						
						|  | | :------------------------ | :-----------: | ------------------: | | 
					
						
						|  | | [HellaSwag][hellaswag]    |    0-shot     |                37.7 | | 
					
						
						|  | | [PIQA][piqa]              |    0-shot     |                66.2 | | 
					
						
						|  | | [ARC-c][arc]              |    0-shot     |                28.2 | | 
					
						
						|  | | [WinoGrande][winogrande]  |    0-shot     |                52.3 | | 
					
						
						|  | | [BIG-Bench Hard][bbh]     |   few-shot    |                26.7 | | 
					
						
						|  | | [IF Eval][ifeval]         |    0-shot     |                51.2 | | 
					
						
						|  |  | 
					
						
						|  | [hellaswag]: https://arxiv.org/abs/1905.07830 | 
					
						
						|  | [piqa]: https://arxiv.org/abs/1911.11641 | 
					
						
						|  | [arc]: https://arxiv.org/abs/1911.01547 | 
					
						
						|  | [winogrande]: https://arxiv.org/abs/1907.10641 | 
					
						
						|  | [bbh]: https://paperswithcode.com/dataset/bbh | 
					
						
						|  | [bbh]: https://paperswithcode.com/dataset/bbh | 
					
						
						|  | [ifeval]: https://arxiv.org/abs/2311.07911 | 
					
						
						|  |  | 
					
						
						|  | #### Gemma 3 1B, 4B, 12B & 27B | 
					
						
						|  |  | 
					
						
						|  | ##### Reasoning and factuality | 
					
						
						|  |  | 
					
						
						|  | | Benchmark                      | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | | 
					
						
						|  | |--------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | 
					
						
						|  | | [GPQA][gpqa] Diamond           | 0-shot |      19.2     |      30.8     |      40.9      |      42.4      | | 
					
						
						|  | | [SimpleQA][simpleqa]           | 0-shot |      2.2      |      4.0      |       6.3      |      10.0      | | 
					
						
						|  | | [FACTS Grounding][facts-grdg]  |    -   |      36.4     |      70.1     |      75.8      |      74.9      | | 
					
						
						|  | | [BIG-Bench Hard][bbh]          | 0-shot |      39.1     |      72.2     |      85.7      |      87.6      | | 
					
						
						|  | | [BIG-Bench Extra Hard][bbeh]   | 0-shot |      7.2      |      11.0     |      16.3      |      19.3      | | 
					
						
						|  | | [IFEval][ifeval]               | 0-shot |      80.2     |      90.2     |      88.9      |      90.4      | | 
					
						
						|  |  | 
					
						
						|  | | Benchmark                      | n-shot   | Gemma 3 PT 1B  | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | 
					
						
						|  | | ------------------------------ |----------|:--------------:|:-------------:|:--------------:|:--------------:| | 
					
						
						|  | | [HellaSwag][hellaswag]         | 10-shot  |      62.3      |      77.2     |      84.2      |      85.6      | | 
					
						
						|  | | [BoolQ][boolq]                 | 0-shot   |      63.2      |      72.3     |      78.8      |      82.4      | | 
					
						
						|  | | [PIQA][piqa]                   | 0-shot   |      73.8      |      79.6     |      81.8      |      83.3      | | 
					
						
						|  | | [SocialIQA][socialiqa]         | 0-shot   |      48.9      |      51.9     |      53.4      |      54.9      | | 
					
						
						|  | | [TriviaQA][triviaqa]           | 5-shot   |      39.8      |      65.8     |      78.2      |      85.5      | | 
					
						
						|  | | [Natural Questions][naturalq]  | 5-shot   |      9.48      |      20.0     |      31.4      |      36.1      | | 
					
						
						|  | | [ARC-c][arc]                   | 25-shot  |      38.4      |      56.2     |      68.9      |      70.6      | | 
					
						
						|  | | [ARC-e][arc]                   | 0-shot   |      73.0      |      82.4     |      88.3      |      89.0      | | 
					
						
						|  | | [WinoGrande][winogrande]       | 5-shot   |      58.2      |      64.7     |      74.3      |      78.8      | | 
					
						
						|  | | [BIG-Bench Hard][bbh]          | few-shot |      28.4      |      50.9     |      72.6      |      77.7      | | 
					
						
						|  | | [DROP][drop]                   | 1-shot   |      42.4      |      60.1     |      72.2      |      77.2      | | 
					
						
						|  |  | 
					
						
						|  | [gpqa]: https://arxiv.org/abs/2311.12022 | 
					
						
						|  | [simpleqa]: https://arxiv.org/abs/2411.04368 | 
					
						
						|  | [facts-grdg]: https://goo.gle/FACTS_paper | 
					
						
						|  | [bbeh]: https://github.com/google-deepmind/bbeh | 
					
						
						|  | [ifeval]: https://arxiv.org/abs/2311.07911 | 
					
						
						|  | [hellaswag]: https://arxiv.org/abs/1905.07830 | 
					
						
						|  | [boolq]: https://arxiv.org/abs/1905.10044 | 
					
						
						|  | [piqa]: https://arxiv.org/abs/1911.11641 | 
					
						
						|  | [socialiqa]: https://arxiv.org/abs/1904.09728 | 
					
						
						|  | [triviaqa]: https://arxiv.org/abs/1705.03551 | 
					
						
						|  | [naturalq]: https://github.com/google-research-datasets/natural-questions | 
					
						
						|  | [arc]: https://arxiv.org/abs/1911.01547 | 
					
						
						|  | [winogrande]: https://arxiv.org/abs/1907.10641 | 
					
						
						|  | [bbh]: https://paperswithcode.com/dataset/bbh | 
					
						
						|  | [drop]: https://arxiv.org/abs/1903.00161 | 
					
						
						|  |  | 
					
						
						|  | ##### STEM and code | 
					
						
						|  |  | 
					
						
						|  | | Benchmark                  | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | | 
					
						
						|  | |----------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | 
					
						
						|  | | [MMLU][mmlu] (Pro)         | 0-shot |      14.7     |      43.6     |      60.6      |      67.5      | | 
					
						
						|  | | [LiveCodeBench][lcb]       | 0-shot |      1.9      |      12.6     |      24.6      |      29.7      | | 
					
						
						|  | | [Bird-SQL][bird-sql] (dev) |    -   |      6.4      |      36.3     |      47.9      |      54.4      | | 
					
						
						|  | | [Math][math]               | 0-shot |      48.0     |      75.6     |      83.8      |      89.0      | | 
					
						
						|  | | HiddenMath                 | 0-shot |      15.8     |      43.0     |      54.5      |      60.3      | | 
					
						
						|  | | [MBPP][mbpp]               | 3-shot |      35.2     |      63.2     |      73.0      |      74.4      | | 
					
						
						|  | | [HumanEval][humaneval]     | 0-shot |      41.5     |      71.3     |      85.4      |      87.8      | | 
					
						
						|  | | [Natural2Code][nat2code]   | 0-shot |      56.0     |      70.3     |      80.7      |      84.5      | | 
					
						
						|  | | [GSM8K][gsm8k]             | 0-shot |      62.8     |      89.2     |      94.4      |      95.9      | | 
					
						
						|  |  | 
					
						
						|  | | Benchmark                      | n-shot         | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | 
					
						
						|  | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | 
					
						
						|  | | [MMLU][mmlu]                   | 5-shot         |      59.6     |      74.5      |      78.6      | | 
					
						
						|  | | [MMLU][mmlu] (Pro COT)         | 5-shot         |      29.2     |      45.3      |      52.2      | | 
					
						
						|  | | [AGIEval][agieval]             | 3-5-shot       |      42.1     |      57.4      |      66.2      | | 
					
						
						|  | | [MATH][math]                   | 4-shot         |      24.2     |      43.3      |      50.0      | | 
					
						
						|  | | [GSM8K][gsm8k]                 | 8-shot         |      38.4     |      71.0      |      82.6      | | 
					
						
						|  | | [GPQA][gpqa]                   | 5-shot         |      15.0     |      25.4      |      24.3      | | 
					
						
						|  | | [MBPP][mbpp]                   | 3-shot         |      46.0     |      60.4      |      65.6      | | 
					
						
						|  | | [HumanEval][humaneval]         | 0-shot         |      36.0     |      45.7      |      48.8      | | 
					
						
						|  |  | 
					
						
						|  |  |