Sweelol-ai
/

finetuned-pruned-gemma3-270m-dolly

@@ -1,24 +1,14 @@
 ---
 license: apache-2.0
 tags:
 - sweelol-ai
 - text-generation
-- gemma-3-270m
 - distillation
 - pruning
 - lora
 - prompt-tuning
-- code
-datasets:
-- databricks/databricks-dolly-15k
-language:
-- en
-metrics:
-- accuracy
-base_model:
-- google/gemma-3-270m
-pipeline_tag: text-generation
-library_name: transformers
 ---
 # {model_name}
@@ -35,4 +25,320 @@ This model is part of the **Sweelol AI Hub** collection, resulting from experime
 *   **Fine-Tuning Method:** {method_description}
 *   **Purpose:** {purpose}
-This is a placeholder README. A detailed model card with full results and usage instructions will be added shortly.

 ---
 license: apache-2.0
 tags:
 - sweelol-ai
 - text-generation
+- gemma
 - distillation
 - pruning
 - lora
 - prompt-tuning
 ---
 # {model_name}
 *   **Fine-Tuning Method:** {method_description}
 *   **Purpose:** {purpose}
+This is a placeholder README. A detailed model card with full results and usage instructions will be added shortly.
+## Evaluation Results
+This table compares the performance of this **Finetuned-Pruned** model against the original, un-tuned `google/gemma-3-270m` base model.
+| Benchmark Task | Sweelol Finetuned-Pruned | Baseline (Gemma-3-270m) | Change |
+| :--- | :--- | :--- | :--- |
+| **Average MMLU (5 tasks)** | 25.18% | 24.88% | **+0.30%** |
+| HellaSwag (Common Sense) | 29.50% | 43.50% | -14.00% |
+| ---------------------------------- | ---------- | ---------- | -------- |
+| *MMLU Sub-task Breakdown:* | | | |
+| MMLU - Formal Logic | **28.57%** | 25.40% | **+3.17%** |
+| MMLU - High School Computer Science | **25.00%** | 24.00% | **+1.00%** |
+| MMLU - Professional Law | 25.00% | 27.00% | -2.00% |
+| MMLU - Abstract Algebra | 22.00% | 22.00% | 0.00% |
+| MMLU - High School Mathematics | 21.00% | 26.00% | -5.00% |
+#### Summary of Findings
+Fine-tuning the pruned model resulted in a solid overall improvement on MMLU, particularly in formal logic. However, like the pruned-only baseline, it suffered a significant drop in common-sense reasoning (HellaSwag).
+## Evaluation
+### Testing Data & Metrics
+All models were evaluated on a comprehensive suite of tasks from the `lm-evaluation-harness`, including 5 diverse subsets of **MMLU** (for academic reasoning) and **HellaSwag** (for common-sense reasoning). The primary metric is zero-shot accuracy on a 200-sample subset of each task's test split.
+### Results
+This table summarizes the final benchmark scores for all models created in the **Sweelol AI Comparative Study**. All fine-tuned models were trained on a subset of the `databricks/databricks-dolly-15k` dataset.
+| Model | Technique | Average MMLU | HellaSwag | MMLU CompSci | MMLU Logic | MMLU Law | MMLU Math | MMLU Algebra |
+| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
+| **Baseline** | *(Pre-trained)* | 24.88% | **43.50%** | 24.00% | 25.40% | **27.00%** | **26.00%** | 22.00% |
+| **Pruned-Baseline**| Pruning | **26.17%** | 29.50% | **28.00%** | **29.37%** | 26.00% | 24.50% | **23.00%** |
+| **Prompt-Tune** | PEFT | 25.77% | 39.00% | 27.00% | **29.37%** | **27.50%** | 22.00% | **23.00%** |
+| **Finetuned-Pruned**| Pruning + FT | 25.18% | 29.50% | 25.00% | 28.57% | 25.00% | 21.00% | 22.00% |
+| **LoRA** | PEFT | 24.60% | 26.00% | 25.00% | 28.57% | 25.00% | 21.00% | 22.00% |
+| **KD-Pruned** | Distillation | 23.98% | 33.00% | 26.00% | 25.40% | 25.00% | 21.50% | 22.00% |
+| **Full-Finetune** | Full FT | 22.60% | 39.00% | 26.00% | 23.02% | 23.50% | 21.50% | 19.00% |
+#### Summary of Key Findings
+1.  **Pruning is a Superpower for Logic:** The `Pruned-Baseline` model, with no fine-tuning, was the **undisputed champion on average MMLU performance**. It achieved the highest scores in Formal Logic and Computer Science, suggesting that pruning enhances the model's core, pre-trained reasoning abilities.
+2.  **Prompt Tuning is the Efficiency King:** The `Prompt-Tune` model was the second-best performer on MMLU and retained strong common-sense performance (HellaSwag). This makes it the most efficient and effective overall technique, delivering top-tier results with minimal training.
+3.  **The "Alignment Tax" is Real:** Both `Full-Finetune` and `KD-Pruned` models, while trained on instruction data, showed a significant drop in performance on the MMLU reasoning tasks compared to the baseline. This is a classic example of the "alignment tax," where teaching a model to be a helpful assistant can sometimes dilute its raw, academic reasoning capabilities.
+4.  **Common Sense is Fragile:** Techniques that heavily modified the model's structure or weights (`Pruning`, `LoRA`) resulted in a significant drop in performance on the `HellaSwag` common-sense benchmark. The `Baseline` model remains the champion of common sense.
+This comprehensive benchmark provides a clear, data-driven guide for selecting the right optimization technique for a given task.
+# Gemma 3 model card
+**Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
+**Resources and Technical Documentation**:
+* [Gemma 3 Technical Report][g3-tech-report]
+* [Responsible Generative AI Toolkit][rai-toolkit]
+* [Gemma on Kaggle][kaggle-gemma]
+* [Gemma on Vertex Model Garden][vertex-mg-gemma3]
+**Terms of Use**: [Terms][terms]
+**Authors**: Google DeepMind
+## Model Information
+Summary description and brief definition of inputs and outputs.
+### Description
+Gemma is a family of lightweight, state-of-the-art open models from Google,
+built from the same research and technology used to create the Gemini models.
+Gemma 3 models are multimodal, handling text and image input and generating text
+output, with open weights for both pre-trained variants and instruction-tuned
+variants. Gemma 3 has a large, 128K context window, multilingual support in over
+140 languages, and is available in more sizes than previous versions. Gemma 3
+models are well-suited for a variety of text generation and image understanding
+tasks, including question answering, summarization, and reasoning. Their
+relatively small size makes it possible to deploy them in environments with
+limited resources such as laptops, desktops or your own cloud infrastructure,
+democratizing access to state of the art AI models and helping foster innovation
+for everyone.
+### Inputs and outputs
+-   **Input:**
+    -  Text string, such as a question, a prompt, or a document to be summarized
+    -  Images, normalized to 896 x 896 resolution and encoded to 256 tokens
+       each, for the 4B, 12B, and 27B sizes.
+    -  Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and
+       32K tokens for the 1B and 270M sizes.
+-   **Output:**
+    -   Generated text in response to the input, such as an answer to a
+        question, analysis of image content, or a summary of a document
+    -   Total output context up to 128K tokens for the 4B, 12B, and 27B sizes,
+        and 32K tokens for the 1B and 270M sizes per request, subtracting the
+        request input tokens
+### Citation
+```none
+@article{gemma_2025,
+    title={Gemma 3},
+    url={https://arxiv.org/abs/2503.19786},
+    publisher={Google DeepMind},
+    author={Gemma Team},
+    year={2025}
+}
+```
+## Model Data
+Data used for model training and how the data was processed.
+### Training Dataset
+These models were trained on a dataset of text data that includes a wide variety
+of sources. The 27B model was trained with 14 trillion tokens, the 12B model was
+trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens,
+the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The
+knowledge cutoff date for the training data was August 2024. Here are the key
+components:
+-   Web Documents: A diverse collection of web text ensures the model is
+    exposed to a broad range of linguistic styles, topics, and vocabulary. The
+    training dataset includes content in over 140 languages.
+-   Code: Exposing the model to code helps it to learn the syntax and
+    patterns of programming languages, which improves its ability to generate
+    code and understand code-related questions.
+-   Mathematics: Training on mathematical text helps the model learn logical
+    reasoning, symbolic representation, and to address mathematical queries.
+-   Images: A wide range of images enables the model to perform image
+    analysis and visual data extraction tasks.
+The combination of these diverse data sources is crucial for training a powerful
+multimodal model that can handle a wide variety of different tasks and data
+formats.
+### Data Preprocessing
+Here are the key data cleaning and filtering methods applied to the training
+data:
+-   CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering
+    was applied at multiple stages in the data preparation process to ensure
+    the exclusion of harmful and illegal content.
+-   Sensitive Data Filtering: As part of making Gemma pre-trained models
+    safe and reliable, automated techniques were used to filter out certain
+    personal information and other sensitive data from training sets.
+-   Additional methods: Filtering based on content quality and safety in
+    line with [our policies][safety-policies].
+## Implementation Information
+Details about the model internals.
+### Hardware
+Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p,
+TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant
+computational power. TPUs, designed specifically for matrix operations common in
+machine learning, offer several advantages in this domain:
+-   Performance: TPUs are specifically designed to handle the massive
+    computations involved in training VLMs. They can speed up training
+    considerably compared to CPUs.
+-   Memory: TPUs often come with large amounts of high-bandwidth memory,
+    allowing for the handling of large models and batch sizes during training.
+    This can lead to better model quality.
+-   Scalability: TPU Pods (large clusters of TPUs) provide a scalable
+    solution for handling the growing complexity of large foundation models.
+    You can distribute training across multiple TPU devices for faster and more
+    efficient processing.
+-   Cost-effectiveness: In many scenarios, TPUs can provide a more
+    cost-effective solution for training large models compared to CPU-based
+    infrastructure, especially when considering the time and resources saved
+    due to faster training.
+-   These advantages are aligned with
+    [Google's commitments to operate sustainably][sustainability].
+### Software
+Training was done using [JAX][jax] and [ML Pathways][ml-pathways].
+JAX allows researchers to take advantage of the latest generation of hardware,
+including TPUs, for faster and more efficient training of large models. ML
+Pathways is Google's latest effort to build artificially intelligent systems
+capable of generalizing across multiple tasks. This is specially suitable for
+foundation models, including large language models like these ones.
+Together, JAX and ML Pathways are used as described in the
+[paper about the Gemini family of models][gemini-2-paper]; *"the 'single
+controller' programming model of Jax and Pathways allows a single Python
+process to orchestrate the entire training run, dramatically simplifying the
+development workflow."*
+## Evaluation
+Model evaluation metrics and results.
+### Benchmark Results
+These models were evaluated against a large collection of different datasets and
+metrics to cover different aspects of text generation. Evaluation results marked
+with **IT** are for instruction-tuned models. Evaluation results marked with
+**PT** are for pre-trained models.
+#### Gemma 3 270M
+| **Benchmark**             |  **n-shot**   | **Gemma 3 PT 270M** |
+| :------------------------ | :-----------: | ------------------: |
+| [HellaSwag][hellaswag]    |    10-shot    |                40.9 |
+| [BoolQ][boolq]            |    0-shot     |                61.4 |
+| [PIQA][piqa]              |    0-shot     |                67.7 |
+| [TriviaQA][triviaqa]      |    5-shot     |                15.4 |
+| [ARC-c][arc]              |    25-shot    |                29.0 |
+| [ARC-e][arc]              |    0-shot     |                57.7 |
+| [WinoGrande][winogrande]  |    5-shot     |                52.0 |
+[hellaswag]: https://arxiv.org/abs/1905.07830
+[boolq]: https://arxiv.org/abs/1905.10044
+[piqa]: https://arxiv.org/abs/1911.11641
+[triviaqa]: https://arxiv.org/abs/1705.03551
+[arc]: https://arxiv.org/abs/1911.01547
+[winogrande]: https://arxiv.org/abs/1907.10641
+| **Benchmark**             |  **n-shot**   | **Gemma 3 IT 270m** |
+| :------------------------ | :-----------: | ------------------: |
+| [HellaSwag][hellaswag]    |    0-shot     |                37.7 |
+| [PIQA][piqa]              |    0-shot     |                66.2 |
+| [ARC-c][arc]              |    0-shot     |                28.2 |
+| [WinoGrande][winogrande]  |    0-shot     |                52.3 |
+| [BIG-Bench Hard][bbh]     |   few-shot    |                26.7 |
+| [IF Eval][ifeval]         |    0-shot     |                51.2 |
+[hellaswag]: https://arxiv.org/abs/1905.07830
+[piqa]: https://arxiv.org/abs/1911.11641
+[arc]: https://arxiv.org/abs/1911.01547
+[winogrande]: https://arxiv.org/abs/1907.10641
+[bbh]: https://paperswithcode.com/dataset/bbh
+[bbh]: https://paperswithcode.com/dataset/bbh
+[ifeval]: https://arxiv.org/abs/2311.07911
+#### Gemma 3 1B, 4B, 12B & 27B
+##### Reasoning and factuality
+| Benchmark                      | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B |
+|--------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:|
+| [GPQA][gpqa] Diamond           | 0-shot |      19.2     |      30.8     |      40.9      |      42.4      |
+| [SimpleQA][simpleqa]           | 0-shot |      2.2      |      4.0      |       6.3      |      10.0      |
+| [FACTS Grounding][facts-grdg]  |    -   |      36.4     |      70.1     |      75.8      |      74.9      |
+| [BIG-Bench Hard][bbh]          | 0-shot |      39.1     |      72.2     |      85.7      |      87.6      |
+| [BIG-Bench Extra Hard][bbeh]   | 0-shot |      7.2      |      11.0     |      16.3      |      19.3      |
+| [IFEval][ifeval]               | 0-shot |      80.2     |      90.2     |      88.9      |      90.4      |
+| Benchmark                      | n-shot   | Gemma 3 PT 1B  | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
+| ------------------------------ |----------|:--------------:|:-------------:|:--------------:|:--------------:|
+| [HellaSwag][hellaswag]         | 10-shot  |      62.3      |      77.2     |      84.2      |      85.6      |
+| [BoolQ][boolq]                 | 0-shot   |      63.2      |      72.3     |      78.8      |      82.4      |
+| [PIQA][piqa]                   | 0-shot   |      73.8      |      79.6     |      81.8      |      83.3      |
+| [SocialIQA][socialiqa]         | 0-shot   |      48.9      |      51.9     |      53.4      |      54.9      |
+| [TriviaQA][triviaqa]           | 5-shot   |      39.8      |      65.8     |      78.2      |      85.5      |
+| [Natural Questions][naturalq]  | 5-shot   |      9.48      |      20.0     |      31.4      |      36.1      |
+| [ARC-c][arc]                   | 25-shot  |      38.4      |      56.2     |      68.9      |      70.6      |
+| [ARC-e][arc]                   | 0-shot   |      73.0      |      82.4     |      88.3      |      89.0      |
+| [WinoGrande][winogrande]       | 5-shot   |      58.2      |      64.7     |      74.3      |      78.8      |
+| [BIG-Bench Hard][bbh]          | few-shot |      28.4      |      50.9     |      72.6      |      77.7      |
+| [DROP][drop]                   | 1-shot   |      42.4      |      60.1     |      72.2      |      77.2      |
+[gpqa]: https://arxiv.org/abs/2311.12022
+[simpleqa]: https://arxiv.org/abs/2411.04368
+[facts-grdg]: https://goo.gle/FACTS_paper
+[bbeh]: https://github.com/google-deepmind/bbeh
+[ifeval]: https://arxiv.org/abs/2311.07911
+[hellaswag]: https://arxiv.org/abs/1905.07830
+[boolq]: https://arxiv.org/abs/1905.10044
+[piqa]: https://arxiv.org/abs/1911.11641
+[socialiqa]: https://arxiv.org/abs/1904.09728
+[triviaqa]: https://arxiv.org/abs/1705.03551
+[naturalq]: https://github.com/google-research-datasets/natural-questions
+[arc]: https://arxiv.org/abs/1911.01547
+[winogrande]: https://arxiv.org/abs/1907.10641
+[bbh]: https://paperswithcode.com/dataset/bbh
+[drop]: https://arxiv.org/abs/1903.00161
+##### STEM and code
+| Benchmark                  | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B |
+|----------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:|
+| [MMLU][mmlu] (Pro)         | 0-shot |      14.7     |      43.6     |      60.6      |      67.5      |
+| [LiveCodeBench][lcb]       | 0-shot |      1.9      |      12.6     |      24.6      |      29.7      |
+| [Bird-SQL][bird-sql] (dev) |    -   |      6.4      |      36.3     |      47.9      |      54.4      |
+| [Math][math]               | 0-shot |      48.0     |      75.6     |      83.8      |      89.0      |
+| HiddenMath                 | 0-shot |      15.8     |      43.0     |      54.5      |      60.3      |
+| [MBPP][mbpp]               | 3-shot |      35.2     |      63.2     |      73.0      |      74.4      |
+| [HumanEval][humaneval]     | 0-shot |      41.5     |      71.3     |      85.4      |      87.8      |
+| [Natural2Code][nat2code]   | 0-shot |      56.0     |      70.3     |      80.7      |      84.5      |
+| [GSM8K][gsm8k]             | 0-shot |      62.8     |      89.2     |      94.4      |      95.9      |
+| Benchmark                      | n-shot         | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
+| ------------------------------ |----------------|:-------------:|:--------------:|:--------------:|
+| [MMLU][mmlu]                   | 5-shot         |      59.6     |      74.5      |      78.6      |
+| [MMLU][mmlu] (Pro COT)         | 5-shot         |      29.2     |      45.3      |      52.2      |
+| [AGIEval][agieval]             | 3-5-shot       |      42.1     |      57.4      |      66.2      |
+| [MATH][math]                   | 4-shot         |      24.2     |      43.3      |      50.0      |
+| [GSM8K][gsm8k]                 | 8-shot         |      38.4     |      71.0      |      82.6      |
+| [GPQA][gpqa]                   | 5-shot         |      15.0     |      25.4      |      24.3      |
+| [MBPP][mbpp]                   | 3-shot         |      46.0     |      60.4      |      65.6      |
+| [HumanEval][humaneval]         | 0-shot         |      36.0     |      45.7      |      48.8      |