Update metadata and improve model card for Ettin decoder model (#2)

Browse files

- Update metadata and improve model card for Ettin decoder model (cacef4824d9d72fecac984c396b457d1a50e31f9)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +160 -66

README.md CHANGED Viewed

@@ -1,39 +1,51 @@
 ---
-license: mit
 language:
 - en
-pipeline_tag: fill-mask
 ---
 # Ettin: an Open Suite of Paired Encoders and Decoders
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2507.11412)
-[![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-12%20Models-blue)](https://huggingface.co/jhu-clsp)
 [![Data](https://img.shields.io/badge/🤗%20Training%20Data-2T%20Tokens-green)](https://huggingface.co/datasets/jhu-clsp)
 [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
 > 🎯 **TL;DR**: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.
-📄 [Paper](https://arxiv.org/abs/2507.11412) | 🚀 [GitHub Repository](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
 This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
 ## Table of Contents
-- [Performance Highlights](#performance-highlights)
-- [Quick Start](#quick-start)
 - [Model Description](#model-description)
 - [Training Data](#training-data)
-- [Model Family](#model-family)
   - [Encoder Models](#encoder-models)
   - [Decoder Models](#decoder-models)
   - [Cross-Objective Models](#cross-objective-models)
 - [Accessing Training Checkpoints](#accessing-training-checkpoints)
-- [Research Applications](#research-applications)
 - [Training Details](#training-details)
 - [Model Architecture](#model-architecture)
 - [Usage Examples](#usage-examples)
 - [Fine-tuning Examples](#fine-tuning-examples)
 - [Citation](#citation)
 ## 📊 Performance Highlights
@@ -82,11 +94,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
 Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
-1. **Identical training data** - Same high-quality mixture across all models
-2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
-3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
-4. **Consistent training recipe** - Three-phase training with 2T tokens
-5. **Multiple scales** - From 17M to 1B parameters
 This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
@@ -94,12 +106,12 @@ This approach allows for true apples-to-apples comparisons between encoder and d
 The training data is publicly available and split across different phases:
-- **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
-- **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
-- **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
-- **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
-## Model Family
 ### Encoder Models
@@ -174,9 +186,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
 #### HuggingFace Format Checkpoints
 Each model repository contains multiple tagged versions representing different training stages:
-- **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
-- **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
-- **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -209,19 +221,19 @@ This checkpoint availability enables detailed analysis of training dynamics, los
 Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
-- **Identical Training Data**: Same 2T token mixture across all models
-- **Matched Architectures**: Only attention patterns and objectives differ
-- **Open Everything**: Training data, model weights, and batch-level training order
-- **Multiple Scales**: Fair comparison from 17M to 1B parameters
-- **250+ Checkpoints**: Complete training trajectory analysis
 ### Use Cases for Researchers
-- **Architecture Studies**: Compare encoder vs decoder capabilities fairly
-- **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
-- **Scaling Laws**: Study how architectural advantages change with scale
-- **Transfer Learning**: Investigate cross-objective training effectiveness
-- **Replication Studies**: First open replication of ModernBERT training recipe
 ### Reproducibility
@@ -238,14 +250,14 @@ All training artifacts are publicly available:
 **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
 **Training Phases:**
-- **Pre-training**: 1.7T tokens with diverse data mixture
-- **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
-- **Decay phase**: 100B tokens with premium data sources
 **Key Features:**
-- Context length: Up to 8K tokens
-- Vocabulary: 50,368 tokens (ModernBERT tokenizer)
-- Deep but efficient architectures following MobileLLM principles
 ## Model Architecture
@@ -256,8 +268,6 @@ All training artifacts are publicly available:
 | Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
 | Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
 ## Usage Examples
 ### Encoder: Masked Language Modeling
@@ -376,7 +386,7 @@ def main():
     eval_dataset = dataset_dict["test"]
     # 3. Define a loss function
-    loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)  # Increase mini_batch_size if you have enough VRAM
     run_name = f"{model_shortname}-DPR-{lr}"
     # 4. (Optional) Specify training arguments
@@ -388,16 +398,16 @@ def main():
         per_device_train_batch_size=512,
         per_device_eval_batch_size=512,
         warmup_ratio=0.05,
-        fp16=False,  # Set to False if GPU can't handle FP16
-        bf16=True,  # Set to True if GPU supports BF16
-        batch_sampler=BatchSamplers.NO_DUPLICATES,  # (Cached)MultipleNegativesRankingLoss benefits from no duplicates
         learning_rate=lr,
         # Optional tracking/debugging parameters:
         save_strategy="steps",
         save_steps=500,
         save_total_limit=2,
         logging_steps=500,
-        run_name=run_name,  # Used in `wandb`, `tensorboard`, `neptune`, etc. if installed
     )
     # 5. (Optional) Create an evaluator & evaluate the base model
@@ -432,6 +442,7 @@ def main():
 if __name__ == "__main__":
     main()
 ```
 </details>
@@ -487,8 +498,8 @@ def main():
         output_dir=output_dir,
         num_train_epochs=num_train_epochs,
         per_device_train_batch_size=batch_size,
-        fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
-        bf16=True,  # Set to True if you have a GPU that supports BF16
         run_name=run_name,
         logging_steps=10,
         learning_rate=lr,
@@ -572,9 +583,9 @@ args = SparseEncoderTrainingArguments(
     per_device_eval_batch_size=16,
     learning_rate=2e-5,
     warmup_ratio=0.1,
-    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
-    bf16=False,  # Set to True if you have a GPU that supports BF16
-    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
     # Optional tracking/debugging parameters:
     eval_strategy="steps",
     eval_steps=1000,
@@ -582,7 +593,7 @@ args = SparseEncoderTrainingArguments(
     save_steps=1000,
     save_total_limit=2,
     logging_steps=200,
-    run_name=run_name,  # Will be used in W&B if `wandb` is installed
 )
 # 6. (Optional) Create an evaluator & evaluate the base model
@@ -644,7 +655,7 @@ def main():
     train_batch_size = 64
     num_epochs = 1
-    num_hard_negatives = 5  # How many hard negatives should be mined for each question-answer pair
     # 1a. Load a model to finetune with 1b. (Optional) model card data
     model = CrossEncoder(
@@ -671,13 +682,13 @@ def main():
     hard_train_dataset = mine_hard_negatives(
         train_dataset,
         embedding_model,
-        num_negatives=num_hard_negatives,  # How many negatives per question-answer pair
-        margin=0,  # Similarity between query and negative samples should be x lower than query-positive similarity
-        range_min=0,  # Skip the x most similar samples
-        range_max=100,  # Consider only the x most similar samples
-        sampling_strategy="top",  # Sample the top negatives from the range
-        batch_size=4096,  # Use a batch size of 4096 for the embedding model
-        output_format="labeled-pair",  # The output format is (query, passage, label), as required by BinaryCrossEntropyLoss
         use_faiss=True,
     )
     logging.info(hard_train_dataset)
@@ -703,8 +714,8 @@ def main():
     hard_eval_dataset = mine_hard_negatives(
         eval_dataset,
         embedding_model,
-        corpus=full_dataset["answer"],  # Use the full dataset as the corpus
-        num_negatives=30,  # How many documents to rerank
         batch_size=4096,
         include_positives=True,
         output_format="n-tuple",
@@ -743,8 +754,8 @@ def main():
         per_device_eval_batch_size=train_batch_size,
         learning_rate=2e-5,
         warmup_ratio=0.1,
-        fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
-        bf16=True,  # Set to True if you have a GPU that supports BF16
         dataloader_num_workers=4,
         load_best_model_at_end=True,
         metric_for_best_model="eval_gooaq-dev_ndcg@10",
@@ -756,7 +767,7 @@ def main():
         save_total_limit=2,
         logging_steps=200,
         logging_first_step=True,
-        run_name=run_name,  # Will be used in W&B if `wandb` is installed
         seed=12,
     )
@@ -783,7 +794,8 @@ def main():
         model.push_to_hub(run_name)
     except Exception:
         logging.error(
-            f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
             f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
             f"and saving it using `model.push_to_hub('{run_name}')`."
         )
@@ -882,7 +894,7 @@ def main(script_args, training_args, model_args):
     if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
         from transformers import AutoModelForImageTextToText
-        model_kwargs.pop("use_cache", None)  # Image models do not support cache
         model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
     else:
         model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)
@@ -942,6 +954,80 @@ if __name__ == "__main__":
 ```
 </details>
 ## Citation
 If you use Ettin models in your research, please cite our work:
@@ -956,4 +1042,12 @@ If you use Ettin models in your research, please cite our work:
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2507.11412},
 }
-```

 ---
 language:
 - en
+license: mit
+pipeline_tag: text-generation
+library_name: transformers
+datasets:
+  - jhu-clsp/ettin-pretraining-data
+  - jhu-clsp/ettin-extension-data
+  - jhu-clsp/ettin-decay-data
+tags:
+  - ettin
+  - decoder
 ---
 # Ettin: an Open Suite of Paired Encoders and Decoders
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2507.11412)
+[![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-24%20Models-blue)](https://huggingface.co/jhu-clsp)
 [![Data](https://img.shields.io/badge/🤗%20Training%20Data-2T%20Tokens-green)](https://huggingface.co/datasets/jhu-clsp)
 [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
 > 🎯 **TL;DR**: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.
+📄 [Paper](https://arxiv.org/abs/2507.11412) | 🤗 [Model Collection](https://huggingface.co/jhu-clsp) | 📊 [Training Data](https://huggingface.co/datasets/jhu-clsp)
 This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
 ## Table of Contents
+- [📊 Performance Highlights](#-performance-highlights)
+- [🚀 Quick Start](#-quick-start)
 - [Model Description](#model-description)
 - [Training Data](#training-data)
+- [🤖 Model Family](#-model-family)
   - [Encoder Models](#encoder-models)
   - [Decoder Models](#decoder-models)
   - [Cross-Objective Models](#cross-objective-models)
 - [Accessing Training Checkpoints](#accessing-training-checkpoints)
+- [🔬 Research Applications](#-research-applications)
 - [Training Details](#training-details)
 - [Model Architecture](#model-architecture)
 - [Usage Examples](#usage-examples)
 - [Fine-tuning Examples](#fine-tuning-examples)
+- [📋 Training and Evaluation](#-training-and-evaluation)
+- [❓ FAQ](#-faq)
 - [Citation](#citation)
+- [License](#license)
 ## 📊 Performance Highlights
 Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
+1.  **Identical training data** - Same high-quality mixture across all models
+2.  **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
+3.  **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
+4.  **Consistent training recipe** - Three-phase training with 2T tokens
+5.  **Multiple scales** - From 17M to 1B parameters
 This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
 The training data is publicly available and split across different phases:
+-   **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
+-   **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
+-   **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
+-   **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
+## 🤖 Model Family
 ### Encoder Models
 #### HuggingFace Format Checkpoints
 Each model repository contains multiple tagged versions representing different training stages:
+-   **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
+-   **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
+-   **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
+-   **Identical Training Data**: Same 2T token mixture across all models
+-   **Matched Architectures**: Only attention patterns and objectives differ
+-   **Open Everything**: Training data, model weights, and batch-level training order
+-   **Multiple Scales**: Fair comparison from 17M to 1B parameters
+-   **250+ Checkpoints**: Complete training trajectory analysis
 ### Use Cases for Researchers
+-   **Architecture Studies**: Compare encoder vs decoder capabilities fairly
+-   **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
+-   **Scaling Laws**: Study how architectural advantages change with scale
+-   **Transfer Learning**: Investigate cross-objective training effectiveness
+-   **Replication Studies**: First open replication of ModernBERT training recipe
 ### Reproducibility
 **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
 **Training Phases:**
+-   **Pre-training**: 1.7T tokens with diverse data mixture
+-   **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
+-   **Decay phase**: 100B tokens with premium data sources
 **Key Features:**
+-   Context length: Up to 8K tokens
+-   Vocabulary: 50,368 tokens (ModernBERT tokenizer)
+-   Deep but efficient architectures following MobileLLM principles
 ## Model Architecture
 | Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
 | Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
 ## Usage Examples
 ### Encoder: Masked Language Modeling
     eval_dataset = dataset_dict["test"]
     # 3. Define a loss function
+    loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16) # Increase mini_batch_size if you have enough VRAM
     run_name = f"{model_shortname}-DPR-{lr}"
     # 4. (Optional) Specify training arguments
         per_device_train_batch_size=512,
         per_device_eval_batch_size=512,
         warmup_ratio=0.05,
+        fp16=False, # Set to False if GPU can't handle FP16
+        bf16=True, # Set to True if GPU supports BF16
+        batch_sampler=BatchSamplers.NO_DUPLICATES, # (Cached)MultipleNegativesRankingLoss benefits from no duplicates
         learning_rate=lr,
         # Optional tracking/debugging parameters:
         save_strategy="steps",
         save_steps=500,
         save_total_limit=2,
         logging_steps=500,
+        run_name=run_name, # Used in `wandb`, `tensorboard`, `neptune`, etc. if installed
     )
     # 5. (Optional) Create an evaluator & evaluate the base model
 if __name__ == "__main__":
     main()
 ```
 </details>
         output_dir=output_dir,
         num_train_epochs=num_train_epochs,
         per_device_train_batch_size=batch_size,
+        fp16=False, # Set to False if you get an error that your GPU can't run on FP16
+        bf16=True, # Set to True if you have a GPU that supports BF16
         run_name=run_name,
         logging_steps=10,
         learning_rate=lr,
     per_device_eval_batch_size=16,
     learning_rate=2e-5,
     warmup_ratio=0.1,
+    fp16=True, # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False, # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
     # Optional tracking/debugging parameters:
     eval_strategy="steps",
     eval_steps=1000,
     save_steps=1000,
     save_total_limit=2,
     logging_steps=200,
+    run_name=run_name, # Will be used in W&B if `wandb` is installed
 )
 # 6. (Optional) Create an evaluator & evaluate the base model
     train_batch_size = 64
     num_epochs = 1
+    num_hard_negatives = 5 # How many hard negatives should be mined for each question-answer pair
     # 1a. Load a model to finetune with 1b. (Optional) model card data
     model = CrossEncoder(
     hard_train_dataset = mine_hard_negatives(
         train_dataset,
         embedding_model,
+        num_negatives=num_hard_negatives, # How many negatives per question-answer pair
+        margin=0, # Similarity between query and negative samples should be x lower than query-positive similarity
+        range_min=0, # Skip the x most similar samples
+        range_max=100, # Consider only the x most similar samples
+        sampling_strategy="top", # Sample the top negatives from the range
+        batch_size=4096, # Use a batch size of 4096 for the embedding model
+        output_format="labeled-pair", # The output format is (query, passage, label), as required by BinaryCrossEntropyLoss
         use_faiss=True,
     )
     logging.info(hard_train_dataset)
     hard_eval_dataset = mine_hard_negatives(
         eval_dataset,
         embedding_model,
+        corpus=full_dataset["answer"], # Use the full dataset as the corpus
+        num_negatives=30, # How many documents to rerank
         batch_size=4096,
         include_positives=True,
         output_format="n-tuple",
         per_device_eval_batch_size=train_batch_size,
         learning_rate=2e-5,
         warmup_ratio=0.1,
+        fp16=False, # Set to False if you get an error that your GPU can't run on FP16
+        bf16=True, # Set to True if you have a GPU that supports BF16
         dataloader_num_workers=4,
         load_best_model_at_end=True,
         metric_for_best_model="eval_gooaq-dev_ndcg@10",
         save_total_limit=2,
         logging_steps=200,
         logging_first_step=True,
+        run_name=run_name, # Will be used in W&B if `wandb` is installed
         seed=12,
     )
         model.push_to_hub(run_name)
     except Exception:
         logging.error(
+            f"Error uploading model to the Hugging Face Hub:
+{traceback.format_exc()}To upload it manually, you can run "
             f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
             f"and saving it using `model.push_to_hub('{run_name}')`."
         )
     if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
         from transformers import AutoModelForImageTextToText
+        model_kwargs.pop("use_cache", None) # Image models do not support cache
         model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
     else:
         model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)
 ```
 </details>
+## 📋 Training and Evaluation
+### Pre-training
+For details on model pre-training, data preparation, and training recipes:
+-   **📖 [Pre-training Guide](pretraining/README.md)** - Complete training setup, data mixture, and ModernBERT recipe adaptation
+### Evaluation
+#### Encoder Evaluation
+-   **📊 [Encoder on Generative Tasks](docs/encoder-generative-eval.md)** - Evaluating encoders on language modeling tasks using our lm-evaluation-harness fork
+-   **🔍 [Encoder Retrieval Training](docs/retrieval.md)** - Fine-tuning on MS MARCO and evaluation on MTEB v2 English
+-   **🎯 [GLUE Evaluation](glue_evaluation/README.md)** - Comprehensive GLUE benchmark evaluation with fine-tuning scripts
+#### Decoder Evaluation
+-   **🎯 [Decoder on Generative Tasks](docs/decoder-eval.md)** - Using EleutherAI evaluation harness (commit `867413f8677f00f6a817262727cbb041bf36192a`) for comprehensive generative task evaluation
+#### Bias Evaluation
+-   **⚖️ [Gender Bias Evaluation](bias_eval/README.md)** - Comprehensive gender bias testing using Winogender dataset gotcha examples. Tests how well models handle counter-stereotypical pronouns in occupational contexts. Supports both encoder (MLM) and decoder (perplexity) evaluation methods.
+### Quick Decoder Evaluation Example
+```bash
+# Clone the specific commit of lm-evaluation-harness
+git clone https://github.com/EleutherAI/lm-evaluation-harness.git
+cd lm-evaluation-harness
+git checkout 867413f8677f00f6a817262727cbb041bf36192a
+pip install -e .
+# Run evaluation on Ettin decoder
+lm_eval --model hf \
+    --model_args pretrained=jhu-clsp/ettin-decoder-150m \
+    --tasks hellaswag,arc_easy,arc_challenge,winogrande \
+    --device cuda:0 \
+    --batch_size 8
+```
+## ❓ FAQ
+### Model Loading Issues
+**Q: I'm getting an error that ModernBERT-decoder isn't found.**
+**A:** Make sure you have the latest version of transformers installed:
+```bash
+# for the latest version until the official pypi release:
+pip install git+https://github.com/huggingface/transformers.git
+```
+**Q: Which model should I choose for my task?**
+**A:**
+-   **Classification/Retrieval/Understanding**: Use encoder models
+-   **Text Generation/Chat/Completion**: Use decoder models
+-   **Research on cross-training**: Use cross-objective models
+-   **Size selection**: Start with 150M for experimentation, scale up to 400M or 1B for production
+**Q: How do I access training checkpoints?**
+**A:** Each model has multiple git tags for different training stages. Use the `revision` parameter:
+```python
+model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m", revision="step500000")
+```
+**Q: Can I continue training these models?**
+**A:** Yes! We provide raw checkpoints in the [jhu-clsp/ettin-checkpoints](https://huggingface.co/datasets/jhu-clsp/ettin-checkpoints) dataset that can be loaded into training frameworks.
+**Q: What's the difference between cross-objective models and regular models?**
+**A:** Cross-objective models started as one architecture (e.g., decoder) and were continued with a different objective (e.g., MLM). They demonstrate the limitations of cross-training and generally underperform native models.
+**Q: How do I reproduce the paper results?**
+**A:** See our evaluation guides:
+-   [Encoder Generative Eval](docs/encoder-generative-eval.md)
+-   [Retrieval Eval](docs/retrieval.md)
+-   [GLUE Eval](glue_evaluation/README.md)
+-   [Decoder Eval](docs/decoder-eval.md)
+-   [Pre-training](pretraining/README.md)
 ## Citation
 If you use Ettin models in your research, please cite our work:
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2507.11412},
 }
+```
+## License
+This project is licensed under the MIT License - see the [LICENSE](https://github.com/jhu-clsp/ettin-encoder-vs-decoder/blob/main/LICENSE) file for details.
+---
+**Contact**: For questions about the models or research, please open an issue or contact the authors.