Update metadata and improve model card for Ettin decoder model (#2)
Browse files- Update metadata and improve model card for Ettin decoder model (cacef4824d9d72fecac984c396b457d1a50e31f9)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
|
@@ -1,39 +1,51 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
---
|
|
|
|
| 7 |
# Ettin: an Open Suite of Paired Encoders and Decoders
|
| 8 |
|
| 9 |
[](https://opensource.org/licenses/MIT)
|
| 10 |
[](https://arxiv.org/abs/2507.11412)
|
| 11 |
-
[](https://huggingface.co/datasets/jhu-clsp)
|
| 13 |
[](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
|
| 14 |
|
| 15 |
> π― **TL;DR**: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.
|
| 16 |
|
| 17 |
-
π [Paper](https://arxiv.org/abs/2507.11412) |
|
| 18 |
|
| 19 |
This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
|
| 20 |
|
| 21 |
## Table of Contents
|
| 22 |
-
- [Performance Highlights](
|
| 23 |
-
- [Quick Start](
|
| 24 |
- [Model Description](#model-description)
|
| 25 |
- [Training Data](#training-data)
|
| 26 |
-
- [Model Family](
|
| 27 |
- [Encoder Models](#encoder-models)
|
| 28 |
- [Decoder Models](#decoder-models)
|
| 29 |
- [Cross-Objective Models](#cross-objective-models)
|
| 30 |
- [Accessing Training Checkpoints](#accessing-training-checkpoints)
|
| 31 |
-
- [Research Applications](
|
| 32 |
- [Training Details](#training-details)
|
| 33 |
- [Model Architecture](#model-architecture)
|
| 34 |
- [Usage Examples](#usage-examples)
|
| 35 |
- [Fine-tuning Examples](#fine-tuning-examples)
|
|
|
|
|
|
|
| 36 |
- [Citation](#citation)
|
|
|
|
| 37 |
|
| 38 |
## π Performance Highlights
|
| 39 |
|
|
@@ -82,11 +94,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
|
|
| 82 |
|
| 83 |
Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
|
| 84 |
|
| 85 |
-
1.
|
| 86 |
-
2.
|
| 87 |
-
3.
|
| 88 |
-
4.
|
| 89 |
-
5.
|
| 90 |
|
| 91 |
This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
|
| 92 |
|
|
@@ -94,12 +106,12 @@ This approach allows for true apples-to-apples comparisons between encoder and d
|
|
| 94 |
|
| 95 |
The training data is publicly available and split across different phases:
|
| 96 |
|
| 97 |
-
-
|
| 98 |
-
-
|
| 99 |
-
-
|
| 100 |
-
-
|
| 101 |
|
| 102 |
-
## Model Family
|
| 103 |
|
| 104 |
### Encoder Models
|
| 105 |
|
|
@@ -174,9 +186,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
|
|
| 174 |
#### HuggingFace Format Checkpoints
|
| 175 |
Each model repository contains multiple tagged versions representing different training stages:
|
| 176 |
|
| 177 |
-
-
|
| 178 |
-
-
|
| 179 |
-
-
|
| 180 |
|
| 181 |
```python
|
| 182 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
@@ -209,19 +221,19 @@ This checkpoint availability enables detailed analysis of training dynamics, los
|
|
| 209 |
|
| 210 |
Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
|
| 211 |
|
| 212 |
-
-
|
| 213 |
-
-
|
| 214 |
-
-
|
| 215 |
-
-
|
| 216 |
-
-
|
| 217 |
|
| 218 |
### Use Cases for Researchers
|
| 219 |
|
| 220 |
-
-
|
| 221 |
-
-
|
| 222 |
-
-
|
| 223 |
-
-
|
| 224 |
-
-
|
| 225 |
|
| 226 |
### Reproducibility
|
| 227 |
|
|
@@ -238,14 +250,14 @@ All training artifacts are publicly available:
|
|
| 238 |
**Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
|
| 239 |
|
| 240 |
**Training Phases:**
|
| 241 |
-
-
|
| 242 |
-
-
|
| 243 |
-
-
|
| 244 |
|
| 245 |
**Key Features:**
|
| 246 |
-
-
|
| 247 |
-
-
|
| 248 |
-
-
|
| 249 |
|
| 250 |
## Model Architecture
|
| 251 |
|
|
@@ -256,8 +268,6 @@ All training artifacts are publicly available:
|
|
| 256 |
| Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
|
| 257 |
| Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
|
| 258 |
|
| 259 |
-
|
| 260 |
-
|
| 261 |
## Usage Examples
|
| 262 |
|
| 263 |
### Encoder: Masked Language Modeling
|
|
@@ -376,7 +386,7 @@ def main():
|
|
| 376 |
eval_dataset = dataset_dict["test"]
|
| 377 |
|
| 378 |
# 3. Define a loss function
|
| 379 |
-
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)
|
| 380 |
|
| 381 |
run_name = f"{model_shortname}-DPR-{lr}"
|
| 382 |
# 4. (Optional) Specify training arguments
|
|
@@ -388,16 +398,16 @@ def main():
|
|
| 388 |
per_device_train_batch_size=512,
|
| 389 |
per_device_eval_batch_size=512,
|
| 390 |
warmup_ratio=0.05,
|
| 391 |
-
fp16=False,
|
| 392 |
-
bf16=True,
|
| 393 |
-
batch_sampler=BatchSamplers.NO_DUPLICATES,
|
| 394 |
learning_rate=lr,
|
| 395 |
# Optional tracking/debugging parameters:
|
| 396 |
save_strategy="steps",
|
| 397 |
save_steps=500,
|
| 398 |
save_total_limit=2,
|
| 399 |
logging_steps=500,
|
| 400 |
-
run_name=run_name,
|
| 401 |
)
|
| 402 |
|
| 403 |
# 5. (Optional) Create an evaluator & evaluate the base model
|
|
@@ -432,6 +442,7 @@ def main():
|
|
| 432 |
if __name__ == "__main__":
|
| 433 |
main()
|
| 434 |
```
|
|
|
|
| 435 |
</details>
|
| 436 |
|
| 437 |
|
|
@@ -487,8 +498,8 @@ def main():
|
|
| 487 |
output_dir=output_dir,
|
| 488 |
num_train_epochs=num_train_epochs,
|
| 489 |
per_device_train_batch_size=batch_size,
|
| 490 |
-
fp16=False,
|
| 491 |
-
bf16=True,
|
| 492 |
run_name=run_name,
|
| 493 |
logging_steps=10,
|
| 494 |
learning_rate=lr,
|
|
@@ -572,9 +583,9 @@ args = SparseEncoderTrainingArguments(
|
|
| 572 |
per_device_eval_batch_size=16,
|
| 573 |
learning_rate=2e-5,
|
| 574 |
warmup_ratio=0.1,
|
| 575 |
-
fp16=True,
|
| 576 |
-
bf16=False,
|
| 577 |
-
batch_sampler=BatchSamplers.NO_DUPLICATES,
|
| 578 |
# Optional tracking/debugging parameters:
|
| 579 |
eval_strategy="steps",
|
| 580 |
eval_steps=1000,
|
|
@@ -582,7 +593,7 @@ args = SparseEncoderTrainingArguments(
|
|
| 582 |
save_steps=1000,
|
| 583 |
save_total_limit=2,
|
| 584 |
logging_steps=200,
|
| 585 |
-
run_name=run_name,
|
| 586 |
)
|
| 587 |
|
| 588 |
# 6. (Optional) Create an evaluator & evaluate the base model
|
|
@@ -644,7 +655,7 @@ def main():
|
|
| 644 |
|
| 645 |
train_batch_size = 64
|
| 646 |
num_epochs = 1
|
| 647 |
-
num_hard_negatives = 5
|
| 648 |
|
| 649 |
# 1a. Load a model to finetune with 1b. (Optional) model card data
|
| 650 |
model = CrossEncoder(
|
|
@@ -671,13 +682,13 @@ def main():
|
|
| 671 |
hard_train_dataset = mine_hard_negatives(
|
| 672 |
train_dataset,
|
| 673 |
embedding_model,
|
| 674 |
-
num_negatives=num_hard_negatives,
|
| 675 |
-
margin=0,
|
| 676 |
-
range_min=0,
|
| 677 |
-
range_max=100,
|
| 678 |
-
sampling_strategy="top",
|
| 679 |
-
batch_size=4096,
|
| 680 |
-
output_format="labeled-pair",
|
| 681 |
use_faiss=True,
|
| 682 |
)
|
| 683 |
logging.info(hard_train_dataset)
|
|
@@ -703,8 +714,8 @@ def main():
|
|
| 703 |
hard_eval_dataset = mine_hard_negatives(
|
| 704 |
eval_dataset,
|
| 705 |
embedding_model,
|
| 706 |
-
corpus=full_dataset["answer"],
|
| 707 |
-
num_negatives=30,
|
| 708 |
batch_size=4096,
|
| 709 |
include_positives=True,
|
| 710 |
output_format="n-tuple",
|
|
@@ -743,8 +754,8 @@ def main():
|
|
| 743 |
per_device_eval_batch_size=train_batch_size,
|
| 744 |
learning_rate=2e-5,
|
| 745 |
warmup_ratio=0.1,
|
| 746 |
-
fp16=False,
|
| 747 |
-
bf16=True,
|
| 748 |
dataloader_num_workers=4,
|
| 749 |
load_best_model_at_end=True,
|
| 750 |
metric_for_best_model="eval_gooaq-dev_ndcg@10",
|
|
@@ -756,7 +767,7 @@ def main():
|
|
| 756 |
save_total_limit=2,
|
| 757 |
logging_steps=200,
|
| 758 |
logging_first_step=True,
|
| 759 |
-
run_name=run_name,
|
| 760 |
seed=12,
|
| 761 |
)
|
| 762 |
|
|
@@ -783,7 +794,8 @@ def main():
|
|
| 783 |
model.push_to_hub(run_name)
|
| 784 |
except Exception:
|
| 785 |
logging.error(
|
| 786 |
-
f"Error uploading model to the Hugging Face Hub
|
|
|
|
| 787 |
f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
|
| 788 |
f"and saving it using `model.push_to_hub('{run_name}')`."
|
| 789 |
)
|
|
@@ -882,7 +894,7 @@ def main(script_args, training_args, model_args):
|
|
| 882 |
if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
|
| 883 |
from transformers import AutoModelForImageTextToText
|
| 884 |
|
| 885 |
-
model_kwargs.pop("use_cache", None)
|
| 886 |
model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
|
| 887 |
else:
|
| 888 |
model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)
|
|
@@ -942,6 +954,80 @@ if __name__ == "__main__":
|
|
| 942 |
```
|
| 943 |
</details>
|
| 944 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 945 |
## Citation
|
| 946 |
|
| 947 |
If you use Ettin models in your research, please cite our work:
|
|
@@ -956,4 +1042,12 @@ If you use Ettin models in your research, please cite our work:
|
|
| 956 |
primaryClass={cs.CL},
|
| 957 |
url={https://arxiv.org/abs/2507.11412},
|
| 958 |
}
|
| 959 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
+
license: mit
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
+
library_name: transformers
|
| 7 |
+
datasets:
|
| 8 |
+
- jhu-clsp/ettin-pretraining-data
|
| 9 |
+
- jhu-clsp/ettin-extension-data
|
| 10 |
+
- jhu-clsp/ettin-decay-data
|
| 11 |
+
tags:
|
| 12 |
+
- ettin
|
| 13 |
+
- decoder
|
| 14 |
---
|
| 15 |
+
|
| 16 |
# Ettin: an Open Suite of Paired Encoders and Decoders
|
| 17 |
|
| 18 |
[](https://opensource.org/licenses/MIT)
|
| 19 |
[](https://arxiv.org/abs/2507.11412)
|
| 20 |
+
[](https://huggingface.co/jhu-clsp)
|
| 21 |
[](https://huggingface.co/datasets/jhu-clsp)
|
| 22 |
[](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
|
| 23 |
|
| 24 |
> π― **TL;DR**: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.
|
| 25 |
|
| 26 |
+
π [Paper](https://arxiv.org/abs/2507.11412) | π€ [Model Collection](https://huggingface.co/jhu-clsp) | π [Training Data](https://huggingface.co/datasets/jhu-clsp)
|
| 27 |
|
| 28 |
This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
|
| 29 |
|
| 30 |
## Table of Contents
|
| 31 |
+
- [π Performance Highlights](#-performance-highlights)
|
| 32 |
+
- [π Quick Start](#-quick-start)
|
| 33 |
- [Model Description](#model-description)
|
| 34 |
- [Training Data](#training-data)
|
| 35 |
+
- [π€ Model Family](#-model-family)
|
| 36 |
- [Encoder Models](#encoder-models)
|
| 37 |
- [Decoder Models](#decoder-models)
|
| 38 |
- [Cross-Objective Models](#cross-objective-models)
|
| 39 |
- [Accessing Training Checkpoints](#accessing-training-checkpoints)
|
| 40 |
+
- [π¬ Research Applications](#-research-applications)
|
| 41 |
- [Training Details](#training-details)
|
| 42 |
- [Model Architecture](#model-architecture)
|
| 43 |
- [Usage Examples](#usage-examples)
|
| 44 |
- [Fine-tuning Examples](#fine-tuning-examples)
|
| 45 |
+
- [π Training and Evaluation](#-training-and-evaluation)
|
| 46 |
+
- [β FAQ](#-faq)
|
| 47 |
- [Citation](#citation)
|
| 48 |
+
- [License](#license)
|
| 49 |
|
| 50 |
## π Performance Highlights
|
| 51 |
|
|
|
|
| 94 |
|
| 95 |
Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
|
| 96 |
|
| 97 |
+
1. **Identical training data** - Same high-quality mixture across all models
|
| 98 |
+
2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
|
| 99 |
+
3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
|
| 100 |
+
4. **Consistent training recipe** - Three-phase training with 2T tokens
|
| 101 |
+
5. **Multiple scales** - From 17M to 1B parameters
|
| 102 |
|
| 103 |
This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
|
| 104 |
|
|
|
|
| 106 |
|
| 107 |
The training data is publicly available and split across different phases:
|
| 108 |
|
| 109 |
+
- **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
|
| 110 |
+
- **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
|
| 111 |
+
- **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
|
| 112 |
+
- **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
|
| 113 |
|
| 114 |
+
## π€ Model Family
|
| 115 |
|
| 116 |
### Encoder Models
|
| 117 |
|
|
|
|
| 186 |
#### HuggingFace Format Checkpoints
|
| 187 |
Each model repository contains multiple tagged versions representing different training stages:
|
| 188 |
|
| 189 |
+
- **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
|
| 190 |
+
- **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
|
| 191 |
+
- **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
|
| 192 |
|
| 193 |
```python
|
| 194 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
| 221 |
|
| 222 |
Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
|
| 223 |
|
| 224 |
+
- **Identical Training Data**: Same 2T token mixture across all models
|
| 225 |
+
- **Matched Architectures**: Only attention patterns and objectives differ
|
| 226 |
+
- **Open Everything**: Training data, model weights, and batch-level training order
|
| 227 |
+
- **Multiple Scales**: Fair comparison from 17M to 1B parameters
|
| 228 |
+
- **250+ Checkpoints**: Complete training trajectory analysis
|
| 229 |
|
| 230 |
### Use Cases for Researchers
|
| 231 |
|
| 232 |
+
- **Architecture Studies**: Compare encoder vs decoder capabilities fairly
|
| 233 |
+
- **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
|
| 234 |
+
- **Scaling Laws**: Study how architectural advantages change with scale
|
| 235 |
+
- **Transfer Learning**: Investigate cross-objective training effectiveness
|
| 236 |
+
- **Replication Studies**: First open replication of ModernBERT training recipe
|
| 237 |
|
| 238 |
### Reproducibility
|
| 239 |
|
|
|
|
| 250 |
**Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
|
| 251 |
|
| 252 |
**Training Phases:**
|
| 253 |
+
- **Pre-training**: 1.7T tokens with diverse data mixture
|
| 254 |
+
- **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
|
| 255 |
+
- **Decay phase**: 100B tokens with premium data sources
|
| 256 |
|
| 257 |
**Key Features:**
|
| 258 |
+
- Context length: Up to 8K tokens
|
| 259 |
+
- Vocabulary: 50,368 tokens (ModernBERT tokenizer)
|
| 260 |
+
- Deep but efficient architectures following MobileLLM principles
|
| 261 |
|
| 262 |
## Model Architecture
|
| 263 |
|
|
|
|
| 268 |
| Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
|
| 269 |
| Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
|
| 270 |
|
|
|
|
|
|
|
| 271 |
## Usage Examples
|
| 272 |
|
| 273 |
### Encoder: Masked Language Modeling
|
|
|
|
| 386 |
eval_dataset = dataset_dict["test"]
|
| 387 |
|
| 388 |
# 3. Define a loss function
|
| 389 |
+
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16) # Increase mini_batch_size if you have enough VRAM
|
| 390 |
|
| 391 |
run_name = f"{model_shortname}-DPR-{lr}"
|
| 392 |
# 4. (Optional) Specify training arguments
|
|
|
|
| 398 |
per_device_train_batch_size=512,
|
| 399 |
per_device_eval_batch_size=512,
|
| 400 |
warmup_ratio=0.05,
|
| 401 |
+
fp16=False, # Set to False if GPU can't handle FP16
|
| 402 |
+
bf16=True, # Set to True if GPU supports BF16
|
| 403 |
+
batch_sampler=BatchSamplers.NO_DUPLICATES, # (Cached)MultipleNegativesRankingLoss benefits from no duplicates
|
| 404 |
learning_rate=lr,
|
| 405 |
# Optional tracking/debugging parameters:
|
| 406 |
save_strategy="steps",
|
| 407 |
save_steps=500,
|
| 408 |
save_total_limit=2,
|
| 409 |
logging_steps=500,
|
| 410 |
+
run_name=run_name, # Used in `wandb`, `tensorboard`, `neptune`, etc. if installed
|
| 411 |
)
|
| 412 |
|
| 413 |
# 5. (Optional) Create an evaluator & evaluate the base model
|
|
|
|
| 442 |
if __name__ == "__main__":
|
| 443 |
main()
|
| 444 |
```
|
| 445 |
+
|
| 446 |
</details>
|
| 447 |
|
| 448 |
|
|
|
|
| 498 |
output_dir=output_dir,
|
| 499 |
num_train_epochs=num_train_epochs,
|
| 500 |
per_device_train_batch_size=batch_size,
|
| 501 |
+
fp16=False, # Set to False if you get an error that your GPU can't run on FP16
|
| 502 |
+
bf16=True, # Set to True if you have a GPU that supports BF16
|
| 503 |
run_name=run_name,
|
| 504 |
logging_steps=10,
|
| 505 |
learning_rate=lr,
|
|
|
|
| 583 |
per_device_eval_batch_size=16,
|
| 584 |
learning_rate=2e-5,
|
| 585 |
warmup_ratio=0.1,
|
| 586 |
+
fp16=True, # Set to False if you get an error that your GPU can't run on FP16
|
| 587 |
+
bf16=False, # Set to True if you have a GPU that supports BF16
|
| 588 |
+
batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
|
| 589 |
# Optional tracking/debugging parameters:
|
| 590 |
eval_strategy="steps",
|
| 591 |
eval_steps=1000,
|
|
|
|
| 593 |
save_steps=1000,
|
| 594 |
save_total_limit=2,
|
| 595 |
logging_steps=200,
|
| 596 |
+
run_name=run_name, # Will be used in W&B if `wandb` is installed
|
| 597 |
)
|
| 598 |
|
| 599 |
# 6. (Optional) Create an evaluator & evaluate the base model
|
|
|
|
| 655 |
|
| 656 |
train_batch_size = 64
|
| 657 |
num_epochs = 1
|
| 658 |
+
num_hard_negatives = 5 # How many hard negatives should be mined for each question-answer pair
|
| 659 |
|
| 660 |
# 1a. Load a model to finetune with 1b. (Optional) model card data
|
| 661 |
model = CrossEncoder(
|
|
|
|
| 682 |
hard_train_dataset = mine_hard_negatives(
|
| 683 |
train_dataset,
|
| 684 |
embedding_model,
|
| 685 |
+
num_negatives=num_hard_negatives, # How many negatives per question-answer pair
|
| 686 |
+
margin=0, # Similarity between query and negative samples should be x lower than query-positive similarity
|
| 687 |
+
range_min=0, # Skip the x most similar samples
|
| 688 |
+
range_max=100, # Consider only the x most similar samples
|
| 689 |
+
sampling_strategy="top", # Sample the top negatives from the range
|
| 690 |
+
batch_size=4096, # Use a batch size of 4096 for the embedding model
|
| 691 |
+
output_format="labeled-pair", # The output format is (query, passage, label), as required by BinaryCrossEntropyLoss
|
| 692 |
use_faiss=True,
|
| 693 |
)
|
| 694 |
logging.info(hard_train_dataset)
|
|
|
|
| 714 |
hard_eval_dataset = mine_hard_negatives(
|
| 715 |
eval_dataset,
|
| 716 |
embedding_model,
|
| 717 |
+
corpus=full_dataset["answer"], # Use the full dataset as the corpus
|
| 718 |
+
num_negatives=30, # How many documents to rerank
|
| 719 |
batch_size=4096,
|
| 720 |
include_positives=True,
|
| 721 |
output_format="n-tuple",
|
|
|
|
| 754 |
per_device_eval_batch_size=train_batch_size,
|
| 755 |
learning_rate=2e-5,
|
| 756 |
warmup_ratio=0.1,
|
| 757 |
+
fp16=False, # Set to False if you get an error that your GPU can't run on FP16
|
| 758 |
+
bf16=True, # Set to True if you have a GPU that supports BF16
|
| 759 |
dataloader_num_workers=4,
|
| 760 |
load_best_model_at_end=True,
|
| 761 |
metric_for_best_model="eval_gooaq-dev_ndcg@10",
|
|
|
|
| 767 |
save_total_limit=2,
|
| 768 |
logging_steps=200,
|
| 769 |
logging_first_step=True,
|
| 770 |
+
run_name=run_name, # Will be used in W&B if `wandb` is installed
|
| 771 |
seed=12,
|
| 772 |
)
|
| 773 |
|
|
|
|
| 794 |
model.push_to_hub(run_name)
|
| 795 |
except Exception:
|
| 796 |
logging.error(
|
| 797 |
+
f"Error uploading model to the Hugging Face Hub:
|
| 798 |
+
{traceback.format_exc()}To upload it manually, you can run "
|
| 799 |
f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
|
| 800 |
f"and saving it using `model.push_to_hub('{run_name}')`."
|
| 801 |
)
|
|
|
|
| 894 |
if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
|
| 895 |
from transformers import AutoModelForImageTextToText
|
| 896 |
|
| 897 |
+
model_kwargs.pop("use_cache", None) # Image models do not support cache
|
| 898 |
model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
|
| 899 |
else:
|
| 900 |
model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)
|
|
|
|
| 954 |
```
|
| 955 |
</details>
|
| 956 |
|
| 957 |
+
## π Training and Evaluation
|
| 958 |
+
|
| 959 |
+
### Pre-training
|
| 960 |
+
For details on model pre-training, data preparation, and training recipes:
|
| 961 |
+
- **π [Pre-training Guide](pretraining/README.md)** - Complete training setup, data mixture, and ModernBERT recipe adaptation
|
| 962 |
+
|
| 963 |
+
### Evaluation
|
| 964 |
+
|
| 965 |
+
#### Encoder Evaluation
|
| 966 |
+
- **π [Encoder on Generative Tasks](docs/encoder-generative-eval.md)** - Evaluating encoders on language modeling tasks using our lm-evaluation-harness fork
|
| 967 |
+
- **π [Encoder Retrieval Training](docs/retrieval.md)** - Fine-tuning on MS MARCO and evaluation on MTEB v2 English
|
| 968 |
+
- **π― [GLUE Evaluation](glue_evaluation/README.md)** - Comprehensive GLUE benchmark evaluation with fine-tuning scripts
|
| 969 |
+
|
| 970 |
+
#### Decoder Evaluation
|
| 971 |
+
- **π― [Decoder on Generative Tasks](docs/decoder-eval.md)** - Using EleutherAI evaluation harness (commit `867413f8677f00f6a817262727cbb041bf36192a`) for comprehensive generative task evaluation
|
| 972 |
+
|
| 973 |
+
#### Bias Evaluation
|
| 974 |
+
- **βοΈ [Gender Bias Evaluation](bias_eval/README.md)** - Comprehensive gender bias testing using Winogender dataset gotcha examples. Tests how well models handle counter-stereotypical pronouns in occupational contexts. Supports both encoder (MLM) and decoder (perplexity) evaluation methods.
|
| 975 |
+
|
| 976 |
+
### Quick Decoder Evaluation Example
|
| 977 |
+
|
| 978 |
+
```bash
|
| 979 |
+
# Clone the specific commit of lm-evaluation-harness
|
| 980 |
+
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
|
| 981 |
+
cd lm-evaluation-harness
|
| 982 |
+
git checkout 867413f8677f00f6a817262727cbb041bf36192a
|
| 983 |
+
pip install -e .
|
| 984 |
+
|
| 985 |
+
# Run evaluation on Ettin decoder
|
| 986 |
+
lm_eval --model hf \
|
| 987 |
+
--model_args pretrained=jhu-clsp/ettin-decoder-150m \
|
| 988 |
+
--tasks hellaswag,arc_easy,arc_challenge,winogrande \
|
| 989 |
+
--device cuda:0 \
|
| 990 |
+
--batch_size 8
|
| 991 |
+
```
|
| 992 |
+
|
| 993 |
+
## β FAQ
|
| 994 |
+
|
| 995 |
+
### Model Loading Issues
|
| 996 |
+
|
| 997 |
+
**Q: I'm getting an error that ModernBERT-decoder isn't found.**
|
| 998 |
+
**A:** Make sure you have the latest version of transformers installed:
|
| 999 |
+
```bash
|
| 1000 |
+
# for the latest version until the official pypi release:
|
| 1001 |
+
pip install git+https://github.com/huggingface/transformers.git
|
| 1002 |
+
```
|
| 1003 |
+
|
| 1004 |
+
**Q: Which model should I choose for my task?**
|
| 1005 |
+
**A:**
|
| 1006 |
+
- **Classification/Retrieval/Understanding**: Use encoder models
|
| 1007 |
+
- **Text Generation/Chat/Completion**: Use decoder models
|
| 1008 |
+
- **Research on cross-training**: Use cross-objective models
|
| 1009 |
+
- **Size selection**: Start with 150M for experimentation, scale up to 400M or 1B for production
|
| 1010 |
+
|
| 1011 |
+
**Q: How do I access training checkpoints?**
|
| 1012 |
+
**A:** Each model has multiple git tags for different training stages. Use the `revision` parameter:
|
| 1013 |
+
```python
|
| 1014 |
+
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m", revision="step500000")
|
| 1015 |
+
```
|
| 1016 |
+
|
| 1017 |
+
**Q: Can I continue training these models?**
|
| 1018 |
+
**A:** Yes! We provide raw checkpoints in the [jhu-clsp/ettin-checkpoints](https://huggingface.co/datasets/jhu-clsp/ettin-checkpoints) dataset that can be loaded into training frameworks.
|
| 1019 |
+
|
| 1020 |
+
**Q: What's the difference between cross-objective models and regular models?**
|
| 1021 |
+
**A:** Cross-objective models started as one architecture (e.g., decoder) and were continued with a different objective (e.g., MLM). They demonstrate the limitations of cross-training and generally underperform native models.
|
| 1022 |
+
|
| 1023 |
+
**Q: How do I reproduce the paper results?**
|
| 1024 |
+
**A:** See our evaluation guides:
|
| 1025 |
+
- [Encoder Generative Eval](docs/encoder-generative-eval.md)
|
| 1026 |
+
- [Retrieval Eval](docs/retrieval.md)
|
| 1027 |
+
- [GLUE Eval](glue_evaluation/README.md)
|
| 1028 |
+
- [Decoder Eval](docs/decoder-eval.md)
|
| 1029 |
+
- [Pre-training](pretraining/README.md)
|
| 1030 |
+
|
| 1031 |
## Citation
|
| 1032 |
|
| 1033 |
If you use Ettin models in your research, please cite our work:
|
|
|
|
| 1042 |
primaryClass={cs.CL},
|
| 1043 |
url={https://arxiv.org/abs/2507.11412},
|
| 1044 |
}
|
| 1045 |
+
```
|
| 1046 |
+
|
| 1047 |
+
## License
|
| 1048 |
+
|
| 1049 |
+
This project is licensed under the MIT License - see the [LICENSE](https://github.com/jhu-clsp/ettin-encoder-vs-decoder/blob/main/LICENSE) file for details.
|
| 1050 |
+
|
| 1051 |
+
---
|
| 1052 |
+
|
| 1053 |
+
**Contact**: For questions about the models or research, please open an issue or contact the authors.
|