almanach
/

moderncamembert-base

@@ -13,8 +13,8 @@ tags:
 ---
 # ModernCamemBERT
-[ModernCamemBERT](TODO) is a French language model pretrained on a large corpus of 1T tokens of High-Quality French text. It is the French version of the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) model. ModernCamemBERT was trained using the Masked Language Modeling (MLM) objective with 30% mask rate on 1T tokens on 48 H100 GPUs. The dataset used for training is a combination of French [RedPajama-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) filtered using heuristic and semantic filtering, French scientific documents from [HALvest](https://huggingface.co/datasets/almanach/HALvest), and the French Wikipedia. Semantic filtering was done by fine-tuning a BERT classifier trained on a document quality dataset automatically labeled by LLama-3 70B.
-We also re-use the old [CamemBERTav2](https://huggingface.co/almanach/camembertav2-base) tokenizer. The model was first trained with 1024 context length which was then increased to 8192 tokens later in the pretraining. More details about the training process can be found in the [ModernCamemBERT](TODO) paper.
 The goal of ModernCamemBERT was to run a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT’s primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as the BERT and RoBERTa CamemBERT/v2 model. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation.

 ---
 # ModernCamemBERT
+[ModernCamemBERT](https://arxiv.org/abs/2504.08716) is a French language model pretrained on a large corpus of 1T tokens of High-Quality French text. It is the French version of the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) model. ModernCamemBERT was trained using the Masked Language Modeling (MLM) objective with 30% mask rate on 1T tokens on 48 H100 GPUs. The dataset used for training is a combination of French [RedPajama-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) filtered using heuristic and semantic filtering, French scientific documents from [HALvest](https://huggingface.co/datasets/almanach/HALvest), and the French Wikipedia. Semantic filtering was done by fine-tuning a BERT classifier trained on a document quality dataset automatically labeled by LLama-3 70B.
+We also re-use the old [CamemBERTav2](https://huggingface.co/almanach/camembertav2-base) tokenizer. The model was first trained with 1024 context length which was then increased to 8192 tokens later in the pretraining. More details about the training process can be found in the [ModernCamemBERT](https://arxiv.org/abs/2504.08716) paper.
 The goal of ModernCamemBERT was to run a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT’s primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as the BERT and RoBERTa CamemBERT/v2 model. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation.