--- library_name: transformers pipeline_tag: fill-mask tags: [gpt-bert, babylm, remote-code] license: other --- # jumelet/gptbert-pol-250steps-base GPT-BERT style BabyBabyLLM model for language **pol**. This repository may include both *main* and *EMA* variants. **Default variant exposed to generic loaders:** `ema` ## Variants Available ema, main ## Files - model.safetensors (alias of default variant) - model_ema.safetensors - pytorch_model.bin (legacy PyTorch format) - pol-2gpu-250steps.bin (raw training checkpoint) - pol-2gpu-250steps_ema.bin (raw training checkpoint) ## Configuration ```json { "attention_probs_dropout_prob": 0.1, "hidden_dropout_prob": 0.1, "hidden_size": 768, "intermediate_size": 2560, "max_position_embeddings": 512, "position_bucket_size": 32, "num_attention_heads": 12, "num_hidden_layers": 12, "vocab_size": 16384, "layer_norm_eps": 1e-05, "force_causal_mask": true, "classifier_dropout": 0.1, "classifier_layer_norm_eps": 1e-05, "num_labels": 2 } ``` Tokenizer file: `tokenizer_pol_vs16384.json` ## Quick Usage ```python from transformers import AutoTokenizer, AutoModelForMaskedLM model_id = 'jumelet/gptbert-pol-250steps-base' tok = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True) out = model(**tok('Hello world', return_tensors='pt')) ``` ### Forced Causal Attention Causal attention is enforced during inference by applying a triangular future mask inside the remote code. This prevents the hybrid GPT-BERT layers from attending to future tokens even when a bidirectional mask is provided. ### Sequence Classification `GPTBertForSequenceClassification` mirrors the original GLUE classifier head for downstream fine-tuning. ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification model_id = 'jumelet/gptbert-pol-250steps-base' tok = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id, trust_remote_code=True) outputs = model(**tok('This movie was great!', return_tensors='pt')) print(outputs.logits) ``` ## Notes - Converted on 2025-10-07T01:14:48.240581+00:00 - Weights are the exact trained parameters; no new layers were initialized. - Requires `trust_remote_code=True` due to custom architecture.