License: MIT Model Size Type Performance

Model Card for boltuix/bert-medium

The boltuix/bert-medium model is a versatile BERT variant designed for natural language processing tasks requiring a strong balance of accuracy and computational efficiency. Pretrained on English text using masked language modeling (MLM) and next sentence prediction (NSP) objectives, it is optimized for fine-tuning on a wide range of NLP tasks, including sequence classification, token classification, and question answering. With a size of ~160 MB, it serves as a robust general-purpose model for applications needing reliable performance with moderate resource requirements.

Model Details

Model Description

The boltuix/bert-medium model is a PyTorch-based transformer model derived from TensorFlow checkpoints in the Google BERT repository. It builds on research from On the Importance of Pre-training Compact Models (arXiv) and Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics (arXiv). Ported to Hugging Face, this uncased model (~160 MB) is engineered for general-purpose NLP applications, such as sentiment analysis, named entity recognition, and natural language inference, making it ideal for researchers and developers seeking a balance between performance and resource efficiency.

  • Developed by: BoltUIX
  • Funded by: BoltUIX Research Fund
  • Shared by: Hugging Face
  • Model type: Transformer (BERT)
  • Language(s) (NLP): English (en)
  • License: MIT
  • Finetuned from model: google-bert/bert-base-uncased

Model Sources

Model Variants

BoltUIX offers a range of BERT-based models tailored to different performance and resource requirements. The boltuix/bert-medium model is a strong general-purpose option, ideal for applications needing reliable accuracy with moderate resource usage. Below is a summary of available models:

Tier Model ID Size (MB) Notes
Micro boltuix/bert-micro ~15 MB Smallest, blazing-fast, moderate accuracy
Mini boltuix/bert-mini ~17 MB Ultra-compact, fast, slightly better accuracy
Tinyplus boltuix/bert-tinyplus ~20 MB Slightly bigger, better capacity
Small boltuix/bert-small ~45 MB Good compact/accuracy balance
Mid boltuix/bert-mid ~50 MB Well-rounded mid-tier performance
Medium boltuix/bert-medium ~160 MB Strong general-purpose model
Large boltuix/bert-large ~365 MB Top performer below full-BERT
Pro boltuix/bert-pro ~420 MB Use only if max accuracy is mandatory
Mobile boltuix/bert-mobile ~140 MB Mobile-optimized; quantize to ~25 MB with no major loss

For more details on each variant, visit the BoltUIX Model Hub.

Uses

Direct Use

The model can be used directly for masked language modeling or next sentence prediction tasks, such as predicting missing words in sentences or determining sentence coherence, delivering reliable accuracy in these core tasks.

Downstream Use

The model is designed for fine-tuning on a variety of downstream NLP tasks, including:

  • Sequence classification (e.g., sentiment analysis, intent detection)
  • Token classification (e.g., named entity recognition, part-of-speech tagging)
  • Question answering (e.g., extractive QA, reading comprehension)
  • Natural language inference (e.g., MNLI, RTE) It is recommended for researchers, developers, and enterprises seeking a general-purpose NLP model with strong performance and moderate resource requirements.

Out-of-Scope Use

The model is not suitable for:

  • Text generation tasks (use generative models like GPT-3 instead).
  • Non-English language tasks without significant fine-tuning.
  • Ultra-low-latency or highly resource-constrained environments (use boltuix/bert-micro or boltuix/bert-mini instead).

Bias, Risks, and Limitations

The model may inherit biases from its training data (BookCorpus and English Wikipedia), potentially reinforcing stereotypes, such as gender or occupational biases. For example:

from transformers import pipeline
unmasker = pipeline('fill-mask', model='boltuix/bert-medium')
unmasker("The man worked as a [MASK].")

Output:

[
  {'sequence': '[CLS] the man worked as a engineer. [SEP]', 'token_str': 'engineer'},
  {'sequence': '[CLS] the man worked as a doctor. [SEP]', 'token_str': 'doctor'},
  ...
]
unmasker("The woman worked as a [MASK].")

Output:

[
  {'sequence': '[CLS] the woman worked as a teacher. [SEP]', 'token_str': 'teacher'},
  {'sequence': '[CLS] the woman worked as a nurse. [SEP]', 'token_str': 'nurse'},
  ...
]

These biases may propagate to downstream tasks. Due to its size (~160 MB), the model may still require optimization for deployment on resource-constrained devices.

Recommendations

Users should:

  • Conduct bias audits tailored to their application.
  • Fine-tune with diverse, representative datasets to reduce bias.
  • Apply model compression techniques (e.g., quantization, pruning) for resource-constrained deployments.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import pipeline, BertTokenizer, BertModel

# Masked Language Modeling
unmasker = pipeline('fill-mask', model='boltuix/bert-medium')
result = unmasker("Hello I'm a [MASK] model.")
print(result)

# Feature Extraction (PyTorch)
tokenizer = BertTokenizer.from_pretrained('boltuix/bert-medium')
model = BertModel.from_pretrained('boltuix/bert-medium')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Training Details

Training Data

The model was pretrained on:

  • BookCorpus: ~11,038 unpublished books, providing diverse narrative text.
  • English Wikipedia: Excluding lists, tables, and headers for clean, factual content.

See the BoltUIX Dataset Card for more details.

Training Procedure

Preprocessing

  • Texts are lowercased and tokenized using WordPiece with a vocabulary size of 30,000.
  • Inputs are formatted as: [CLS] Sentence A [SEP] Sentence B [SEP].
  • 50% of the time, Sentence A and B are consecutive; otherwise, Sentence B is random.
  • Masking:
    • 15% of tokens are masked.
    • 80% of masked tokens are replaced with [MASK].
    • 10% are replaced with a random token.
    • 10% are left unchanged.

Training Hyperparameters

  • Training regime: fp16 mixed precision
  • Optimizer: Adam (learning rate 1e-4, β1=0.9, β2=0.999, weight decay 0.01)
  • Batch size: 256
  • Steps: 1 million
  • Sequence length: 128 tokens (90% of steps), 512 tokens (10% of steps)
  • Warmup: 10,000 steps with linear learning rate decay

Speeds, Sizes, Times

  • Training time: Approximately 200 hours
  • Checkpoint size: ~160 MB
  • Throughput: ~100 sentences/second on TPU infrastructure

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluated on the GLUE benchmark, including tasks like MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, and RTE.

Factors

  • Subpopulations: General English text, academic, and professional domains
  • Domains: News, books, Wikipedia, scientific articles

Metrics

  • Accuracy: For classification tasks (e.g., MNLI, SST-2)
  • F1 Score: For tasks like QQP, MRPC
  • Pearson/Spearman Correlation: For STS-B

Results

GLUE test results (fine-tuned):

Task MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average
Score 84.2/83.1 71.8 90.2 93.0 52.5 85.4 88.3 66.8 79.4

Summary

The model delivers strong performance across GLUE tasks, with notable results in SST-2 and QNLI. It outperforms smaller BERT variants in tasks like RTE and CoLA, making it a reliable general-purpose model.

Model Examination

The model’s attention mechanisms were analyzed to ensure balanced contextual understanding, with no significant overfitting observed during pretraining. Ablation studies validated the training configuration for general-purpose performance.

Environmental Impact

Carbon emissions estimated using the Machine Learning Impact calculator from Lacoste et al. (2019).

  • Hardware Type: 4 cloud TPUs (16 TPU chips)
  • Hours used: 200 hours
  • Cloud Provider: Google Cloud
  • Compute Region: us-central1
  • Carbon Emitted: ~140 kg CO2eq (estimated based on TPU energy consumption and regional grid carbon intensity)

Technical Specifications

Model Architecture and Objective

  • Architecture: BERT (transformer-based, bidirectional)
  • Objective: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP)
  • Layers: 8
  • Hidden Size: 512
  • Attention Heads: 8

Compute Infrastructure

Hardware

  • 4 cloud TPUs in Pod configuration (16 TPU chips total)

Software

  • PyTorch
  • Transformers library (Hugging Face)

Citation

BibTeX:

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805}
}

APA: Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805. http://arxiv.org/abs/1810.04805

Glossary

  • MLM: Masked Language Modeling, where 15% of tokens are masked for prediction.
  • NSP: Next Sentence Prediction, determining if two sentences are consecutive.
  • WordPiece: Tokenization method splitting words into subword units.

More Information

Model Card Authors

  • Hugging Face team
  • BoltUIX contributors

Model Card Contact

For questions, please contact [email protected] or open an issue on the model repository.

Downloads last month
0
Safetensors
Model size
41.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support