Model Card for boltuix/bert-large
The boltuix/bert-large
model is a high-performance BERT variant designed for natural language processing tasks requiring excellent accuracy with balanced resource demands. Pretrained on English text using masked language modeling (MLM) and next sentence prediction (NSP) objectives, it is optimized for fine-tuning on complex NLP tasks such as sequence classification, token classification, and question answering. With a size of ~365 MB, it offers robust performance for applications needing high accuracy without the maximum computational overhead of boltuix/bert-pro
.
Model Details
Model Description
The boltuix/bert-large
model is a PyTorch-based transformer model derived from TensorFlow checkpoints in the Google BERT repository. It builds on research from On the Importance of Pre-training Compact Models (arXiv) and Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics (arXiv). Ported to Hugging Face, this uncased model (~365 MB) is engineered for applications requiring high accuracy, such as natural language inference, sentiment analysis, and question answering, making it ideal for enterprise and research applications where performance and efficiency are both priorities.
- Developed by: BoltUIX
- Funded by: BoltUIX Research Fund
- Shared by: Hugging Face
- Model type: Transformer (BERT)
- Language(s) (NLP): English (
en
) - License: MIT
- Finetuned from model: google-bert/bert-base-uncased
Model Sources
- Repository: Hugging Face Model Hub
- Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Demo: Hugging Face Spaces Demo
Model Variants
BoltUIX offers a range of BERT-based models tailored to different performance and resource requirements. The boltuix/bert-large
model is a top performer just below the maximum-accuracy boltuix/bert-pro
, ideal for applications needing high accuracy with moderate resource usage. Below is a summary of available models:
Tier | Model ID | Size (MB) | Notes |
---|---|---|---|
Micro | boltuix/bert-micro | ~15 MB | Smallest, blazing-fast, moderate accuracy |
Mini | boltuix/bert-mini | ~17 MB | Ultra-compact, fast, slightly better accuracy |
Tinyplus | boltuix/bert-tinyplus | ~20 MB | Slightly bigger, better capacity |
Small | boltuix/bert-small | ~45 MB | Good compact/accuracy balance |
Mid | boltuix/bert-mid | ~50 MB | Well-rounded mid-tier performance |
Medium | boltuix/bert-medium | ~160 MB | Strong general-purpose model |
Large | boltuix/bert-large | ~365 MB | Top performer below full-BERT |
Pro | boltuix/bert-pro | ~420 MB | Use only if max accuracy is mandatory |
Mobile | boltuix/bert-mobile | ~140 MB | Mobile-optimized; quantize to ~25 MB with no major loss |
For more details on each variant, visit the BoltUIX Model Hub.
Uses
Direct Use
The model can be used directly for masked language modeling or next sentence prediction tasks, such as predicting missing words in sentences or determining sentence coherence, delivering strong accuracy in these core tasks.
Downstream Use
The model is designed for fine-tuning on high-stakes downstream NLP tasks, including:
- Sequence classification (e.g., sentiment analysis, intent detection)
- Token classification (e.g., named entity recognition, part-of-speech tagging)
- Question answering (e.g., extractive QA, reading comprehension)
- Natural language inference (e.g., MNLI, RTE) It is recommended for researchers, data scientists, and enterprises requiring high-performance NLP solutions with manageable resource requirements.
Out-of-Scope Use
The model is not suitable for:
- Text generation tasks (use generative models like GPT-3 instead).
- Non-English language tasks without significant fine-tuning.
- Ultra-low-latency or highly resource-constrained environments (use
boltuix/bert-micro
orboltuix/bert-mini
instead).
Bias, Risks, and Limitations
The model may inherit biases from its training data (BookCorpus and English Wikipedia), potentially reinforcing stereotypes, such as gender or occupational biases. For example:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='boltuix/bert-large')
unmasker("The man worked as a [MASK].")
Output:
[
{'sequence': '[CLS] the man worked as a engineer. [SEP]', 'token_str': 'engineer'},
{'sequence': '[CLS] the man worked as a doctor. [SEP]', 'token_str': 'doctor'},
...
]
unmasker("The woman worked as a [MASK].")
Output:
[
{'sequence': '[CLS] the woman worked as a teacher. [SEP]', 'token_str': 'teacher'},
{'sequence': '[CLS] the woman worked as a nurse. [SEP]', 'token_str': 'nurse'},
...
]
These biases may propagate to downstream tasks. Due to its size (~365 MB), the model requires notable computational resources, making it less suitable for edge devices without optimization.
Recommendations
Users should:
- Conduct bias audits tailored to their application.
- Fine-tune with diverse, representative datasets to reduce bias.
- Apply model compression techniques (e.g., quantization, pruning) for resource-constrained deployments.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import pipeline, BertTokenizer, BertModel
# Masked Language Modeling
unmasker = pipeline('fill-mask', model='boltuix/bert-large')
result = unmasker("Hello I'm a [MASK] model.")
print(result)
# Feature Extraction (PyTorch)
tokenizer = BertTokenizer.from_pretrained('boltuix/bert-large')
model = BertModel.from_pretrained('boltuix/bert-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Training Details
Training Data
The model was pretrained on:
- BookCorpus: ~11,038 unpublished books, providing diverse narrative text.
- English Wikipedia: Excluding lists, tables, and headers for clean, factual content.
See the BoltUIX Dataset Card for more details.
Training Procedure
Preprocessing
- Texts are lowercased and tokenized using WordPiece with a vocabulary size of 30,000.
- Inputs are formatted as:
[CLS] Sentence A [SEP] Sentence B [SEP]
. - 50% of the time, Sentence A and B are consecutive; otherwise, Sentence B is random.
- Masking:
- 15% of tokens are masked.
- 80% of masked tokens are replaced with
[MASK]
. - 10% are replaced with a random token.
- 10% are left unchanged.
Training Hyperparameters
- Training regime: fp16 mixed precision
- Optimizer: Adam (learning rate 1e-4, β1=0.9, β2=0.999, weight decay 0.01)
- Batch size: 512
- Steps: 1.2 million
- Sequence length: 128 tokens (85% of steps), 512 tokens (15% of steps)
- Warmup: 12,000 steps with linear learning rate decay
Speeds, Sizes, Times
- Training time: Approximately 300 hours
- Checkpoint size: ~365 MB
- Throughput: ~85 sentences/second on TPU infrastructure
Evaluation
Testing Data, Factors & Metrics
Testing Data
Evaluated on the GLUE benchmark, including tasks like MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, and RTE.
Factors
- Subpopulations: General English text, academic, and professional domains
- Domains: News, books, Wikipedia, scientific articles
Metrics
- Accuracy: For classification tasks (e.g., MNLI, SST-2)
- F1 Score: For tasks like QQP, MRPC
- Pearson/Spearman Correlation: For STS-B
Results
GLUE test results (fine-tuned):
Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
---|---|---|---|---|---|---|---|---|---|
Score | 85.8/84.7 | 72.5 | 91.8 | 94.2 | 54.8 | 86.9 | 89.7 | 68.2 | 80.9 |
Summary
The model performs exceptionally across GLUE tasks, with strong results in SST-2, QNLI, and MRPC. It offers improved performance over smaller BERT variants in complex tasks like RTE and CoLA, making it a top performer just below boltuix/bert-pro
.
Model Examination
The model’s attention mechanisms were analyzed to ensure robust contextual understanding, with minimal overfitting observed during pretraining. Ablation studies confirmed the effectiveness of the training configuration for high performance.
Environmental Impact
Carbon emissions estimated using the Machine Learning Impact calculator from Lacoste et al. (2019).
- Hardware Type: 6 cloud TPUs (24 TPU chips)
- Hours used: 300 hours
- Cloud Provider: Google Cloud
- Compute Region: us-central1
- Carbon Emitted: ~200 kg CO2eq (estimated based on TPU energy consumption and regional grid carbon intensity)
Technical Specifications
Model Architecture and Objective
- Architecture: BERT (transformer-based, bidirectional)
- Objective: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP)
- Layers: 12
- Hidden Size: 768
- Attention Heads: 12
Compute Infrastructure
Hardware
- 6 cloud TPUs in Pod configuration (24 TPU chips total)
Software
- PyTorch
- Transformers library (Hugging Face)
Citation
BibTeX:
@article{DBLP:journals/corr/abs-1810-04805,
author = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova},
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding},
journal = {CoRR},
volume = {abs/1810.04805},
year = {2018},
url = {http://arxiv.org/abs/1810.04805},
archivePrefix = {arXiv},
eprint = {1810.04805}
}
APA: Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805. http://arxiv.org/abs/1810.04805
Glossary
- MLM: Masked Language Modeling, where 15% of tokens are masked for prediction.
- NSP: Next Sentence Prediction, determining if two sentences are consecutive.
- WordPiece: Tokenization method splitting words into subword units.
More Information
- See the Hugging Face documentation for advanced usage details.
- Contact: [email protected]
Model Card Authors
- Hugging Face team
- BoltUIX contributors
Model Card Contact
For questions, please contact [email protected] or open an issue on the model repository.
- Downloads last month
- 0