CliSciBERT ๐ŸŒฟ๐Ÿ“š

CliSciBERT is a domain-adapted version of SciBERT, further pretrained on a curated corpus of peer-reviewed research papers in the climate change domain. It is designed to enhance performance on climate-focused scientific NLP tasks by adapting the general scientific knowledge of SciBERT to the specialized subdomain of climate research.

๐Ÿ” Overview

  • Base Model: SciBERT (BERT-base architecture, scientific vocab)
  • Pretraining Method: Continued pretraining (domain adaptation) using Masked Language Modeling (MLM)
  • Corpus: Scientific papers focused on climate change and environmental science
  • Tokenizer: SciBERT tokenizer (unchanged)
  • Language: English
  • Domain: Climate change research

๐Ÿ“Š Performance

Evaluated on ClimaBench, a benchmark for climate-focused NLP tasks:

Metric Value
Macro F1 (avg) 60.50
Tasks won 0/7
Avg. Std Dev 0.01772

Note: While CliSciBERT builds on SciBERTโ€™s scientific grounding, its domain specialization improves relevance for climate-related NLP tasks.

Climate performance model card:

CliSciBERT
1. Model publicly available? Yes
2. Time to train final model 463h
3. Time for all experiments 1,226h ~ 51 days
4. Power of GPU and CPU 0.250 kW + 0.013 kW
5. Location for computations Croatia
6. Energy mix at location 224.71 gCO2eq/kWh
7. CO$_2$eq for final model 28 kg CO2
8. CO$_2$eq for all experiments 74 kg CO2

๐Ÿงช Intended Uses

Use for:

  • Scientific text classification and relation extraction in climate change literature
  • Domain-specific document tagging or summarization
  • Supporting knowledge graph population for climate research

Not recommended for:

  • Non-climate or general news content
  • Non-English corpora
  • Highly informal or colloquial text

Example:

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch

# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)

# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth."

# Run prediction
predictions = fill_mask(text)

# Show top predictions
print(text)
print(10*">")
for p in predictions:
    print(f"{p['sequence']} โ€” {p['score']:.4f}")

Output:

The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth.
>>>>>>>>>>
the increase in greenhouse gas ... affected the energy balance of the earth. โ€” 0.3911
the increase in greenhouse gas ... affected the radiative balance of the earth. โ€” 0.2640
the increase in greenhouse gas ... affected the radiation balance of the earth. โ€” 0.1233
the increase in greenhouse gas ... affected the carbon balance of the earth. โ€” 0.0589
the increase in greenhouse gas ... affected the ecological balance of the earth. โ€” 0.0332

โš ๏ธ Limitations

  • Retains SciBERTโ€™s limitations outside the scientific domain
  • May inherit biases from climate change literature
  • No tokenizer retraining โ€” tokenization optimized for general science, not climate-specific vocabulary

๐Ÿงพ Citation

If you use this model, please cite:

@article{poleksic_etal_2025,
  title={Climate Research Domain BERTs: Pretraining, Adaptation, and Evaluation},
  author={Poleksiฤ‡, Andrija  and
      Martinฤiฤ‡-Ipลกiฤ‡, Sanda},
  journal={PREPRINT (Version 1)},
  year={2025},
  doi={https://doi.org/10.21203/rs.3.rs-6644722/v1}
}
Downloads last month
17
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including P0L3/cliscibert_scivocab_uncased

Evaluation results