|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- ibm-research/Climate-Change-NER |
|
|
metrics: |
|
|
- f1 |
|
|
base_model: |
|
|
- allenai/specter2_base |
|
|
tags: |
|
|
- climate-change |
|
|
- ner |
|
|
--- |
|
|
|
|
|
# Climate NER Model |
|
|
|
|
|
This repository contains a fine-tuned Named Entity Recognition (NER) model specialized for climate change-related entities. The model was trained on the Climate Change NER dataset, which consists of 534 manually annotated abstracts from climate-related academic papers. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is fine-tuned to recognize 13 climate-related entity types: |
|
|
- climate-assets |
|
|
- climate-datasets |
|
|
- climate-greenhouse-gases |
|
|
- climate-hazards |
|
|
- climate-impacts |
|
|
- climate-mitigations |
|
|
- climate-models |
|
|
- climate-nature |
|
|
- climate-observations |
|
|
- climate-organisms |
|
|
- climate-organizations |
|
|
- climate-problem-origins |
|
|
- climate-properties |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on the [Climate Change NER dataset](https://huggingface.co/datasets/YOUR_DATASET_LINK), which contains 534 abstracts sourced from the Semantic Scholar Academic Graph. The abstracts were manually annotated with climate-related entities using the IOB (Inside-Outside-Beginning) tagging scheme. |
|
|
|
|
|
**Dataset Statistics:** |
|
|
- Train set: 382 instances |
|
|
- Validation set: 77 instances |
|
|
- Test set: 75 instances |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
We evaluated three different models on the Climate Change NER test set: |
|
|
|
|
|
| Model | Precision | Recall | F1 Score | |
|
|
|-------|-----------|--------|----------| |
|
|
| **specter2_base** | 0.57 | 0.61 | 0.57 | |
|
|
| modernBERT | 0.45 | 0.42 | 0.41 | |
|
|
| BERT-base | 0.53 | 0.57 | 0.52 | |
|
|
|
|
|
*We report micro-avg metrics, because entity classes are very unbalanced in test.* |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from ipymarkup import show_span_box_markup |
|
|
from transformers import pipeline |
|
|
ner = pipeline("ner", model="nicolauduran45/specter-climate-change-NER", tokenizer="nicolauduran45/specter-climate-change-NER", aggregation_strategy="simple", device=0) |
|
|
|
|
|
text = 'multi-centennial variability of open ocean deep convection in the Atlantic sector of the Southern Ocean impacts the strength of the Atlantic Meridional Overturning Circulation (AMOC) in the Kiel Climate Model. The northward extent of Antarctic Bottom Water (AABW) strongly depends on the state of Weddell Sea deep convection.' |
|
|
|
|
|
entities = ner(ex) |
|
|
spans = [(s['start'], s['end'], s['entity_group'])for s in entities] |
|
|
show_span_box_markup(text, spans) |
|
|
``` |
|
|
|
|
|
To improve aggregation of split words, we recommend to use this function |
|
|
|
|
|
```python |
|
|
def predict_with_proper_aggregation(text): |
|
|
# Get the raw predictions |
|
|
raw_entities = ner(text) |
|
|
|
|
|
# Aggregate subword pieces into complete entities |
|
|
aggregated_entities = [] |
|
|
current_entity = None |
|
|
|
|
|
for entity in raw_entities: |
|
|
# Check if this is a continuation token (starts with ##) |
|
|
is_continuation = entity["word"].startswith("##") |
|
|
|
|
|
if is_continuation and current_entity: |
|
|
# Update the current entity by removing ## and appending |
|
|
current_entity["word"] += entity["word"][2:] |
|
|
current_entity["end"] = entity["end"] |
|
|
|
|
|
# Update the score (average or keep the minimum) |
|
|
current_entity["score"] = min(current_entity["score"], entity["score"]) |
|
|
|
|
|
# If entity types differ, use the one with higher confidence |
|
|
if entity["entity_group"] != current_entity["entity_group"] and entity["score"] > current_entity["score"]: |
|
|
current_entity["entity_group"] = entity["entity_group"] |
|
|
current_entity["score"] = entity["score"] |
|
|
else: |
|
|
# If we have a previous entity, add it to results |
|
|
if current_entity: |
|
|
aggregated_entities.append(current_entity) |
|
|
|
|
|
# Start a new entity |
|
|
current_entity = entity.copy() |
|
|
|
|
|
# Don't forget the last entity |
|
|
if current_entity: |
|
|
aggregated_entities.append(current_entity) |
|
|
|
|
|
# Further aggregation: detect split entities that might not use ## notation |
|
|
# but should be merged based on adjacent positions |
|
|
i = 0 |
|
|
while i < len(aggregated_entities) - 1: |
|
|
current = aggregated_entities[i] |
|
|
next_entity = aggregated_entities[i + 1] |
|
|
|
|
|
# Check if entities are adjacent and should be merged |
|
|
if (current["end"] == next_entity["start"] and |
|
|
current["entity_group"] == next_entity["entity_group"]): |
|
|
# Merge entities |
|
|
current["word"] += next_entity["word"] |
|
|
current["end"] = next_entity["end"] |
|
|
current["score"] = (current["score"] + next_entity["score"]) / 2 |
|
|
# Remove the next entity as it's now merged |
|
|
aggregated_entities.pop(i + 1) |
|
|
else: |
|
|
i += 1 |
|
|
|
|
|
return aggregated_entities |
|
|
|
|
|
entities = predict_with_proper_aggregation(ex) |
|
|
spans = [(s['start'], s['end'], s['entity_group'])for s in entities] |
|
|
show_span_box_markup(text, spans) |
|
|
``` |