Climate NER Model

This repository contains a fine-tuned Named Entity Recognition (NER) model specialized for climate change-related entities. The model was trained on the Climate Change NER dataset, which consists of 534 manually annotated abstracts from climate-related academic papers.

Model Description

This model is fine-tuned to recognize 13 climate-related entity types:

  • climate-assets
  • climate-datasets
  • climate-greenhouse-gases
  • climate-hazards
  • climate-impacts
  • climate-mitigations
  • climate-models
  • climate-nature
  • climate-observations
  • climate-organisms
  • climate-organizations
  • climate-problem-origins
  • climate-properties

Training Data

The model was trained on the Climate Change NER dataset, which contains 534 abstracts sourced from the Semantic Scholar Academic Graph. The abstracts were manually annotated with climate-related entities using the IOB (Inside-Outside-Beginning) tagging scheme.

Dataset Statistics:

  • Train set: 382 instances
  • Validation set: 77 instances
  • Test set: 75 instances

Model Performance

We evaluated three different models on the Climate Change NER test set:

Model Precision Recall F1 Score
specter2_base 0.57 0.61 0.57
modernBERT 0.45 0.42 0.41
BERT-base 0.53 0.57 0.52

We report micro-avg metrics, because entity classes are very unbalanced in test.

Usage

from ipymarkup import show_span_box_markup
from transformers import pipeline
ner = pipeline("ner", model="nicolauduran45/specter-climate-change-NER", tokenizer="nicolauduran45/specter-climate-change-NER", aggregation_strategy="simple", device=0)

text = 'multi-centennial variability of open ocean deep convection in the Atlantic sector of the Southern Ocean impacts the strength of the Atlantic Meridional Overturning Circulation (AMOC) in the Kiel Climate Model. The northward extent of Antarctic Bottom Water (AABW) strongly depends on the state of Weddell Sea deep convection.'

entities = ner(ex)
spans = [(s['start'], s['end'], s['entity_group'])for s in entities]
show_span_box_markup(text, spans)

To improve aggregation of split words, we recommend to use this function

def predict_with_proper_aggregation(text):
    # Get the raw predictions
    raw_entities = ner(text)
    
    # Aggregate subword pieces into complete entities
    aggregated_entities = []
    current_entity = None
    
    for entity in raw_entities:
        # Check if this is a continuation token (starts with ##)
        is_continuation = entity["word"].startswith("##")
        
        if is_continuation and current_entity:
            # Update the current entity by removing ## and appending
            current_entity["word"] += entity["word"][2:]
            current_entity["end"] = entity["end"]
            
            # Update the score (average or keep the minimum)
            current_entity["score"] = min(current_entity["score"], entity["score"])
            
            # If entity types differ, use the one with higher confidence
            if entity["entity_group"] != current_entity["entity_group"] and entity["score"] > current_entity["score"]:
                current_entity["entity_group"] = entity["entity_group"]
                current_entity["score"] = entity["score"]
        else:
            # If we have a previous entity, add it to results
            if current_entity:
                aggregated_entities.append(current_entity)
            
            # Start a new entity
            current_entity = entity.copy()
    
    # Don't forget the last entity
    if current_entity:
        aggregated_entities.append(current_entity)
    
    # Further aggregation: detect split entities that might not use ## notation
    # but should be merged based on adjacent positions
    i = 0
    while i < len(aggregated_entities) - 1:
        current = aggregated_entities[i]
        next_entity = aggregated_entities[i + 1]
        
        # Check if entities are adjacent and should be merged
        if (current["end"] == next_entity["start"] and 
            current["entity_group"] == next_entity["entity_group"]):
            # Merge entities
            current["word"] += next_entity["word"]
            current["end"] = next_entity["end"]
            current["score"] = (current["score"] + next_entity["score"]) / 2
            # Remove the next entity as it's now merged
            aggregated_entities.pop(i + 1)
        else:
            i += 1
    
    return aggregated_entities

entities = predict_with_proper_aggregation(ex)
spans = [(s['start'], s['end'], s['entity_group'])for s in entities]
show_span_box_markup(text, spans)
Downloads last month
22
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nicolauduran45/specter-climate-change-NER

Finetuned
(22)
this model

Dataset used to train nicolauduran45/specter-climate-change-NER