Update README.md

94561ee verified 6 months ago

4.92 kB

	---
	license: apache-2.0
	datasets:
	- ibm-research/Climate-Change-NER
	metrics:
	- f1
	base_model:
	- allenai/specter2_base
	tags:
	- climate-change
	- ner
	---

	# Climate NER Model

	This repository contains a fine-tuned Named Entity Recognition (NER) model specialized for climate change-related entities. The model was trained on the Climate Change NER dataset, which consists of 534 manually annotated abstracts from climate-related academic papers.

	## Model Description

	This model is fine-tuned to recognize 13 climate-related entity types:
	- climate-assets
	- climate-datasets
	- climate-greenhouse-gases
	- climate-hazards
	- climate-impacts
	- climate-mitigations
	- climate-models
	- climate-nature
	- climate-observations
	- climate-organisms
	- climate-organizations
	- climate-problem-origins
	- climate-properties

	## Training Data

	The model was trained on the [Climate Change NER dataset](https://huggingface.co/datasets/YOUR_DATASET_LINK), which contains 534 abstracts sourced from the Semantic Scholar Academic Graph. The abstracts were manually annotated with climate-related entities using the IOB (Inside-Outside-Beginning) tagging scheme.

	Dataset Statistics:
	- Train set: 382 instances
	- Validation set: 77 instances
	- Test set: 75 instances

	## Model Performance

	We evaluated three different models on the Climate Change NER test set:

	\| Model \| Precision \| Recall \| F1 Score \|
	\|-------\|-----------\|--------\|----------\|
	\| specter2_base \| 0.57 \| 0.61 \| 0.57 \|
	\| modernBERT \| 0.45 \| 0.42 \| 0.41 \|
	\| BERT-base \| 0.53 \| 0.57 \| 0.52 \|

	We report micro-avg metrics, because entity classes are very unbalanced in test.

	## Usage

	```python
	from ipymarkup import show_span_box_markup
	from transformers import pipeline
	ner = pipeline("ner", model="nicolauduran45/specter-climate-change-NER", tokenizer="nicolauduran45/specter-climate-change-NER", aggregation_strategy="simple", device=0)

	text = 'multi-centennial variability of open ocean deep convection in the Atlantic sector of the Southern Ocean impacts the strength of the Atlantic Meridional Overturning Circulation (AMOC) in the Kiel Climate Model. The northward extent of Antarctic Bottom Water (AABW) strongly depends on the state of Weddell Sea deep convection.'

	entities = ner(ex)
	spans = [(s['start'], s['end'], s['entity_group'])for s in entities]
	show_span_box_markup(text, spans)
	```

	To improve aggregation of split words, we recommend to use this function

	```python
	def predict_with_proper_aggregation(text):
	# Get the raw predictions
	raw_entities = ner(text)

	# Aggregate subword pieces into complete entities
	aggregated_entities = []
	current_entity = None

	for entity in raw_entities:
	# Check if this is a continuation token (starts with ##)
	is_continuation = entity["word"].startswith("##")

	if is_continuation and current_entity:
	# Update the current entity by removing ## and appending
	current_entity["word"] += entity["word"][2:]
	current_entity["end"] = entity["end"]

	# Update the score (average or keep the minimum)
	current_entity["score"] = min(current_entity["score"], entity["score"])

	# If entity types differ, use the one with higher confidence
	if entity["entity_group"] != current_entity["entity_group"] and entity["score"] > current_entity["score"]:
	current_entity["entity_group"] = entity["entity_group"]
	current_entity["score"] = entity["score"]
	else:
	# If we have a previous entity, add it to results
	if current_entity:
	aggregated_entities.append(current_entity)

	# Start a new entity
	current_entity = entity.copy()

	# Don't forget the last entity
	if current_entity:
	aggregated_entities.append(current_entity)

	# Further aggregation: detect split entities that might not use ## notation
	# but should be merged based on adjacent positions
	i = 0
	while i < len(aggregated_entities) - 1:
	current = aggregated_entities[i]
	next_entity = aggregated_entities[i + 1]

	# Check if entities are adjacent and should be merged
	if (current["end"] == next_entity["start"] and
	current["entity_group"] == next_entity["entity_group"]):
	# Merge entities
	current["word"] += next_entity["word"]
	current["end"] = next_entity["end"]
	current["score"] = (current["score"] + next_entity["score"]) / 2
	# Remove the next entity as it's now merged
	aggregated_entities.pop(i + 1)
	else:
	i += 1

	return aggregated_entities

	entities = predict_with_proper_aggregation(ex)
	spans = [(s['start'], s['end'], s['entity_group'])for s in entities]
	show_span_box_markup(text, spans)
	```