SDG SciBERT Classifier (`sdg-scibert-zo_up`)

This repository contains a fine-tuned version of allenai/scibert_scivocab_cased for classifying scientific text into Sustainable Development Goal (SDG) categories.

Fine-tuned using the 🤗 transformers Trainer API
Uses standard AutoModelForSequenceClassification
Published with full label mappings, inference scripts, and CLI tool

🧪 Quick Inference (Python)

You can use the model directly with the Hugging Face pipeline:

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="simon-clmtd/sdg-scibert-zo_up",
    tokenizer="simon-clmtd/sdg-scibert-zo_up",
    truncation=True,
    padding=True,
    max_length=512,
    return_all_scores=True,
    device=0  # or -1 for CPU
)

text = "Ensure access to affordable, reliable, sustainable and modern energy for all"
print(classifier(text))

🖥️ CLI Tool: `sdg-predict`

🔧 Installation (local)

Clone the repo and install as a Python package:

git clone https://huggingface.co/simon-clmtd/sdg-scibert-zo_up
cd sdg-scibert-zo_up
pip install -e .

This will install a command-line tool called sdg-predict.

📥 Input format

The CLI tool accepts a .jsonl file (one JSON object per line). You must specify the key containing the text to classify:

Example input file (input.jsonl):

{"id": 1, "text": "Ensure access to affordable, reliable, sustainable and modern energy for all"}
{"id": 2, "text": "Atmospheric warming is profoundly affecting high-mountain regions"}

▶️ Example usage

Top-1 prediction:

sdg-predict input.jsonl --key text --top1 --output preds.jsonl

Full label distribution:

sdg-predict input.jsonl --key text --output preds_all.jsonl

Custom batch size:

sdg-predict input.jsonl --key text --batch_size 16

📤 Output format

Each output line is the original input with an added prediction key:

With --top1:

{
  "id": 1,
  "text": "...",
  "prediction": {
    "label": "7", 
    "score": 0.9124
  }
}

Without --top1:

{
  "id": 1,
  "text": "...",
  "prediction": [
    {"label": "1", "score": 0.0021},
    {"label": "2", "score": 0.0005},
    ...
    {"label": "7", "score": 0.9124}
  ]
}

📦 Repository Contents

modeling.py: Optional class wrapper if extending the base model.
inference.py: Reusable batch inference logic for Python scripts.
cli_predict.py: CLI tool using the inference logic.
requirements.txt: Runtime dependencies.
setup.py: Installation and entry point for the CLI.

🔍 Citation

Please cite the original SciBERT paper if using this model, and attribute this fine-tuning setup if relevant.

👤 Author

Simon Clematide
Computational Linguistics, UZH
simon-clematide.net (if applicable)

SDG SciBERT Classifier (sdg-scibert-zo_up)