text-dating / README.md

pierluigic

Update README.md

299f0fd verified 3 months ago

preview code

raw

history blame contribute delete

3.67 kB

metadata

language:
  - en
base_model:
  - FacebookAI/roberta-large
pipeline_tag: text-classification

Sentence Dating Model

Model Description

The Sentence Dating Model is a fine-tuned RoBERTa-large transformer designed for predicting the decade in which a given sentence was written. This model is trained on historical text data to classify sentences into time periods from 1700 to 2021. It is particularly useful for historical linguistics, text dating, and semantic change studies.

Training Details

Base Model

Model: roberta-large
Fine-tuned for: Sentence classification into time periods (1700-2021)

Dataset

The model is trained on a dataset derived from historical text corpora, including examples extracted from the Oxford English Dictionary (OED). The dataset includes:

Texts: Sentences extracted from historical documents.
Labels: Time periods (grouped by decades).

Fine-tuning Process

Tokenizer: AutoTokenizer.from_pretrained("roberta-large")
Loss function: CrossEntropy Loss
Optimizer: AdamW
Batch size: 32
Learning rate: 1e-6
Epochs: 1
Evaluation Strategy: Steps (every 10% of training data)
Metric: Weighted F1-score
Splitting: 90% training, 10% validation

Usage

Example

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ChangeIsKey/text-dating")
model = AutoModelForSequenceClassification.from_pretrained("ChangeIsKey/text-dating")

# Example text
text = "He put the phone back in the cradle and turned toward the kitchen."

# Tokenize input
inputs = tokenizer(text, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predicted_label = torch.argmax(outputs.logits, dim=1).item()

print(f"Predicted decade: {1700 + predicted_label * 10}")

Limitations

The model may have difficulty distinguishing between closely related time periods (e.g., 1950s vs. 1960s).
Biases may exist due to the training dataset composition.
Performance is lower on shorter, contextually ambiguous sentences.

Citation

If you use this model, please cite:

@article{10.1162/tacl_a_00761,
    author = {Cassotti, Pierluigi and Tahmasebi, Nina},
    title = {Sense-specific Historical Word Usage Generation},
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {13},
    pages = {690-708},
    year = {2025},
    month = {07},
    abstract = {Large-scale sense-annotated corpora are important for a range of tasks but are hard to come by. Dictionaries that record and describe the vocabulary of a language often offer a small set of real-world example sentences for each sense of a word. However, on their own, these sentences are too few to be used as diachronic sense-annotated corpora. We propose a targeted strategy for training and evaluating generative models producing historically and semantically accurate word usages given any word, sense definition, and year triple. Our results demonstrate that fine-tuned models can generate usages with the same properties as real-world example sentences from a reference dictionary. Thus the generated usages will be suitable for training and testing computational models where large-scale sense-annotated corpora are needed but currently unavailable.},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00761},
    url = {https://doi.org/10.1162/tacl\_a\_00761},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00761/2535111/tacl\_a\_00761.pdf},
}

License

MIT License