Sense-specific Historical Word Usage Generation
Collection
4 items
•
Updated
(Built with Meta Llama 3)
For the version with the PoS tag visit Janus (PoS).
Janus is a fine-tuned Llama 3 8B model designed to generate historically and semantically accurate word usages. It takes as input a word, its sense definition, and a year and produces example sentences that reflect linguistic usage from the specified period. This model is particularly useful for semantic change detection, historical NLP, and linguistic research.
<year><|t|><lemma><|t|><definition><|s|><historical usage sentence><|end|>
<year><|t|><lemma><|t|><definition><|p|><PoS><|p|><|s|><historical usage sentence><|end|>
meta-llama/Llama-3-8B
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "ChangeIsKey/llama3-janus"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
input_text = "1800<|t|>awful<|t|>Used to emphasize something unpleasant or negative; ‘such a’, ‘an absolute’.<|s|>"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
output = model.generate(**inputs, temperature=1.0, top_p=0.9, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
For more examples, see the GitHub repository Historical Word Usage Generation
If you use Janus, please cite:
@article{10.1162/tacl_a_00761,
author = {Cassotti, Pierluigi and Tahmasebi, Nina},
title = {Sense-specific Historical Word Usage Generation},
journal = {Transactions of the Association for Computational Linguistics},
volume = {13},
pages = {690-708},
year = {2025},
month = {07},
abstract = {Large-scale sense-annotated corpora are important for a range of tasks but are hard to come by. Dictionaries that record and describe the vocabulary of a language often offer a small set of real-world example sentences for each sense of a word. However, on their own, these sentences are too few to be used as diachronic sense-annotated corpora. We propose a targeted strategy for training and evaluating generative models producing historically and semantically accurate word usages given any word, sense definition, and year triple. Our results demonstrate that fine-tuned models can generate usages with the same properties as real-world example sentences from a reference dictionary. Thus the generated usages will be suitable for training and testing computational models where large-scale sense-annotated corpora are needed but currently unavailable.},
issn = {2307-387X},
doi = {10.1162/tacl_a_00761},
url = {https://doi.org/10.1162/tacl\_a\_00761},
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00761/2535111/tacl\_a\_00761.pdf},
}
Base model
meta-llama/Meta-Llama-3-8B