YAML Metadata Warning: The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Janus

(Built with Meta Llama 3)

For the version with the PoS tag visit Janus (PoS).

Model Details

Model Description

Janus is a fine-tuned Llama 3 8B model designed to generate historically and semantically accurate word usages. It takes as input a word, its sense definition, and a year and produces example sentences that reflect linguistic usage from the specified period. This model is particularly useful for semantic change detection, historical NLP, and linguistic research.

Intended Use

  • Semantic Change Detection: Investigating how word meanings evolve over time.
  • Historical Text Processing: Enhancing the understanding and modeling of historical texts.
  • Corpus Expansion: Generating sense-annotated corpora for linguistic studies.

Training Data

  • Dataset: Extracted from the Oxford English Dictionary (OED)
  • Size: Over 1.2 million sense-annotated historical usages
  • Time Span: 1700 - 2020
  • Data Format:
    <year><|t|><lemma><|t|><definition><|s|><historical usage sentence><|end|>
    
  • Janus (PoS) Format:
    <year><|t|><lemma><|t|><definition><|p|><PoS><|p|><|s|><historical usage sentence><|end|>
    

Training Procedure

  • Base Model: meta-llama/Llama-3-8B
  • Optimization: QLoRA (Quantized Low-Rank Adaptation)
  • Batch Size: 4
  • Learning Rate: 2e-4
  • Epochs: 1

Model Performance

  • Temporal Accuracy: Root mean squared error (RMSE) of ~52.7 years (close to OED ground truth)
  • Semantic Accuracy: Comparable to OED test data on human evaluations
  • Context Variability: Low lexical repetition, preserving natural linguistic diversity

Usage Example

Generating Historical Usages

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "ChangeIsKey/llama3-janus"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

input_text = "1800<|t|>awful<|t|>Used to emphasize something unpleasant or negative; ‘such a’, ‘an absolute’.<|s|>"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

output = model.generate(**inputs, temperature=1.0, top_p=0.9, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

For more examples, see the GitHub repository Historical Word Usage Generation

Limitations & Ethical Considerations

  • Historical Bias: The model may reflect biases present in historical texts.
  • Time Granularity: The temporal resolution is approximate (~50 years RMSE).
  • Modern Influence: Despite fine-tuning, the model may still generate modern phrases in older contexts.
  • Not Trained for Fairness: The model has not been explicitly trained to be fair or unbiased. It may produce sensitive, outdated, or culturally inappropriate content.

Citation

If you use Janus, please cite:

@article{10.1162/tacl_a_00761,
    author = {Cassotti, Pierluigi and Tahmasebi, Nina},
    title = {Sense-specific Historical Word Usage Generation},
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {13},
    pages = {690-708},
    year = {2025},
    month = {07},
    abstract = {Large-scale sense-annotated corpora are important for a range of tasks but are hard to come by. Dictionaries that record and describe the vocabulary of a language often offer a small set of real-world example sentences for each sense of a word. However, on their own, these sentences are too few to be used as diachronic sense-annotated corpora. We propose a targeted strategy for training and evaluating generative models producing historically and semantically accurate word usages given any word, sense definition, and year triple. Our results demonstrate that fine-tuned models can generate usages with the same properties as real-world example sentences from a reference dictionary. Thus the generated usages will be suitable for training and testing computational models where large-scale sense-annotated corpora are needed but currently unavailable.},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00761},
    url = {https://doi.org/10.1162/tacl\_a\_00761},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00761/2535111/tacl\_a\_00761.pdf},
}
Downloads last month
8
Safetensors
Model size
8.03B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ChangeIsKey/llama3-janus

Finetuned
(450)
this model

Collection including ChangeIsKey/llama3-janus