metadata
license: mit
datasets:
- ELiRF/dacsa
- projecte-aina/CATalog
language:
- ca
- en
base_model:
- openai-community/gpt2
- openai-community/gpt2-medium
pipeline_tag: text-generation
GPT-2 Medium Catalan-English Model
The model is still being trained, and I will be making updates. Please do not expect great results just yet. 😀
Model Overview
This model is a GPT-2 Medium architecture trained from scratch, meaning it does not inherit any weights from existing models. It has been trained using Catalan dataset, specifically ELiRF/dacsa and projecte-aina/CATalog.
License and Usage
This model is free to use under the MIT license. However, proper credit must be given when using it in research, applications, or any derived work.
Tokenizer
The model utilizes a 52,000-token vocabulary, using gpt2 config, specifically trained to handle Catalan, the tokenizer is also available in "Marxx01/gpt2-catalan-tokenizer".
How to Use
To use this model for text generation, you can load it with the transformers
library as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Marxx01/test_gpt2_catalan"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "El president de la generalitat va dir "
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))