metadata

license: mit
datasets:
  - ELiRF/dacsa
  - projecte-aina/CATalog
language:
  - ca
  - en
base_model:
  - openai-community/gpt2
  - openai-community/gpt2-medium
pipeline_tag: text-generation

GPT-2 Medium Catalan-English Model

The model is still being trained, and I will be making updates. Please do not expect great results just yet. 😀

Model Overview

This model is a GPT-2 Medium architecture trained from scratch, meaning it does not inherit any weights from existing models. It has been trained using Catalan dataset, specifically ELiRF/dacsa and projecte-aina/CATalog.

License and Usage

This model is free to use under the MIT license. However, proper credit must be given when using it in research, applications, or any derived work.

Tokenizer

The model utilizes a 52,000-token vocabulary, using gpt2 config, specifically trained to handle Catalan, the tokenizer is also available in "Marxx01/gpt2-catalan-tokenizer".

How to Use

To use this model for text generation, you can load it with the transformers library as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Marxx01/test_gpt2_catalan"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "El president de la generalitat va dir "
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))