---
license: mit
datasets:
- ELiRF/dacsa
- projecte-aina/CATalog
language:
- ca
- en
base_model:
- openai-community/gpt2
- openai-community/gpt2-medium
pipeline_tag: text-generation
---

# GPT-2 Medium Catalan-English Model

The model is still being trained, and I will be making updates. Please do not expect great results just yet. 😀

## Model Overview
This model is a GPT-2 Medium architecture trained **from scratch**, meaning it does not inherit any weights from existing models. It has been trained using **Catalan** dataset, specifically **ELiRF/dacsa** and **projecte-aina/CATalog**. 

## License and Usage
This model is **free to use** under the MIT license. However, proper credit must be given when using it in research, applications, or any derived work. 

## Tokenizer
The model utilizes a **52,000-token vocabulary**, using gpt2 config, specifically trained to handle Catalan, the tokenizer is also available in "Marxx01/gpt2-catalan-tokenizer".

## How to Use
To use this model for text generation, you can load it with the `transformers` library as follows:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Marxx01/test_gpt2_catalan"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "El president de la generalitat va dir "
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))