--- license: mit datasets: - ELiRF/dacsa - projecte-aina/CATalog language: - ca - en base_model: - openai-community/gpt2 - openai-community/gpt2-medium pipeline_tag: text-generation --- # GPT-2 Medium Catalan-English Model The model is still being trained, and I will be making updates. Please do not expect great results just yet. 😀 ## Model Overview This model is a GPT-2 Medium architecture trained **from scratch**, meaning it does not inherit any weights from existing models. It has been trained using **Catalan** dataset, specifically **ELiRF/dacsa** and **projecte-aina/CATalog**. ## License and Usage This model is **free to use** under the MIT license. However, proper credit must be given when using it in research, applications, or any derived work. ## Tokenizer The model utilizes a **52,000-token vocabulary**, using gpt2 config, specifically trained to handle Catalan, the tokenizer is also available in "Marxx01/gpt2-catalan-tokenizer". ## How to Use To use this model for text generation, you can load it with the `transformers` library as follows: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Marxx01/test_gpt2_catalan" model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) text = "El president de la generalitat va dir " inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True))