File size: 3,953 Bytes
0018ee1
e56ff2d
e8e52a9
0018ee1
 
e8e52a9
0018ee1
 
 
 
 
3f5b139
fc91c97
e56ff2d
 
 
 
 
 
 
 
 
 
 
b8840b1
e56ff2d
 
 
0018ee1
 
 
e56ff2d
 
 
fc91c97
e56ff2d
 
 
 
 
 
 
0018ee1
e56ff2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0018ee1
 
 
7ef2e7d
e56ff2d
 
 
 
 
 
 
 
 
0018ee1
 
 
 
 
 
 
8a978fa
 
0018ee1
 
 
543888f
9a42a20
8a978fa
 
 
 
3f5b139
 
 
 
 
 
 
 
8a978fa
 
0018ee1
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
language: es
license: mit
tags:
- generated_from_trainer
base_model: flax-community/spanish-t5-small
model-index:
- name: poem-gen-spanish-t5-small
  results: []
---

# poem-gen-spanish-t5-small

This model is a fine-tuned version of [flax-community/spanish-t5-small](https://huggingface.co/flax-community/spanish-t5-small) on the [Spanish Poetry Dataset](https://www.kaggle.com/andreamorgar/spanish-poetry-dataset/version/1) dataset.

The model was created during the [First Spanish Hackathon](https://somosnlp.org/hackathon) organized by [Somos NLP](https://somosnlp.org/).

The team who participated was composed by:

- 🇨🇺 [Alberto Carmona Barthelemy](https://huggingface.co/milyiyo)
- 🇨🇴 [Jorge Henao](https://huggingface.co/jorge-henao)
- 🇪🇸 [Andrea Morales Garzón](https://huggingface.co/andreamorgar)
- 🇮🇳 [Drishti Sharma](https://huggingface.co/DrishtiSharma)

It achieves the following results on the evaluation set:
- Loss: 2.8707
- Perplexity: 17.65


## Model description

The model was trained to generate spanish poems attending to some parameters like style, sentiment, words to include and starting phrase.

Example:

```
poema:
  estilo: Pablo Neruda &&
  sentimiento: positivo &&
  palabras: cielo, luna, mar &&
  texto: Todos fueron a verle pasar
```

### How to use

You can use this model directly with a pipeline for masked language modeling:

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = 'hackathon-pln-es/poem-gen-spanish-t5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

author, sentiment, word, start_text = 'Pablo Neruda', 'positivo', 'cielo', 'Todos fueron a la plaza'
input_text = f"""poema: estilo: {author} && sentimiento: {sentiment} && palabras: {word} && texto: {start_text} """
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(inputs["input_ids"],
                         do_sample = True,
                         max_length = 30,
                         repetition_penalty = 20.0,
                         top_k = 50,
                         top_p = 0.92)
detok_outputs = [tokenizer.decode(x, skip_special_tokens=True) for x in outputs]
res = detok_outputs[0]
```

## Training and evaluation data

The original [dataset](https://www.kaggle.com/andreamorgar/spanish-poetry-dataset/version/1) has the columns `author`, `content` and `title`.
For each poem we generate new examples:
- content: *line_i* , generated: *line_i+1*
- content: *concatenate(line_i, line_i+1)* , generated: *line_i+2*
- content: *concatenate(line_i, line_i+1, line_i+2)* , generated: *line_i+3*

The resulting dataset has the columns `author`, `content`, `title` and `generated`.

For each example we compute the sentiment of the generated column and the nouns. In the case of sentiment, we used the model `mrm8488/electricidad-small-finetuned-restaurant-sentiment-analysis` and for nouns extraction we used spaCy.
 

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 6
- eval_batch_size: 6
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 6

### Training results

| Training Loss | Epoch | Step   | Validation Loss |
|:-------------:|:-----:|:------:|:---------------:|
| 2.7082        | 0.73  | 30000  | 2.8878          |
| 2.6251        | 1.46  | 60000  | 2.8940          |
| 2.5796        | 2.19  | 90000  | 2.8853          |
| 2.5556        | 2.93  | 120000 | 2.8749          |
| 2.527         | 3.66  | 150000 | 2.8850          |
| 2.5024        | 4.39  | 180000 | 2.8760          |
| 2.4887        | 5.12  | 210000 | 2.8749          |
| 2.4808        | 5.85  | 240000 | 2.8707          |


### Framework versions

- Transformers 4.17.0
- Pytorch 1.10.0+cu111
- Datasets 2.0.0
- Tokenizers 0.11.6