Update README.md
Browse files
README.md
CHANGED
@@ -1,176 +1,293 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
# Model Card for Model ID
|
7 |
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
|
12 |
## Model Details
|
13 |
|
14 |
### Model Description
|
15 |
|
16 |
-
|
17 |
-
|
18 |
-
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
19 |
|
20 |
-
- **
|
21 |
-
- **
|
22 |
-
- **
|
23 |
-
- **
|
24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
25 |
-
- **License:** [More Information Needed]
|
26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
27 |
|
28 |
-
### Model Sources
|
29 |
|
30 |
-
|
31 |
-
|
32 |
-
- **
|
33 |
-
- **
|
34 |
-
- **Demo [optional]:** [More Information Needed]
|
35 |
|
36 |
## Uses
|
37 |
|
38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
39 |
-
|
40 |
### Direct Use
|
41 |
|
42 |
-
|
43 |
|
44 |
-
|
|
|
45 |
|
46 |
-
|
|
|
|
|
|
|
47 |
|
48 |
-
|
49 |
|
50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
### Out-of-Scope Use
|
53 |
|
54 |
-
|
|
|
|
|
|
|
|
|
55 |
|
56 |
-
[More Information Needed]
|
57 |
|
58 |
## Bias, Risks, and Limitations
|
59 |
|
60 |
-
|
|
|
61 |
|
62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
|
64 |
-
### Recommendations
|
65 |
|
66 |
-
|
67 |
|
68 |
-
|
|
|
|
|
|
|
69 |
|
70 |
## How to Get Started with the Model
|
71 |
|
72 |
-
|
73 |
|
74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
|
76 |
## Training Details
|
77 |
|
78 |
### Training Data
|
79 |
|
80 |
-
|
81 |
-
|
82 |
-
[More Information Needed]
|
83 |
|
84 |
### Training Procedure
|
85 |
|
86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
-
|
89 |
|
90 |
-
|
91 |
|
|
|
92 |
|
93 |
-
|
|
|
|
|
|
|
94 |
|
95 |
-
|
|
|
|
|
|
|
96 |
|
97 |
-
|
|
|
|
|
98 |
|
99 |
-
|
100 |
|
101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
102 |
|
103 |
## Evaluation
|
104 |
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
126 |
|
127 |
### Results
|
128 |
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
## Model Examination [optional]
|
136 |
|
137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
138 |
-
|
139 |
-
[More Information Needed]
|
140 |
|
141 |
## Environmental Impact
|
142 |
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
- **
|
148 |
-
- **Hours used:** [More Information Needed]
|
149 |
-
- **Cloud Provider:** [More Information Needed]
|
150 |
-
- **Compute Region:** [More Information Needed]
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
-
|
153 |
-
## Technical Specifications [optional]
|
154 |
-
|
155 |
-
### Model Architecture and Objective
|
156 |
-
|
157 |
-
[More Information Needed]
|
158 |
-
|
159 |
-
### Compute Infrastructure
|
160 |
-
|
161 |
-
[More Information Needed]
|
162 |
-
|
163 |
-
#### Hardware
|
164 |
-
|
165 |
-
[More Information Needed]
|
166 |
-
|
167 |
-
#### Software
|
168 |
-
|
169 |
-
[More Information Needed]
|
170 |
-
|
171 |
-
## Citation [optional]
|
172 |
|
173 |
-
|
174 |
|
175 |
**BibTeX:**
|
176 |
|
@@ -178,22 +295,4 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
178 |
|
179 |
**APA:**
|
180 |
|
181 |
-
[More Information Needed]
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
-
|
187 |
-
[More Information Needed]
|
188 |
-
|
189 |
-
## More Information [optional]
|
190 |
-
|
191 |
-
[More Information Needed]
|
192 |
-
|
193 |
-
## Model Card Authors [optional]
|
194 |
-
|
195 |
-
[More Information Needed]
|
196 |
-
|
197 |
-
## Model Card Contact
|
198 |
-
|
199 |
[More Information Needed]
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
+
license: apache-2.0
|
4 |
+
base_model: google/pegasus-xsum
|
5 |
+
datasets:
|
6 |
+
- eilamc14/wikilarge-clean
|
7 |
+
language:
|
8 |
+
- en
|
9 |
+
tags:
|
10 |
+
- pegasus
|
11 |
+
- text-simplification
|
12 |
+
- WikiLarge
|
13 |
+
model-index:
|
14 |
+
- name: pegasus-xsum-text-simplification
|
15 |
+
results:
|
16 |
+
- task:
|
17 |
+
type: text2text-generation
|
18 |
+
name: Text Simplification
|
19 |
+
dataset:
|
20 |
+
name: ASSET
|
21 |
+
type: facebook/asset
|
22 |
+
url: https://huggingface.co/datasets/facebook/asset
|
23 |
+
split: test
|
24 |
+
metrics:
|
25 |
+
- type: SARI
|
26 |
+
value: 33.80
|
27 |
+
- type: FKGL
|
28 |
+
value: 9.23
|
29 |
+
- type: BERTScore
|
30 |
+
value: 87.54
|
31 |
+
- type: LENS
|
32 |
+
value: 62.46
|
33 |
+
- type: Identical ratio
|
34 |
+
value: 0.29
|
35 |
+
- type: Identical ratio (ci)
|
36 |
+
value: 0.29
|
37 |
+
|
38 |
+
- task:
|
39 |
+
type: text2text-generation
|
40 |
+
name: Text Simplification
|
41 |
+
dataset:
|
42 |
+
name: MEDEASI
|
43 |
+
type: cbasu/Med-EASi
|
44 |
+
url: https://huggingface.co/datasets/cbasu/Med-EASi
|
45 |
+
split: test
|
46 |
+
metrics:
|
47 |
+
- type: SARI
|
48 |
+
value: 32.68
|
49 |
+
- type: FKGL
|
50 |
+
value: 10.98
|
51 |
+
- type: BERTScore
|
52 |
+
value: 45.14
|
53 |
+
- type: LENS
|
54 |
+
value: 50.55
|
55 |
+
- type: Identical ratio
|
56 |
+
value: 0.30
|
57 |
+
- type: Identical ratio (ci)
|
58 |
+
value: 0.30
|
59 |
+
|
60 |
+
- task:
|
61 |
+
type: text2text-generation
|
62 |
+
name: Text Simplification
|
63 |
+
dataset:
|
64 |
+
name: OneStopEnglish
|
65 |
+
type: OneStopEnglish
|
66 |
+
url: https://github.com/nishkalavallabhi/OneStopEnglishCorpus
|
67 |
+
split: advanced→elementary
|
68 |
+
metrics:
|
69 |
+
- type: SARI
|
70 |
+
value: 37.07
|
71 |
+
- type: FKGL
|
72 |
+
value: 8.66
|
73 |
+
- type: BERTScore
|
74 |
+
value: 77.77
|
75 |
+
- type: LENS
|
76 |
+
value: 60.97
|
77 |
+
- type: Identical ratio
|
78 |
+
value: 0.40
|
79 |
+
- type: Identical ratio (ci)
|
80 |
+
value: 0.40
|
81 |
---
|
82 |
|
83 |
# Model Card for Model ID
|
84 |
|
85 |
+
This is one of the models fine-tuned on text simplification for [Simplify This](https://github.com/eilamc14/Simplify-This) project.
|
|
|
|
|
86 |
|
87 |
## Model Details
|
88 |
|
89 |
### Model Description
|
90 |
|
91 |
+
Fine-tuned **sequence-to-sequence (encoder–decoder) Transformer** for **English text simplification**.
|
92 |
+
Trained on the dataset **`eilamc14/wikilarge-clean`** (cleaned WikiLarge-style pairs).
|
|
|
93 |
|
94 |
+
- **Model type:** Seq2Seq Transformer (encoder–decoder)
|
95 |
+
- **Language (NLP):** English
|
96 |
+
- **License:** `apache-2.0`
|
97 |
+
- **Finetuned from model:** `google/pegasus-xsum`
|
|
|
|
|
|
|
98 |
|
99 |
+
### Model Sources
|
100 |
|
101 |
+
- **Repository (code):** https://github.com/eilamc14/Simplify-This
|
102 |
+
- **Dataset:** https://huggingface.co/datasets/eilamc14/wikilarge-clean
|
103 |
+
- **Paper [optional]:** —
|
104 |
+
- **Demo [optional]:** —
|
|
|
105 |
|
106 |
## Uses
|
107 |
|
|
|
|
|
108 |
### Direct Use
|
109 |
|
110 |
+
The model is intended for **English text simplification**.
|
111 |
|
112 |
+
- **Input format:** `Simplify: <complex sentence>`
|
113 |
+
- **Output:** `<simplified sentence>`
|
114 |
|
115 |
+
**Typical uses**
|
116 |
+
- Research on automatic text simplification
|
117 |
+
- Benchmarking against other simplification systems
|
118 |
+
- Demos/prototypes that require simpler English rewrites
|
119 |
|
120 |
+
### Downstream Use
|
121 |
|
122 |
+
This repository already contains a **fine-tuned** model specialized for text simplification.
|
123 |
+
|
124 |
+
Further fine-tuning is **optional** and mainly relevant when:
|
125 |
+
- Adapting to a markedly different domain (e.g., medical/legal/news)
|
126 |
+
- Addressing specific failure modes (e.g., over/under-simplification, factual drops)
|
127 |
+
- Distilling/quantizing for deployment constraints
|
128 |
+
|
129 |
+
When fine-tuning further, keep the same input convention: `Simplify: <...>`.
|
130 |
|
131 |
### Out-of-Scope Use
|
132 |
|
133 |
+
Not intended for:
|
134 |
+
- Tasks unrelated to simplification (dialogue, translation etc.)
|
135 |
+
- Production use without additional safety filtering (no toxicity/bias mitigation)
|
136 |
+
- Languages other than English
|
137 |
+
- High-stakes settings (legal/medical advice, safety-critical decisions)
|
138 |
|
|
|
139 |
|
140 |
## Bias, Risks, and Limitations
|
141 |
|
142 |
+
The model was trained on **Wikipedia and Simple English Wikipedia** alignments (via WikiLarge).
|
143 |
+
As a result, it inherits the characteristics and limitations of this data:
|
144 |
|
145 |
+
- **Domain bias:** Simplifications may reflect encyclopedic style; performance may degrade on informal, technical, or domain-specific text (e.g., medical/legal/news).
|
146 |
+
- **Content bias:** Wikipedia content itself contains biases in coverage, cultural perspective, and phrasing. Simplified outputs may reflect or amplify these.
|
147 |
+
- **Simplification quality:** The model may:
|
148 |
+
- Over-simplify (drop important details)
|
149 |
+
- Under-simplify (retain complex phrasing)
|
150 |
+
- Produce ungrammatical or awkward rephrasings
|
151 |
+
- **Language limitation:** Only suitable for English. Applying to other languages is unsupported.
|
152 |
+
- **Safety limitation:** The model has not been aligned to avoid toxic, biased, or harmful content. If the input text contains such content, the output may reproduce or modify it without safeguards.
|
153 |
|
|
|
154 |
|
155 |
+
### Recommendations
|
156 |
|
157 |
+
- **Evaluation required:** Always evaluate the model in the target domain before deployment. Benchmark simplification quality (e.g., with SARI, FKGL, BERTScore, LENS, human evaluation).
|
158 |
+
- **Human oversight:** Use human-in-the-loop review for applications where meaning preservation is critical (education, accessibility tools, etc.).
|
159 |
+
- **Attribution:** Preserve source attribution where required (Wikipedia → CC BY-SA).
|
160 |
+
- **Not for high-stakes use:** Avoid legal, medical, or safety-critical applications without extensive validation and domain adaptation.
|
161 |
|
162 |
## How to Get Started with the Model
|
163 |
|
164 |
+
Load the model and tokenizer directly from the Hugging Face Hub:
|
165 |
|
166 |
+
```python
|
167 |
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
168 |
+
|
169 |
+
model_id = "eilamc14/bart-base-text-simplification"
|
170 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
171 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
|
172 |
+
|
173 |
+
# Example input
|
174 |
+
PREFIX = "Simplify: "
|
175 |
+
text = "The committee deemed the proposal unnecessarily complicated."
|
176 |
+
|
177 |
+
# Tokenize and generate
|
178 |
+
inputs = tokenizer(PREFIX+text, return_tensors="pt")
|
179 |
+
outputs = model.generate(**inputs, max_new_tokens=64, num_beams=4)
|
180 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
181 |
+
```
|
182 |
|
183 |
## Training Details
|
184 |
|
185 |
### Training Data
|
186 |
|
187 |
+
[WikiLarge-clean](https://huggingface.co/datasets/eilamc14/wikilarge-clean) Dataset
|
|
|
|
|
188 |
|
189 |
### Training Procedure
|
190 |
|
191 |
+
- **Hardware:** NVIDIA L4 GPU on Google Colab
|
192 |
+
- **Objective:** Standard sequence-to-sequence cross-entropy loss
|
193 |
+
- **Training type:** Full fine-tuning of all parameters (no LoRA/PEFT used)
|
194 |
+
- **Batching:** Dynamic padding with Hugging Face `Trainer` / PyTorch DataLoader
|
195 |
+
- **Evaluation:** Monitored on the `validation` split with metrics (SARI and identical_ratio)
|
196 |
+
- **Stopping criteria:** Early stopping CallBack based on validation performance
|
197 |
+
|
198 |
+
#### Preprocessing
|
199 |
|
200 |
+
The dataset was preprocessed by prefixing each source sentence with **"Simplify: "** and tokenizing both the source (inputs) and target (labels).
|
201 |
|
202 |
+
#### Memory & Checkpointing
|
203 |
|
204 |
+
To reduce VRAM during training, gradient checkpointing was enabled and the KV cache was disabled:
|
205 |
|
206 |
+
```python
|
207 |
+
model.config.use_cache = False # required when using gradient checkpointing
|
208 |
+
model.gradient_checkpointing_enable() # saves memory at the cost of extra compute
|
209 |
+
```
|
210 |
|
211 |
+
**Notes**
|
212 |
+
- Disabling `use_cache` avoids warnings/conflicts with gradient checkpointing and reduces memory usage in the forward pass.
|
213 |
+
- Gradient checkpointing trades **GPU memory ↓** for **training speed ↓** (extra recomputation).
|
214 |
+
- For **inference/evaluation**, re-enable the cache for faster generation:
|
215 |
|
216 |
+
```python
|
217 |
+
model.config.use_cache = True
|
218 |
+
```
|
219 |
|
220 |
+
#### Training Hyperparameters
|
221 |
|
222 |
+
The models were trained with Hugging Face `Seq2SeqTrainingArguments`.
|
223 |
+
Hyperparameters varied slightly across models and runs to optimize, and full logs (batch size, steps, exact LR schedule) were not preserved.
|
224 |
+
Below are the **typical defaults** used:
|
225 |
+
|
226 |
+
- **Epochs:** 5
|
227 |
+
- **Evaluation strategy:** every 300 steps
|
228 |
+
- **Save strategy:** every 300 steps (keep best model, `eval_loss` as criterion)
|
229 |
+
- **Learning rate:** ~3e-5
|
230 |
+
- **Batch size:** ~8-64 , depends on model size
|
231 |
+
- **Optimizer:** `adamw_torch_fused`
|
232 |
+
- **Precision:** bf16
|
233 |
+
- **Generation config (during eval):** `max_length=128`, `num_beams=4`, `predict_with_generate=True`
|
234 |
+
- **Other settings:**
|
235 |
+
- Weight decay: 0.01
|
236 |
+
- Label smoothing: 0.1
|
237 |
+
- Warmup ratio: 0.1
|
238 |
+
- Max grad norm: 0.5
|
239 |
+
- Dataloader workers: 8 (L4 GPU)
|
240 |
+
|
241 |
+
> Because hyperparameters were adjusted between runs and not all were logged, exact reproduction may differ slightly.
|
242 |
|
243 |
## Evaluation
|
244 |
|
245 |
+
### Testing Data
|
246 |
+
|
247 |
+
- [**ASSET**](https://huggingface.co/datasets/facebook/asset) (test subset)
|
248 |
+
- [**MEDEASI**](https://huggingface.co/datasets/cbasu/Med-EASi) (test subset)
|
249 |
+
- [**OneStopEnglish**](https://github.com/nishkalavallabhi/OneStopEnglishCorpus) (advanced → elementary)
|
250 |
+
|
251 |
+
### Metrics
|
252 |
+
|
253 |
+
- **Identical ratio** — share of outputs identical to the source, both normalized by basic, language-agnostic: strip, NFKC, collapse spaces
|
254 |
+
- **Identical ratio (ci)** — case insensitive identical ratio
|
255 |
+
- **SARI** — main simplification metric (higher is better)
|
256 |
+
- **FKGL** — readability grade level (lower is simpler)
|
257 |
+
- **BERTScore (F1)** — semantic similarity (higher is better)
|
258 |
+
- **LENS** — composite simplification quality score (higher is better)
|
259 |
+
|
260 |
+
### Generation Arguments
|
261 |
+
|
262 |
+
```python
|
263 |
+
gen_args = dict(
|
264 |
+
max_new_tokens=64,
|
265 |
+
num_beams=4,
|
266 |
+
length_penalty=1.0,
|
267 |
+
no_repeat_ngram_size=3,
|
268 |
+
early_stopping=True,
|
269 |
+
do_sample=False,
|
270 |
+
)
|
271 |
+
```
|
272 |
|
273 |
### Results
|
274 |
|
275 |
+
| Dataset | Identical ratio | Identical ratio (ci) | SARI | FKGL | BERTScore | LENS |
|
276 |
+
|--------------------|----------------:|---------------------:|------:|-----:|----------:|------:|
|
277 |
+
| **ASSET** | 0.29 | 0.29 | 33.80 | 9.23 | 87.54 | 62.46 |
|
278 |
+
| **MEDEASI** | 0.30 | 0.30 | 32.68 | 10.98| 45.14 | 50.55 |
|
279 |
+
| **OneStopEnglish** | 0.40 | 0.40 | 37.07 | 8.66 | 77.77 | 60.97 |
|
|
|
|
|
280 |
|
|
|
|
|
|
|
281 |
|
282 |
## Environmental Impact
|
283 |
|
284 |
+
- **Hardware Type:** Single NVIDIA L4 GPU (Google Colab)
|
285 |
+
- **Hours used:** Approx. 5–10
|
286 |
+
- **Cloud Provider:** Google Cloud (via Colab)
|
287 |
+
- **Compute Region:** Unknown (Google Colab dynamic allocation)
|
288 |
+
- **Carbon Emitted:** Estimated to be very low (< a few kg CO₂eq), since training was limited to a single GPU for a small number of hours.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
289 |
|
290 |
+
## Citation
|
291 |
|
292 |
**BibTeX:**
|
293 |
|
|
|
295 |
|
296 |
**APA:**
|
297 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
298 |
[More Information Needed]
|