Update README.md
Browse files
README.md
CHANGED
@@ -3,4 +3,40 @@ datasets:
|
|
3 |
- bakhitovd/data_science_arxiv
|
4 |
metrics:
|
5 |
- rouge
|
6 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
- bakhitovd/data_science_arxiv
|
4 |
metrics:
|
5 |
- rouge
|
6 |
+
---
|
7 |
+
# Fine-tuned Longformer for Summarization of Machine Learning Articles
|
8 |
+
|
9 |
+
## Model Details
|
10 |
+
- GitHub: https://github.com/Bakhitovd/MS_in_Data_Science_Capstone
|
11 |
+
- Model name: bakhitovd/led-base-16384-data-science
|
12 |
+
- Model type: Longformer (alenai/led-base-16384)
|
13 |
+
- Model description: This Longformer model has been fine-tuned on a focused subset of the arXiv part of the scientific papers dataset, specifically targeting articles about Machine Learning. It aims to generate accurate and consistent summaries of machine learning research papers.
|
14 |
+
## Intended Use
|
15 |
+
This model is intended to be used for text summarization tasks, specifically for summarizing machine learning research papers.
|
16 |
+
## How to Use
|
17 |
+
~~~
|
18 |
+
import torch
|
19 |
+
from transformers import LEDTokenizer, LEDForConditionalGeneration
|
20 |
+
tokenizer = LEDTokenizer.from_pretrained("bakhitovd/led-base-16384-data-science")
|
21 |
+
model = LEDForConditionalGeneration.from_pretrained("bakhitovd/led-base-16384-data-science")
|
22 |
+
~~~
|
23 |
+
|
24 |
+
## Use the model for summarization
|
25 |
+
~~~
|
26 |
+
article = "... long document ..."
|
27 |
+
inputs_dict = tokenizer.encode(article, padding="max_length", max_length=16384, return_tensors="pt", truncation=True)
|
28 |
+
input_ids = inputs_dict.input_ids.to("cuda")
|
29 |
+
attention_mask = inputs_dict.attention_mask.to("cuda")
|
30 |
+
global_attention_mask = torch.zeros_like(attention_mask)
|
31 |
+
global_attention_mask[:, 0] = 1
|
32 |
+
predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=512)
|
33 |
+
summary = tokenizer.decode(predicted_abstract_ids, skip_special_tokens=True)
|
34 |
+
print(summary)
|
35 |
+
~~~
|
36 |
+
## Training Data
|
37 |
+
Dataset name: bakhitovd/data_science_arxiv
|
38 |
+
This dataset is a subset of the 'Scientific papers' dataset, which contains articles semantically, structurally, and meaningfully closest to articles describing machine learning. This subset was obtained using K-means clustering on the embeddings generated by SciBERT.
|
39 |
+
## Evaluation Results
|
40 |
+
The model's performance was evaluated using ROUGE metrics and it showed improved performance over the baseline models.
|
41 |
+
|
42 |
+

|