Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Gupshup
|
| 2 |
+
GupShup: Summarizing Open-Domain Code-Switched Conversations EMNLP 2021
|
| 3 |
+
Paper: [https://aclanthology.org/2021.emnlp-main.499.pdf](https://aclanthology.org/2021.emnlp-main.499.pdf)
|
| 4 |
+
Github: [https://github.com/midas-research/gupshup](https://github.com/midas-research/gupshup)
|
| 5 |
+
|
| 6 |
+
### Dataset
|
| 7 |
+
Please request for the Gupshup data using [this Google form](https://docs.google.com/forms/d/1zvUk7WcldVF3RCoHdWzQPzPprtSJClrnHoIOYbzaJEI/edit?ts=61381ec0).
|
| 8 |
+
|
| 9 |
+
Dataset is available for `Hinglish Dilaogues to English Summarization`(h2e) and `English Dialogues to English Summarization`(e2e). For each task, Dialogues/conversastion have `.source`(train.source) as file extension whereas Summary has `.target`(train.target) file extension. ".source" file need to be provided to `input_path` and ".target" file to `reference_path` argument in the scripts.
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
## Models
|
| 13 |
+
All model weights are available on the Huggingface model hub. Users can either directly download these weights in their local and provide this path to `model_name` argument in the scripts or use the provided alias (to `model_name` argument) in scripts directly; this will lead to download weights automatically by scripts.
|
| 14 |
+
|
| 15 |
+
Model names were aliased in "gupshup_TASK_MODEL" sense, where "TASK" can be h2e,e2e and MODEL can be mbart, pegasus, etc., as listed below.
|
| 16 |
+
|
| 17 |
+
**1. Hinglish Dialogues to English Summary (h2e)**
|
| 18 |
+
|
| 19 |
+
| Model | Huggingface Alias |
|
| 20 |
+
|---------|-------------------------------------------------------------------------------|
|
| 21 |
+
| mBART | [midas/gupshup_h2e_mbart](https://huggingface.co/midas/gupshup_h2e_mbart) |
|
| 22 |
+
| PEGASUS | [midas/gupshup_h2e_pegasus](https://huggingface.co/midas/gupshup_h2e_pegasus) |
|
| 23 |
+
| T5 MTL | [midas/gupshup_h2e_t5_mtl](https://huggingface.co/midas/gupshup_h2e_t5_mtl) |
|
| 24 |
+
| T5 | [midas/gupshup_h2e_t5](https://huggingface.co/midas/gupshup_h2e_t5) |
|
| 25 |
+
| BART | [midas/gupshup_h2e_bart](https://huggingface.co/midas/gupshup_h2e_bart) |
|
| 26 |
+
| GPT-2 | [midas/gupshup_h2e_gpt](https://huggingface.co/midas/gupshup_h2e_gpt) |
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
**2. English Dialogues to English Summary (e2e)**
|
| 30 |
+
|
| 31 |
+
| Model | Huggingface Alias |
|
| 32 |
+
|---------|-------------------------------------------------------------------------------|
|
| 33 |
+
| mBART | [midas/gupshup_e2e_mbart](https://huggingface.co/midas/gupshup_e2e_mbart) |
|
| 34 |
+
| PEGASUS | [midas/gupshup_e2e_pegasus](https://huggingface.co/midas/gupshup_e2e_pegasus) |
|
| 35 |
+
| T5 MTL | [midas/gupshup_e2e_t5_mtl](https://huggingface.co/midas/gupshup_e2e_t5_mtl) |
|
| 36 |
+
| T5 | [midas/gupshup_e2e_t5](https://huggingface.co/midas/gupshup_e2e_t5) |
|
| 37 |
+
| BART | [midas/gupshup_e2e_bart](https://huggingface.co/midas/gupshup_e2e_bart) |
|
| 38 |
+
| GPT-2 | [midas/gupshup_e2e_gpt](https://huggingface.co/midas/gupshup_e2e_gpt) |
|
| 39 |
+
|
| 40 |
+
## Inference
|
| 41 |
+
|
| 42 |
+
### Using command line
|
| 43 |
+
1. Clone this repo and create a python virtual environment (https://docs.python.org/3/library/venv.html). Install the required packages using
|
| 44 |
+
```
|
| 45 |
+
git clone https://github.com/midas-research/gupshup.git
|
| 46 |
+
pip install -r requirements.txt
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
2. run_eval script has the following arguments.
|
| 50 |
+
* **model_name** : Path or alias to one of our models available on Huggingface as listed above.
|
| 51 |
+
* **input_path** : Source file or path to file containing conversations, which will be summarized.
|
| 52 |
+
* **save_path** : File path where to save summaries generated by the model.
|
| 53 |
+
* **reference_path** : Target file or path to file containing summaries, used to calculate matrices.
|
| 54 |
+
* **score_path** : File path where to save scores.
|
| 55 |
+
* **bs** : Batch size
|
| 56 |
+
* **device**: Cuda devices to use.
|
| 57 |
+
|
| 58 |
+
Please make sure you have downloaded the Gupshup dataset using the above google form and provide the correct path to these files in the argument's `input_path` and `refrence_path.` Or you can simply put `test.source` and `test.target` in `data/h2e/`(hinglish to english) or `data/e2e/`(english to english) folder. For example, to generate English summaries from Hinglish dialogues using the mbart model, run the following command
|
| 59 |
+
|
| 60 |
+
```
|
| 61 |
+
python run_eval.py \
|
| 62 |
+
--model_name midas/gupshup_h2e_mbart \
|
| 63 |
+
--input_path data/h2e/test.source \
|
| 64 |
+
--save_path generated_summary.txt \
|
| 65 |
+
--reference_path data/h2e/test.target \
|
| 66 |
+
--score_path scores.txt \
|
| 67 |
+
--bs 8
|
| 68 |
+
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
Another example, to generate English summaries from English dialogues using the Pegasus model
|
| 72 |
+
```
|
| 73 |
+
python run_eval.py \
|
| 74 |
+
--model_name midas/gupshup_e2e_pegasus \
|
| 75 |
+
--input_path data/e2e/test.source \
|
| 76 |
+
--save_path generated_summary.txt \
|
| 77 |
+
--reference_path data/e2e/test.target \
|
| 78 |
+
--score_path scores.txt \
|
| 79 |
+
--bs 8
|
| 80 |
+
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
Please create an issue if you are facing any difficulties in replicating the results.
|
| 85 |
+
|
| 86 |
+
### References
|
| 87 |
+
|
| 88 |
+
Please cite [[1]](https://arxiv.org/abs/1910.04073) if you found the resources in this repository useful.
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
[1] Mehnaz, Laiba, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle G. Lee, Anish Acharya, and Rajiv Shah. [*GupShup: Summarizing Open-Domain Code-Switched Conversations*](https://aclanthology.org/2021.emnlp-main.499.pdf)
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
```
|
| 95 |
+
@inproceedings{mehnaz2021gupshup,
|
| 96 |
+
title={GupShup: Summarizing Open-Domain Code-Switched Conversations},
|
| 97 |
+
author={Mehnaz, Laiba and Mahata, Debanjan and Gosangi, Rakesh and Gunturi, Uma Sushmitha and Jain, Riya and Gupta, Gauri and Kumar, Amardeep and Lee, Isabelle G and Acharya, Anish and Shah, Rajiv},
|
| 98 |
+
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
|
| 99 |
+
pages={6177--6192},
|
| 100 |
+
year={2021}
|
| 101 |
+
}
|
| 102 |
+
|
| 103 |
+
```
|