metadata
license: apache-2.0
base_model: google/flan-t5-large
tags:
  - generated_from_trainer
model-index:
  - name: Prompting-NLP-Paper-to-QA-Generation-abstract-only
    results: []
widget:
  - text: >-
      Generate Question, Answer pair correspond to the following research paper.
      [Abstract] The dominant sequence transduction models are based on complex
      recurrent or convolutional neural networks in an encoder-decoder
      configuration. The best performing models also connect the encoder and
      decoder through an attention mechanism. We propose a new simple network
      architecture, the Transformer, based solely on attention mechanisms,
      dispensing with recurrence and convolutions entirely. Experiments on two
      machine translation tasks show these models to be superior in quality
      while being more parallelizable and requiring significantly less time to
      train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German
      translation task, improving over the existing best results, including
      ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation
      task, our model establishes a new single-model state-of-the-art BLEU score
      of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the
      training costs of the best models from the literature. We show that the
      Transformer generalizes well to other tasks by applying it successfully to
      English constituency parsing both with large and limited training data.
      [Introduction] Recurrent neural networks, long short-term memory [13] and
      gated recurrent [7] neural networks in particular, have been firmly
      established as state of the art approaches in sequence modeling and
      transduction problems such as language modeling and machine translation
      [35, 2, 5]. Numerous efforts have since continued to push the boundaries
      of recurrent language models and encoder-decoder architectures [38, 24,
      15]. Recurrent models typically factor computation along the symbol
      positions of the input and output sequences. Aligning the positions to
      steps in computation time, they generate a sequence of hidden states ht,
      as a function of the previous hidden state ht−1 and the input for position
      t. This inherently sequential nature precludes parallelization within
      training examples, which becomes critical at longer sequence lengths, as
      memory constraints limit batching across examples. Recent work has
      achieved significant improvements in computational efficiency through
      factorization tricks [21] and conditional computation [32], while also
      improving model performance in case of the latter. The fundamental
      constraint of sequential computation, however, remains. Attention
      mechanisms have become an integral part of compelling sequence modeling
      and transduction models in various tasks, allowing modeling of
      dependencies without regard to their distance in the input or output
      sequences [2, 19]. In all but a few cases [27], however, such attention
      mechanisms are used in conjunction with a recurrent network. In this work
      we propose the Transformer, a model architecture eschewing recurrence and
      instead relying entirely on an attention mechanism to draw global
      dependencies between input and output. The Transformer allows for
      significantly more parallelization and can reach a new state of the art in
      translation quality after being trained for as little as twelve hours on
      eight P100 GPUs. 
       Question, Answer:
    example_title: Attention Is All You Need
  - text: >-
      Generate Question, Answer pair correspond to the following research paper.
      [Abstract] In this work, we explore prompt tuning, a simple yet effective
      mechanism for learning soft prompts to condition frozen language models to
      perform specific downstream tasks. Unlike the discrete text prompts used
      by GPT-3, soft prompts are learned through backpropagation and can be
      tuned to incorporate signal from any number of labeled examples. Our
      end-to-end learned approach outperforms GPT-3's few-shot learning by a
      large margin. More remarkably, through ablations on model size using T5,
      we show that prompt tuning becomes more competitive with scale: as models
      exceed billions of parameters, our method closes the gap and matches the
      strong performance of model tuning (where all model weights are tuned).
      This finding is especially relevant in that large models are costly to
      share and serve, and the ability to reuse one frozen model for multiple
      downstream tasks can ease this burden. Our method can be seen as a
      simplification of the recently proposed prefix tuning of Li and Liang
      (2021), and we provide a comparison to this and other similar approaches.
      Finally, we show that conditioning a frozen model with soft prompts
      confers benefits in robustness to domain transfer, as compared to full
      model tuning. [Introduction] With the wide success of pre-trained large
      language models, a range of techniques has arisen to adapt these
      general-purpose models to downstream tasks. ELMo (Peters et al., 2018)
      proposed freezing the pre-trained model and learning a task-specific
      weighting of its per-layer representations. However, since GPT (Radford et
      al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation
      technique has been model tuning (or fine-tuning), where all model
      parameters are tuned during adaptation, as proposed by Howard and Ruder
      (2018).More recently, Brown et al. (2020) showed that prompt design (or
      priming) is surprisingly effective at modulating a frozen GPT-3 model’s
      behavior through text prompts. Prompts are typically composed of a task
      description and/or several canonical examples. This return to freezing
      pre-trained models is appealing, especially as model size continues to
      increase. Rather than requiring a separate copy of the model for each
      downstream task, a single generalist model can simultaneously serve many
      different tasks. Unfortunately, prompt-based adaptation has several key
      drawbacks. Task description is error-prone and requires human involvement,
      and the effectiveness of a prompt is limited by how much conditioning text
      can fit into the model’s input. As a result, downstream task quality still
      lags far behind that of tuned models. For instance, GPT-3 175B fewshot
      performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et
      al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several
      efforts to automate prompt design have been recently proposed. Shin et al.
      (2020) propose a search algorithm over the discrete space of words, guided
      by the downstream application training data. While this technique
      outperforms manual prompt design, there is still a gap relative to model
      tuning. Li and Liang (2021) propose prefix tuning and show strong results
      on generative tasks. This method freezes the model parameters and
      backpropagates the error during tuning to prefix activations prepended to
      each layer in the encoder stack, including the input layer. Hambardzumyan
      et al. (2021) simplify this recipe by restricting the trainable parameters
      to the input and output subnetworks of a masked language model, and show
      reasonable results on classifications tasks. In this paper, we propose
      prompt tuning as a further simplification for adapting language models. We
      freeze the entire pre-trained model and only allow an additional k tunable
      tokens per downstream task to be prepended to the input text. This soft
      prompt is trained end-to-end and can condense the signal from a full
      labeled dataset, allowing our method to outperform few-shot prompts and
      close the quality gap with model tuning (Figure 1). At the same time,
      since a single pre-trained model is recycled for all downstream tasks, we
      retain the efficient serving benefits of frozen models (Figure 2). While
      we developed our method concurrently with Li and Liang (2021) and
      Hambardzumyan et al. (2021), we are the first to show that prompt tuning
      alone (with no intermediate-layer prefixes or task-specific output layers)
      is sufficient to be competitive with model tuning. Through detailed
      experiments in sections 2–3, we demonstrate that language model capacity
      is a key ingredient for these approaches to succeed. As Figure 1 shows,
      prompt tuning becomes more competitive with scale. We compare with similar
      approaches in Section 4. Explicitly separating task-specific parameters
      from the generalist parameters needed for general language-understanding
      has a range of additional benefits. We show in Section 5 that by capturing
      the task definition in the prompt while keeping the generalist parameters
      fixed, we are able to achieve better resilience to domain shifts. In
      Section 6, we show that prompt ensembling, learning multiple prompts for
      the same task, can boost quality and is more efficient than classic model
      ensembling. Finally, in Section 7, we investigate the interpretability of
      our learned soft prompts. In sum, our key contributions are: 1. Proposing
      prompt tuning and showing its competitiveness with model tuning in the
      regime of large language models. 2. Ablating many design choices, and
      showing quality and robustness improve with scale. 3. Showing prompt
      tuning outperforms model tuning on domain shift problems. 4. Proposing
      prompt ensembling and showing its effectiveness. 
       Question, Answer:
    example_title: '2104.08691'
Prompting-NLP-Paper-to-QA-Generation-abstract-only

This model is a fine-tuned version of google/flan-t5-large on an unknown dataset. It achieves the following results on the evaluation set:
Loss: 0.4504
Model description

More information needed
Intended uses & limitations

More information needed
Training and evaluation data

More information needed
Training procedure

Training hyperparameters

The following hyperparameters were used during training:
learning_rate: 0.0001
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 16
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 184
num_epochs: 10
Training results

Training Loss	Epoch	Step	Validation Loss
No log	0.99	46	34.6109
29.7732	1.99	92	16.5236
29.7732	2.98	138	4.6887
7.9911	3.97	184	0.5679
7.9911	4.97	230	0.4795
0.6152	5.96	276	0.4577
0.6152	6.95	322	0.4523
0.4811	7.95	368	0.4509
0.4811	8.94	414	0.4505
0.4721	9.93	460	0.4504
Framework versions

Transformers 4.35.2
Pytorch 2.1.0+cu118
Datasets 2.15.0
Tokenizers 0.15.0