MT-LLaMA Model Card

Model details

Model type: MT-LLaMA is an open-source multi-task model trained by fine-tuning LLaMA on the massive tasks in P3 (i.e., T0 Train). Concretely, the used datasets during training and task taxonomy are listed below:

  • Multi-choice QA: CommonsenseQA, Cosmos QA, DREAM, QuAIL, QuaRTz, QASC, QuaRel, SciQ, Social IQA, Wiki Hop, WiQA
  • Extractive QA: Adversarial QA, DuoRC, Quoref, ROPES
  • Close-Book QA: Hotpot QA, Wiki QA
  • Sentiment Classification: Amazon, App Reviews, IMDB, Rotten Tomatoes, Yelp
  • Topic Classification: AG News, DBPedia, TREC
  • Structure-to-Text Generation: Common Gen, Wiki Bio
  • Text Summarization: CNN Daily Mail, Gigaword, MultiNews, SamSum, XSum
  • Paraphrase Identification: MRPC, PAWS, QQP

Organizations developing the model: The MT-LLaMA team with members from Alibaba Damo Academy and the Chinese University of Hong Kong.

Intended use

You can try the codes from our github repo.

Zero-shot Evaluation

We primarily follow the protocols of Bigscience T0 to assess the generalization capability of our Multi-task LLaMA to: (1) Unseen Datasets (i.e., datasets from seen tasks); (2) Unseen Tasks.

Prompt Format

Extractive QA:

  1. XQuAD, TyDiQA, MLQA, SQuAD
     Input: Answer the question according to the context. Question: ${question}. Context: ${context}. Answer:
     Output: ${Answer}
    

Sentiment:

  1. SST-2
    Input: ${sentence} Based on this review, would the user recommend this product? No or Yes?
    Output: Yes / No
    

Multiple-Choice QA:

  1. OpenbookQA
    Input: ${question} Which is the correct answer? - (A) ${choiceA} - (B) ${choiceB} - (C) ${choiceC} - (D) ${choiceD}
    Output: ${choiceA} / ${choiceB} / ${choiceC} / ${choiceD}
    

Sentence Completion:

  1. COPA
    Input: ${premise} {% if question == "cause" %} This happened because... {% else %} As a consequence... Help me pick the more plausible option: - ${text1} - ${text2}
    Output: ${text1} / ${text2}
    

Coreference Resolution:

  1. Winogrande:
    Input: ${sentence} In the previous sentence, does _ refer to ${option1} or ${option2}?
    Output: ${option1} / ${option2}
    

Word Sense Disambiguation:

  1. WiC
    Input: Does the word "${word}" have the same meaning in these two sentences? Yes, No? ${sentence1} ${sentence2}
    Output: ${sentence1} / ${sentence2}
    

Natural Language Inference:

  1. MNLI:
    Input: ${premise} Question: Does this imply that ${hypothesis}? Please response with 'Yes', 'No', or 'Maybe'.
    Output: Yes / No / Maybe
    
  2. RTE
    Input: Given ${premise} Is it guaranteed true that "${hypothesis}"? Yes or no?
    Output: Yes / no
    

Results on Unseen Datasets

Model XQuAD-en (F1/EM) TyDiQA-en (F1/EM) MLQA-en (F1/EM) SQuAD (F1/EM) SST-2 (Acc.) OpenbookQA (Acc.)
LLaMA-7b 9.5 / 2.0 14.3 / 2.6 13.4 / 3.3 29.4 / 11.5 50.5 32.4
MT-LLaMA-7b 42.3 / 31.1 38.9 / 26.9 45.4 / 31.5 85.9 / 77.6 92.6 38.2

Results on Unseen Tasks

Model COPA (Acc.) Winogrande (Acc.) WiC (Acc.) MNLI (Acc.) RTE (Acc.)
LLaMA-7b 56.0 49.3 51.7 30.2 52.7
MT-LLaMA-7b 88.0 54.9 52.2 49.6 79.1

Acknowledgement

  • Our training codes are largely borrowed from FastChat
  • We are also grateful for the efforts of LLaMA (from FAIR) and T0 (from BigScience), which serve as the foundation of our work

If you find this resource useful, please cite the repo as follows:

@software{damonlpsg2023mtllama,
  author = {Xu, Weiwen and Li, Xin and Bing, Lidong},
  title = {Multi-task Instruction-tuned LLaMA},
  year = 2023,
  url = {https://github.com/DAMO-NLP-SG/MT-LLaMA}
}
Downloads last month
17
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.