arxiv:2004.03090

Interview: A Large-Scale Open-Source Corpus of Media Dialog

Published on Apr 7, 2020

Authors:

Abstract

A new large-scale media dialog dataset, 'Interview,' based on news interview transcripts, enhances zero-shot performance on spoken dialog tasks and improves role-specific responses in interview-style conversations.

AI-generated summary

Existing conversational datasets consist either of written proxies for dialog or small-scale transcriptions of natural speech. We introduce 'Interview': a large-scale (105K conversations) media dialog dataset collected from news interview transcripts. Compared to existing large-scale proxies for conversational data, language models trained on our dataset exhibit better zero-shot out-of-domain performance on existing spoken dialog datasets, demonstrating its usefulness in modeling real-world conversations. 'Interview' contains speaker role annotations for each turn, facilitating the development of engaging, responsive dialog systems. In fact, experiments on two dialog tasks show that leveraging such labels improves performance over strong speaker-agnostic baselines, and enabling models to generate more specific and inquisitive responses in interview-style conversations.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2004.03090 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2004.03090 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2004.03090 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.