Authors:
(1) Jianzhu Yao, The CoAI group, Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China Beijing National Research Center for Information Science and Technology;
(2) Ziqi Liu, The CoAI group, Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China Beijing National Research Center for Information Science and Technology;
(3) Jian Guan, The CoAI group, Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China Beijing National Research Center for Information Science and Technology;
(4) Minlie Huang, The CoAI group, Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China Beijing National Research Center for Information Science and Technology.
We aim to measure model’s ability to understand and generate dialogue in a story. To this end, we design the dialogue generation task Masked Dialogue Generation and dialogue understanding task Dialogue Speaker Recognition. We show the task definitions, targets, dataset construction and statistics below.
Dataset Construction We use the following constraints to construct the DialGen dataset based on DIALSTORY:
• We randomly mask 30% of the dialogue turns in each story.
• We do not mask the first 50 tokens to provide sufficient background information for the story.
• We do not mask the last 30 tokens to provide ending information for that story.
• We ensure that each input story (i.e. with masked dialogue turns) mentions at least five characters.
Table 2 shows the detailed statistics.
Dataset Construction We randomly sampled 20k stories from DIALSTORY and automatically annotate the speaker for each dialogue turn for training, and resorted to manual annotation for validation and testing. For manual annotation, we first ask one annotator to label the characters in a story and the speaker of each dialogue turn. Then we asked another two annotators to check the correctness of the annotations, e.g., whether all mentioned characters are annotated, and whether each dialogue speaker is correct. We require the first annotator to re-annotate those examples that another two annotators do not agree on, and repeat the above process until all annotators agree on the examples. We also sampled 100 stories in the training set for manual annotation to investigate the accuracy of automatic annotation, which we will discuss in Section 6.2. Table 2 shows the detailed statistics.
This paper is available on arxiv under CC 4.0 DEED license.