Tutorial: NAACL-22: New Frontiers of Information Extraction

Instructors

Muhao Chen, Lifu Huang, Manling Li, Ben Zhou, Heng Ji and Dan Roth.

Date and Time

Sunday, July 10, 2022.

Goal of Tutorial:

This tutorial targets researchers and practitioners who are interested in AI and ML technologies for structural information extraction (IE) from unstructured textual sources. In particular, this tutorial will provide audience with a systematic introduction to recent advances in IE, by addressing several important research questions. These questions include (i) how to develop a robust IE system from small amounts and noisy training data, while ensuring the reliability of its prediction? (ii) how to foster the generalizability of IE through enhancing the system’s cross-lingual, crossdomain, cross-task and cross-modal transferability? (iii) how to support extracting structural information with extremely fine-grained and diverse labels? (iv) how to further improve IE by leveraging indirect supervision from other NLP tasks, such as NLI, QA or summarization, and pre-trained language models? (v) how to acquire knowledge to guide inference in IE systems? We will discuss several lines of frontier research that tackle those challenges, and will conclude the tutorial by outlining directions for further investigation.

Introduction

Information extraction (IE) is the process of automatically extracting structural information from unstructured or semi-structured data. It provides the essential support for natural language understanding by recognizing and resolving the concepts, entities, events described in text, and inferring the relations among them. In various application domains, IE automates the costly acquisition process of domain-specific knowledge representations that have been the backbone of any knowledge-driven AI systems. For example, automated knowledge base construction has relied on technologies for entity-centric IE. Extraction of events and event chains assists machines with narrative prediction and summarization tasks. Medical IE also benefits important but expensive clinical tasks such as drug discovery and repurposing. Despite the importance, frontier research in IE still face several key challenges. The first challenge is that existing dominant methods using language modeling representation cannot sufficiently capture the essential knowledge and structures required for IE tasks. The second challenge is on the development of extraction models for fine-grained information with less supervision, considering that obtaining structural annotation on unlabeled data have been very costly. The third challenge is to extend the reliability and generalizability of IE systems in real-world scenarios, where data sources often contain incorrect, invalid or unrecognizable inputs, as well as inputs containing unseen labels and mixture of modalities. Recently, by tackling those critical challenges, recent literature is leading to transformative advancement in principles and methodologies of IE system development. We believe it is necessary to present a timely tutorial to comprehensively summarize the new frontiers in IE research and point out the emerging challenges that deserve further investigation.

In this tutorial, we will systematically review several lines of frontier research on developing robust, reliable and adaptive learning systems for extracting rich structured information. Beyond introducing robust learning and inference methods for unsupervised denoising, constraint capture and novelty detection, we will discuss recent approaches for leveraging indirect supervision from natural language inference and generation tasks to improve IE. We will also review recent minimally supervised method for training IE models with distant supervision from linguistic patterns, corpus statistics or language modeling objectives. In addition, we will illustrate how a model trained on a close domain can be reliably adapted to produce extraction from data sources in different domains, languages and modalities, or acquiring global knowledge (e.g., event schemas) to guide the extraction on a highly diverse open label space. Participants will learn about recent trends and emerging challenges in this topic, representative tools and learning resources to obtain ready-to-use models, and how related technologies benefit end-user NLP applications.

Tutorial Outline

Introduction [20 min]
handout

We will define the main research problem and motivate the topic by presenting several real-world NLP and knowledge-driven AI applications of IE technologies, as well as several key challenges that are at the core of frontier research in this area.

Indirect and Minimal Supervision for IE [35 min]
handout

We will introduce effective approaches that use indirect supervision for IE, that is, to use supervision signals from related tasks to make up for the lack of quantity and comprehensiveness in IEspecific training data. Popular indirect supervision sources include question answering and reading comprehension, natural language inference and generation. We will also cover structural texts (e.g., Wikipedia) as indirect sources. With the breakthrough of large-scale pre-trained languague models, methodologies have been proposed to explore the language model objective as indirect supervision for IE. To this end, we will cover methods includes direct probing, and more recently, pre-training with distant signals.

Robust Learning and Inference for IE [35 min]
handout

We will introduce methodologies that enhance the robustness of learning systems for IE in both their learning and inference phases. Those methodologies involve self-supervised denoising techniques for training noise-robust IE models based on coregularized knowledge distillation, label re-weighting and label smoothing. Besides, we will also discuss about unsupervised techniques for out-of-distribution (OOD) detection, prediction with abstention and novelty class detection that seek to help the IE model identify invalid inputs or inputs with semantic shifts during its inference phase. Specifically, to demonstrate how models can ensure the global consistency of the extraction, we will cover constraint learning methods that automatically capture logical constraints among relations, and techniques to enforce the constraints in inference.

Knowledge-guided IE [15 min]
handout

Global knowledge representation induced from large-scale corpora can guide the inference about the complicated connections between knowledge elements and help fix the extraction errors. We will introduce cross-task and cross-instance statistical constraint knowledge, commonsense knowledge, and global event schema knowledge that help jointly extract entities, relations, and events.

Transferablity of IE Systems [35 min]
handout

One important challenge of developing IE systems lies in the limited coverage of predefined schemas (e.g., predefined types of entities, relations or events) and the heavy reliance on human annotations. When moving to new types, domains or languages, we have to start from scratch by creating annotations and re-training the extraction models. In this part of tutorial, we will cover the recent advances in improving the transferabil ity of IE, including (1) cross-lingual transfer by leveraging adversarial training, languageinvariant representations and resources, pre-trained multilingual language models as well as data projection; (2) cross-type transfer including zero-shot and few-shot IE by learning prototypes, reading the definitions, answering questions, and (3) transfer across different benchmark datasets. Finally, we will also discuss the progress on life-long learning for IE to enable knowledge transfer across incrementally updated models.

Cross-modal IE [20 min]
handout

Cross-modal IE aims to extract structured knowledge from multiple modalities, including unstructured and semi-structured text, images, videos, tables, etc. We will start from visual event and argument extraction from images and videos. To extract multimedia events, the key challenge is to identify the cross-modal coreference and linking, and represent both text and visual knowledge in a common semantic space. We will also introduce the information extraction from semi-structured data and tabular data.

Future Research Directions [30 min]
handout

IE is a key component in supporting knowledge acquisition and it impacts a wide spectrum of knowledge-driven AI applications. We will conclude the tutorial by presenting further challenges and potential research topics in identifying trustworthiness of extracted content, IE with quantitative reasoning, cross-document IE, modeling of label semantics, and challenges for acquiring implicit but essential information from corpora that potentially involve reporting bias.

Resources:

  • Tutorial syllabus
  • Tutorial slides