Mirror: An Ubiquitous Model for Information Extraction related Tasks

Written By: kinza.sabir
Last Updated On: November 15, 2023

Experiencing wastage of information and complications while information extraction process? Mirror is here, a universal framework for various information extraction related tasks.

Enthusiast from the field of Artificial Intelligence Tong Zhu, Junfei Ren, Zijian Yu, Mengsong Wu, Guoliang Zhang from Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, China presented this framework.

Mirror is a unified framework to perform tasks related to Information Extraction IE. Existing models is upgraded as a multi-span cyclic graph extraction problem and created an algorithm to extract all the spans in a single cycle. This graph structure is versatile and support simple as well as complex task under the classification of information extraction.

A database was created manually having 57 datasets for model pre-training. Experiments were conducted on 30 datasets across 8 downstream tasks. The experiments results that this model is very compatible and outperforms compared to previous models.

Existing Researches

The field of multi-task is showing progress recently aiming to perform multiple IE tasks using a single model. Information Extraction tasks includes different graph structures. Flat, nested and discontinuous NER (Named Entity Recognition) tasks are formulated as a graph with tail-to-head and next neighboring connections. DyGIE++ takes NER, RE, and EE tasks as span graphs and applies iterative propagation to enhance spans’ contextual representations. Other than graph-based multi-task IE generative language models are also used. Schemas are provided as an input to extract required information, in Schema-guided IE system.

Although, All the above mentioned methods practices generative language models still they cannot predict exact position causing ambiguity while evaluation. Also, large generative language models are usually require maximum computer resources and slow to train. Existing models cannot handle complex IE tasks such as n-ary information extraction and multi-span discontinuous NER.

What is Mirror?

Information Extraction is a very important field of Natural Language Processing (NLP), which extracts structured information from an unstructured context. The Information Extraction contains Named Entity Recognition (NER), Relation Extraction (RE) and Event Extraction (EE).

Each category of IE task is performed through a specific model and specific data structure, making it impossible to share information among the tasks. Maximum advantage can be generated of common features between different task by using generative pre-trained language models (PLMs). This will help to generate structured information directly. Besides, generative models require maximum resource and slow to train on large-scale dataset. Extractive PLMs can also be applied which are faster to train and inference. USM (consider IE tasks as a triplet prediction problem via semantic matching. considering all the enhancements, this model is for limited range of triplet-based tasks and does not support to resolve n-ary extraction and multi-span problems.

A novel framework, Mirror was introduced to overcome above mentioned challenges. It can handle complex multi-span extraction, n-ary extraction, Machine Reading Comprehension (MRC) tasks and also classification task which prior model were unable to perform.

The IE task was formulated as a unified multi-slot tuple extraction problem and transfer tuple into a multi-span cyclic graph. This structure is quite scalable and flexible. This framework could be applied to Machine Reading Comprehension (MRC), complex IE tasks and classification tasks. Mirror takes schemas input and benefits few-shot and zero-shot task originally. This model shows compatibility across different task and datasets.

Experiments were performed extensively on 30 datasets from 8 tasks, including NER, RE, EE, Aspect-based Sentiment Analysis (ABSA), multi-span discontinuous NER, n-ary hyper RE, MRC, and classification. Zero-shot and few-shot proficiency were enhanced by collecting 57 dataset across 5 tasks by manual means. The results showed that this model achieve outstanding result under zero-shot and few-shot settings.

Mirror’s contributions are as follows;

A combined multi-slot extraction paradigm which is schema-guided, was presented which can address complicated information extraction (IE), machine reading comprehension and classification tasks.
This model is unified non-autoaggressive framework that converts multiple tasks into a multi-span cyclic graph.
Extensive experiments were conducted showing results that this model achieves competitive results under few-shot and zero-shot settings.

Accessibility

The research is easily available on Arxiv and code is available on GitHub.

Nuts and Bolts of Mirror

Unified data interface was introduced for the model input divided into three parts; instruction, schema labels and text.

The instruction is consists of a leading token and a natural language sentence. The token indicates the instruction parts while sentence tell the model what t should do. The schema labels are tasks for schema-guided extraction. This part have special token labels represented as ([LM], [LR], and [LC]) and corresponding label texts. [LM] denoted the label of mentions (or event types), [LR] denoted the label of relations (or argument roles), and [LC] denoted the label of classes. The input is the text part from where the model will extract information, it contains tokens ([TL], [TP] or [B]) and a human-centric sentence.

Evaluation

Mirror performances was observed on 13 IE benchmarks. The comparisons among different models are shown in the table below.

Compared with other baseline models, Mirror surpasses baseline models on some datasets in NER (ACE04), RE (ACE05, NYT), and EE (CASIE-Trigger) tasks. After performing comparison to extraction-based USM, Mirror showed competing results on various tasks, while showing hinderance in RE (CoNLL04), NER (ACE05), and EE (CASIEArg). Compared to generation-based methods, this method outperforms TANL across all datasets and surpasses UIE in most datasets.

Conclusion

Mirror is a information extraction framework guided by schemas that turns Information extraction IE tasks into a integrated multi-slot tuple extraction task and proposed the multi-span cyclic graph. This model has a flexible design so it is capable of n-ary and multi-span extraction tasks. It also supports complex information extraction. The experimental results also presented that this model is compatible and achieves competitive performances with state-of-the-art systems.

Due to base model De-BERTa the maximum sequence length should not exceed than 512. This limits the efficiency of tasks with many document level Information extraction. There are many RE and NER datasets. Still, there are few large-scale event extraction collection with high diversity in schemas and domains, which cause limited performance on event-relevant information extraction tasks.