LOOGLE: Long-Context perceive extended phrases

Written By: kinza.sabir
Last Updated On: November 19, 2023

Lets find out what’s new in the world of LooGLE? Does long-context language models are capable to grasp extended phrases and long sentences?

Researchers Jiaqi Li, Mengmeng Wang, Zilong Zheng, Muhan Zhang from National Key Laboratory of General Artificial Intelligence, BIGAI and Institute for Artificial Intelligence, Peking University presented this latest research.

LooGLE stands for Long Context Generic Language Evaluation, helps to evaluate the extended context based LLMs. While evaluation of LLMs, this benchmark focus on understanding longer context compared to context window size.

To check the feasibility of LooGLE, LLMs were selected which understand long sentences as a baseline and detailed evaluation was conducted. The results shows that base model having large context window size performs better. Whereas, these still needs improvements for true long context.

What has been done

Extensive research has been done to develop methods to extend context window size of LLMs for example utilization of recurrent memory, external memory and selective attention. Developing better architecture improves transformers’ performance. A transformer is said to be an efficient transformer when longer texts utilize less memory. Fine-tuned benchmarks were also discovered for long documents but they increases the cost and go through a tiresome process of finding fine-tune data for long texts.

Zero-SCROLLS, L-Eval and LongBench are the three recent benchmarks to check the ability of LLMs for long context understanding. Zero-SCROLLS spontaneously process datasets from various sources into single input format with 10K average words. L-Eval ensures the quality with the help of smaller size dataset. Besides this L-Eva maximizes the evaluation process and generates most accurate results.

LongBench is a diverse benchmark, providing a multi-task dataset with varying length, distributions, patterns, domains and languages. Whereas, the input text consists of only thousands of words and extracts short-term information. All the above mentioned benchmarks require extra utilization of resources.

What is LooGLE?

Considering the challenges highlighted in the prior benchmarks like presenting outdated documents and text length with the average number of thousand words. Their was a great need for high-quality benchmarks which deals with longer text lengths and more difficult tasks to come up with the consistent evaluation. Certainly, it’s crucial to emphasize that current benchmarks mainly consist of short dependency tasks.

These tasks involve language models (LLMs) retrieving answers from a single sentence or paragraph. However, they don’t thoroughly assess LLMs’ capability to gather information from paragraphs across an entire document and then condense this information into a comprehensive answer. This broader evaluation, involving what team LooGLE term as long dependency tasks, is essential for a more comprehensive understanding of LLMs’ proficiency.

To overcome the failings of existing dataset, LooGLE was introduced, abbreviated as Long Context Generic Language Evaluation, to evaluate the ability of LLM for long context. This benchmark is distinguished from its baselines because;

Extra-long documents: This dataset have 776 extremely long and latest documents were gathered with the average of 19.3K words. Without distribution bias there are more than 6,448 test instances for a more contextual assessment, many of which exceed 100k words.
Manually designed long and short dependency tasks: To evaluate the ability of LLM to perceive both long and short dependency content 7 major task were proposed. 5 types of dependency tasks were designed and manually created 1101 long dependency Question and Answer (QA) instances which require extra effort and extra cost.
Latest documents: This benchmark consist of text that is published after 2022 which depicts that latest existing LLMs are not pre-trained on these documents which focus on reliance of in-context learning abitity other than remembrance and memorization.
Cross-domain generic data: The data is extracted from famous open-source documents such as arXiv papers, Wikipedia and movie and TV scripts and multiple categories such as entertainment, politics, arts, academia, history and sports.

A detailed evaluation has been done on 8 models that represents LLM on LooGLE. The results showed that these base models performs better with large context window size. However, these models shows drastic decline when it comes to long dependency tasks.

Accessibility

The paper is easily available on Arxiv and code is available on GitHub. The dataset used for this benchmark is also open-source.

Technicalities of LooGLE

In LooGLE two main types of tasks were generated that is short dependency and long dependency task. Short QA was generated from Wikipedia articles and cloze from scripts, for short dependency and summarization of arXiv papers and manually generated QA for long dependencies.

consists of 3 sources: Wikipedia Articles, scientific papers, TV scripts and movies comprises of various categories and topics. All the documents are after 2022 having length not more than 10K words.

Long dependency tasks consists of summarization, long dependency QA that is further categorize into 4 long dependency task that is Computation, Multiple information retrieval, Timeline reorder, Comprehension and reasoning. Short task dependency consists of Question Answering, Cloze.

Evaluation

LooGLE in short dependency tasks show the result that model performance decreases when segment length increases which cause difficulty in performance of LLM. Summarization can be addressed well by commercial models in long dependency tasks. Longer context window size helps in long context task. Both commercial and open-source models shows low performance. The poor performance in long dependency shows that LLMs should be revised for long dependency tasks.

Different models has been introduced by OpenAI like GPT4-32k, GPT4-8k, GPT3.5-turbo-16k. Models of version 0613 by default was used for evaluation. It is observed that performance increases when required information is placed at the beginning or at the end of the input context. The models doesn’t perform well when the it has to access the information from the center of the text. Therefore, the model fine-tuned the input document to specific size by focusing on the head and tail of the document.

Automatic evaluation metrics were adopted and they were classified into two groups. Bleu, Rouge, Meteor Score and Bert Score are used for generative task such as QA and summarization of input context whereas Cloze, Exact Match and Partial Match are used for evaluation purpose. To assess the benchmark more comprehensively, GPT4-8K was utilize as a LLM evaluator. Manual evaluation was also done to check if the prediction of LLM matches the groundtruth.

Conclusion

A novel benchmark LooGLE was introduced for the assessment of long-context comprehension by Large Language Models. LooGLE enhanced the dataset addressing the deficiency present in previous dataset. By offering comparatively large text passage and utilize latest documents presented after 2022. This dataset consists of various categories with the task of diverse contextual dependencies.

The extensive evaluation highlights the limitations in the existing LLMs for long texts. Drastic divergence was observed between open-source and commercial models, showing challenges for long dependency tasks. The results showed that newly proposed dataset is valuable for evaluating long texts.