Large Language Models Understand Layout (2024)

WeimingLi  ManniDuan  DongAn  YanShaoCorresponding Author. Email: shaoyan@cmhi.chinamobile.com.Zhejiang Lab, Hangzhou, ChinaChina Mobile, Hangzhou Research and Development Center, China

Abstract

Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is observed when the spatial markers from the original data are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2, Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for further analysis. The experimental results reveal that the layout understanding ability of LLMs is mainly introduced by the coding data for pretraining, which is further enhanced at the instruction-tuning stage. In addition, layout understanding can be enhanced by integrating low-cost, auto-generated data approached by a novel text game. Finally, we show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.

\paperid

752

1 Introduction

In recent years, large language models (LLMs) have emerged as a dominant force in the global artificial intelligence field, sparking extensive discussions among researchers about their potential and limitations [2, 24]. Although LLMs are primarily designed for natural language processing (NLP) tasks, some studies demonstrate their additional abilities. For instance, they are employed to generate executable code and even achieve remarkable performance in Google coding interviews [6].

Beyond text understanding capability, we find that LLMs are capable of processing text layouts that are denoted by spatial markers. As shown in Figure 1, we conceptualize the newline-separated plain text as a "visual" two-dimensional canvas, as text editors and browsers are visually two-dimensional intuitively. Three identical questions with distinct answers are arranged in different orientations, interspersed with space markers (denoted as layout). We inquire with ChatGPT about the answers in various orientations. Remarkably, ChatGPT provides accurate responses, and some other open-source LLMs also demonstrate reasonable results. To compare with this, we exclude the space markers from the original data (denoted as strip), resulting in a substantial decline in performance.

(a) Layout

Here are three names mentioned in the context:

What is your name? What is your name?

Im James. I m Oliver.

What is your name?

Im Emma.

Question: What is the name mentioned in the top-left corner?
Answer: The name mentioned in the top-left corner is "James".

(b) Strip

Here are three names mentioned in the context:

What is your name? What is your name? Im

James. I m Oliver. What is your name? Im

Emma.

Question: What is the name mentioned in the top-left corner?
Answer: The name mentioned in the top-left corner is not specified in the given context.

This study initiates a comprehensive examination of LLMs’ proficiency in understanding text layout, aiming to unravel insights into their performance and implications across various datasets and fine-tuning methodologies.

First, we build up a dataset called TextLayoutQA to evaluate LLMs’ text layout understanding capability. Through experiments with the GPT-3.5, Baichuan2 [45], Llama2 [39] and ChatGLM3 [48] models, we uncover that the incorporation of text layout information substantially enhances model performance, resulting in an 8similar-to\sim25% gain compared to text without layout.

Furthermore, we explore the effects of pre-training and instruction-tuning stages on LLMs’ comprehension of text layout. We illustrate that although LLMs initially demonstrate a basic understanding during pre-training, their proficiency is further enhanced during the instruction-tuning stage.

Moreover, we explore the essential role of training data in shaping LLMs’ understanding of text layout, emphasizing the necessity of datasets enriched with layout information, such as code and table data. Through instruction-tuning, we reveal the varying impacts of different types of datasets on LLMs’ performance, providing detailed insights into their contributions and constraints.

Our findings not only illuminate the intrinsic capabilities of LLMs in comprehending text layout, but also carry profound implications for their broader applications. By unraveling the intricacies of LLMs’ interaction with text layout information, we pave the way for leveraging this capability in tasks ranging from visual question answering (VQA) [26] to document analysis and beyond. Our code and datasets are available on Github.111https://github.com/liweim/TextLayoutLLM

The contribution of this paper can be summarized as:

  1. 1.

    To the best of our knowledge, we are the first to systematically analyze the text layout understanding capability of LLMs.

  2. 2.

    We introduce TextLayoutQA, a dataset designed to assess the text layout understanding capability of LLMs.

  3. 3.

    The origin of text layout understanding capability is thoroughly investigated via instruction-tuning.

  4. 4.

    We propose a low-cost data generation method, approached by a novel text game, that significantly enhances the text layout understanding capability.

  5. 5.

    We show that the text layout understanding capability can be applied in text-rich VQA problems and achieve good performance improvements.

2 Related Work

Text layout and language

Preprocessing for text layout is essential before conducting any NLP on the textual content of intricate documents. Hurst and Nasukawa [13] present a general approach that integrates language model and text spatial arrangement knowledge. By taking into account text language features and layout characteristics, this method accurately identifies the boundaries and structures of text blocks. However, it relies on rules and is limited to simple cases. Furthermore, it overlooks the relationship between text blocks, which is crucial for document comprehension.

LLMs’ capability in text layout understanding

In recent years, LLMs, represented by the GPT series, have demonstrated strong text comprehension abilities. In this domain, we are aware of some research efforts dedicated to exploring the performance of LLMs in spatial reasoning [47, 4], as well as the application of LLMs in graph analysis, understanding, and visualization [2, 5, 9, 35, 41, 50, 51].

A survey by investigates planning and logical reasoning application in spatial environments, finding ChatGPT adept at tracking objects and inferring spatial relationships. Their experiments span various tasks related to physical understanding including optical and shadow projection, spatial viewpoint reasoning, predicting the impact of actions on objects, one-dimensional object sorting, two-dimensional box placement queries, simulated robot exploration like navigating an apartment and searching for a ball in a room, and simulated robot task completion such as setting a table for a meal. However, the effectiveness of these efforts is limited since all "spatial" or "visual" features are interacted in high-level terms. For instance, navigating the apartment involves users informing ChatGPT about the room’s location and available doors, followed by describing the new room and door choices after the model makes a selection. In general, ChatGPT demonstrates a certain level of spatial understanding capability, although not in a "geometric" sense.

Bayani [2] explores ChatGPT’s performance in visual tasks involving ASCII art [28] input, which is an image drawn using ASCII characters in plain text. They find that ChatGPT shows high-level performance in tasks evaluating visual and spatial capabilities, though there is still room for improvement. The study includes tasks like ASCII art recognition, for instance evaluating ChatGPT’s ability to handle rotations, scaling, noise, and translation of box plots; ASCII art part identification, such as asking the model about the identity of specific parts of ASCII art images, like heads or tails; and ASCII art generation tasks, for example generating identical copies of ASCII art, removing noise from ASCII art and proportionally enlarging ASCII art.

LLMs’ applications in text-rich VQA

VQA is a task that involves answering questions about images, spanning various formats such as receipts, web pages, tables, documents, or even natural images containing textual information. This task fundamentally requires models to understand multiple modalities. Previous work predominantly focuses on multimodal pre-trained models [18, 12, 38, 11, 21], aiming to leverage all modal information. However, Liu etal. [25] point out that LLMs struggle with text-rich VQA scenarios. The main reasons include the short token lengths, usually 256, of the textual input from the visual encoder to the text encoder in multimodal LLMs, resulting in a significant loss of textual information. Additionally, the low resolution of the image encoder, typically 224*224, compresses and loses a considerable amount of textual information in text-rich images. Under the circ*mstances, recent work starts exploring the performance of purely textual LLMs in answering questions using only serialized text from images [10], leveraging the high accuracy of optical character recognition (OCR) models in recognizing long text.

3 Layout Understanding Based on Spatial Markers

Text layout refers to the arrangement and presentation of text within a visual space or canvas. It involves the spatial organization of characters, words, and paragraphs to create a visually coherent and aesthetically pleasing display. It is a form of text representation and does not specifically refer to particular data types like codes or tables. Generally, text layout encompasses factors such as newlines, indentation, alignment, font size, and spacing between characters and lines. In this study, we focus on the layout of plain texts, which means the texts do not have formats or font styles. We encode the spatial organization of the texts by using spatial markers such as space and newline, forming a plain text that can be directly input to the LLMs.

In general NLP, text layout is often not explicitly considered because most traditional NLP tasks, such as sentiment analysis, named entity recognition, text classification, and machine translation, focus on understanding the content and meaning of the text rather than its visual presentation. Text layout becomes more relevant when dealing with tasks that involve visual or spatial understanding, such as OCR, document understanding, and certain computer vision tasks. In these cases, the physical placement of text on a page or within an image becomes crucial for accurate interpretation.

However, as research progresses and interdisciplinary approaches become more common, there is an increasing recognition of the importance of text layout understanding, even in traditional NLP. For example, understanding the structure and layout of documents can aid in tasks like information extraction or summarization. As the field evolves, there will be more integration of text layout considerations into a broader range of NLP applications.

The existing datasets that incorporate text layout elements are often related to transcribed documents and tables [27, 26], but they are not specifically designed to evaluate the text layout understanding capability. Under the circ*mstances, we introduce a generated dataset called TextLayoutQA specifically for this purpose.

Subsequently, we investigate whether the text layout understanding capability emerges during the pre-training stage or the instruction-tuning stage. Our hypothesis posits that the training corpora with consecutive spaces, including programming code, tables, HTML, YAML-formatted data, may contribute to the text layout understanding capability of LLMs. To validate this hypothesis, we construct an instruction-tuning dataset that does not include data with consecutive spaces as the instruction-basic dataset. We perform instruction-tuning on LLMs of different types and sizes, utilizing the characteristic of catastrophic forgetting to induce LLMs to "forget" text layout understanding capability. Subsequently, we add training corpora containing consecutive spaces, such as code and table corpora, to the instruction-basic dataset to observe whether the text layout understanding capability is recovered.

Furthermore, inspired by the well-known text game, word search puzzle, we devise a novel text game aiming at enhancing text layout comprehension through gameplay. Starting pre-training from scratch with different training corpora would be the most direct validation method. However, due to the substantial computational resources required, it is beyond the scope of our team’s capacity and could be considered as future work.

Finally, we apply the text layout understanding capabilities of LLMs to text-rich VQA tasks. We introduce a method named textLayoutParser that converts the original texts from VQA datasets to texts with layout. We observe that various LLMs yield better results on text with layout than on those without, highlighting the practical benefits and effectiveness of our research in real-world applications.

4 Experiments

4.1 Datasets

In this section, we describe all the datasets we use in the experiments. These include three public datasets: XfundQA, FetaQA [27], DocVQA [26], along with a generated dataset for text layout understanding evaluation, named TextLayoutQA. Additionally, we propose various instruction-tuning datasets including instruction-basic, instruction-code, instruction-table, instruction-generated and instruction-test.

XfundQA

A form QA dataset generated from the XFUND [44] dataset, which is a multilingual form understanding benchmark dataset covering 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese) with manually annotated forms. Each language includes 199 forms, with 149 forms in the training set and 50 forms in the test set. The dataset involves two sub-tasks: semantic entity recognition and relation extraction. As our primary focus is on QA, we make the following modifications to the Chinese test set of XFUND:

  1. 1.

    Change the key-value relations to the QA format: "What is the value of the key ’{key}’?"

  2. 2.

    Remove invalid QA pairs, including those with empty or invalid values and nested key-key-value relations.

  3. 3.

    Rewrite answers with multiple options to the selected one, such as changing "✓A \square B" to "A".

This modified dataset is named XfundQA. Since LLMs’ outputs are usually long, we use recall as the evaluation metric, considering a prediction correct if the ground truth appears completely in the LLMs’ output.

FetaQA

A table QA dataset consists of free-form table questions that require deep reasoning and understanding. Most questions are based on discontinuous blocks of information in the table. We conduct evaluations on the test set containing 2,003 samples. Consistently with the dataset’s conventions, we use ROUGE-L [22] and BLEU-4 [29] as the evaluation metrics.

DocVQA

A document QA dataset consists of printed and typed text as well as scanned documents with various layouts, some of which also include handwritten data. Evaluations are performed on the test set containing 5,188 samples. Following the conventional evaluation, we use average normalized Levenshtein score (ANLS) [26] as the evaluation metric. Since LLMs’ outputs are relatively long, the same LLM is used to rephrase the original output answers into shorter ones so that they are aligned with the references.

TextLayoutQA

A layout QA dataset generated specifically for testing the layout understanding capability of LLMs. This dataset revolves around enumerating items from various shopping lists arranged in random orientations within the text canvas. As illustrated in Figure 2a, some shopping lists are randomly positioned in four orientations (top-left, top-right, bottom-left, bottom-right) on a newline-separated plain text canvas filled with space markers. Each shopping list is assigned a name (A, B, C, or D) and comprises different products. Both the name and items within the same shopping list are first-letter aligned. For comparison, a version without layout information is constructed for each sample, which involves replacing consecutive space and newline markers with a single space marker. A minimum of two consecutive space markers are maintained between any two shopping lists. Figure 2b illustrates the "without layout" version corresponding to 2a. The paired samples, with and without layout, share the same set of three questions as shown in Figure 2c.

The TextLayoutQA dataset comprises a total of 300 sample pairs, encompassing 900 questions. All questions require the output in list format. F-score is employed to evaluate LLMs’ performance. The evaluation process is as follows: first, lists are extracted from the output using regular expressions. Subsequently, the F-score is calculated with each element in the list as a token. If the output does not contain a list, the F-score is calculated with words as tokens, disregarding characters besides words.

(a) Layout

Here are 4 shopping lists (A, B, C, D) with different products:

A B

footwear lenses

movies

walkers

jet skis

C D

fortified wines animal clothes

bulbs

(b) Strip

Here are 4 shopping lists (A, B, C, D) with different products:

(c) QA set

Question: What products do shopping list B contain?
Answer: ["lenses"]
Question: "What products do shopping list B and A contain?"
Answer: ["lenses", "footwear", "movies", "walkers", "jet skis"]
Question: What products do shopping list in the bottom-right corner contain?"
Answer: ["animal clothes", "bulbs"]

Instruction-basic dataset

An instruction-tuning dataset designed to diminish the text layout understanding capability of LLMs. Specifically, we randomly select 100k bilingual (English and Chinese) instances from publicly available instruction-tuning datasets [37, 14, 46, 16, 7, 49, 8, 43, 19, 31], deliberately excluding consecutive spaces (three or more spaces or two or more tabs), to form the instruction-basic dataset. The distribution of each sub-dataset in the instruction-basic dataset is shown in Table 1.

DatasetNumRatio/%
MOSS56,19556.19
belle20,88120.88
firefly8,9298.92
CSL3,2893.28
hh-rlhf2,2342.23
COI2,1042.10
HC31,5771.57
Chain-of-Thought1,2001.20
prosocial-dialog9630.96
alpacaGPT48510.85
gpt4tools5550.55
GPTeacher4310.431
alpaca4140.414
webGPT1730.173
dolly1280.128
Auto-CoT590.059
GAOKAO170.017
DatasetNumRatio/%
GPT4all65,77365.77
CodeAlpaca18,91118.91
COIG11,04811.04
GPTeacher4,2684.26
Instruction-code dataset

An instruction-tuning dataset designed to verify the influence of the code corpora on the text layout understanding capability of LLMs. We randomly sample 100k bilingual (English and Chinese) data from diverse public code-relative instruction-tuning datasets [1, 3, 49]. The distribution of each sub-dataset in the instruction-code dataset is shown in Table 2. To preserve the other capabilities of LLMs, these code data are combined with the data from the instruction-basic, resulting in a 200k instruction-code dataset.

Instruction-table dataset

An instruction-tuning dataset designed to verify the influence of the table corpora on the text layout understanding capability of LLMs. We randomly sample tables from the public table QA dataset, WikiTableQuestions [30]. We introduce text layout by aligning the first characters of all elements in each column of the table using consecutive space markers. A minimum of two consecutive space markers is maintained between any two columns of elements. Distinct from utilizing the dataset’s original QA pairs, we reformulate inquiries to elicit the value of each cell in the table, generating 100k new QA pairs. An example is depicted in Figure 3. These QA instances are combined with the data from the instruction-basic, forming a 200k instruction-table dataset.

Given a table:

Year Title Role

2009-2013 We Speak NYC Jorge / Fredy

2014-2019 Broad City Jaime Castro

2015-2016 Alternatino Arturo

2017 No Activity Pedro

2019 Alternatino Arturo

Question: What is the Role of Year 2009-2013?
Answer: Jorge / Fredy

Instruction-generated dataset

An instruction-tuning dataset designed to improve the text layout understanding capability of LLMs. Specifically, we propose a novel text game to generate data automatically, akin to the renowned text game word search puzzle (Figure 4) which challenges to find hidden words within a grid of letters. These puzzles typically feature a rectangular or square grid filled with random letters, accompanied by a list of words to be found. The words can be oriented in various directions—horizontally, vertically, diagonally, and even backward.

Large Language Models Understand Layout (1)

The sentence search puzzle is a game that involves a grid of words, where players are tasked with finding meaningful sentences hidden within the grid.The challenge lies in locating continuous words that make up meaningful sentences horizontally and vertically.The unused spaces in the grid are usually filled with random words to add complexity to the puzzle.Note: answer in the form of a list, for example: [’a’, ’b’]. If you do not know the answer, reply with the empty list [].Here is a toy example:

Large Language Models Understand Layout (2)

First, search horizontally and find "good morning".

Then, search vertically and find "get some food".

So all the sentences hidden in this puzzle are: ["good morning", "get some food"].

Let’s solve the following sentence search puzzle step by step:

Large Language Models Understand Layout (3)

Answer:

First, search horizontally and find "i believe you".

Then, search vertically and find "what are you doing".

So all the sentences hidden in this puzzle are: [’"i believe you"’, ’"what are you doing"’].

Acknowledging the scarcity of single letters in training corpora, we adapt the word search puzzle into a new game named sentence search puzzle. This game is designed to identify hidden sentences within a grid of words, with each word separated by consecutive spaces, and each row maintaining a consistent length. The first letters of all words in each column are aligned. A minimum of two consecutive space markers are maintained between any two columns of elements. The sentences can be oriented in two directions—horizontally and vertically. An illustrative example is provided in Figure 5.

To mitigate the difficulty of the game, we include intermediate-solving steps in the example provided. We mandate the game to output a list of sentences. During the evaluation step, similar to TextLayoutQA, we employ regular expressions to extract the list from the output.

We randomly generate 100k sentence search games. These instances are combined with the data from the instruction-basic, forming a 200k instruction-generated dataset.

Instruction-test

A testing dataset designed to assess the efficacy of instruction-tuning, consisting of four segments of data, as illustrated below, with a total of 800 samples. The evaluation metrics adopted for all instruction-tuning datasets are ROUGE-L and recall.

  • Basic: it includes five task types: generation, answering, classification, rewriting, and mathematics, obtained by sampling 100 samples for each task type from the public dataset, natural-instructions [42], which is a benchmark of 1,616 diverse NLP tasks and expert-written instructions.

  • Code: we randomly select 100 samples from the public LeetCode dataset oa_leet10k.222https://huggingface.co/datasets/cognitivecomputations/oa_leet10k

  • Table: we construct 100 samples using the same method as building the instruction-table dataset, sourced from the public table QA dataset, FeTaQA [27].

  • Generate: we generate 100 samples using the same method as building the instruction-generated dataset.

4.2 LLMs

In our experiments, we select several LLMs for evaluating TextLayoutQA, including GLM3, Llama2, Baichuan2, and GPT-3.5-Turbo. These models cover various sizes and types of open-source and proprietary LLMs that are currently popular.

  • GLM3: the latest generation in the open-source GLM series, featuring a singular parameter size of 6B. The chat version, ChatGLM3-6B exhibits strong performance among pre-training models under 10B, with features such as fluent dialogue and a low deployment threshold.

  • Llama2: the latest open-source LLMs released by Meta, Llama2 surpasses various open-source models on benchmarks and performs comparably to or better than GPT-4 on many test sets. Its open-source license permits commercial utilization, and it comes in a range of parameter sizes — 7B, 13B, and 70B. In our evaluation, we focus on the 7B and 13B parameter sizes.

  • Baichuan2: a new generation open-source LLM introduced by Baichuan Intelligence containing 7B and 13B parameters. Baichuan2 achieves optimal results among models of the same size on various general and domain benchmarks in Chinese, English, and multilingual contexts. Our evaluation encompasses both parameter sizes.

  • GPT-3.5-Turbo: launched by OpenAI, GPT-3.5-Turbo is an advanced language model derived from GPT-3, which has 175B parameters.

4.3 Experimental Result and Analysis

4.3.1 General Evaluation

In this section, we first evaluate the text layout understanding capability of LLMs, subsequently analyze how the tokenizers of LLMs encode the consecutive punctuations, and finally compare the performance of using different spatial markers for ablation study.

Table 3 shows the evaluation results of different LLMs on TextLayoutQA with (layout) and without (strip) text layout information. Compared to the strip, various LLMs achieve a performance improvement of 8similar-to\sim25% with text layout information, indicating the models’ ability to understand text alignment, layout, and orientation.

LLMsStripLayoutDifference
ChatGLM3-6B33.5249.52+16.00
Llama2-7B47.4758.80+11.33
Llama2-13B53.4561.93+8.48
Baichuan2-7B47.0860.82+13.74
Baichuan2-13B47.0068.69+21.69
GPT-3.5-Turbo61.8087.77+25.97

ChatGLM3, Llama2, Baichuan2, and GPT-3.5-Turbo all use byte pair encoding (BPE) [34] from SentencePiece [20] as their tokenizer. We find that the tokenizers of these LLMs encode different lengths of consecutive spaces with distinct tokens. Table 4 illustrates the maximum lengths of consecutive punctuation that can be encoded by different LLMs’ tokenizers. GPT-3.5-Turbo has different tokens for most consecutive punctuation, while ChatGLM3, Llama2, and Baichuan2 only have different tokens for relatively limited consecutive punctuation. A commonality among them is that they all support encoding relatively long consecutive spaces, indicating that consecutive spaces have a certain proportion in the training corpora. Given that some programming languages like Python and certain file formats like YAML are sensitive to indentation, we assume that these corpora aid LLMs in grasping how consecutive spaces align texts, thereby acquiring text layout understanding capability.

LLMsSpaceTabNewlineExclamationCommaFull-stopChatGLM31511214Llama21511214Baichuan23211123GPT-3.5-Turbo812014549

In TextLayoutQA, we use space and newline as spatial markers. For the ablation study, we investigate the other three characters as spatial markers: tab, caron (an accent mark), and a random vanilla character a. Notably, newline is still used to separate text lines. Table 5 delineates the effects of various LLMs on the TextLayoutQA dataset when deploying these characters as spatial markers. Generally, spaces consistently result in optimal performance for the majority of LLMs. The character a generally exhibits the poorest performance across various LLMs due to its lack of spatial semantics. In addition, LLMs do not generalize to interpret consecutive a as a spatial marker. Notably, although caron marker is rare in corpora, it still outperforms strip for most LLMs.

LLMsStripSpaceTabCarona
ChatGLM3-6B33.5249.5249.7237.6216.78
Llama2-7B47.4758.8051.4635.9637.84
Llama2-13B53.4561.9346.0043.3929.33
Baichuan2-7B47.0860.8260.3050.4828.49
Baichuan2-13B47.0068.6966.2756.1144.76
GPT-3.5-Turbo61.8087.7787.3672.8645.79

4.3.2 The Origin of Layout Understanding Capability

In this section, we first delve into the formation stage of text layout understanding capability and subsequently examine the training corpora used for LLMs. Finally, we explore the type of training corpora that fosters the text layout understanding capability.

Due to the lack of ability to strictly follow instructions, the outputs of the base models are difficult to align with the reference of the QA tasks. We therefore employ perplexity as the metric to ensure fair comparison of layout understanding capability between base and chat models. Perplexity is a widely used metric for assessing language models. Lower perplexity indicates better modeling performance. By comparing the perplexity of different LLMs on the TextLayoutQA dataset with and without text layout, the results presented in Table 6 are obtained. Notably, all the base models exhibit a lower perplexity on text with layout compared to text without layout, suggesting that the base models inherently possess some level of text layout understanding during the pre-training stage. Following instruction-tuning, all the chat models demonstrate a lower perplexity compared to the base models when processing text with layout as opposed to text without layout. This indicates that instruction-tuning further enhances the text layout understanding capability. It should be noted that, to mitigate the influence of context length on perplexity, newline markers are used for padding at the beginning of text without layout, with the padding length being the difference between the length of text without layout and the length of text with layout after tokenization.

LLMsTypeStripLayoutDifference
Base6.874.98-1.89
ChatGLM3-6BChat5.583.56-2.02
Base2.331.85-0.48
Llama2-7BChat2.952.26-0.69
Base2.151.81-0.34
Llama2-13BChat3.062.27-0.79
Base1.901.40-0.50
Baichuan2-7BChat3.091.53-1.56
Base1.891.33-0.56
Baichuan2-13BChat3.031.35-1.68

Table 7 presents the training corpora utilized by various LLMs during the pre-training stage. Notably, the training corpora for GLM3 and Llama2 are not explicitly published, so related information about GLM-130b and Llama is considered. We find that GLM, Llama, and GPT-3 all use datasets such as CommonCrawl, Wikipedia, and Books (Pile includes CommonCrawl, Wikipedia, and Books) in their pre-training. CommonCrawl is a large-scale, unstructured, multilingual web dataset containing over 8 years of web crawler data. Additionally, GLM and Llama utilize code-related sources like GitHub and StackExchange. We do find some examples with various text layouts sourced from GitHub and StackExchange within the Pile dataset. The specific examples can be referred to Appendix.

LLMsTraining corporaGLMPile, Chinese WudaoCorpora, Chinese corpora (including online forums, encyclopedia, and QA) crawled from the websiteLlamaCommonCrawl [36], C4 [33], Github, Wikipedia, Books, ArXiv, StackExchangeBaichuan2General internet webpages, books, research papers, codebases, and moreGPT-3CommonCrawl, WebText [32], Books1, Books2, Wikipedia

We perform instruction-tuning on the instruction-basic, instruction-code, instruction-table and instruction-generated dataset using Firefly [46] tuning framework. Each dataset is partitioned into training and validation sets with a ratio of 98:2. The training sets undergo 5 epochs with early stopping. As expected, adding different instruction types improves the corresponding ability. For further information, please refer to Appendix.

Table 8 presents the performance of LLMs on TextLayoutQA following various types of instruction-tuning. Due to the characteristic of catastrophic forgetting, the text layout capability decreases after instruction-basic tuning compared to the chat model, except for the Llama2 series. This is because the Llama2 chat models tend to produce long responses and sometimes fail to follow the required output format. However, a significant recovery in text layout capability is observed after instruction-code tuning, underscoring the crucial role of code-related data in enhancing text layout understanding. In contrast, the model fine-tuned on instruction-table experiences a decrease in text layout capability, indicating that the table-related data do not contribute to the text layout capability. It is noteworthy that performance shows enhancement following instruction-generated tuning. For specific small models like Llama2-7B and Baichuan2-7B, instruction-generated tuning even yields the best result among all instruction-tuning datasets. Considering that generating data is significantly more convenient and cost-effective than collecting, we have laid a promising path for pre-training LLMs.

LLMOriginBasicCodeTableGenerate
ChatGLM3-6B49.5245.3663.3628.9859.52
Llama2-7B58.8064.3766.1061.7766.59
Llama2-13B61.9372.3173.6966.7166.26
Baichuan2-7B60.8259.9360.7753.7963.78
Baichuan2-13B68.6966.1472.0665.9369.15

4.3.3 Applications

In this section, we illustrate the utilization of LLMs’ text layout understanding capability in the text-rich VQA domain. We introduce a method named textLayoutParser designed to parse texts with diverse layouts from documents, including plain texts, forms, tables, images, and their combinations. The method involves the placement of text on a two-dimensional character canvas according to the text’s coordinates. Detailed implementation is available in Appendix. We evaluate the zero-shot performance on the test sets of three datasets—XfundQA, DocVQA, and FeTaQA. The prompts utilized for each dataset are provided in Appendix.

XfundQA

We use the OCR output provided by the dataset and construct corpora with text layout using textLayoutParser. As a comparison, we replace consecutive spaces and newlines with a single space marker, forming corpora without text layout. The evaluation results of different LLMs on XfundQA with and without text layout are presented in Table 11. Notably, corpora with text layout lead to performance improvements ranging from 1.96% to 9.55% compared to corpora without text layout.

LLMsROUGE-LBLEU-4
StripLayoutDifferenceStripLayoutDifference
ChatGLM3-6B28.7931.28+2.4910.8411.08+0.24
Llama2-7B19.7124.03+4.325.827.67+1.85
Llama2-13B27.0730.49+3.429.1410.91+1.77
Baichuan2-7B32.2634.26+2.0012.1513.57+1.42
Baichuan2-13B34.4639.15+4.6914.0416.51+2.47
GPT-3.5-Turbo39.0539.76+0.7116.2116.63+0.42
LLMsROUGE-LBLEU-4
ArrayLinearTripletOursArrayLinearTripletOurs
ChatGLM3-6B28.7931.0131.2531.2810.8410.8511.0511.08
Llama2-7B19.7123.6922.8424.035.827.636.987.67
Llama2-13B27.0728.8026.4030.499.149.929.2210.91
Baichuan2-7B32.2631.8731.0334.2612.1512.2111.5513.57
Baichuan2-13B34.4640.0832.5739.1514.0416.9412.5316.51
GPT-3.5-Turbo39.0535.2136.8839.7616.2114.1514.9616.63
LLMsStripLayoutDifference
ChatGLM3-6B60.1366.18+6.05
Llama2-7B57.4166.96+9.55
Llama2-13B58.9266.60+7.68
Baichuan2-7B64.7066.66+1.96
Baichuan2-13B67.3873.27+5.89
GPT-3.5-Turbo76.6777.50+3.03
DocVQA

We use the OCR output provided by the dataset and construct corpora with text layout using textLayoutParser. For comparison, consecutive spaces and newlines are replaced with a single space marker, forming corpora without text layout. Table 12 shows the evaluation results of different LLMs on the DocVQA test set with and without text layout. Compared to corpora without text layout, different LLMs achieved performance improvements of 2.67% to 4.27% on corpora with text layout.

LLMsStripLayoutDifference
ChatGLM3-6B44.6048.30+3.70
Llama2-7B38.5041.81+3.31
Llama2-13B41.3344.42+3.09
Baichuan2-7B33.5036.17+2.67
Baichuan2-13B38.7541.80+3.05
GPT-3.5-Turbo62.6866.95+4.27
FeTaQA

The FeTaQA dataset provides tables in array format, we convert the array table data into string format serving as corpora without text layout. Additionally, corpora refactored by textLayoutParser are used as corpora with text layout. Table 4.3.3 presents the evaluation results of different LLMs on the FeTaQA test set with and without text layout. Notably, various LLMs showcase performance enhancements ranging from 0.71% to 4.69% (ROUGE-L) and 0.24% to 2.47% (BLEU-4) on corpora with text layout, compared to those without.

Difference text layout encoding methods are tailored to specific cases. For instance, in the context of table QA, common table encoding techniques include employing identifiers to distinguish headers and rows (referred to as Linear) [23, 15] and representing each element as a col-row-value triplet to create a list (referred to as Triplet) [40]. Apart from our proposed method, we explore several other text layout encoding techniques for an ablation study. Examples of different table encoding methods can be referred to Appendix. Table 4.3.3 provides a performance assessment of various table encoding methods on the FeTaQA test set. Our proposed method outperforms others for ChatGLM3-6B, Llama2-7B, and GPT-3.5-Turbo. Conversely, for Baichuan2-13B, the Linear encoding method demonstrates superior results.

5 Conclusion

This study extensively investigates the potential of LLMs in text layout understanding by constructing the TextLayoutQA dataset for in-depth research. Experiments utilizing various LLMs demonstrate that, compared to text without layout, the performance of LLMs on datasets with text layout improves by 8similar-to\sim25%, confirming their potential in text alignment, layout, and orientation understanding. The additional experiments show that during the pre-training phase, the base models already possess preliminary text layout understanding capabilities, which are further enhanced during instruction-tuning. Through ablation experiments with diverse instruction-tuning datasets, we find that training data is crucial for LLMs to acquire text layout understanding, particularly datasets containing text layouts such as codes. In addition, text layout understanding can be enhanced by low-cost auto-generated data approached by a novel text game. Subsequently, leveraging the text layout understanding capabilities of LLMs, we propose an approach named TextLayoutParser to address text-rich VQA problems, achieving decent performance improvements on the XfundQA, FetaQA, and DocVQA datasets.

In summary, our research unveils capabilities in LLMs that have been underexplored, demonstrating their potential to enhance the performance of text-rich VQA problems, expanding the application scenarios of language-centric LLMs, and providing new perspectives for subsequent LLM corpora preparation.

6 Acknowledgments

This research was supported by "Pioneer" and "Leading Goose" R&D Program of Zhejiang (No. 2024C01020).

References

  • Anand etal. [2023]Y.Anand, Z.Nussbaum, B.Duderstadt, B.Schmidt, and A.Mulyar.Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo.https://github.com/nomic-ai/gpt4all, 2023.
  • Bayani [2023]D.Bayani.Testing the depth of chatgpt’s comprehension via cross-modal tasks based on ascii-art: Gpt3. 5’s abilities in regard to recognizing and generating ascii-art are not totally lacking, 2023.
  • Chaudhary [2023]S.Chaudhary.Code alpaca: An instruction-following llama model for code generation.https://github.com/sahil280114/codealpaca, 2023.
  • Chen etal. [2023]L.Chen, L.Wang, H.Dong, Y.Du, J.Yan, F.Yang, S.Li, P.Zhao, S.Qin, S.Rajmohan, etal.Introspective tips: Large language model for in-context decision making, 2023.
  • DiBartolomeo etal. [2023]S.DiBartolomeo, G.Severi, V.Schetinger, and C.Dunne.Ask and you shall receive (a graph drawing): Testing chatgpt’s potential to apply graph layout algorithms, 2023.
  • [6]E.Dreibelbis.Chatgpt passes google coding interview for level 3 engineer with $183k salary.https://www.pcmag.com/news/chatgpt-passes-google-coding-interviewfor-level-3-engineer-with-183k-salary.
  • Ganguli etal. [2022]D.Ganguli, L.Lovitt, J.Kernion, A.Askell, Y.Bai, S.Kadavath, B.Mann, E.Perez, N.Schiefer, K.Ndousse, etal.Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022.
  • Guo etal. [2023a]B.Guo, X.Zhang, Z.Wang, M.Jiang, J.Nie, Y.Ding, J.Yue, and Y.Wu.How close is chatgpt to human experts? comparison corpus, evaluation, and detection.arXiv preprint arXiv:2301.07597, 2023a.
  • Guo etal. [2023b]J.Guo, L.Du, and H.Liu.Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking, 2023b.
  • Hegde etal. [2023]N.Hegde, S.Paul, G.Madan, and G.Aggarwal.Analyzing the efficacy of an llm-only approach for image-based document question answering, 2023.
  • Hu etal. [2020]R.Hu, A.Singh, T.Darrell, and M.Rohrbach.Iterative answer prediction with pointer-augmented multimodal transformers for textvqa.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9992–10002, 2020.
  • Huang etal. [2022]Y.Huang, T.Lv, L.Cui, Y.Lu, and F.Wei.Layoutlmv3: Pre-training for document ai with unified text and image masking.In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022.
  • Hurst and Nasukawa [2000]M.Hurst and T.Nasukawa.Layout and language: Integrating spatial and linguistic knowledge for layout understanding tasks.In COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics, 2000.
  • Ji etal. [2023]Y.Ji, Y.Gong, Y.Deng, Y.Peng, Q.Niu, B.Ma, and X.Li.Towards better instruction following language models for chinese: Investigating the impact of training data and evaluation, 2023.
  • Jiang etal. [2022]Z.Jiang, Y.Mao, P.He, G.Neubig, and W.Chen.Omnitab: Pretraining with natural and synthetic data for few-shot table-based question answering, 2022.
  • JosephusCheung [2021]JosephusCheung.Guanaco - generative universal assistant for natural-language adaptive contextaware omnilingual outputs.https://guanaco-model.github.io/, 2021.
  • Joublin etal. [2023]F.Joublin, A.Ceravola, J.Deigmoeller, M.Gienger, M.Franzius, and J.Eggert.A glimpse in chatgpt capabilities and its impact for ai research, 2023.
  • Kim etal. [2022a]G.Kim, T.Hong, M.Yim, J.Nam, J.Park, J.Yim, W.Hwang, S.Yun, D.Han, and S.Park.Ocr-free document understanding transformer.In European Conference on Computer Vision, pages 498–517. Springer, 2022a.
  • Kim etal. [2022b]H.Kim, Y.Yu, L.Jiang, X.Lu, D.Khashabi, G.Kim, Y.Choi, and M.Sap.Prosocialdialog: A prosocial backbone for conversational agents.In EMNLP, 2022b.
  • Kudo and Richardson [2018]T.Kudo and J.Richardson.Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018.
  • Lee etal. [2023]K.Lee, M.Joshi, I.R. Turc, H.Hu, F.Liu, J.M. Eisenschlos, U.Khandelwal, P.Shaw, M.-W. Chang, and K.Toutanova.Pix2struct: Screenshot parsing as pretraining for visual language understanding.In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
  • Lin [2004]C.-Y. Lin.Rouge: A package for automatic evaluation of summaries.In Text summarization branches out, pages 74–81, 2004.
  • Liu etal. [2021]Q.Liu, B.Chen, J.Guo, M.Ziyadi, Z.Lin, W.Chen, and J.-G. Lou.Tapex: Table pre-training via learning a neural sql executor, 2021.
  • Liu etal. [2023a]Y.Liu, T.Han, S.Ma, J.Zhang, Y.Yang, J.Tian, H.He, A.Li, M.He, Z.Liu, etal.Summary of chatgpt-related research and perspective towards the future of large language models.Meta-Radiology, page 100017, 2023a.
  • Liu etal. [2023b]Y.Liu, Z.Li, H.Li, W.Yu, M.Huang, D.Peng, M.Liu, M.Chen, C.Li, L.Jin, etal.On the hidden mystery of ocr in large multimodal models, 2023b.
  • Mathew etal. [2021]M.Mathew, D.Karatzas, and C.Jawahar.Docvqa: A dataset for vqa on document images.In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  • Nan etal. [2022]L.Nan, C.Hsieh, Z.Mao, X.V. Lin, N.Verma, R.Zhang, W.Kryściński, H.Schoelkopf, R.Kong, X.Tang, etal.Fetaqa: Free-form table question answering.Transactions of the Association for Computational Linguistics, 10:35–49, 2022.
  • [28]K.O’Riordan.Ascii art.https://www.britannica.com/topic/ASCIIart.
  • Papineni etal. [2002]K.Papineni, S.Roukos, T.Ward, and W.-J. Zhu.Bleu: a method for automatic evaluation of machine translation.In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  • Pasupat and Liang [2015]P.Pasupat and P.Liang.Compositional semantic parsing on semi-structured tables, 2015.
  • Peng etal. [2023]B.Peng, C.Li, P.He, M.Galley, and J.Gao.Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277, 2023.
  • Radford etal. [2019]A.Radford, J.Wu, R.Child, D.Luan, D.Amodei, I.Sutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
  • Raffel etal. [2020]C.Raffel, N.Shazeer, A.Roberts, K.Lee, S.Narang, M.Matena, Y.Zhou, W.Li, and P.J. Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • Sennrich etal. [2015]R.Sennrich, B.Haddow, and A.Birch.Neural machine translation of rare words with subword units, 2015.
  • Shi etal. [2023]Y.Shi, H.Ma, W.Zhong, G.Mai, X.Li, T.Liu, and J.Huang.Chatgraph: Interpretable text classification by converting chatgpt knowledge to graphs, 2023.
  • Smith etal. [2013]J.R. Smith, H.Saint-Amand, M.Plamada, P.Koehn, C.Callison-Burch, and A.Lopez.Dirt cheap web-scale parallel text from the common crawl.Association for Computational Linguistics, 2013.
  • Sun etal. [2023]T.Sun, X.Zhang, Z.He, P.Li, Q.Cheng, H.Yan, X.Liu, Y.Shao, Q.Tang, X.Zhao, K.Chen, Y.Zheng, Z.Zhou, R.Li, J.Zhan, Y.Zhou, L.Li, X.Yang, L.Wu, Z.Yin, X.Huang, and X.Qiu.Moss: Training conversational language models from synthetic data.2023.
  • Tang etal. [2023]Z.Tang, Z.Yang, G.Wang, Y.Fang, Y.Liu, C.Zhu, M.Zeng, C.Zhang, and M.Bansal.Unifying vision, text, and layout for universal document processing.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023.
  • Touvron etal. [2023]H.Touvron, L.Martin, K.Stone, P.Albert, A.Almahairi, Y.Babaei, N.Bashlykov, S.Batra, P.Bhargava, S.Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models, 2023.
  • Vakulenko and Savenkov [2017]S.Vakulenko and V.Savenkov.Tableqa: Question answering on tabular data, 2017.
  • Wang etal. [2023]H.Wang, S.Feng, T.He, Z.Tan, X.Han, and Y.Tsvetkov.Can language models solve graph problems in natural language?, 2023.
  • Wang etal. [2022]Y.Wang, S.Mishra, P.Alipoormolabashi, Y.Kordi, A.Mirzaei, A.Arunkumar, A.Ashok, A.S. Dhanasekaran, A.Naik, D.Stap, etal.Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, 2022.
  • [43]J.Wei, M.Bosma, V.Zhao, K.Guu, A.W. Yu, B.Lester, N.Du, A.M. Dai, and Q.V. Le.Finetuned language models are zero-shot learners.In International Conference on Learning Representations.
  • Xu etal. [2021]Y.Xu, T.Lv, L.Cui, G.Wang, Y.Lu, D.Florencio, C.Zhang, and F.Wei.Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding, 2021.
  • Yang etal. [2023]A.Yang, B.Xiao, B.Wang, B.Zhang, C.Bian, C.Yin, C.Lv, D.Pan, D.Wang, D.Yan, etal.Baichuan 2: Open large-scale language models, 2023.
  • Yang [2023]J.Yang.Firefly.https://github.com/yangjianxin1/Firefly, 2023.
  • Ye etal. [2023]Y.Ye, H.You, and J.Du.Improved trust in human-robot collaboration with chatgpt.IEEE Access, 2023.
  • Zeng etal. [2022]A.Zeng, X.Liu, Z.Du, Z.Wang, H.Lai, M.Ding, Z.Yang, Y.Xu, W.Zheng, X.Xia, etal.Glm-130b: An open bilingual pre-trained model, 2022.
  • Zhang etal. [2023]G.Zhang, Y.Shi, R.Liu, R.Yuan, Y.Li, S.Dong, Y.Shu, Z.Li, Z.Wang, C.Lin, W.Huang, and J.Fu.Chinese open instruction generalist: A preliminary release, 2023.
  • Zhang [2023]J.Zhang.Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt, 2023.
  • Zhu etal. [2023]Y.Zhu, X.Wang, J.Chen, S.Qiao, Y.Ou, Y.Yao, S.Deng, H.Chen, and N.Zhang.Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities, 2023.

Appendix A Examples of LLMs Text Layout Understanding Capability

During the early exploration of GPT-3.5-Turbo’s QA ability, its remarkable capability to comprehend text alignment, layout, and orientation was discovered. Figure 6 shows some examples of the exploration.

Example 1

 TomJean Thomas Lee David

Question: What is the text in the center?
Answer: The text in the center is "Thomas".

Example 2

Here are two bboxes:

-------| || | ------| | | || | | |------- ------

Question: Which bbox is larger, left or right?
Hint: the bbox with more whitespace inside is larger.
Answer: Based on the given information, the left bbox is larger.

Example 3

Now we define the mathematical symbols with their visual representation using a 5x5 matrix, which is made of "0" and "1". For example:
The visual representation of 0 is:
11111
10001
10001
10001
11111
The visual representation of 1 is:
00100
00100
00100
00100
00100
The visual representation of 2 is:
11111
00001
11111
10000
11111

Question: What is the visual representation of 3?
Answer: The visual representation of 3 is:
11111
00001
11111
00001
11111
Question: What is the visual representation of x?
Answer: The visual representation of x is:
10001
01010
00100
01010
10001

Appendix B Corpora with Layout Information on Github and StackExchange

By searching for data within Pile that potentially contains text layout information, we discover considerable relevant data from sources like GitHub and StackExchange. Figure 7 shows some examples.

Example 1: from Github

<html id=\"top\">

<head>

<meta charset=\"utf-8\">

<title>The Crosswalk Project</title>

<link rel=\"shortcut icon\" href=\"/assets/favicon.ico\" type=\"image/x-icon\" />

<link rel=\"icon\" href=\"/assets/favicon.ico\" type=\"image/x-icon\" />

<script>

WebFontConfig = {

custom: {

families: [’Clear Sans’],

urls: [’/css/fonts.css’]

},

google: {

families: [’Source Code Pro:n4,n6’]

},

timeout: 2000

};

</script>

</head>

</html>

Example 2: from Github

/*

* Summary:

* Selectors for feature type kCursiveConnectionType

*/

enum {

kUnconnectedSelector = 0,

kPartiallyConnectedSelector = 1,

kCursiveSelector = 2

};

Example 3: from StackExchange

** LOGGED HOURS ** ** SICK HOURS ** ** RESULT TABLE **

+--------+-------+ +--------+-------+ +--------+-------+-------+

|Name | Hours | |Name | Hours | |Name |Hours |Sick |

+--------+-------+ +--------+-------+ +--------+-------+-------+

|David |47 | |David |9 | |David |47 |9 |

+--------+-------+ +--------+-------+ +--------+-------+-------+

|David |9 | |David |9 |0 |

+--------+-------+ +--------+-------+-------+

Example 4: from StackExchange

Switch flooding when bonding interfaces in Linux

+----+-----+

| Switch 1 | (layer2/3)

+----+-----+

|

+----+-----+

| Switch 2 |

+----+-----+

|

+----------+----------+

+-------------------------+ Switch 3 +-------------------------+

| +----+-----------+----+ |

| | | |

| | | |

| eth0 (B0:B0:B0:B0:B0:B0) | | eth4 (B4:B4:B4:B4:B4:B4) |

| +----+-----------+----+ |

| | Host B | |

| +----+-----------+----+ |

| eth1 (B1:B1:B1:B1:B1:B1) | | eth5 (B5:B5:B5:B5:B5:B5) |

| | | |

| | | |

+------------------------------+ +------------------------------+

ROUGE-LRecall
CodeTableGenerateOthersCodeTableGenerateOthers
ChatGLM3-6BOrigin26.6520.538.1428.4967.7830.9624.5138.12
Base9.2525.655.3933.8819.3929.2511.0437.15
Code20.6717.387.4630.7136.2420.6516.8436.04
Table2.1332.005.8721.493.7232.006.0525.37
Generate4.7915.0564.3333.927.1415.9867.1638.31
LlaMA2-7BOrigin19.2824.340.0717.4166.9240.440.1138.17
Base10.5925.639.8332.2526.6945.4027.0439.52
Code31.2340.508.0627.6168.5846.2418.3135.36
Table23.3262.725.7332.9154.8763.1717.6237.96
Generate12.5330.7474.9829.2631.1337.9485.9735.58
LlaMA2-13BOrigin21.2718.970.3917.6868.7947.940.4438.97
Base7.4747.077.8335.6115.1153.5016.2239.69
Code30.2743.655.1832.7968.0953.5516.2639.81
Table12.4161.838.2932.2525.8762.0711.8438.75
Generate17.5343.2377.0534.1332.7046.6780.1839.93
Baichuan2-7BOrigin27.3418.053.7922.7270.0047.985.9041.59
Base27.8118.122.6022.3470.6247.485.5340.95
Code26.8317.304.3222.3969.9747.989.4940.74
Table27.8361.2910.1925.3155.0161.6712.3037.64
Generate12.6011.4077.6426.6531.6936.5681.9039.49
Baichuan2-13BOrigin29.8013.924.2226.2267.0446.8710.1541.38
Base8.6817.989.8925.4921.4043.4617.4936.91
Code20.5817.968.0427.0851.8247.9421.6836.64
Table10.7263.7613.7329.0326.9764.2417.9341.76
Generate15.9119.5275.7724.0837.5741.4580.2835.48

Appendix C Test results of using different types of instructions to tune LLMs

The test results of using different types of instructions to tune LLMs are presented in Table 13. Given LLMs often produce long responses that don’t align with the ground truth of the table subset, recall yields more reasonable results than ROUGE-L. It can be observed that, compared to the chat model, the code capability significantly decreases after tuning on instruction-base, but it substantially recovers after tuning on the instruction-code, except for Baichuan2-7B. In contrast to the model tuned on the instruction-base, table capability gains a considerable improvement after tuning on the instruction-table, while layout capability obtains a remarkable improvement following instruction-generated tuning.

Appendix D Method of textLayoutParser

The implementation of textLayoutParser includes four steps: text parsing, determination of unit character size and coordinate conversion, filling text into the character matrix, and conversion of character matrix to plain text.

Text Parsing

Utilize appropriate parsing methods based on different file formats to obtain text content and their corresponding positional coordinates. For example, OCR can be used to extract text and coordinates from images, while the PyMuPDF Python library can be employed to parse PDF files. As for table data, we generated bounding boxes (bboxes) for each element in the table, including header and cell, based on coordinates and text length. The generation process is as follows: Each character was treated as a unit character, with an assumption of a spacing of 2 between adjacent elements in the same row and 1 between adjacent elements in the same column. The maximum text length for all elements in the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column is denoted as l𝑙litalic_l, and the bbox for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row’s j1th𝑗superscript1𝑡{j-1}^{th}italic_j - 1 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column element (Vijsubscript𝑉𝑖𝑗V_{ij}italic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT for short) is represented as [x1,i,x2,i+1]subscript𝑥1𝑖subscript𝑥2𝑖1[x_{1},i,x_{2},i+1][ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i + 1 ]. Then, the bbox coordinate for Vijsubscript𝑉𝑖𝑗V_{ij}italic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is [x2+2,i,x2+2+l,i+1]subscript𝑥22𝑖subscript𝑥22𝑙𝑖1[x_{2}+2,i,x_{2}+2+l,i+1][ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 , italic_i , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 + italic_l , italic_i + 1 ].

Determination of Unit Character Size and Coordinate Conversion

Determine a unit character size by analyzing the sizes of all text characters, filtering out characters smaller than this unit size. The other text coordinates are then converted using this unit character size. Define a text t𝑡titalic_t with length n and bbox coordinates (x1,y1,x2,y2)subscript𝑥1subscript𝑦1subscript𝑥2subscript𝑦2(x_{1},y_{1},x_{2},y_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). The approximate character width and height can be calculated as (x2x1)/nsubscript𝑥2subscript𝑥1𝑛(x_{2}-x_{1})/n( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / italic_n and y2y1subscript𝑦2subscript𝑦1y_{2}-y_{1}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, respectively. Let the unit character’s width be x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and the height be y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The coordinates for t𝑡titalic_t after conversion become (x1/x0,y1/y0,x2/x0,y2/y0)subscript𝑥1subscript𝑥0subscript𝑦1subscript𝑦0subscript𝑥2subscript𝑥0subscript𝑦2subscript𝑦0(x_{1}/x_{0},y_{1}/y_{0},x_{2}/x_{0},y_{2}/y_{0})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), rounded to the nearest integer.

Filling Text into the Character Matrix

Using the coordinates, insert the text into a character matrix. Initialize a matrix with spaces as elements, setting the rows and columns to the maximum y-value and x-value after conversion of text coordinates. Then, sequentially place the text into the corresponding indices of the matrix from left to right to ensure text continuity. For example, if the converted text coordinate is (10,10,20,20)10102020(10,10,20,20)( 10 , 10 , 20 , 20 ), and the text length is 5, each character of the text is placed in the matrix indices (10,10)1010(10,10)( 10 , 10 ) to (15,10)1510(15,10)( 15 , 10 ) one by one.

Conversion of Character Matrix to Plain Text

Convert the character matrix into the plain text for LLMs. This process involves joining all characters in each row into one line of text, and then combining all lines of text using a newline character as a separator. In order to reduce the redundancy of the dense spaces and newline markers, we remove the first column of those with at least three consecutive columns entirely filled with spaces, replace entire rows filled with spaces with a newline character, and replace at least three consecutive newline markers with two newline markers.

Appendix E Prompt Designs for Difference Datasets

Figure 8 illustrates the prompt designs for different datasets. (a) display one-shot prompting for TextLayoutQA. (b)similar-to\sim(d) display zero-shot prompting for XfundQA, DocVQA, and FeTaQA, respectively. (e) illustrates the 3-shot prompting for rephrasing answers in the DocVQA dataset. The instructions remain consistent across all LLMs except for the Llama2 series, as depicted in (f). Regarding LLM parameter settings, we utilize a temperature of 0.1, maximum output length of 512, top p of 0.85, and repetition penalty of 1.05.

(a) An example of one-shot prompting for TextLayoutQA dataset

Given some shopping lists with different products, you are supposed to enumerate the products of specific listsand answer questions in the form of a list, for example: [’a’, ’b’], reply with the list only! If you don’t know the answer, reply with the empty list [].

For example:
Here are 2 shopping lists (A, B) with different products:

A B

apple fish

banana chair

car

Question: What products do shopping list B contain?
Answer: [’fish’, ’chair’]

Now answer the question below:
{context}

Question: {question}
Answer:

(b) An example of zero-shot prompting for XfundQA dataset

The following is a form composed of key-value pairs: "{context}". Please answer according to the given form.
Note: The value usually appears near the key. Think carefully and answer with a few words.
Question: What is the value of the key "{question}"?
Answer:

(c) An example of zero-shot prompting for DocVQA dataset

Given the context:
{context}
Use few words to answer the question: {question}
Answer:

(d) An example of zero-shot prompting for FeTaQA dataset

Given a table:
{context}
Answer questions about the table.
Note: think step by step.
Question: {question}
Answer:

(e) An example of 3-shot prompting for rephrasing answers in DocVQA dataset

Given the question and answer pair, rephrase the answer to provide the most straightforward response to the question with few words in English.

Example 1:
Question: What is the name of the person in the CC field?
Answer: The name of the person in the CC field is Jo Spach.
Rephrased answer: Jo Spach

Example 2:
Question: What is the given document about?
Answer: The given document appears to be a summary of an evaluation survey conducted by Telmark in a particular monthly region in 2014. The survey aimed to evaluate the effectiveness of Telmark’s promotional programs in the region. The document provides information on various aspects of the survey, including the number of stores that received promotional materials, the percentage of stores that placed the materials in a visible location, and the number of stores that participated in the promotion. Additionally, the document includes information on the wholesale accounts sold by Telmark in the region and the percentage of accounts that refused the promotion.
Rephrased answer: region monthly telmark program evaluation survey

Example 3:
Question: What is the % of Employees in 2012 based on graph ’Distribution of Value-Added’?
Answer: Based on the graph ’Distribution of Value-Added’, it can be observed that the percentage of employees in 2012 is around 80%.
Rephrased answer: 80%

Now rephrase the answer based on the QA pair:
Question: {question}
Answer: {answer}
Rephrased answer:

(f) Prompt template for Llama2

<s>[INST] <<SYS>>

{system prompt}

<</SYS>>

{instruction} [/INST]

Appendix F Examples of different table encoding methods

Figure 9 shows examples of different table encoding methods. The widely used table encoding methods include: arranging data in array format (Array), using unique identifiers to distinguish between headers and rows (Linear), and formatting each element as a column-row-value triplet to form a list (Triple).

(a) Array

[[’Year’, Title’, Role’, Channel’],

[’2015’, Kuch Toh Hai Tere Mere Darmiyaan’, Sanjana Kapoor’, Star Plus’],

[’2016’, Kuch Rang Pyar Ke Aise Bhi’, Khushi’, Sony TV’],

[’2016’, Gangaa’, Aashi Jhaa’, ’\&TV’]]

(b) Linear

[HEAD] Year | Title | Role | Channel

[ROW] 1 2015 | Kuch Toh Hai Tere Mere Darmiyaan | Sanjana Kapoor | Star Plus

[ROW] 2 2016 | Kuch Rang Pyar Ke Aise Bhi | Khushi | Sony TV

[ROW] 3 2016 | Gangaa | Aashi Jhaa | \&TV

(c) Triplet

Row1 | Year | 2015

Row1 | Title | Kuch Toh Hai Tere Mere Darmiyaan

Row1 | Role | Sanjana Kapoor

Row1 | Channel | Star Plus

Row2 | Year | 2016

Row2 | Title | Kuch Rang Pyar Ke Aise Bhi

Row2 | Role | Khushi

Row2 | Channel | Sony TV

Row3 | Year | 2016

Row3 | Title | Gangaa

Row3 | Role | Aashi Jhaa

Row3 | Channel | \&TV

(d) Ours

Year Title Role Channel

2015 Kuch Toh Hai Tere Mere Darmiyaan Sanjana Kapoor Star Plus

2016 Kuch Rang Pyar Ke Aise Bhi Khushi Sony TV

2016 Gangaa Aashi Jhaa \&TV

Large Language Models Understand Layout (2024)

References

Top Articles
Latest Posts
Article information

Author: Patricia Veum II

Last Updated:

Views: 5928

Rating: 4.3 / 5 (44 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Patricia Veum II

Birthday: 1994-12-16

Address: 2064 Little Summit, Goldieton, MS 97651-0862

Phone: +6873952696715

Job: Principal Officer

Hobby: Rafting, Cabaret, Candle making, Jigsaw puzzles, Inline skating, Magic, Graffiti

Introduction: My name is Patricia Veum II, I am a vast, combative, smiling, famous, inexpensive, zealous, sparkling person who loves writing and wants to share my knowledge and understanding with you.