More detailed thoughts about data extraction

This gist contains some ideas about using LLMs to extract data from papers (specifically related to biology, aging research and the like).

Just to quickly expand a bit on what I was trying to say when our meeting was cut off:

I think the LLM data extraction can be viewed as a problem tractable at 3 different layers:
1. purely text based, e.g. use `pdftotext` to turn a PDF into a text document, then use LLMs to summarize, extract, tag, ... papers in order to have machine readable data.
2. using LLMs for full PDF parsing, where the idea is to produce a machine readable version directly, _including_ extraction of plots, tables, schematics and the like. A lot more challenging, but still somewhat doable. Though, data extraction from plots especially is a difficult problem, where LLMs on their own are not going to be useful on their own. In any case, the idea is to make all the data that is written in the paper to be usable programmatically.
3. beyond just PDF parsing to get machine readable data, _also_ make sure that the paper is cohesive, correct minor ambiguities / errors, fill in missing details in legends of plots and the like, potentially raise a "human input needed" flag and so on. This is certainly the most difficult problem, as it relies on being able to do 1 and 2 correctly already.

The reason I raised the distinction is that by framing the problem in this way, we can make incremental progress towards 3. Each step can form the basis of the beginning of the next. Say if one implements 2, we can use the output of 2 as the starting point to fix up errors for example.

For a few more thoughts, see <<this page>>

Step 1

As mentioned in the short message above, the first step is to run pdftotext on each PDF. pdftotext is a command line program part of Poppler. It is pretty good (but not perfect!) at converting PDFs into pure text files with most of the important layout retained.

For example, take this extract of this paper I’m currently reading:

If we run pdftotext like so:

pdftotext -layout path/to/paper.pdf

we get a text file that contains content looking like this:

The resulting data can easily be fed into an LLM for further processing (potentially in parts, depending on the length of the paper).

What to tell the LLM to do?

The next somewhat tricky question is what to tell the LLM to do with that text file? I’d propose to start by having a certain set of types of different papers:

describing an experimental setup and reporting experimental results
proposals
theoretical / phenomenological models
computer simulation work
review papers
…

And for each design a type of output structure we’d like to generate.

In all cases, start by categorizing the paper, writing a summary, etc.

Different ideas for types of data to fill:

key findings
experimental setup
experimental results
…

Probably it would make the most sense to use the ‘structured output’ feature that most LLM providers support nowadays, e.g. OpenAI’s docs about it here, where the model will fill a JSON schema to only return valid JSON. That would guarantee that the output is actually machine readable without additional need for an LLM to interpret data.

Part of the difficult problem will be making sure one covers a useful range of paper types and output types we want to produce.

Step 2

In step 2 we would then either partially reuse the result from step 1 as the baseline and only use the PDFs to extract non text data (schematics, plots etc) or start from scratch.

Depending on the paper length it is likely needed to split the PDFs into multiple chunks and have the LLM process the pieces individually. Attempting to get an LLM to produce a complete output for a full paper is likely going to be too flawed.

Some agentic workflow might help here.

The basic idea should anyhow be to get the model to produce a structured output (whether that’s JSON or some other maybe better format is not so important. JSON is only useful, because model providers provide machinery to produce correct JSON).

Step 3

Potentially the easiest way to get to step 3 would be to have an agentic workflow of agents that cross check data from the raw data (text or PDF) and the generated output in step 2. That could fix / highlight etc. the data. Given that one would already be working with machine readable data, one could use a version control system (e.g. git) to track the changes and introduce programmatic checks. E.g. flag for human input not just if the model says so, but also if the diff is larger than a certain threshold, …

Vindaar/thoughts_on_llm_data_extraction.org

Select an option

No results found