Skip to content

Instantly share code, notes, and snippets.

@colinricardo
Created April 9, 2025 21:14
Show Gist options
  • Save colinricardo/f4102c049154ab0b55fb5a4dde7c67a4 to your computer and use it in GitHub Desktop.
Save colinricardo/f4102c049154ab0b55fb5a4dde7c67a4 to your computer and use it in GitHub Desktop.
o3output

got it. i'll design an approach to extract structured tabular data (with notes) from lossy entryText arrays like these, where spacing and line breaks can vary. i'll use regex, rule-based parsing, and heuristics around patterns like dates and titles to reconstruct columns.

will include fallback logic for edge cases (e.g. merged rows, dangling notes, partial entries) and provide a clear JSON schema for the output.

sit tight, i'll be back with the full outline and rationale.

Plan: Parsing Land Registry “Schedule of Notices of Leases” into Structured Data

Overview of the Problem

Land Registry title registers include a Schedule of Notices of Lease presented in tabular form. This schedule has four columns of information for each lease entry:

  1. Registration date and plan reference – e.g. the date the lease was registered and how it’s marked on the title plan (like “edged red” or “tinted blue”).
  2. Property description – the address or description of the leased property (e.g. a flat or unit address).
  3. Date of lease and term – the lease’s execution date and its term/duration (e.g. “99 years from 1.1.2000”).
  4. Lessee’s title – the title number of the leasehold estate (if registered).

Some lease entries also have notes that span across all columns (typically prefixed with “NOTE:” in the text). The Land Registry API provides this schedule as an array of text lines (entryText), but without clear column separators. Our task is to reconstruct the table structure with high accuracy and output each entry as structured JSON fields, including properly associated notes.

Challenges in the Raw entryText Format

Parsing the entryText array is non-trivial due to several issues:

  • No explicit delimiters: Columns are separated only by spaces, and the spacing is inconsistent. If a column has long text (e.g. a long address), it may run right up against the next column’s text with minimal spacing. Conversely, short entries might have extra spaces. This ambiguity makes it hard to know where one field ends and the next begins.
  • Multi-line entries: A single lease entry often spans multiple lines in the entryText. For example, the property description or the lease term might break onto a new line. The API just lists all lines sequentially, so we must determine which lines belong together as one entry.
  • Loss of table structure: The original PDF table had fixed columns, but the API output is “lossy.” It doesn’t preserve clear column boundaries or even the row boundaries beyond sequential order. We have to infer structure from content.
  • Notes handling: Notes are printed as separate lines (usually starting with "NOTE:") that apply to the preceding entry. These note lines have no column structure (they span all columns), so they need special handling to extract them and attach them to the correct lease entry.
  • Various content patterns: The text in each column can vary (addresses, dates, title numbers, etc.). We must account for patterns like dates in different formats (with or without leading zeros), terms that may include phrases like “less XXXX days,” and property descriptions that could be complex (floor levels in parentheses, addresses with commas, etc.).
  • Potential missing data: Some leases in the schedule might not have a lessee’s title (e.g. if the lease is not separately registered). These entries will have an empty last column, which we need to represent as null or empty in JSON without mis-aligning other fields.
  • Scale and performance: The solution needs to handle hundreds of thousands of schedules quickly. This means we need a deterministic approach (no heavy ML or external API calls) that reliably parses each entry in linear time.

Methodology and Approach

To tackle these challenges, the solution will use a series of deterministic parsing steps, combining heuristic rules and regular expressions to identify each component. The process can be broken down as follows:

1. Segmentation of Entries

First, iterate through the entryText lines and group them by individual lease entry. Each entry may consist of multiple lines. We can detect the start of a new entry by looking for patterns that only occur at the beginning of an entry, such as a registration date (and possibly an entry number):

  • Detect start of entry: Typically, a new entry begins with a date in the format dd.mm.yyyy (e.g. “28.01.2009”). In some cases, the PDF might show an entry number at the far left (e.g. “1”, “2”, “3”… as seen in the screenshots), which could appear in the text. If an entry number is present, it would be at the very start of the line, followed by the date. We handle this by allowing an optional entry number and looking for a date pattern following it.
  • Regex for date: Use a regex to match a date at the line start, for example: ^(\d+\s+)?(?P<reg_date>\d{1,2}\.\d{1,2}\.\d{4}). This pattern says: optionally some digits and whitespace (entry number) then a date of the form dd.mm.yyyy (day, month, year). This robustly catches cases like “1 28.01.2009” or just “28.01.2009”.
  • Group lines until next date: Once a start of entry is found, that line (and possibly subsequent lines) belong to the current entry until the next start-of-entry line is encountered. All lines that do not begin with a date (or entry number + date) are treated as continuations of the current entry. This way, a multi-line property description or term will be grouped correctly.
  • Include note lines: If a line starts with "NOTE:", we will treat it as part of the current entry (a special part, to be extracted as a note). It does not indicate a new entry; it’s additional info for the preceding entry. We group note lines with the entry they belong to.

By the end of this step, we will have each scheduleEntry grouped into a collection of lines. For example, an entry might be grouped as:

["13.11.1996        Retail Warehouse, The             25.07.1996    SY664660        ",
 "1 in yellow       Causeway and River Park          25 years from ",
 "                  Avenue, Staines                  25.3.1995    ",
 "NOTE: The Lease comprises also other land"]

(This corresponds to Entry 1 from the example, grouped into one entry’s lines.)

2. Identifying and Extracting Note Lines

Within each grouped entry, we next separate the notes from the main row data:

  • Detect notes: Any line (or part of a line) beginning with the prefix "NOTE:" is considered a note. In practice, notes usually occupy a whole line by themselves in the entryText array. For instance, "NOTE: The Lease comprises also other land" would be one line. We will pull such lines out of the main text and store them in a notes list for that entry.
  • Multi-line notes: If a note is extremely long, it might wrap to an additional line (which likely would not start with "NOTE:" on the second line). In that case, we’d treat the lines consecutively as one note (the first line has "NOTE:" and following line is a continuation). However, such cases are rare; typically each note is a single line in the schedule. We will handle it by checking: once we encounter "NOTE:", continue gathering subsequent lines until another clear break (like a new entry start or another "NOTE:") appears.
  • Clean note text: When outputting, we will remove the "NOTE:" label (and any numbering like "NOTE 1:") from the note content, since the JSON will just contain the note text. For example, "NOTE: See entry in the Charges Register..." becomes "See entry in the Charges Register..." in the output. Each note will be an item in an array, so if multiple notes are present, they’re kept separate.

After this, the entry’s main lines will exclude the note lines. We have isolated the notes for later output and simplified the remaining text to just the table columns content.

3. Reconstructing the Columns

Now comes the core challenge: taking the lines of an entry (minus notes) and splitting them into the four intended columns. We use a combination of position-based alignment and content-based heuristics:

  • Merge lines for parsing: It can be helpful to initially concatenate the entry’s lines (in case the entry spans multiple lines) into one continuous string, because the content for each column might be split across lines. We can insert a single space (or a special marker) where lines break to avoid accidentally merging words. For example, the three lines of the entry (excluding the note) could be joined into one string:

    "13.11.1996 1 in yellow Retail Warehouse, The Causeway and River Park Avenue, Staines 25.07.1996 25 years from 25.3.1995 SY664660"
    

    This allows us to search across the whole entry’s text. (We will still need to know where original breaks were for more refined alignment if needed, but a full string search is convenient for pattern matching.)

  • Find the Registration Date: Using regex on the entry string, find the first date occurrence. This will be the registration date (by design of the schedule, each entry starts with the registration date). Capture this as registration_date.

    • Example: find 13.11.1996 at the start.
    • Save registration_date = "13.11.1996".
  • Remove Entry Number (if any): If our regex included an entry number (like "1 " before the date), we should strip that out from the text to avoid confusion. The entry number is not needed in the output. In our example, “1 ” (if present) would be removed, leaving the date as the first token.

  • Find the Lessee’s Title: Next, identify the lessee’s title number, which is usually the last column. Title numbers in England/Wales are distinctive: typically one or two (sometimes three) uppercase letters followed by a sequence of digits (e.g. SY664660, EGL568130). We can use a regex for this, such as \b[A-Z]{1,3}\d+\b near the end of the entry text.

    • We should pick the last occurrence of this pattern in the entry, assuming the property description or other fields won’t contain a similar pattern. (Addresses rarely have a pattern of letters followed immediately by digits with no spaces.)
    • If a title number is found, capture it as lessee_title. Remove it (and any trailing whitespace) from the entry string once captured, since we’ve extracted that column.
    • If no title pattern is found, it might mean the lease has no registered title (common for short leases). In that case, we’ll set lessee_title = null (or empty string) in the output, and proceed with parsing the rest by treating the entry text as ending at the lease term.
  • Find the Lease Date: After removing the lessee title from the text, search for the next date occurrence (which should now be the only date left aside from the registration date we already removed). This second date corresponds to the date of the lease itself (the start of column 3 content). Capture this as lease_date.

    • For example, find 25.07.1996 as the lease date.
    • We expect exactly one such additional date per entry (if the input is correct). If there are more than one remaining (e.g. a date also appears in the middle of an address – uncommon, or the term contains another date “from 25.3.1995”), we need to be careful: the term often contains a second date for the start of term. We will handle that in the next step. For identifying the lease execution date, we specifically take the first date that appears in the column 3 content. In practice, the lease execution date will appear before the term’s start date in the text. In the concatenated string, “25.07.1996 25 years from 25.3.1995”, the first date here is the execution date, the second is part of the term.
    • So we locate the execution date (perhaps by finding a full dd.mm.yyyy that isn’t part of a “years from” phrase). A heuristic: the lease execution date usually is in full format (dd.mm.yyyy), whereas the term’s start date sometimes is written without leading zero (like 1.1.2009 instead of 01.01.2009) – but this isn’t guaranteed. A more robust method: identify the phrase like “years from” or “from” to differentiate the two dates. The date after the word “from” typically is the term commencement date, not the execution date. So we can initially capture the first date as lease_date and later double-check if the next part of text immediately says “years from [date]”.
  • Extract the Lease Term: Once we have the lease_date, the rest of that column (after the lease date) constitutes the lease term. For example, in the text “… 25.07.1996 25 years from 25.3.1995 …”, the portion “25 years from 25.3.1995” is the term. We will:

    • Locate the phrase starting right after the lease_date we found. This typically begins with a number (the number of years) or occasionally a phrase like “Term” or “Expiring on”. Common patterns include "years from" as seen in most cases. We capture everything from the end of the lease_date up until the start of the lessee title (which we already removed) or the end of entry text.
    • Trim this term string and store it as lease_term. In our example, lease_term = "25 years from 25.3.1995".
    • Note: We do not attempt to split the term further (e.g. into number of years and start date) since the output schema just needs the term as one string. But we ensure we’ve got the whole term even if it originally was split across lines (in the example, “25 years from” was on one line and “25.3.1995” on the next – our grouping and joining handles that).
    • If the term includes additional info in parentheses (like “(less 7 days)” or similar, occasionally present), it will be included in this string.
  • Determine the Plan Reference vs Property Description: At this stage, we have identified three of the four main pieces (reg date, lease date, lease term, and lessee title). What remains to parse is the text of column 1 (plan reference) and column 2 (property description). Initially, after removing the known pieces, we’ll have a chunk of text that includes both the plan reference and the property description. We need to split these two. Key heuristics:

    • The plan reference is usually a short phrase that describes how the lease area is drawn on the plan. It often contains keywords like: “edged”, “tinted”, “numbered”, “in”, and a color (common colors: red, blue, brown, green, yellow). It may also include a reference number or letter (e.g. “numbered 4 in blue”) and sometimes the phrase “(part of)” if only part of the property is leased.
    • The property description is typically an address or location description, often containing commas and names (e.g. “Flat 2, 10 High Street, London”). It usually does not contain those plan reference keywords or color names.
    • Strategy to split: We can search within the remaining text for the boundary between plan ref and property. Possible approaches:
      • If the plan reference portion ends with the specific phrase “(part of)”, we can confidently split right after that. For example: “Edged and numbered 4 in blue (part of) [split] Flat 2308…”. Everything up to and including “(part of)” is plan_ref, and everything after is property_description.
      • If there is no "(part of)", look for the last occurrence of a color word (red, blue, yellow, green, brown, pink etc.) or plan-related keyword in the text. Typically, the color word will be part of the plan ref. We assume the plan ref extends through the color word (and any trailing descriptors). The property description likely begins right after that.
      • Another clue: property descriptions often contain a comma followed by a space, as addresses have comma-separated elements, whereas plan refs rarely contain commas. For example, in “… in blue Flat 1602, Landmark West Tower…”, there’s no comma before "Flat" (which is the property start), but then “Flat 1602, Landmark West Tower” has a comma after the number. If we find the first comma in the remaining chunk and it appears to be part of an address (e.g. followed by a space and a capitalized word), that comma likely lies inside the property description. In such a case, it means the split between plan ref and property happened just before that comma’s word.
      • We can also maintain the alignment info: in the original lines, column 1 and column 2 had separate designated widths. If the text extraction preserved some spacing, we could deduce the split by the largest gap in the first line of the entry. For instance, in the first line of an entry, after the registration date, there might be several spaces before the property description starts. We could calculate the index where property description likely begins by measuring these spaces. However, because spacing is not reliable in all cases, this would be a secondary heuristic.
    • Using these rules, extract plan_ref and property_description:
      • plan_ref: e.g. "1 in yellow" or "Edged and numbered 4 in blue (part of)".
      • property_description: e.g. "Retail Warehouse, The Causeway and River Park Avenue, Staines" or "Flat 2308 Landmark West Tower (twenty third floor flat)".
    • Trim any stray spaces. Ensure we recombine parts that were split across lines (our grouping already ensures the pieces are contiguous in the joined string).
  • Quality check: At this point, we have all fields for the entry. We should validate that the extraction makes sense: we expect to have found two dates (reg_date and lease_date), a non-empty property description, and (usually) a plan ref and lessee title (the latter might be empty for unregistered leases). If, for example, we found only one date or none, or if no property description text was identified, then the entry might be malformed. We will implement a fallback or flagging mechanism (see Fallback logic below) in such cases rather than trust possibly incorrect parsing.

4. Handling Complex and Edge Cases

The heuristics above cover the common structure, but we also design for complexities:

  • Multiple lines per column: If any column text was especially long (for instance, a very long property description or a note within the description), the grouping and joining approach ensures we still capture it. The regex and splits are applied on the full entry text, so multi-line content is naturally handled. We just need to ensure we don’t accidentally insert or remove spaces that alter the content. Using single spaces for line joins usually suffices, but for safety, we might want to respect column alignment if given (e.g. we know how many spaces preceded a line in the raw text – if it was clearly indentation for alignment, those could be trimmed). The key is that all meaningful words remain in correct order.
  • Embedded commas and punctuation: Property descriptions can contain commas, and lease term can contain parentheses. Our splitting logic mostly keys off known patterns (dates, keywords) rather than punctuation, so it should not be thrown off by commas in addresses. We will take care not to split on every comma — only use comma as a hint in distinguishing plan ref vs address when other cues are absent.
  • Lease term variations: Most terms say “X years from [date]”. But occasionally, you might see “Term: 99 years from 24.6.2010” or “99 years (less 10 days) from 29.9.1980” or even “999 years from 1.1.2010 (and as for the rent, ... etc)”. Our approach of capturing everything after the lease date until the title should grab these in full. If there are additional dates mentioned in the term (like the 29.9.1980 above), those will simply be part of the lease_term string. We are not attempting to separately parse that second date; it stays as part of the term description.
  • No Lessee’s Title: If no title number was found, we will output lessee_title: null (or "") and still output the other fields. Our detection of the lease term would in this case go until the end of the entry text. We should confirm that in such scenario, the end of the text truly is the end of lease term, not including any note (notes we already stripped) or something else. This is why a thorough grouping in step 1 is important.
  • Missing or extra columns: In very rare cases, an entry might be incomplete or have an unusual note inline. We incorporate a confidence flagging mechanism: for example, if we don’t find two dates in an entry where we expect them, or we find a date but no “years from” following it (and no title number where one is expected), we mark that entry with a warning. This could be an additional field like "confidence": "low" or a boolean "parsed_ok": false in the output. This alerts downstream processes or users that the entry may need manual review.

5. Fallback Alignment Logic (if needed)

If the content-based parsing fails or is ambiguous for certain entries, an alternative is to use the column positions from the text. The Land Registry schedule usually has fixed column widths. We can determine these by analyzing the positions of known elements: for example, across all entries, the registration dates might all appear within the first 15 characters of the line, the lease dates all appear around character 60-70, etc. The JSON example in the screenshot shows that text was aligned in columns (with spaces padding the fields to fixed widths). A fallback approach:

  • Compute the median index where the second date (lease date) occurs across entries, and similarly for where the title number starts. These indices can serve as cut-off points for columns.
  • For instance, suppose we determine: Column1 spans chars 0–20, Column2 spans 21– Fifty-something, Column3 spans 55–80, Column4 starts at 81. (Exact numbers would come from observing spacing patterns.) We could then slice each entry line or combined entry string at these fixed indices.
  • This approach must be used carefully, since if the text has shifted due to a very long property description not fitting in its allotted space, the alignment might break. But in well-formatted cases, it provides a sanity check.
  • We prefer the content-based method because it adapts to each entry’s actual content, but this positional method can be used to verify or correct obvious mistakes (for example, if our regex parsing finds a title number, but the position of that number in the line is far from where we expect column 4, we might realize it’s not a title number but part of some address by coincidence).

By combining these strategies, we ensure both accuracy (through pattern recognition) and robustness (through position-based verification when needed).

6. Implementation Details for Efficiency

  • We will implement this parsing in a language like Python (or Node, etc.), using compiled regular expressions for efficiency. The approach is deterministic and runs in linear time relative to the length of the input. For each entry, matching regex and splitting strings are very fast (mere microseconds for each, which is suitable even for hundreds of thousands of entries).
  • No external API/LLM calls: The logic uses only local computation (string ops, regex, maybe standard NLP libraries if needed for things like date normalization, though not strictly necessary). This ensures the solution can scale horizontally (processing many documents in parallel, for example) without dependency on rate-limited services.
  • We will unit test the parsing on known examples (like those from the screenshots) and edge cases to ensure the heuristics cover them. We’ll also include logging or debugging modes to output an entry and how it was parsed (especially if flagged low confidence) to facilitate quick adjustments in case new patterns are observed in the data.

Output Schema and Structure

Each parsed lease entry will be output as a JSON object with the following fields:

  • registration_date (string): The registration date of the lease, exactly as it appears (e.g. "13.11.1996").
  • plan_ref (string): The plan reference text associated with that date (e.g. "1 in yellow", "Edged and numbered 4 in blue (part of)"). This describes how the leased property is identified on the plan.
  • property_description (string): The address or description of the leased property. All parts of the address are concatenated into a single line, with commas as appropriate (e.g. "Retail Warehouse, The Causeway and River Park Avenue, Staines" or "Flat 2308 Landmark West Tower (twenty third floor flat)").
  • lease_date (string): The date the lease was executed, as given in the schedule (e.g. "25.07.1996").
  • lease_term (string): The term of the lease. This typically includes the length and the start date of that term, and exactly mirrors the wording in the register (e.g. "25 years from 25.3.1995" or "999 years from 1.1.2009").
  • lessee_title (string or null): The title number of the lessee’s interest (e.g. "SY664660", "EGL568130"). If the lease does not have a registered title, this field can be null or an empty string to indicate none.
  • notes (array of strings): Any notes associated with the entry. Each note will be a separate string in the array (without the "NOTE:" prefix). If an entry has no notes, this can be an empty array or omitted (notes: []). The notes preserve the exact text (aside from removed prefix), for example: ["The Lease comprises also other land"].

All entries from a schedule can be output as a JSON array of these objects, or another suitable structured format. For instance, the final output could be an array under a key like "leases" or similar, but the exact wrapping is flexible. Here we focus on the structure of each entry object as above.

Example Output

Using the examples from the provided screenshots, below is a sample of how the parsed output would look. We demonstrate two entries:

  • Entry 1 (from the first screenshot, which included a note about “other land”), and
  • Entry 93 (from the second screenshot, which included a note referencing the Charges Register).

They are shown here as elements in a JSON array for clarity:

[
  {
    "registration_date": "13.11.1996",
    "plan_ref": "1 in yellow",
    "property_description": "Retail Warehouse, The Causeway and River Park Avenue, Staines",
    "lease_date": "25.07.1996",
    "lease_term": "25 years from 25.3.1995",
    "lessee_title": "SY664660",
    "notes": [
      "The Lease comprises also other land"
    ]
  },
  {
    "registration_date": "22.02.2010",
    "plan_ref": "Edged and numbered 4 in blue (part of)",
    "property_description": "Flat 2308 Landmark West Tower (twenty third floor flat)",
    "lease_date": "03.02.2010",
    "lease_term": "999 years from 1.1.2009",
    "lessee_title": "EGL568130",
    "notes": [
      "See entry in the Charges Register relating to a Deed of Rectification dated 26 January 2018"
    ]
  }
]

Explanation of the examples: In the first entry, we successfully split the fields: the plan reference "1 in yellow" is separate from the property address, and the note is correctly attached in the notes array. In the second entry, the plan reference includes “(part of)”, and the note (which began with "NOTE:" in the text) is captured without the prefix in the output. Both entries have the dates and terms appropriately extracted.

This structured output meets the requirements: it preserves all information from the original schedule in a machine-readable format, enabling easy access to each field. The approach described ensures a high accuracy of parsing and is designed to operate efficiently at scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment