Skip to content

Instantly share code, notes, and snippets.

@colinricardo
Created April 9, 2025 22:11
Show Gist options
  • Save colinricardo/236dabc94f384fff62671d374d14d704 to your computer and use it in GitHub Desktop.
Save colinricardo/236dabc94f384fff62671d374d14d704 to your computer and use it in GitHub Desktop.

cool, i’ll write up a short explanation of the approach and give you python code to parse and structure the lease schedule entries like the ones you shared

starting now—i’ll break down how to reliably extract the 4 canonical fields (registration info, property desc, lease date/term, lessee title), plus notes and structure them cleanly

will drop the code shortly

Transforming "Schedule of Notices of Leases" Data into Structured Records

Explanation of Approach

The Schedule of Notices of Leases section is provided as an array of text lines with a lossy table structure (GitHub - jonathanbrincat/orbitalwitness.technical-assessment: Technical assessment for Orbital Witness Oct 2020 - consisting of a 2/3 hour assessment to re-extract consumable meaning from a large lossy dataset offered in two flavours; pdf or json). Our goal is to reliably extract the four main columns and any notes for each lease entry. The columns (and example values) are (GitHub - jonathanbrincat/orbitalwitness.technical-assessment: Technical assessment for Orbital Witness Oct 2020 - consisting of a 2/3 hour assessment to re-extract consumable meaning from a large lossy dataset offered in two flavours; pdf or json):

  1. Registration Date and Plan Reference – e.g. “28.01.2009 Transformer Chamber tinted blue (part of)”
  2. Property Description – e.g. “Ground Floor”
  3. Date of Lease and Term – e.g. “23.01.2009 99 years from 23.1.2009”
  4. Lessee’s Title – e.g. “EGL551039”

Additional Notes may follow, prefixed by "NOTE:" and spanning all columns.

Challenges: The raw lines lose the original column boundaries; columns are separated by variable whitespace which may be inconsistent or missing (GitHub - jonathanbrincat/orbitalwitness.technical-assessment: Technical assessment for Orbital Witness Oct 2020 - consisting of a 2/3 hour assessment to re-extract consumable meaning from a large lossy dataset offered in two flavours; pdf or json). For example, if one column’s text is long, it may wrap to a new line without clear delimiters. We must reconstruct columns by detecting patterns (dates, title number format, known keywords) and using the relative positioning of text and spaces (GitHub - jonathanbrincat/orbitalwitness.technical-assessment: Technical assessment for Orbital Witness Oct 2020 - consisting of a 2/3 hour assessment to re-extract consumable meaning from a large lossy dataset offered in two flavours; pdf or json).

Solution Outline:

  • Identify and separate notes: First, filter out any lines starting with "NOTE:" as these belong to the notes field for the entry.
  • Iterate through remaining lines, maintaining a current_col pointer (1 to 4 for the four columns). We accumulate text into the appropriate column string (col1col4).
  • Split lines by large whitespace gaps: Using regex to split on two or more spaces helps detect when a line contained multiple columns side by side. Each segment corresponds to a column. We assign each segment in order to the current and subsequent columns. For example, a line split into three segments might contain the end of column 2, column 3, and column 4 respectively.
  • Detect column transitions at line breaks: If a new line begins without an obvious gap (meaning it’s a continuation of some column or the start of a new one), we use context cues:
    • If the line starts with a date (DD.MM.YYYY format) and we’ve already captured the registration date, we assume this is the lease date, so we switch to column 3 (Date of Lease and Term).
    • If the line looks like a title number (typically uppercase letters followed by digits) and we are in column 3, we switch to column 4 (Lessee’s Title).
    • Otherwise, if we are still in column 1 but have already seen the registration date on a previous line, a new line likely indicates the start of Property Description (column 2) – unless it clearly continues the previous text (e.g. starts with a lowercase letter, suggesting the sentence from column 1 continues).
  • Maintain state between lines: We keep track of which column we’re filling. If a line provides content for multiple columns (due to whitespace separation), we know columns up to a certain point are completed on that line, and we move the pointer accordingly. If a column doesn’t receive all its content on that line (e.g. it wraps to next line), the current_col remains pointing to that column for the next line.
  • Regex patterns: We use regex to recognize dates and title numbers. Dates are identified by the pattern \d{1,2}\.\d{1,2}\.\d{4} (e.g. "23.01.2009"), and title numbers by ^[A-Z]+\d+$ (one or more letters followed by digits, e.g. "EGL551039"). These patterns help trigger the switches for column 3 and 4.
  • Edge cases:
    • Lines where one column’s text flows into multiple lines (no clear gaps) are handled by the rule that if a new line starts with lowercase (likely mid-sentence), it’s a continuation of the same column.
    • If an expected column is missing (e.g. no explicit lessee’s title on that entry), the code will leave that field blank rather than crashing.
    • Multiple note lines are collected into a list of notes. Each note line remains intact (minus the "NOTE:" prefix which is preserved as part of the note text).

This approach uses straightforward string parsing and state management, making it scalable to thousands of records without heavy performance overhead. We avoid any external ML or API calls, focusing on deterministic rules and patterns for accuracy.

Python Code

Below is the Python implementation following the above logic. The code is commented to explain each step and to highlight handling of tricky scenarios:

import re

def parse_schedule_entries(entries):
    # Regex for a date in format DD.MM.YYYY (with D and M possibly 1 or 2 digits)
    date_pattern = re.compile(r'^\d{1,2}\.\d{1,2}\.\d{4}')
    # Regex for a title number (one or more capital letters followed by digits)
    title_pattern = re.compile(r'^[A-Z]+\d+$')
    
    parsed_entries = []
    for entry in entries:
        entry_num = entry.get('entryNumber')
        lines = entry.get('entryText', [])
        
        # Separate note lines from main content lines
        notes = []
        main_lines = []
        for line in lines:
            if line.strip().startswith("NOTE:"):
                notes.append(line.strip())
            else:
                main_lines.append(line)
        
        # Initialize column text accumulators
        col1 = col2 = col3 = col4 = ""
        current_col = 1
        first_date_seen = False  # flag to indicate we've seen the registration date
        
        for raw_line in main_lines:
            # Remove trailing spaces (used for alignment in original format) but keep internal spacing
            line_text = raw_line.rstrip()
            if line_text == "":
                continue  # skip empty lines if any
            
            # Split the line by sequences of 2 or more spaces (which indicate column boundaries in the text)
            segments = re.split(r' {2,}', line_text)
            
            if len(segments) == 1:
                # No obvious gap – this line is part of a single column's text
                seg_text = segments[0].strip()
                if seg_text == "":
                    continue  # skip if it was just spaces
                
                # Check for transitions indicated by content:
                if (not first_date_seen) and date_pattern.match(seg_text):
                    # This is the first date (registration date) at the start of the entry
                    col1 += ("" if col1 == "" else " ") + seg_text
                    first_date_seen = True
                    # Stay in col1 unless a new column is detected later
                    continue
                elif first_date_seen and current_col < 3 and date_pattern.match(seg_text):
                    # We already saw a registration date, and here is another date – this must be the lease date (start of col3)
                    current_col = 3
                    col3 += ("" if col3 == "" else " ") + seg_text
                    continue
                elif current_col >= 3 and title_pattern.match(seg_text):
                    # We are in lease info, and this segment looks like a title number – start col4
                    current_col = 4
                    col4 += ("" if col4 == "" else " ") + seg_text
                    continue
                elif current_col == 1 and first_date_seen and col2 == "":
                    # We are still in col1, but we already placed the registration date and haven't started col2 yet.
                    # This likely means this new line is the start of the property description (col2),
                    # unless it clearly continues col1 (e.g., starts lowercase indicating the plan ref text continues).
                    if seg_text and seg_text[0].isalpha() and seg_text[0].islower():
                        # Lowercase start - assume it's continuing the plan reference in col1
                        col1 += ("" if col1 == "" else " ") + seg_text
                    else:
                        # New line, not lowercase start - treat as start of property description (col2)
                        current_col = 2
                        col2 += ("" if col2 == "" else " ") + seg_text
                    continue
                
                # Default case: add this text to whatever column we're currently populating
                if current_col == 1:
                    col1 += ("" if col1 == "" else " ") + seg_text
                    # If this text starts with a date and we haven't flagged the first date yet, do so
                    if not first_date_seen and date_pattern.match(seg_text):
                        first_date_seen = True
                elif current_col == 2:
                    col2 += ("" if col2 == "" else " ") + seg_text
                elif current_col == 3:
                    col3 += ("" if col3 == "" else " ") + seg_text
                elif current_col == 4:
                    col4 += ("" if col4 == "" else " ") + seg_text
                continue  # move to next line
            
            # If we reach here, the line has multiple segments separated by large spaces,
            # meaning multiple columns' content appear on this single line.
            start_col = current_col  # the column index at which this line's first segment belongs
            # Assign each segment to the appropriate column in sequence
            for offset, segment in enumerate(segments):
                text = segment.strip()
                if text == "":
                    continue  # skip purely empty segments (extra spaces)
                col_index = start_col + offset  # the column number for this segment
                if col_index == 1:
                    col1 += ("" if col1 == "" else " ") + text
                    # Mark registration date seen if this segment starts with a date
                    if not first_date_seen and date_pattern.match(text):
                        first_date_seen = True
                elif col_index == 2:
                    col2 += ("" if col2 == "" else " ") + text
                elif col_index == 3:
                    col3 += ("" if col3 == "" else " ") + text
                elif col_index == 4:
                    col4 += ("" if col4 == "" else " ") + text
            # Determine last column index covered on this line
            end_col = start_col + len([s for s in segments if s.strip() != ""]) - 1
            # If we haven't reached the final column, set current_col to the last one filled, 
            # so we continue from there on the next line.
            current_col = end_col if end_col < 4 else 4
        
        # Trim any extra whitespace from the accumulated column texts
        col1 = col1.strip()
        col2 = col2.strip()
        col3 = col3.strip()
        col4 = col4.strip()
        
        # Construct the output dictionary for this entry
        parsed_entries.append({
            "entryNumber": entry_num,
            "registration": col1,
            "propertyDescription": col2,
            "leaseDateAndTerm": col3,
            "lesseeTitle": col4,
            "notes": notes
        })
    
    return parsed_entries

Commentary on edge cases in code:

When applied to the input, the function will output a structured list of dictionaries, for example:

entries = [
    {
        "entryNumber": "1",
        "entryText": [
            "28.01.2009 Transformer Chamber tinted blue (part of)",
            "Ground Floor",
            "23.01.2009 99 years from 23.1.2009",
            "EGL551039",
            "NOTE: sample note 1",
            "NOTE: sample note 2"
        ]
    }
]

print(parse_schedule_entries(entries))

Output:

[{
  "entryNumber": "1",
  "registration": "28.01.2009 Transformer Chamber tinted blue (part of)",
  "propertyDescription": "Ground Floor",
  "leaseDateAndTerm": "23.01.2009 99 years from 23.1.2009",
  "lesseeTitle": "EGL551039",
  "notes": ["NOTE: sample note 1", "NOTE: sample note 2"]
}]

This matches the expected structured format for each lease entry, with each field cleanly extracted. The solution prioritizes correctness and clarity, using straightforward string parsing techniques instead of external ML, making it suitable for production use on large datasets.

Sources: The understanding of the data format and challenges is based on the Land Registry Title Register API description (GitHub - jonathanbrincat/orbitalwitness.technical-assessment: Technical assessment for Orbital Witness Oct 2020 - consisting of a 2/3 hour assessment to re-extract consumable meaning from a large lossy dataset offered in two flavours; pdf or json) (GitHub - jonathanbrincat/orbitalwitness.technical-assessment: Technical assessment for Orbital Witness Oct 2020 - consisting of a 2/3 hour assessment to re-extract consumable meaning from a large lossy dataset offered in two flavours; pdf or json) and known fixed-width table behaviors (GitHub - jonathanbrincat/orbitalwitness.technical-assessment: Technical assessment for Orbital Witness Oct 2020 - consisting of a 2/3 hour assessment to re-extract consumable meaning from a large lossy dataset offered in two flavours; pdf or json).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment