Skip to content

Instantly share code, notes, and snippets.

@joshlk
Last active February 3, 2023 14:02
Show Gist options
  • Save joshlk/92d0e25d847a490b86369ec1c31a39c5 to your computer and use it in GitHub Desktop.
Save joshlk/92d0e25d847a490b86369ec1c31a39c5 to your computer and use it in GitHub Desktop.
Parsing and splitting UK postcodes

Parsing and splitting UK postcodes

You can use this regex to split the postcodes into its constituent parts:area (capture groups a1, a2 ... a5), district (d1...), sector (s1...) and unit (u1...). Adapted from wikipedia:

^(?:(?P<a1>[Gg][Ii][Rr])(?P<d1>) (?P<s1>0)(?P<u1>[Aa]{2}))|(?:(?:(?:(?P<a2>[A-Za-z])(?P<d2>[0-9]{1,2}))|(?:(?:(?P<a3>[A-Za-z][A-Ha-hJ-Yj-y])(?P<d3>[0-9]{1,2}))|(?:(?:(?P<a4>[A-Za-z])(?P<d4>[0-9][A-Za-z]))|(?:(?P<a5>[A-Za-z][A-Ha-hJ-Yj-y])(?P<d5>[0-9]?[A-Za-z]))))) (?P<s2>[0-9])(?P<u2>[A-Za-z]{2}))$

Demonstrated in this Regex101.

Postgres and SQL

The following SQL will extract the separate parts from the postcode using the above Regex (without named capture groups as postgres doesn't support it):

SELECT
    -- 3. Concat parts to create full string
    *
    ,"postcode_area"
    ,CONCAT("postcode_area", "postcode_district_part") AS "postcode_district"
    ,CONCAT("postcode_area", "postcode_district_part", ' ', "postcode_sector_part") AS "postcode_sector"
FROM (
    SELECT
	-- 2. Combine parts from different capture groups
        *
        ,COALESCE(pc_arr[1], pc_arr[5], pc_arr[7], pc_arr[9], pc_arr[11]) AS "postcode_area"
        ,COALESCE(pc_arr[2], pc_arr[6], pc_arr[8], pc_arr[10], pc_arr[12]) AS "postcode_district_part"
        ,COALESCE(pc_arr[3], pc_arr[13]) AS "postcode_sector_part"
        ,COALESCE(pc_arr[4], pc_arr[14]) AS "postcode_unit_part"
    FROM (
	-- 1. Extract postcode into constituent parts
        SELECT *, regexp_matches(
            postcode
            ,'^(?:([Gg][Ii][Rr])() (0)([Aa]{2}))|(?:(?:(?:([A-Za-z])([0-9]{1,2}))|(?:(?:([A-Za-z][A-Ha-hJ-Yj-y])([0-9]{1,2}))|(?:(?:([A-Za-z])([0-9][A-Za-z]))|(?:([A-Za-z][A-Ha-hJ-Yj-y])([0-9]?[A-Za-z]))))) ([0-9])([A-Za-z]{2}))$'
            ,'i'
        ) AS pc_arr
        FROM foo
    ) AS A
) AS B

Demostrated in this fiddle.

Pandas and Python

How to extract the separate parts of a postcode using Pandas and the above Regex:

import pandas as pd
import re

pc = pd.Series(['GIR 0AA', 'M1 1AE','B33 8TH','CR2 6XH','DN55 1PT','W1A 0AX','EC1A 1BB'], name='postcode')

regex = "^(?:(?P<a1>[Gg][Ii][Rr])(?P<d1>) (?P<s1>0)(?P<u1>[Aa]{2}))|(?:(?:(?:(?P<a2>[A-Za-z])(?P<d2>[0-9]{1,2}))|(?:(?:(?P<a3>[A-Za-z][A-Ha-hJ-Yj-y])(?P<d3>[0-9]{1,2}))|(?:(?:(?P<a4>[A-Za-z])(?P<d4>[0-9][A-Za-z]))|(?:(?P<a5>[A-Za-z][A-Ha-hJ-Yj-y])(?P<d5>[0-9]?[A-Za-z]))))) (?P<s2>[0-9])(?P<u2>[A-Za-z]{2}))$"
pc_parts = pc.str.extract(regex, re.IGNORECASE, expand=True)

pc_parsed = pc.reset_index()
pc_parsed['postcode_area'] = pd.concat([pc_parts[['a%s'%i]] for i in range(1,6)], axis=1).bfill(axis=1).iloc[:,0]
pc_parsed['postcode_district_part'] = pd.concat([pc_parts[['d%s'%i]] for i in range(1,6)], axis=1).bfill(axis=1).iloc[:,0]
pc_parsed['postcode_sector_part'] = pd.concat([pc_parts[['s%s'%i]] for i in range(1,3)], axis=1).bfill(axis=1).iloc[:,0]
pc_parsed['postcode_district'] = pc_parsed['postcode_area'] + pc_parsed['postcode_district_part']
pc_parsed['postcode_sector'] = pc_parsed['postcode_area'] + pc_parsed['postcode_district_part'] + ' ' + pc_parsed['postcode_sector_part']
pc_parsed = pc_parsed[['postcode','postcode_area','postcode_district','postcode_sector']]
print(pc_parsed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment