You can use this regex to split the postcodes into its constituent parts:area (capture groups a1
, a2
... a5
), district (d1
...), sector (s1
...) and unit (u1
...). Adapted from wikipedia:
^(?:(?P<a1>[Gg][Ii][Rr])(?P<d1>) (?P<s1>0)(?P<u1>[Aa]{2}))|(?:(?:(?:(?P<a2>[A-Za-z])(?P<d2>[0-9]{1,2}))|(?:(?:(?P<a3>[A-Za-z][A-Ha-hJ-Yj-y])(?P<d3>[0-9]{1,2}))|(?:(?:(?P<a4>[A-Za-z])(?P<d4>[0-9][A-Za-z]))|(?:(?P<a5>[A-Za-z][A-Ha-hJ-Yj-y])(?P<d5>[0-9]?[A-Za-z]))))) (?P<s2>[0-9])(?P<u2>[A-Za-z]{2}))$
Demonstrated in this Regex101.
The following SQL will extract the separate parts from the postcode using the above Regex (without named capture groups as postgres doesn't support it):
SELECT
-- 3. Concat parts to create full string
*
,"postcode_area"
,CONCAT("postcode_area", "postcode_district_part") AS "postcode_district"
,CONCAT("postcode_area", "postcode_district_part", ' ', "postcode_sector_part") AS "postcode_sector"
FROM (
SELECT
-- 2. Combine parts from different capture groups
*
,COALESCE(pc_arr[1], pc_arr[5], pc_arr[7], pc_arr[9], pc_arr[11]) AS "postcode_area"
,COALESCE(pc_arr[2], pc_arr[6], pc_arr[8], pc_arr[10], pc_arr[12]) AS "postcode_district_part"
,COALESCE(pc_arr[3], pc_arr[13]) AS "postcode_sector_part"
,COALESCE(pc_arr[4], pc_arr[14]) AS "postcode_unit_part"
FROM (
-- 1. Extract postcode into constituent parts
SELECT *, regexp_matches(
postcode
,'^(?:([Gg][Ii][Rr])() (0)([Aa]{2}))|(?:(?:(?:([A-Za-z])([0-9]{1,2}))|(?:(?:([A-Za-z][A-Ha-hJ-Yj-y])([0-9]{1,2}))|(?:(?:([A-Za-z])([0-9][A-Za-z]))|(?:([A-Za-z][A-Ha-hJ-Yj-y])([0-9]?[A-Za-z]))))) ([0-9])([A-Za-z]{2}))$'
,'i'
) AS pc_arr
FROM foo
) AS A
) AS B
Demostrated in this fiddle.
How to extract the separate parts of a postcode using Pandas and the above Regex:
import pandas as pd
import re
pc = pd.Series(['GIR 0AA', 'M1 1AE','B33 8TH','CR2 6XH','DN55 1PT','W1A 0AX','EC1A 1BB'], name='postcode')
regex = "^(?:(?P<a1>[Gg][Ii][Rr])(?P<d1>) (?P<s1>0)(?P<u1>[Aa]{2}))|(?:(?:(?:(?P<a2>[A-Za-z])(?P<d2>[0-9]{1,2}))|(?:(?:(?P<a3>[A-Za-z][A-Ha-hJ-Yj-y])(?P<d3>[0-9]{1,2}))|(?:(?:(?P<a4>[A-Za-z])(?P<d4>[0-9][A-Za-z]))|(?:(?P<a5>[A-Za-z][A-Ha-hJ-Yj-y])(?P<d5>[0-9]?[A-Za-z]))))) (?P<s2>[0-9])(?P<u2>[A-Za-z]{2}))$"
pc_parts = pc.str.extract(regex, re.IGNORECASE, expand=True)
pc_parsed = pc.reset_index()
pc_parsed['postcode_area'] = pd.concat([pc_parts[['a%s'%i]] for i in range(1,6)], axis=1).bfill(axis=1).iloc[:,0]
pc_parsed['postcode_district_part'] = pd.concat([pc_parts[['d%s'%i]] for i in range(1,6)], axis=1).bfill(axis=1).iloc[:,0]
pc_parsed['postcode_sector_part'] = pd.concat([pc_parts[['s%s'%i]] for i in range(1,3)], axis=1).bfill(axis=1).iloc[:,0]
pc_parsed['postcode_district'] = pc_parsed['postcode_area'] + pc_parsed['postcode_district_part']
pc_parsed['postcode_sector'] = pc_parsed['postcode_area'] + pc_parsed['postcode_district_part'] + ' ' + pc_parsed['postcode_sector_part']
pc_parsed = pc_parsed[['postcode','postcode_area','postcode_district','postcode_sector']]
print(pc_parsed)