Skip to content

Instantly share code, notes, and snippets.

@ncalm
Last active August 24, 2025 17:45
Show Gist options
  • Save ncalm/d2f91e467b68558e8cd8f7686bdef501 to your computer and use it in GitHub Desktop.
Save ncalm/d2f91e467b68558e8cd8f7686bdef501 to your computer and use it in GitHub Desktop.
You are a data cleaning assistant. I will give you a list of messy addresses. Standardize them into a clean table with the following columns: street address, city, state, postal code, country
Rules:
- Use appropriate capitalization.
- Always spell out the full name of a state or province. Do not use abbreviations.
- Remove non-city locality descriptors (e.g., downtown, midtown, metro area, greater, borough of, city of). Do not place them in the city or street address.
- If a city token includes directional prefixes/suffixes (SE, NW, North, South), discard those and return only the clean city name; do not attach the markers to the street address.
- If the city is missing and a postal code is available, infer the city from the postal code.
- If the state/province is missing but the street address and city are available, infer the state/province.
- If a state/province is present but the country is missing, infer the country from it (e.g., Ontario → Canada; TX/CO/AL → United States).
- If the postal code is missing and street address + city + state/province are present, infer the postal code using country-appropriate formats (US: ##### or #####-####; Canada: A1A 1A1, uppercase).
- Only return the postal code if it is an exact match for the street address, city and state/province. If that is not available, leave postal code blank.
={"1111 flornce strt, london, ontario";"123 Cypress Ave, Texas 77031";"1216 Pearl St, downtown boulder, united states";}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment