Skip to content

Instantly share code, notes, and snippets.

@cutterkom
Last active April 14, 2022 10:26
Show Gist options
  • Save cutterkom/9806eabf6c66e07997c0ac096bf82799 to your computer and use it in GitHub Desktop.
Save cutterkom/9806eabf6c66e07997c0ac096bf82799 to your computer and use it in GitHub Desktop.
Spacy German address patterns
# Goal: Fetch German street addresses
# Works on:
# Müllerstr. 26
# Müllerstr 26
# Müllerstraße
# Müllerstraße 26
# Müllerplatz
# Müllerstraße 26a
# Müller Straße 26
# 2 versions:
# 1) multiple patterns
# 2) one pattern
# weird: Is the housenumber at the end of a sentence, Spacy adds the punct to the token.
# So that punct needs to be considered as well
street_labels = ".*(platz|[Ss]tra[ssß]e|str)$"
patterns = [
{"label": "ADR",
"pattern": [
{"TEXT": {"REGEX": street_labels}},
# here might be a punct or not: Müllerstr. 26 or Müllerstr 26
{"IS_PUNCT": True, "OP": "?"},
# house number can have several formats: 2, 26, 266, 2a, 22a, 222a,
# last six ones catch cases at end of sentence. there might be a better solution out there...
{"SHAPE": {"IN": ["d", "dd", "ddd", "dddx", "ddx", "dx", "d.", "dd.", "ddd.", "dx.", "ddx.", "dddx."]}, "OP": "?"}
]},
# if street name has to parts: Müller Straße
{"label": "ADR",
"pattern": [
{"SHAPE": "Xxxxx", "OP": "?"},
{"TEXT": "Straße"},
{"IS_PUNCT": True, "OP": "?"},
{"SHAPE": {"IN": ["d", "dd", "ddd", "dddx", "ddx", "dx", "d.", "dd.", "ddd."]}, "OP": "?"}
]}
]
# as multiple patterns
patterns = [
# Müllerstraße 26. (with punct at the end!)
{"label": "ADR", "pattern": [{"TEXT": {"REGEX": street_labels}}, {"SHAPE": {"IN": ["d.", "dd.", "ddd."]}}]},
# Müllerstr. 26
{"label": "ADR", "pattern": [{"TEXT": {"REGEX": street_labels}}, {"IS_PUNCT": True, "OP": "?"}, {"IS_DIGIT": True}]},
# Müllerstra0e 10a
{"label": "ADR", "pattern": [{"TEXT": {"REGEX": street_labels}}, {"IS_PUNCT": True, "OP": "?"}, {"SHAPE": {"IN": ["dddx", "ddx", "dx"]}}]},
# Müller Straße 23
{"label": "ADR", "pattern": [{"SHAPE": "Xxxxx"}, {"TEXT": {"REGEX": street_labels}}, {"IS_DIGIT": True, "OP": "?"}]},
# Müllerstraße 26
{"label": "ADR", "pattern": [{"TEXT": {"REGEX": street_labels}}, {"IS_DIGIT": True}]},
# Müllerstraße or Odeonsplatz
{"label": "ADR", "pattern": [{"TEXT": {"REGEX": street_labels}}]},
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment