Skip to content

Instantly share code, notes, and snippets.

@phillipoertel
Last active October 9, 2024 16:43
Show Gist options
  • Save phillipoertel/04678da7064ac80602c8e7b9456cb107 to your computer and use it in GitHub Desktop.
Save phillipoertel/04678da7064ac80602c8e7b9456cb107 to your computer and use it in GitHub Desktop.
Fixing city names
#
# 1. define replacements
#
replacements = {
'germany' => {
# this has the following structure:
# 'correct city name' => ['variant to fix 1', 'variant to fix 2', ...]
# the variants DO NOT repeat the correct city name, neither downcased nor capitalized.
'Berlin' => ['Берлин', 'Berlin, Stadt', 'Berlin / Biesdorf', 'Berlin-Prenzlauer Berg', 'Berlin-Wedding'],
'Frankfurt' => ['Frankfurt am Main', 'Frankfurt am Main-Süd', 'Франкфурт-на-Майне'],
'Hamburg' => ['Hamburg-Barmbek', 'Hamburg-Eilbek', 'Hamburg-Ottensen', 'Hamburg - Wandsbek', 'Hamburger', 'Hamburg-Barmbek'],
'Cologne' => ['Köln', 'Koln', 'Koeln', 'Cologne, Urban District'],
'Munich' => %w[München Munchen],
'Stuttgart' => ['Stuttgart-Mitte', 'Stuttgart / Feuerbach', 'Stuttgart-Stuttgart-West']
},
'france' => {
'Paris' => ['Paris 20th arrondissement']
}
}
#
# 2. generate the replacements
#
queries = []
replacements.each do |country, mapping|
mapping.each do |city, variants|
default_conditions = "AND country='#{country}' AND status=0";
# first do the lowercase buth other wise correct version of the city name ("berlin", "frankfurt", hamburg", ...)
queries << "UPDATE listings SET city='#{city}' WHERE city='#{city.downcase}' #{default_conditions};";
# then all variants. by downcasing both the city and variant, we catch all capitalization variations.
variants.each do |variant|
queries << "UPDATE listings SET city='#{city}' WHERE LOWER(city)='#{variant.downcase}' #{default_conditions};";
end
end
#
# 3. run each query separately somehow ...
# ideally through a psql console. That has little overhead and the connection won't be killed by AWS.
# you can put short sleeps inbetween queries in order not to overload the system
#
# psql -h <host> -U <username> -d <database> -c "<SQL query>"
#
queries.each do |query|
cmd = %(psql -h <host> -U <username> -d <database> -c "#{query}")
# run like this: `#{cmd}`
# then:
# `sleep 1`
puts cmd
end
# now check if there's any garbage left.
replacements.each do |city, variants|
"SELECT DISTINCT city FROM listings WHERE country='#{country}' AND status=0 AND city ILIKE '#{city}%'"
# if the above is too slow, do a COUNT(*) with the ILIKE first and then a SELECT city .... LIMIT 100.
end
# If you found new variations, update the replacements hash in 1. (REMOVING what's already fixed) and do another round.
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment