Skip to content

Instantly share code, notes, and snippets.

@chalg
Last active April 24, 2017 23:11
Show Gist options
  • Save chalg/0131de02e91267f5a4defa10c5f1b559 to your computer and use it in GitHub Desktop.
Save chalg/0131de02e91267f5a4defa10c5f1b559 to your computer and use it in GitHub Desktop.

OpenStreetMap Data Case Study

Map Area

Adelaide, Australia

This map is where I live, so it will be interesting what the database querying reveals.

Problems encountered with the map I created several small samples of the data and worked with the smallest one most of the time, until I had my auditing procedures working correctly. Using the code snippet provided in Project Details, I changed the k size and created samples called sampleK10.osm, sampleK25.osm, sampleK35.osm and sampleK100.osm. Once I had sampleK100.osm working correctly I progressively moved up in size. The problems I came across are listed below. Street Names

  • Abbreviated street names, for example; Peacock Rd, Davenport TCE
  • Typographical error street name; Jerningham Strert
  • Incorrect concatenation street name; Edwin Street23
  • Incorrect value containing Spanish word; aeropuerto Melbourne Street names were audited and corrected by adapting the code from the Case Study lesson, where the mapping and expected dictionaries were updated to include local conventions, such as Terrace, Mall, Arcade and Parade. In addition to the corrections to street names mentioned above, abbreviations such as St, St., Ave, Rd., Rd were corrected.
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Terrace", "Highway", "Way", "Mall",  "Arcade",
            "Esplanade", "Parade", "Crescent", "East", "Grove", "Link", "Mews", "North", "Plaza",
            "South", "Walk"]
# Dictionary of issues related to local mapping and required updates
mapping = { "St": "Street",
            "St.": "Street",
            "Ave": "Avenue",
            "Rd.": "Road",
            "Rd": "Road",
            "Street23": "Street",
            "Strert": "Street",
            "TCE": "Terrace"
            }

All Street Names were corrected except for the aeropuerto Melbourne for which I was unable to determine a corrected value as Melbourne airport is some 725 kilometres away. This is related to a car rental agency called Thrifty.

Post Codes

There was a consistent issue with Post Codes whereby there was a leading state abbreviation (SA) and space for a set of four post codes; ['SA 5075', 'SA 5052', 'SA 5118', 'SA 5016']. The AuditPostCode function compiles a set of correct and incorrect Post Codes and appends these to a list for analysis. This allowed me to create a regular expression to determine inconsistent Post Codes and then perform a substitution via the dctBetterPostCode dictionary below. This permitted me to create a straightforward mapping dictionary of incorrect to correct Post Codes.

# Dictionary of issues related to local post codes and required updates
dctBetterPostCode = { "SA 5075": "5075",
                       "SA 5052": "5052",
                       "SA 5118": "5118",
                       "SA 5016": "5016" }
# Create function to remove incorrect postal codes via regular expression to be used later 
def UpdatePostCode(postcode, dctBetterPostCode):
    """This function is used to clear incorrect postcode formats.
        Args:
            postcode (str): the original post code
            dctBetterPostCode (dict): provides mapping from old to corrected post code
        Returns:
            BetterPostCode (str): returns corrected post code
    """
   
    # This regex disregards common characters outside the standard 4 digits
    regPostCode = re.compile(r'\D{1,}\d{4}', re.IGNORECASE) 
    p = regPostCode.search(postcode)
    BetterPostCode = postcode
    if p:
        if p.group() in dctBetterPostCode.keys():
            print "Before"
            print postcode

            # replace the incorrect postcode with the better postcode:
            BetterPostCode = re.sub(p.group(), dctBetterPostCode[p.group()], postcode)
            print "After"
            print BetterPostCode
        else:
            print "Passed"
            pass
    
    return BetterPostCode

# Output the correct and incorrect postal codes to get an understanding of how to deal with the incorrect ones
print AuditPostCode(OSMFILE)

Both the update_name and UpdatePostCode functions are called within the shape_element function to correct the issues discovered during auditing and to apply those changes to the resulting CSV files. Both these functions have been added to the Helper Functions section of the main code (UpdateAdelaideOSM.py).

Data Import into SQLite3 Database

I found that every second row was blank, I understand this is a common issue with Windows. By searching forums, I was able to find some Python code to adapt using the CSV module to remove the blank rows.

# Get rid of blank lines in CSV (blank line between every record, common issue with Windows)
import csv

in_fnam = '2ways_tags.csv'
out_fnam = 'ways_tags.csv'

input = open(in_fnam, 'rb')
output = open(out_fnam, 'wb')
writer = csv.writer(output)
for row in csv.reader(input):
    if any(row):
        writer.writerow(row)
input.close()
output.close()

I also encountered a datatype mismatch issue and realised this was related to the header row in the CSV files. I deleted this manually and performed the imports successfully.

Data Overview and Additional Ideas

Below are some basic statistics about the dataset.

File Sizes

  • Adelaide_australia.osm - 220 MB
  • AdelaideOSM.db - 116 MB
  • nodes.csv - 83.7 MB
  • nodes_tags.csv - 2.71 MB
  • ways.csv - 7.78 MB
  • ways_tags.csv - 10.7 MB
  • ways_nodes.cv - 23.9 MB

Number of nodes

SELECT COUNT(*) FROM nodes;

1064496

Number of ways

SELECT COUNT(*) FROM ways;

137857

Number of unique users

SELECT COUNT(DISTINCT(e.uid))          
FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e;

749

Top ten shop types

SELECT nodes_tags.value, COUNT(*) as num
FROM nodes_tags 
WHERE nodes_tags.key='shop'
GROUP BY nodes_tags.value
ORDER BY num DESC LIMIT 10;
Value Count
supermarket 142
hairdresser 77
survey 74
yes 70
clothes 61
bakery 51
convenience 47
alcohol 45
car_repair 35
bicycle 32

The above highlights a potential improvement, where the value of “yes” provides little analysis value. This is inconsistent with other values, it would be beneficial to have a proper shop type value updated here.

Number of railway level crossings

SELECT nodes_tags.value, COUNT(*) as count
FROM nodes_tags
WHERE nodes_tags.value = 'level_crossing';
Value Count
level_crossing 377

The number of level crossings in metropolitan areas normally decreases with population density because they cause traffic snarls and are more dangerous. They are however expensive to redevelop and hence governments usually do not do anything until there is significant public pressure to do so.

Top 20 appearing amenities

SELECT value, COUNT(*) as num
FROM nodes_tags
WHERE key='amenity'
GROUP BY value
ORDER BY num DESC
LIMIT 20;
Value Count
bench 1823
fast_food 283
restaurant 270
toilets 247
cafe 213
drinking_water 193
parking 180
post_box 145
pub 125
survey 105
bbq 89
telephone 89
fuel 84
bank 83
waste_basket 81
place_of_worship 80
shelter 79
pharmacy 73
post_office 60
atm 52

Average number of levels in buildings

SELECT AVG(value) as ave
FROM ways_tags
WHERE key='levels' AND type='building'  AND value!='ground floor';

2.838

Top ten cuisines

SELECT nodes_tags.value, COUNT(*) as num
FROM nodes_tags 
    JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value='restaurant') c
    ON nodes_tags.id=c.id
WHERE nodes_tags.key='cuisine'
GROUP BY nodes_tags.value
ORDER BY num DESC LIMIT 10;
Value Count
italian 27
chinese 22
indian 22
pizza 15
thai 15
survey 13
vegetarian 10
vietnamese 9
asian 8
burger 8

Top 10 contributing users

SELECT u.user, COUNT(*) as num
FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) u
GROUP BY u.user
ORDER BY num DESC
LIMIT 10;
User Count
CloCkWeRX 341972
o'cholio 148984
marquisite 60375
Didz 50773
Josh_G 46592
nickbarker 41948
hryciuk 40527
Malco! 33063
Qwertii 30584
KNAPPO 27328

Top 10 sources

SELECT value, COUNT(*) as num
FROM nodes_tags
WHERE key='source'
GROUP BY value
ORDER BY num DESC
LIMIT 10;
Value Count
PGS 806
survey 712
nearmap 276
Bing 257
Yahoo 241
yahoo 171
Survey 36
bing 30
local_knowledge 19
KNAPPO 27328

Other ideas about the dataset

  • User statistics:
    • The top ten users represent 68% of the total contribution (see Top 10 contributing users query)
    • The top user (CloCkWeRX) represents 28% of the total contribution
    • There are 721 out of 749 or 96% of users that make up less than 1% of the total contribution
    • The results show that most updates are performed by only a few users.

Source:

sqlite> .output All_Users.csv
sqlite> SELECT u.user, COUNT(*) as num
FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) u
GROUP BY u.user
ORDER BY num DESC;
sqlite> .output stdout
  • Gamification - integration with popular games like Pokemon Go may prove to be a worthwhile consideration.
  • Mentions - popups on websites or mobile devices for suggestions of local amenities when using maps with a mention of the highest contributor if you click through.
  • Webpage - the Open Street Map homepage is looking dated.

Benefits of the above ideas

  • Gamification - it may be beneficial to provide incentives to diversify the user base to increase the probability that local area data is captured accurately, including new content.
  • Mentions – provide incentives for more users to get involved.
  • Webpage - new contributors are not inclined to trust a site that looks unmaintained or old. If the site could be rejuvenated, this could provide impetus for more contribution from users.

Anticipated problems

  • Gamification – will require a significant amount of development work and associated costs
  • Mentions – will require updates to website functionality and introduction or improvement in smart phone applications.
  • Many users use nicknames so only close friends will know who owns what alias
  • Webpage – requires redevelopment and functionality upgrades. The home page supports three views, which increases the amount of work required for changes. These sort of changes often result in debate about whether to do a total redesign or upgrade.

Conclusion

This dataset is cleaner than I thought, although there are some issues, which I have pointed out such as a value of “yes” for shops in the nodes_tags table. It should be noted that the node tag value “survey” was apparent in the keys; amenities, shops, and cuisine in the above queries, where one would expect a more relevant value. Other inconsistencies mainly revolve around abbreviations and a couple of typographical errors. It would be worthwhile updating information I have audited and cleaned as it improves the overall quality of the dataset. It would also be worth reviewing node tags values like “yes” and “survey” to see if better values can be updated. The last query above highlights that some standardisation of capitalisation could be completed to improve analysis on the key source as there are repeats for lower case Yahoo and Bing values, which makes aggregation more difficult. My map area takes in a coastline, which may explain the large number of Prototype Global Shoreline (PGS) source type. In addition to these more granular improvements, I suggested some other overarching improvements along with their pros and cons in the Other ideas about the dataset section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment