OpenStreetMap Data Case Study

Map Area

Adelaide, Australia

https://mapzen.com/data/metro-extracts/metro/adelaide_australia/

This map is where I live, so it will be interesting what the database querying reveals.

Problems encountered with the map I created several small samples of the data and worked with the smallest one most of the time, until I had my auditing procedures working correctly. Using the code snippet provided in Project Details, I changed the k size and created samples called sampleK10.osm, sampleK25.osm, sampleK35.osm and sampleK100.osm. Once I had sampleK100.osm working correctly I progressively moved up in size. The problems I came across are listed below. Street Names

Abbreviated street names, for example; Peacock Rd, Davenport TCE
Typographical error street name; Jerningham Strert
Incorrect concatenation street name; Edwin Street23
Incorrect value containing Spanish word; aeropuerto Melbourne Street names were audited and corrected by adapting the code from the Case Study lesson, where the mapping and expected dictionaries were updated to include local conventions, such as Terrace, Mall, Arcade and Parade. In addition to the corrections to street names mentioned above, abbreviations such as St, St., Ave, Rd., Rd were corrected.

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Terrace", "Highway", "Way", "Mall",  "Arcade",
            "Esplanade", "Parade", "Crescent", "East", "Grove", "Link", "Mews", "North", "Plaza",
            "South", "Walk"]

# Dictionary of issues related to local mapping and required updates
mapping = { "St": "Street",
            "St.": "Street",
            "Ave": "Avenue",
            "Rd.": "Road",
            "Rd": "Road",
            "Street23": "Street",
            "Strert": "Street",
            "TCE": "Terrace"
            }

All Street Names were corrected except for the aeropuerto Melbourne for which I was unable to determine a corrected value as Melbourne airport is some 725 kilometres away. This is related to a car rental agency called Thrifty.

Post Codes

There was a consistent issue with Post Codes whereby there was a leading state abbreviation (SA) and space for a set of four post codes; ['SA 5075', 'SA 5052', 'SA 5118', 'SA 5016']. The AuditPostCode function compiles a set of correct and incorrect Post Codes and appends these to a list for analysis. This allowed me to create a regular expression to determine inconsistent Post Codes and then perform a substitution via the dctBetterPostCode dictionary below. This permitted me to create a straightforward mapping dictionary of incorrect to correct Post Codes.

# Dictionary of issues related to local post codes and required updates
dctBetterPostCode = { "SA 5075": "5075",
                       "SA 5052": "5052",
                       "SA 5118": "5118",
                       "SA 5016": "5016" }

# Create function to remove incorrect postal codes via regular expression to be used later 
def UpdatePostCode(postcode, dctBetterPostCode):
    """This function is used to clear incorrect postcode formats.
        Args:
            postcode (str): the original post code
            dctBetterPostCode (dict): provides mapping from old to corrected post code
        Returns:
            BetterPostCode (str): returns corrected post code
    """
   
    # This regex disregards common characters outside the standard 4 digits
    regPostCode = re.compile(r'\D{1,}\d{4}', re.IGNORECASE) 
    p = regPostCode.search(postcode)
    BetterPostCode = postcode
    if p:
        if p.group() in dctBetterPostCode.keys():
            print "Before"
            print postcode

            # replace the incorrect postcode with the better postcode:
            BetterPostCode = re.sub(p.group(), dctBetterPostCode[p.group()], postcode)
            print "After"
            print BetterPostCode
        else:
            print "Passed"
            pass
    
    return BetterPostCode

# Output the correct and incorrect postal codes to get an understanding of how to deal with the incorrect ones
print AuditPostCode(OSMFILE)

Both the update_name and UpdatePostCode functions are called within the shape_element function to correct the issues discovered during auditing and to apply those changes to the resulting CSV files. Both these functions have been added to the Helper Functions section of the main code (UpdateAdelaideOSM.py).

Data Import into SQLite3 Database

I found that every second row was blank, I understand this is a common issue with Windows. By searching forums, I was able to find some Python code to adapt using the CSV module to remove the blank rows.

# Get rid of blank lines in CSV (blank line between every record, common issue with Windows)
import csv

in_fnam = '2ways_tags.csv'
out_fnam = 'ways_tags.csv'

input = open(in_fnam, 'rb')
output = open(out_fnam, 'wb')
writer = csv.writer(output)
for row in csv.reader(input):
    if any(row):
        writer.writerow(row)
input.close()
output.close()

I also encountered a datatype mismatch issue and realised this was related to the header row in the CSV files. I deleted this manually and performed the imports successfully.

Data Overview and Additional Ideas

Below are some basic statistics about the dataset.

File Sizes

Adelaide_australia.osm - 220 MB
AdelaideOSM.db - 116 MB
nodes.csv - 83.7 MB
nodes_tags.csv - 2.71 MB
ways.csv - 7.78 MB
ways_tags.csv - 10.7 MB
ways_nodes.cv - 23.9 MB

Number of nodes

SELECT COUNT(*) FROM nodes;

1064496

Number of ways

SELECT COUNT(*) FROM ways;

137857

Number of unique users

SELECT COUNT(DISTINCT(e.uid))          
FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e;

749

Top ten shop types

SELECT nodes_tags.value, COUNT(*) as num
FROM nodes_tags 
WHERE nodes_tags.key='shop'
GROUP BY nodes_tags.value
ORDER BY num DESC LIMIT 10;

Value	Count
supermarket	142
hairdresser	77
survey	74
yes	70
clothes	61
bakery	51
convenience	47
alcohol	45
car_repair	35
bicycle	32

The above highlights a potential improvement, where the value of “yes” provides little analysis value. This is inconsistent with other values, it would be beneficial to have a proper shop type value updated here.

Number of railway level crossings

SELECT nodes_tags.value, COUNT(*) as count
FROM nodes_tags
WHERE nodes_tags.value = 'level_crossing';

Value	Count
level_crossing	377

The number of level crossings in metropolitan areas normally decreases with population density because they cause traffic snarls and are more dangerous. They are however expensive to redevelop and hence governments usually do not do anything until there is significant public pressure to do so.

Top 20 appearing amenities

SELECT value, COUNT(*) as num
FROM nodes_tags
WHERE key='amenity'
GROUP BY value
ORDER BY num DESC
LIMIT 20;

Value	Count
bench	1823
fast_food	283
restaurant	270
toilets	247
cafe	213
drinking_water	193
parking	180
post_box	145
pub	125
survey	105
bbq	89
telephone	89
fuel	84
bank	83
waste_basket	81
place_of_worship	80
shelter	79
pharmacy	73
post_office	60
atm	52

Average number of levels in buildings

SELECT AVG(value) as ave
FROM ways_tags
WHERE key='levels' AND type='building'  AND value!='ground floor';

2.838

Top ten cuisines

SELECT nodes_tags.value, COUNT(*) as num
FROM nodes_tags 
    JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value='restaurant') c
    ON nodes_tags.id=c.id
WHERE nodes_tags.key='cuisine'
GROUP BY nodes_tags.value
ORDER BY num DESC LIMIT 10;

Value	Count
italian	27
chinese	22
indian	22
pizza	15
thai	15
survey	13
vegetarian	10
vietnamese	9
asian	8
burger	8

Top 10 contributing users

SELECT u.user, COUNT(*) as num
FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) u
GROUP BY u.user
ORDER BY num DESC
LIMIT 10;

User	Count
CloCkWeRX	341972
o'cholio	148984
marquisite	60375
Didz	50773
Josh_G	46592
nickbarker	41948
hryciuk	40527
Malco!	33063
Qwertii	30584
KNAPPO	27328

Top 10 sources

SELECT value, COUNT(*) as num
FROM nodes_tags
WHERE key='source'
GROUP BY value
ORDER BY num DESC
LIMIT 10;

Value	Count
PGS	806
survey	712
nearmap	276
Bing	257
Yahoo	241
yahoo	171
Survey	36
bing	30
local_knowledge	19
KNAPPO	27328

Other ideas about the dataset

User statistics:
- The top ten users represent 68% of the total contribution (see Top 10 contributing users query)
- The top user (CloCkWeRX) represents 28% of the total contribution
- There are 721 out of 749 or 96% of users that make up less than 1% of the total contribution
- The results show that most updates are performed by only a few users.

Source:

sqlite> .output All_Users.csv
sqlite> SELECT u.user, COUNT(*) as num
FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) u
GROUP BY u.user
ORDER BY num DESC;
sqlite> .output stdout

Gamification - integration with popular games like Pokemon Go may prove to be a worthwhile consideration.
Mentions - popups on websites or mobile devices for suggestions of local amenities when using maps with a mention of the highest contributor if you click through.
Webpage - the Open Street Map homepage is looking dated.

Benefits of the above ideas

Gamification - it may be beneficial to provide incentives to diversify the user base to increase the probability that local area data is captured accurately, including new content.
Mentions – provide incentives for more users to get involved.
Webpage - new contributors are not inclined to trust a site that looks unmaintained or old. If the site could be rejuvenated, this could provide impetus for more contribution from users.

Anticipated problems

Gamification – will require a significant amount of development work and associated costs
Mentions – will require updates to website functionality and introduction or improvement in smart phone applications.
Many users use nicknames so only close friends will know who owns what alias
Webpage – requires redevelopment and functionality upgrades. The home page supports three views, which increases the amount of work required for changes. These sort of changes often result in debate about whether to do a total redesign or upgrade.

Conclusion

This dataset is cleaner than I thought, although there are some issues, which I have pointed out such as a value of “yes” for shops in the nodes_tags table. It should be noted that the node tag value “survey” was apparent in the keys; amenities, shops, and cuisine in the above queries, where one would expect a more relevant value. Other inconsistencies mainly revolve around abbreviations and a couple of typographical errors. It would be worthwhile updating information I have audited and cleaned as it improves the overall quality of the dataset. It would also be worth reviewing node tags values like “yes” and “survey” to see if better values can be updated. The last query above highlights that some standardisation of capitalisation could be completed to improve analysis on the key source as there are repeats for lower case Yahoo and Bing values, which makes aggregation more difficult. My map area takes in a coastline, which may explain the large number of Prototype Global Shoreline (PGS) source type. In addition to these more granular improvements, I suggested some other overarching improvements along with their pros and cons in the Other ideas about the dataset section.

chalg/Adelaide_OSM_Case_Study.md

Select an option

No results found