This map is where I live, so it will be interesting what the database querying reveals.
Problems encountered with the map I created several small samples of the data and worked with the smallest one most of the time, until I had my auditing procedures working correctly. Using the code snippet provided in Project Details, I changed the k size and created samples called sampleK10.osm, sampleK25.osm, sampleK35.osm and sampleK100.osm. Once I had sampleK100.osm working correctly I progressively moved up in size. The problems I came across are listed below. Street Names
- Abbreviated street names, for example; Peacock Rd, Davenport TCE
- Typographical error street name; Jerningham Strert
- Incorrect concatenation street name; Edwin Street23
- Incorrect value containing Spanish word; aeropuerto Melbourne Street names were audited and corrected by adapting the code from the Case Study lesson, where the mapping and expected dictionaries were updated to include local conventions, such as Terrace, Mall, Arcade and Parade. In addition to the corrections to street names mentioned above, abbreviations such as St, St., Ave, Rd., Rd were corrected.
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road",
"Trail", "Parkway", "Commons", "Terrace", "Highway", "Way", "Mall", "Arcade",
"Esplanade", "Parade", "Crescent", "East", "Grove", "Link", "Mews", "North", "Plaza",
"South", "Walk"]
# Dictionary of issues related to local mapping and required updates
mapping = { "St": "Street",
"St.": "Street",
"Ave": "Avenue",
"Rd.": "Road",
"Rd": "Road",
"Street23": "Street",
"Strert": "Street",
"TCE": "Terrace"
}
All Street Names were corrected except for the aeropuerto Melbourne for which I was unable to determine a corrected value as Melbourne airport is some 725 kilometres away. This is related to a car rental agency called Thrifty.
There was a consistent issue with Post Codes whereby there was a leading state abbreviation (SA) and space for a set of four post codes; ['SA 5075', 'SA 5052', 'SA 5118', 'SA 5016']. The AuditPostCode function compiles a set of correct and incorrect Post Codes and appends these to a list for analysis. This allowed me to create a regular expression to determine inconsistent Post Codes and then perform a substitution via the dctBetterPostCode dictionary below. This permitted me to create a straightforward mapping dictionary of incorrect to correct Post Codes.
# Dictionary of issues related to local post codes and required updates
dctBetterPostCode = { "SA 5075": "5075",
"SA 5052": "5052",
"SA 5118": "5118",
"SA 5016": "5016" }
# Create function to remove incorrect postal codes via regular expression to be used later
def UpdatePostCode(postcode, dctBetterPostCode):
"""This function is used to clear incorrect postcode formats.
Args:
postcode (str): the original post code
dctBetterPostCode (dict): provides mapping from old to corrected post code
Returns:
BetterPostCode (str): returns corrected post code
"""
# This regex disregards common characters outside the standard 4 digits
regPostCode = re.compile(r'\D{1,}\d{4}', re.IGNORECASE)
p = regPostCode.search(postcode)
BetterPostCode = postcode
if p:
if p.group() in dctBetterPostCode.keys():
print "Before"
print postcode
# replace the incorrect postcode with the better postcode:
BetterPostCode = re.sub(p.group(), dctBetterPostCode[p.group()], postcode)
print "After"
print BetterPostCode
else:
print "Passed"
pass
return BetterPostCode
# Output the correct and incorrect postal codes to get an understanding of how to deal with the incorrect ones
print AuditPostCode(OSMFILE)
Both the update_name and UpdatePostCode functions are called within the shape_element function to correct the issues discovered during auditing and to apply those changes to the resulting CSV files. Both these functions have been added to the Helper Functions section of the main code (UpdateAdelaideOSM.py).
I found that every second row was blank, I understand this is a common issue with Windows. By searching forums, I was able to find some Python code to adapt using the CSV module to remove the blank rows.
# Get rid of blank lines in CSV (blank line between every record, common issue with Windows)
import csv
in_fnam = '2ways_tags.csv'
out_fnam = 'ways_tags.csv'
input = open(in_fnam, 'rb')
output = open(out_fnam, 'wb')
writer = csv.writer(output)
for row in csv.reader(input):
if any(row):
writer.writerow(row)
input.close()
output.close()
I also encountered a datatype mismatch issue and realised this was related to the header row in the CSV files. I deleted this manually and performed the imports successfully.
Below are some basic statistics about the dataset.
- Adelaide_australia.osm - 220 MB
- AdelaideOSM.db - 116 MB
- nodes.csv - 83.7 MB
- nodes_tags.csv - 2.71 MB
- ways.csv - 7.78 MB
- ways_tags.csv - 10.7 MB
- ways_nodes.cv - 23.9 MB
SELECT COUNT(*) FROM nodes;
1064496
SELECT COUNT(*) FROM ways;
137857
SELECT COUNT(DISTINCT(e.uid))
FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e;
749
SELECT nodes_tags.value, COUNT(*) as num
FROM nodes_tags
WHERE nodes_tags.key='shop'
GROUP BY nodes_tags.value
ORDER BY num DESC LIMIT 10;
Value | Count |
---|---|
supermarket | 142 |
hairdresser | 77 |
survey | 74 |
yes | 70 |
clothes | 61 |
bakery | 51 |
convenience | 47 |
alcohol | 45 |
car_repair | 35 |
bicycle | 32 |
The above highlights a potential improvement, where the value of “yes” provides little analysis value. This is inconsistent with other values, it would be beneficial to have a proper shop type value updated here.
SELECT nodes_tags.value, COUNT(*) as count
FROM nodes_tags
WHERE nodes_tags.value = 'level_crossing';
Value | Count |
---|---|
level_crossing | 377 |
The number of level crossings in metropolitan areas normally decreases with population density because they cause traffic snarls and are more dangerous. They are however expensive to redevelop and hence governments usually do not do anything until there is significant public pressure to do so.
SELECT value, COUNT(*) as num
FROM nodes_tags
WHERE key='amenity'
GROUP BY value
ORDER BY num DESC
LIMIT 20;
Value | Count |
---|---|
bench | 1823 |
fast_food | 283 |
restaurant | 270 |
toilets | 247 |
cafe | 213 |
drinking_water | 193 |
parking | 180 |
post_box | 145 |
pub | 125 |
survey | 105 |
bbq | 89 |
telephone | 89 |
fuel | 84 |
bank | 83 |
waste_basket | 81 |
place_of_worship | 80 |
shelter | 79 |
pharmacy | 73 |
post_office | 60 |
atm | 52 |
SELECT AVG(value) as ave
FROM ways_tags
WHERE key='levels' AND type='building' AND value!='ground floor';
2.838
SELECT nodes_tags.value, COUNT(*) as num
FROM nodes_tags
JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value='restaurant') c
ON nodes_tags.id=c.id
WHERE nodes_tags.key='cuisine'
GROUP BY nodes_tags.value
ORDER BY num DESC LIMIT 10;
Value | Count |
---|---|
italian | 27 |
chinese | 22 |
indian | 22 |
pizza | 15 |
thai | 15 |
survey | 13 |
vegetarian | 10 |
vietnamese | 9 |
asian | 8 |
burger | 8 |
SELECT u.user, COUNT(*) as num
FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) u
GROUP BY u.user
ORDER BY num DESC
LIMIT 10;
User | Count |
---|---|
CloCkWeRX | 341972 |
o'cholio | 148984 |
marquisite | 60375 |
Didz | 50773 |
Josh_G | 46592 |
nickbarker | 41948 |
hryciuk | 40527 |
Malco! | 33063 |
Qwertii | 30584 |
KNAPPO | 27328 |
SELECT value, COUNT(*) as num
FROM nodes_tags
WHERE key='source'
GROUP BY value
ORDER BY num DESC
LIMIT 10;
Value | Count |
---|---|
PGS | 806 |
survey | 712 |
nearmap | 276 |
Bing | 257 |
Yahoo | 241 |
yahoo | 171 |
Survey | 36 |
bing | 30 |
local_knowledge | 19 |
KNAPPO | 27328 |
- User statistics:
- The top ten users represent 68% of the total contribution (see Top 10 contributing users query)
- The top user (CloCkWeRX) represents 28% of the total contribution
- There are 721 out of 749 or 96% of users that make up less than 1% of the total contribution
- The results show that most updates are performed by only a few users.
Source:
sqlite> .output All_Users.csv
sqlite> SELECT u.user, COUNT(*) as num
FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) u
GROUP BY u.user
ORDER BY num DESC;
sqlite> .output stdout
- Gamification - integration with popular games like Pokemon Go may prove to be a worthwhile consideration.
- Mentions - popups on websites or mobile devices for suggestions of local amenities when using maps with a mention of the highest contributor if you click through.
- Webpage - the Open Street Map homepage is looking dated.
- Gamification - it may be beneficial to provide incentives to diversify the user base to increase the probability that local area data is captured accurately, including new content.
- Mentions – provide incentives for more users to get involved.
- Webpage - new contributors are not inclined to trust a site that looks unmaintained or old. If the site could be rejuvenated, this could provide impetus for more contribution from users.
- Gamification – will require a significant amount of development work and associated costs
- Mentions – will require updates to website functionality and introduction or improvement in smart phone applications.
- Many users use nicknames so only close friends will know who owns what alias
- Webpage – requires redevelopment and functionality upgrades. The home page supports three views, which increases the amount of work required for changes. These sort of changes often result in debate about whether to do a total redesign or upgrade.
This dataset is cleaner than I thought, although there are some issues, which I have pointed out such as a value of “yes” for shops in the nodes_tags table. It should be noted that the node tag value “survey” was apparent in the keys; amenities, shops, and cuisine in the above queries, where one would expect a more relevant value. Other inconsistencies mainly revolve around abbreviations and a couple of typographical errors. It would be worthwhile updating information I have audited and cleaned as it improves the overall quality of the dataset. It would also be worth reviewing node tags values like “yes” and “survey” to see if better values can be updated. The last query above highlights that some standardisation of capitalisation could be completed to improve analysis on the key source as there are repeats for lower case Yahoo and Bing values, which makes aggregation more difficult. My map area takes in a coastline, which may explain the large number of Prototype Global Shoreline (PGS) source type. In addition to these more granular improvements, I suggested some other overarching improvements along with their pros and cons in the Other ideas about the dataset section.