San Fransisco, CA, United States
I chose this position because I recently move here to San Fransisco, and I was interested in discovering more about the area in OpenStreetMap.
All python cleaning methods use the following iteration for parsing the XML file. This method will save a tree into memory clearing the root of each element to save memory.
import xml.etree.cElementTree as ET
def get_element(osm_file, tags=('node', 'way', 'relation')):
context = ET.iterparse(osm_file, events=('start', 'end'))
_, root = next(context)
for event, elem in context:
if event == 'end' and elem.tag in tags:
yield elem
root.clear()
Here are 5 examples of problems found in the area mentioned above, the cleaning method, and the results.
- Problem: Some tags had problematic characters such as (=+/&<>;'"?%#$@,. \t\r\n[])
sfgov.org:OFFICE_TYP sfgov.org:OBJECTID sfgov.org:OFFICE_TYP DOT # DOT # sfgov.org:OBJECTID sfgov.org:OFFICE_TYP
- Cleaning Code:
import re PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]') for element in get_element(osm_file): for child in element: if child.tag == 'tag': #for Tags Only #Gets information from tags child_tag_key = child.get('k') #Identify problematic tags key_problems = PROBLEMCHARS.search(child_tag_key) #Skips any tags with problems if key_problems <> None: continue
- Description of Method: In this example, I use .compile from a regular expression (re) to compile the list of problematic characters into an object. Then I use the method .search to find them in a string variable. Such variable has a value assigned, which is extracted using the .get method. The continue statement skips those tags with problematic characters and avoids the code from including those in the final product.
- Results:
name:ru z_order wikidata wikipedia population short_name is_in:country is_in:continent census:population is_in:country_code
-
Problem: In our previous example, we can see that some tag keys contains metadata describing the key (e.g. census:population).
-
Cleaning Code:
Continuation of last code
#Skips any tags with problems if key_problems <> None: continue #Get the colon position in string colon_pos = child.get('k').find(':') if colon_pos > 0: #for Tags with colon only #Splitting keys inside tags key_tag = child.attrib['k'][colon_pos + 1:] type_tag = child.attrib['k'][0:colon_pos] else: key_tag = child.attrib['k'] type_tag = 'regular'
-
Description of Method: In this example, I also use the .get method, which gets the value, combined with the .find method, which determines if the colon exists in the string returning an integer (index if found, -1 if not). Then I use an if statement to ensure that the cleaning only happens when a colon is found. Although the .get method will provide the same results, I used .attrib this time to clean the values in the attributes. I combined the .attrib method with an index slicing. This method will find the colon position of the string in the attribute k and extract the characters after [colon_pos + 1:] and before [0:colon_pos] the colon position. This will separate the metadata as the type and the string which is describing. Any keys with no colon will have a regular type as a placeholder.
-
Results:
Before: name:zh | After: Type:name, key: zh
Before: z_order | After: Type:regular, key: z_order
Before: wikidata | After: Type:regular, key: wikidata
Before: wikipedia | After: Type:regular, key: wikipedia
Before: population | After: Type:regular, key: population
Before: short_name | After: Type:regular, key: short_name
Before: is_in:country | After: Type:is_in, key: country
Before: is_in:continent | After: Type:is_in, key: continent
Before: census:population | After: Type:census, key: population
- Problem: After loading the data, I ran a query to check for other values that could have been split. Although the same could have been achieved using python, I did not notice until I ran the SQL query--both methods are demonstrated below.
- Cleaning Code:
-
Initial Query to Identify:
#This query uses pandas to query SQLite data using connection = "OpenMapProject" db = sqlite3.connect(connection) non_splitted_query = ''' SELECT nodeid, type, key, value FROM node_tags WHERE value LIKE '%:%' AND key IN ('wikipedia') --change to identify trend by key; ''' print pd.read_sql_query(non_splitted_query, db)
-
Groping Keys to Help Identifying Best Cleaning Code:
Group_byKey = ''' SELECT key, COUNT(*) AS keycount FROM ({subquery}) GROUP BY key ORDER BY COUNT(*) DESC; ''' print pd.read_sql_query(Group_byKey.format(subquery=non_splitted_query), db) >results key keycount 0 opening_hours 1047 1 source 287 2 wikipedia 188 3 collection_times 37 4 note 30 5 name 24 6 description 10 7 wpt_description 9 8 conditional 4 9 highway 2 10 service_times 2 11 traffic_sign 2 12 amenity 1 13 backward 1 14 en 1 15 happy_hour 1 16 happy_hours 1 17 housename 1 18 image 1 19 information 1 20 ref 1 21 route_ref 1
-
Python Code to Clean wikipedia Language Acronym:
Modification of Code in Example 2
#Gets the value of the element value_string = child.attrib['v'] if colon_pos > 0: #for Tags with colon only #Splitting keys inside tags key_tag = child.attrib['k'][colon_pos + 1:] type_tag = child.attrib['k'][0:colon_pos] else: #Set to '' to not add wiki language to key tag laguage_part = '' #Ensuring that wikilanguage is with wiki if child.attrib['k'] == 'wikipedia': #Only does changes if colon is found colon_pos = child.get('v').find(':') laguage_part = '_in_' + child.attrib['v'][0:colon_pos] value_string = child.attrib['v'][colon_pos + 1:] # Applying additions if needed and setting regular types key_tag = child.attrib['k'] + laguage_part type_tag = 'regular'
-
- Description of Method: In this example, to clean the key tags I also used the .attrib method to get the k attribute wikipedia. I combined the results with an if statement to identify only those applicable tags. Then I also used the .get().find() method combinations to get the values in the v attribute and find the colon index. The separation of strings was done by using index slicing as well as explained in the previous example. Then I concatenated the values to create a wikipedia_in_en value. For the SQL part, I used the sqlite3 module using the .connect object to start a connection in memory. I did not use the connection methods. Instead, to query the data I used pandas' .read_sql_query method. This reads the SQL query into a DataFrame allowing me to view the data in a tabular format in Jupyter Notebook with headers. It uses less methods than doing the .cursor and .fetchall methods. To construct the SQL query, I looked for values that could be split by colon one more time. I achieved that by using a WHERE clause combined with a LIKE operator and wild characters, which allowed me to look for the colon in any position of the string. I also used the IN operator to find only those key tags that match wikipedia. To assess the trends and come to the conclusion that the wikipedia key's language acronym could be separate with confidence, I created another query using the GROUP statement to the group they key tags and count the instances by using the COUNT function. Then I used the ORDER BY statement to group the instances from highest to lowest using the DESC argument. To print this query, I used the .format method to use the non_splitted_query as a subquery.
- Results:
Before: Key: wikipedia, Value: en:San Francisco After: Key: wikipedia_in_en, Value: San Francisco Before: Key: wikipedia, Value: en:Hayward (Amtrak station) After: Key: wikipedia_in_en, Value: Hayward (Amtrak station) Before: Key: wikipedia, Value: en:Fitzgerald Marine Reserve After: Key: wikipedia_in_en, Value: Fitzgerald Marine Reserve Before: Key: wikipedia, Value: en:Crocker-Amazon, San Francisco After: Key: wikipedia_in_en, Value: Crocker-Amazon, San Francisco Before: Key: wikipedia, Value: en:Presidio Terrace After: Key: wikipedia_in_en, Value: Presidio Terrace
- Problem: There were two fields using the same string to describe the data. They both used name. However, the type filed used it to describe the name of languages and the key field used to describe the name of a place. To avoid any confusions in the future. All name strings describing a language will be changed to language.
- Cleaning Code:
-
Initial Query to Identify:
type_name_desc = ''' SELECT nodeid, key, value FROM node_tags WHERE LOWER(type) = 'name' LIMIT 16 --limiting to save space ''' print pd.read_sql_query(type_name_desc, db) >results nodeid type key value 0 26819236 name de San Francisco 1 26819236 name eu San Francisco 2 26819236 name fa سان فرانسیسکو 3 26819236 name hi सैन फ्रांसिस्को 4 26819236 name ja サンフランシスコ 5 26819236 name kn ಸಾನ್ ಫ್ರಾನ್ಸಿಸ್ಕೊ 6 26819236 name ko 샌프란시스코 7 26819236 name pt São Francisco 8 26819236 name ru Сан-Франциско 9 26819236 name ta ஸான் ப்²ரான்ஸிஸ்கொ 10 26819236 name zh 旧金山 11 100571506 name ja バークレー 12 100571506 name zh 伯克利 13 140982569 name zh 列治文区 14 140982615 name zh 马里纳区 15 140982670 name zh 海特-黑什伯里区
-
Python Code to Change type name to language
Modification of Code in Example 3
if colon_pos > 0: #for Tags with colon only #Splitting keys inside tags key_tag = child.attrib['k'][colon_pos + 1:] type_tag = child.attrib['k'][0:colon_pos] else: #Set to '' to not add wiki language to key tag laguage_part = '' #Ensuring that wikilanguage is with wiki if child.attrib['k'] == 'wikipedia': #Only does changes if colon is found colon_pos = child.get('v').find(':') laguage_part = '_in_' + child.attrib['v'][0:colon_pos] value_string = child.attrib['v'][colon_pos + 1:] # Applying additions if needed and setting regular types key_tag = child.attrib['k'] + laguage_part type_tag = 'regular' #Changes type name to language only when found if type_tag == 'name': type_tag = 'language'
-
- Description of Method: In this example, to clean the type tags, I used just an if statement to find any type_tag that matches the name string. The type_tag value was assigned using the same .attrib method and index slicing combination. Those type tags with strings matching name were changed to language by reassigning the variable to the new string value.
- Results
Before: Type: name, Key and Value: de San Francisco After: Type: language, Key and Value: de San Francisco Before: Type: name, Key and Value: eu San Francisco After: Type: language, Key and Value: eu San Francisco Before: Type: name, Key and Value: fa سان فرانسیسکو After: Type: language, Key and Value: fa سان فرانسیسکو Before: Type: name, Key and Value: hi सैन फ्रांसिस्को After: Type: language, Key and Value: hi सैन फ्रांसिस्को Before: Type: name, Key and Value: ja サンフランシスコ After: Type: language, Key and Value: ja サンフランシスコ
- Problem: The metadata describing the source of the housenumber is not consistent some tags present the key as source:housenumber and other as housenumber:source. I am going to change that to housenumber_source, which will give us a consistent key tag to query later if needed.
- Cleaning Code:
-
Initial Query to Identify:
address_keys = ''' SELECT DISTINCT key FROM node_tags WHERE type = 'addr'; ''' print pd.read_sql_query(address_keys, db) >result key 0 housenumber 1 street 2 city 3 state 4 postcode 5 country 6 unit 7 source:housenumber 8 full 9 housename 10 pier 11 interpolation 12 county 13 housenumber:source 14 province 15 door 16 floor 17 place 18 suite 19 flats
-
Python Code to Clean Address Metadata
Modification of Code in Example 4
#Changes to metadata when found if type_tag == 'name': type_tag = 'language' elif type_tag == 'addr': if key_tag.find(':') > 0: key_tag_array = key_tag.split(':') if key_tag_array[0] == 'source': sep = '_' key_tag = sep.join([key_tag_array[1], key_tag_array[0]]) else: key_tag = key_tag.replace(':', '_') print key_tag
-
- Description of Method: The methods I used here are the .split, .join, and .replace. I used .split to separate the key tag a one-dimensional array to later change the order of the two separated values. They were joined back together inversely by using .join where the key tag's value starts with source. The result is a consistent housenumber_source key tag. I use .replace when the key tag's value ends with source to just replace the colon with an underscore since it is in the ideal format.
- Results:
Before: source:housenumber After: housenumber_source Before: housenumber:source After: housenumber_source Before: housenumber:source After: housenumber_source Before: source:housenumber After: housenumber_source Before: street:source After: street_source Before: source:housenumber After: housenumber_source
This section contains basic statistics about the dataset, the SQL queries used to gather them, and some additional ideas about the data in context.
from datetime import datetime as dt
def timing_print(s):
print '____________________'
print "Start Time: " + str(s)
print "Timing: " + str(dt.now() - s)
print "Conecttion: " + connection
print '____________________'
san-francisco_california.osm ......... 1,380,743 MB
SF_OpenMapProject.db .......... 752,287 MB
nodes.csv ............. 544,083 MB
node_tags.csv ........ 9,542 MB
ways.csv .............. 49,453 MB
way_tags.csv ......... 58,868 MB
way_nodes.cv ......... 185,116 MB
table_names = ('nodes', 'node_tags', 'ways', 'way_tags', 'way_nodes')
number_nodes_query = '''
SELECT COUNT(*) AS {table}_count
FROM {table};
'''
for table in table_names:
print pd.read_sql_query(number_nodes_query.format(table=table), db)
-----------------
nodes_count
0 6641035
-----------------
node_tags_count
0 260460
-----------------
ways_count
0 827729
-----------------
way_tags_count
0 1784424
-----------------
way_nodes_count
0 7885078
____________________
Start Time: 2017-11-23 11:23:14.859000
Timing: 0:00:01.275000
Conecttion: SF_OpenMapProject
____________________
tables_user = ('nodes', 'ways')
number_uniq_user = '''
SELECT COUNT(DISTINCT uid) AS unique_{name}_count
FROM ({table})
'''
subquery = '''
SELECT uid
FROM nodes
UNION ALL
SELECT uid
FROM ways
'''
s = dt.now()
for table in tables_user:
print pd.read_sql_query(number_uniq_user.format(table=table, name=table), db)
print pd.read_sql_query(number_uniq_user.format(table=subquery, name='total'), db)
timing_print(s)
---------------------
unique_nodes_count
0 2596
---------------------
unique_ways_count
0 1830
---------------------
unique_total_count
0 2909
____________________
Start Time: 2017-11-23 11:30:29.738000
Timing: 0:00:04.547000
Conecttion: SF_OpenMapProject
____________________
number_uniq_user = '''
SELECT user, COUNT(*) AS contrib_count
FROM (SELECT uid, user FROM nodes UNION ALL SELECT uid, user FROM ways)
GROUP BY uid
ORDER BY COUNT(*) DESC
LIMIT 10;
'''
s = dt.now()
print pd.read_sql_query(number_uniq_user, db)
timing_print(s)
-----------------------------
user contrib_count
0 andygol 1495792
1 ediyes 887534
2 Luis36995 679457
3 dannykath 546016
4 RichRico 415649
5 Rub21 383321
6 calfarome 190862
7 oldtopos 165467
8 KindredCoda 151423
9 karitotp 139643
____________________
Start Time: 2017-11-23 15:50:20.792000
Timing: 0:00:05.117000
Conecttion: SF_OpenMapProject
____________________
'''
bytype_count = '''
SELECT type, COUNT(*) AS TypeCount
FROM node_tags
GROUP BY type
HAVING COUNT(*) > 100
ORDER BY COUNT(*) DESC
'''
s = dt.now()
print pd.read_sql_query(bytype_count, db)
timing_print(s)
------------------------------
type TypeCount
0 regular 154699
1 addr 83549
2 gnis 9594
3 species 2923
4 redwood_city_ca 2836
5 seamark 2673
6 language 392
7 survey 374
8 payment 339
9 noaa 297
10 fire_hydrant 277
11 wetap 275
12 source 271
13 traffic_signals 258
14 contact 206
15 toilets 197
16 service 117
17 building 112
18 disused 101
____________________
Start Time: 2017-11-23 16:03:22.740000
Timing: 0:00:00.135000
Conecttion: SF_OpenMapProject
____________________
validate_total_count = '''
SELECT COUNT(*) AS way_regular_total
FROM way_tags
WHERE type = 'regular' and LOWER(key) <> 'fixme'
'''
validate_fixme_count = '''
SELECT
--w1.wayid, --commented after crosschecking with source
COUNT(w2.wayid) AS way_regular_fixme
FROM
(SELECT wayid FROM way_tags w1 WHERE type = 'regular' and LOWER(key) <> 'fixme') w1
LEFT JOIN
(SELECT DISTINCT wayid FROM way_tags WHERE LOWER(key) = 'fixme') w2
ON w1.wayid = w2.wayid
--WHERE w1.wayid = 156124086 --commented after crosschecking
--GROUP BY w1.wayid --commented after crosschecking with source
'''
fixme_tags = '''
SELECT source_node, type, COUNT(fixid) AS fixme_count, COUNT(*) AS total_count
FROM
(
SELECT * FROM (
(SELECT 'node' AS source_node, nodeid as id, type, key, value FROM node_tags WHERE LOWER(key) <> 'fixme') node_nofix
LEFT JOIN
(SELECT nodeid as fixid FROM node_tags WHERE LOWER(key) = 'fixme') node_fix
ON node_nofix.id = node_fix.fixid)
UNION
SELECT * FROM (
(SELECT 'way' AS source_node, wayid as id, type, key, value FROM way_tags WHERE LOWER(key) <> 'fixme') way_nofix
LEFT JOIN
(SELECT wayid as fixid FROM way_tags WHERE LOWER(key) = 'fixme') AS way_fix
ON way_nofix.id = way_fix.fixid)
)
GROUP BY source_node, type
HAVING COUNT(fixid) > 0
ORDER BY COUNT(fixid) DESC;
'''
print pd.read_sql_query(validate_total_count, db)
print pd.read_sql_query(validate_fixme_count, db)
print pd.read_sql_query(fixme_tags, db)
-------------------
way_regular_total
0 1286707
--------------------
way_regular_fixme
0 641
---------------------------------------------------------
source_node type fixme_count total_count
0 way regular 641 1286707
1 way tiger 256 257884
2 node regular 122 154625
3 node addr 91 83549
4 way addr 30 157411
5 way source 16 4897
6 node caltrans 11 28
7 node language 11 392
8 way hgv 11 3096
9 way gnis 8 4994
10 way building 6 3679
11 way Tiger 5 342
12 way cycleway 5 1873
13 node gnis 4 9594
14 node is_in 3 21
15 way destination 2 1336
16 way turn 2 9754
17 node census 1 40
18 way oneway 1 34
19 way park 1 42
20 way social_facility 1 19
21 way toilets 1 72
____________________
Start Time: 2017-11-23 18:19:17.046000
Timing: 0:00:11.530000
Conecttion: SF_OpenMapProject
____________________
-
Verifying the integrity of the addresses entered.
- I noticed inconsistencies in data entry such as entering the housenuber as housename, including the street number in the housename field and also in the street field, leaving important parts of the address format empty such as state and country while entering post code.
-
Example: SQL Script
important_addr_parts = ''' SELECT id, lat || ', ' || lon AS googlemap_checker, housenumber, housename, COALESCE(housenumber, housename) AS number, street, unit, state, country, postcode FROM nodes main LEFT JOIN (SELECT nodeid, value AS housenumber FROM node_tags WHERE key = 'housenumber') AS hnum ON hnum.nodeid = main.id LEFT JOIN (SELECT nodeid, value AS housename FROM node_tags WHERE key = 'housename' and (cast(value as integer) > 0) = 1) AS hnam ON hnam.nodeid = main.id LEFT JOIN (SELECT nodeid, value AS street FROM node_tags WHERE key = 'street') AS str ON str.nodeid = main.id LEFT JOIN (SELECT nodeid, value AS unit FROM node_tags WHERE key = 'unit') AS un ON un.nodeid = main.id LEFT JOIN (SELECT nodeid, value AS state FROM node_tags WHERE key = 'state') AS sta ON sta.nodeid = main.id LEFT JOIN (SELECT nodeid, value AS country FROM node_tags WHERE key = 'country') AS coun ON coun.nodeid = main.id LEFT JOIN (SELECT nodeid, value AS postcode FROM node_tags WHERE key = 'postcode') AS post ON post.nodeid = main.id WHERE housenumber IS NULL and housename IS NOT NULL --LIMIT 10 ''' print pd.read_sql_query(important_addr_parts, db)
-
Results:
id googlemap_checker housenumber housename \ 0 358772135 37.8822704, -122.1889555 None 25 1 359099330 37.7993396, -122.4017555 None 243 2 1126948245 37.7720318, -122.4318431 None 588 3 1374302174 37.8750833, -122.2935778 None 1601 4 1514485184 37.5237877, -122.2740926 None 570 5 2322611878 37.4856225, -122.1468804 None 1 6 2452206221 37.5672122, -122.3245315 None 30 7 2452206222 37.5671059, -122.3243973 None 34 8 2452235804 37.5667107, -122.3234299 None 139 9 2657361795 37.8850851, -122.0512286 None 1400 Creekside Dr 10 4073823484 37.8376837, -122.2628589 None 5179 number street unit state country postcode 0 25 Orinda Way None None None None 1 243 Vallejo Street None None None None 2 588 Haight Street None None None None 3 1601 San Pablo Avenue None None None None 4 570 Masonic Way None None None None 5 1 None None None None None 6 30 South B Street None None None None 7 34 South B Street None None None None 8 139 South B Street None None None None 9 1400 Creekside Dr Creekside Drive None None None 94596 10 5179 None None None None None
-
Suggestion, Benefits, & Concerns:
- The latter can be fixed quickly since we can fill the blanks in this particular area. This area is only for the US and only for CA. So replacing empty values using this logic is fairly easy for this example.
- Benefit: The benefit of having all important parts is that consumers of the map will know the exact address for which they are looking. So by adding the empty values using logic such as placing CA and USA in the state and country field and doing a
- Concern: On a larger scale, it would take a lot of resources to treat a symptom that can be eliminated or minimized at the root level.
- Also, a coalesce when the housenumber is in the housename field (e.g. COALESCE(housenumber, housename) can take care of the housenumber issue. The mixture of address number with the street name (e.g. 1400 Creekside Dr) can be fixed while parsing the data cleaning non-numeric characters if the string starts with a number.
- Benefit: The benefit of having the correct is the ease of identifying the location without having to use the latitude and longitude.
- Concern: This will require a trend analysis to ensure that we are correcting the problem instead of creating more problems. For example, the user instead of writing the number at the beginning could write it at the end (e.g. Creekside Dr 1400). If the code is only fixing non-numeric characters if the string starts with a number, this value will remain the same.
- Another solution would be to gamify data entry from users to incentivize them to enter at least the most important parts of the address.
- Benefit: The benefit is that fewer resources will be needed to maintain completeness of address since the problem will be addressed at the root-preventing problems listed above.
- Concern: This will require a heavy contribution from developers to create an open source gamification algorithm that incentivize the users to maintain completeness of the address at the root of the problem. We will have to provide A/B testing to see which incentives minimize the number of problems etc., which will require more resources. But in the end, the results will be better, in my opinion.
- The latter can be fixed quickly since we can fill the blanks in this particular area. This area is only for the US and only for CA. So replacing empty values using this logic is fairly easy for this example.
-
Identify issues claimed by the user. These entries address a specific issue found by the user. We can identify these nodes and address those concerns. However, we have no categories for these issues. Thus, to visually present this data will be a challenge. We need to find these concerns by categories to see which categories are having more issues.
-
Examples and Results: Count of names with fixme tags
sqlite> SELECT COUNT(*) AS namefixme_count ...> FROM node_tags as main ...> JOIN (SELECT nodeid FROM node_tags WHERE LOWER(key) = 'fixme') AS fixme ...> ON main.nodeid = fixme.nodeid ...> WHERE LOWER(key) = 'name'; namefixme_count 21
Creating Temp Table
sqlite> CREATE TEMP TABLE fixme AS SELECT * ...> FROM ...> ( ...> SELECT ...> id, ...> substr(name.value, 1, 25) AS PlaceName, ...> CASE ...> WHEN LOWER(name.value) LIKE '%caltran%'THEN 'Caltrans' ...> WHEN LOWER(name.value) LIKE '%trail%' THEN 'Trail' ...> WHEN LOWER(name.value) LIKE '%association%' THEN 'Association' ...> WHEN LOWER(name.value) LIKE '%center%' THEN 'Center' ...> WHEN LOWER(name.value) LIKE '%entrance%' THEN 'Entrance' ...> ELSE COALESCE(place.value, amen.value, high.value, misc.value, tour.value, waytag.value) ...> END AS PlaceType, ...> COALESCE(addr.value, waytag.value) AS Address, ...> fixme.value AS FixMessage ...> FROM nodes n ...> JOIN --left to include those w/o names ...> (SELECT nodeid, type, key, value FROM node_tags WHERE key = 'name') name ...> ON n.id = name.nodeid ...> LEFT JOIN --left to include those w/o places ...> (SELECT nodeid, value FROM node_tags WHERE key = 'place') place ...> ON n.id = place.nodeid ...> LEFT JOIN --leff to include those w/o amenities ...> (SELECT nodeid, value FROM node_tags WHERE key = 'amenity') amen ...> ON n.id = amen.nodeid ...> LEFT JOIN --leff to include those w/o highway ...> (SELECT nodeid, value FROM node_tags WHERE key = 'highway') high ...> ON n.id = high.nodeid ...> LEFT JOIN --leff to include those w/o highway ...> (SELECT nodeid, ...> CASE WHEN LOWER(key) = 'shop' THEN 'Shop' ELSE value END AS value FROM node_tags ...> WHERE (LOWER(key) = 'shop' and value = 'yes') OR (LOWER(key) = 'entrance' and value = 'yes')) misc ...> ON n.id = misc.nodeid ...> LEFT JOIN --left to include those w/o address ...> (SELECT nodeid, group_concat(value) AS value FROM node_tags WHERE type = 'addr' GROUP BY nodeid) addr ...> ON n.id = addr.nodeid ...> LEFT JOIN --leff to include those w/o highway ...> (SELECT nodeid, value FROM node_tags WHERE LOWER(key) = 'tourism') tour ...> ON n.id = tour.nodeid ...> JOIN --left to force only those w/ potential issues ...> (SELECT nodeid, value FROM node_tags WHERE LOWER(key) = 'fixme') fixme ...> ON n.id = fixme.nodeid ...> LEFT JOIN ...> (SELECT DISTINCT node_id, key, ...> CASE WHEN LOWER(key) = 'building' THEN 'Building' ELSE value END AS Value FROM way_nodes wn ...> JOIN way_tags wt ON wn.wayid = wt.wayid WHERE LOWER(key) = 'waterway' OR LOWER(key) = 'building') waytag ...> --adds more tag descriptions but slows down the query ...> ON n.id = waytag.node_id ...> ); sqlite> .tables node_tags nodes temp.fixme way_nodes way_tags ways
Validating names with fixme tags after complex query
sqlite> SELECT ...> COUNT(*) AS ValidCount ...> FROM fixme; ValidCount 21
Count of Fixme nodes by Place Type
sqlite> .mode column sqlite> SELECT ...> PlaceType, ...> COUNT(*) AS FixmeCount ...> FROM fixme ...> GROUP BY PlaceType ...> ORDER BY COUNT(*) DESC; PlaceType FixmeCount ---------- ---------- Caltrans 4 Trail 2 bus_stop 2 Associatio 1 Center 1 Entrance 1 Shop 1 cafe 1 city 1 fast_food 1 hotel 1 ice_cream 1 library 1 place_of_w 1 restaurant 1 stream 1
List of Place Names, Place Type and Address
sqlite> SELECT ...> id AS nodeid, ...> PlaceName, ...> PlaceType, ...> googlemap_checker ...> FROM fixme; nodeid PlaceName PlaceType googlemap_checker ---------- ---------------------------------------- --------------- ------------------------ 26819236 San Francisco city 37.7792808, -122.4192363 282600552 First Baptist Church (in Covenant with F place_of_worshi 37.8673104, -122.2614279 368170033 Peninsula Hospital Health Sciences Libra library 37.5920286, -122.3831222 816289382 Market St & 17th St bus_stop 37.7624107, -122.4359472 816300001 Market St & Castro St bus_stop 37.7628952, -122.4350231 1110102182 Caltrans South San Francisco Maintenance Caltrans 37.6540039, -122.4070986 1110125147 Caltrans 55th Street Maintenance Station Caltrans 37.8399997, -122.2622393 1184118036 Caltrans Westborough Maintenance Station Caltrans 37.6437518, -122.4537848 1210822824 Caltrans Ettie Street Maintenance Statio Caltrans 37.8260525, -122.2893159 1901868592 Vertigo hotel 37.7884016, -122.4158739 1924019363 Blush Frozen Yogurt restaurant 37.7774125, -122.392276 2217250747 Bayview trail Trail 37.562389, -122.096965 2217251865 Bayview trail Trail 37.562486, -122.097385 2356902019 Mental Health Association of Alameda Cou Association 37.8441993, -122.2776973 2525079629 Strawberry Creek stream 37.8676787, -122.2833163 2639227318 Gelateria Naia ice_cream 37.7998759, -122.4093442 2765693662 First Floor Entrance Entrance 37.7237531, -122.4767131 3069400840 El Faro fast_food 37.7698259, -122.449508 3228601037 Integral Counseling Center Center 37.7423814, -122.4267344 3474438154 Jump Start Coffee and Whole Foods Shop 37.7521445, -122.4230494 4339548347 CAW Pacific cafe 37.7947938, -122.4225248
Manual cross-checking with google using latitude and longitude
Creating Table
sqlite> CREATE TEMP TABLE PlaceTypeValid ( ...> id INT PRIMARY KEY, ...> valid INT, ...> googlefinding VARCHAR(75); sqlite> .tables node_tags temp.fixme ways nodes way_nodes temp.PlaceTypeValid way_tags
Inserting Findings
sqlite> INSERT INTO PlaceTypeValid ...> VALUES ...> ('4339548347',1, 'It is a contemporary pâtisserie'), ...> ('3474438154',1 ,'Deli, Sandwich, Coffee and Tea, Bagels shop'), ...> ('3228601037',1 ,'It is a nonprofit counseling center'), ...> ('3069400840',1 ,'Mexican Cuisine'), ...> ('2765693662',0 ,'Undetermined, more info needed'), ...> ('2639227318',1 ,'Ice cream shop selling gelato'), ...> ('2525079629',1 ,'Waterway stream in Berkerly'), ...> ('2356902019',1 ,'A mental health association'), ...> ('2217251865',1 ,'A trail in the coyote hills regional park'), ...> ('2217250747',1 ,'A trail in the coyote hills regional park'), ...> ('1924019363',0 ,'Frozen Yougurt not really representing a restaurant'), ...> ('1901868592',1 ,'A hotel in the city'), ...> ('1210822824',1 ,'A toll Place by the Interstate 80'), ...> ('1184118036',0 ,'Undetermined, more data needed'), ...> ('1110125147',0 ,'Undetermined, more data needed'), ...> ('1110102182',0 ,'Undetermined, more data needed'), ...> ('816300001',0 ,'Undetermined, more data needed'), ...> ('816289382',0 ,'Undetermined, more data needed'), ...> ('368170033',0 ,'Undetermined, more data needed'), ...> ('282600552',1 ,'A church'), ...> ('26819236',1 ,'San Francisco the City');
Places and Findings
sqlite> SELECT main.id, placename, placetype, valid, googlefinding ...> FROM fixme AS main ...> JOIN placetypevalid AS ptv ...> ON main.id = ptv.id; id PlaceName PlaceType valid googlefinding ---------- ---------------------------------------- --------------- ---------- ---------------------- 26819236 San Francisco city 1 San Francisco the City 282600552 First Baptist Church (in Covenant with F place_of_worshi 1 A church 368170033 Peninsula Hospital Health Sciences Libra library 0 Undetermined, more dat 816289382 Market St & 17th St bus_stop 0 Undetermined, more dat 816300001 Market St & Castro St bus_stop 0 Undetermined, more dat 1110102182 Caltrans South San Francisco Maintenance Caltrans 0 Undetermined, more dat 1110125147 Caltrans 55th Street Maintenance Station Caltrans 0 Undetermined, more dat 1184118036 Caltrans Westborough Maintenance Station Caltrans 0 Undetermined, more dat 1210822824 Caltrans Ettie Street Maintenance Statio Caltrans 1 A toll Place by the In 1901868592 Vertigo hotel 1 A hotel in the city 1924019363 Blush Frozen Yogurt restaurant 0 Frozen Yougurt not rea 2217250747 Bayview trail Trail 1 A trail in the coyote 2217251865 Bayview trail Trail 1 A trail in the coyote 2356902019 Mental Health Association of Alameda Cou Association 1 A mental health associ 2525079629 Strawberry Creek stream 1 Waterway stream in Ber 2639227318 Gelateria Naia ice_cream 1 Ice cream shop selling 2765693662 First Floor Entrance Entrance 0 Undetermined, more inf 3069400840 El Faro fast_food 1 Mexican Cuisine 3228601037 Integral Counseling Center Center 1 It is a nonprofit coun 3474438154 Jump Start Coffee and Whole Foods Shop 1 Deli, Sandwich, Coffee 4339548347 CAW Pacific cafe 1 It is a contemporary p
-
Suggestion, Benefits, & Concerns:
- Suggestion: We can start categorizing the issues as demonstrated above. On this example, I created a temporary table. But it would be useful to create a view that we can load into a visualization tool. Then the challenge is to find the right categories that are more important to us and validate them. We can take samples and validate them when the process scales up.
- Benefit: This approach will allow us to categorize the issues. In this example, there are only 21 named places with some sort of issue as indicated by a user. If we extend the database or incentivize our users to provide us with more feedback, we will scale up on insights that we need to find root-cause solutions. If we categorize these issues, we will see which categories are the most prominent.
- Concern: There are a couple of concerns about this approach. Every time that there is user interaction with data, there is room for errors. Thus, are the users pointing to the right issue? That is another challenge. Do we have the right process to filter the noise of misunderstandings where the user thinks there is a problem when there is no problem? The following example grouping the most frequent user concerns are listed below*
Most Concerns by Rank*
sqlite> WITH IssueTable AS ...> ( ...> SELECT ...> value, ...> COUNT(*) AS IssueCount ...> FROM node_tags nt ...> LEFT JOIN nodes n ...> ON nt.nodeid = n.id ...> WHERE LOWER(key) = 'fixme' ...> GROUP BY value ...> ORDER BY COUNT(*) DESC ...> ) ...> SELECT ...> *, ...> ( ...> SELECT ...> COUNT(*) + 1 ...> FROM ...> ( ...> SELECT DISTINCT IssueCount ...> FROM IssueTable i1 ...> WHERE main.IssueCount < i1.IssueCount ...> ) ...> ) AS IssueRank ...> FROM IssueTable main; value IssueCount IssueRank ------------------------------------------------------------ ---------- ---------- continue 10 1 Connect to a building, or use the door tag if this is inside 3 2 address number not visible, number inferred 3 2 Connect to a building. 2 3 There appears to be a social trail in the vicinity. Please c 2 3 Where is this? 2 3 address not actually visible;assuming based on numbers surro 2 3 is no exit? 2 3 map to line 2 3 unsigned, so unsure of number 2 3 yes 2 3 Check position. 1 4 Continue track 1 4 Discover what this facility is. 1 4 Does Muni stop here? 1 4 Exact position 1 4 I couldn't see this during a drive-by survey on 2016-06-15; 1 4 Location needs some improvement 1 4 Needs Ground Survey. Road closed to through traffic near her 1 4 Needs Ground survey 1 4 North Field What? 1 4 Please change address for 3580 Terrace way to the new locati 1 4 This is an alternative address for Maru Sushi, differing onl 1 4 This yard needs additional work and cleanup 1 4 What is this? Should be deleted? 1 4 When zooming out, Oakland (a nearby city) label covers over 1 4 Where's exit 46B to Ygnacio Valley Road? 1 4 address info for upstairs residences 1 4 address not visible; presumed logical address 1 4 address not visible;presumed 1 4 address number not visible;assumed 1192 1 4 address number not visible;assumed address 1 4 check address number;none was visible, so assumed 41 1 4 check name 1 4 confirm address - sign hard to read from across the street 1 4 confirm address, since last 2 digits of number worn off 1 4 couldn't read address number clearly so not sure 1 4 does this still exist? 1 4 ground survey needed for Market St & Castro St for location 1 4 ground survey needed to confirm name 1 4 is there really a library here? 1 4 more precise location within building 1 4 needs an approval on the ground 1 4 no visible address;making obvious assumption 1 4 not 100% sure this address was this building and not the nex 1 4 not sure/can't tell where this goes 1 4 resurvey; yelp says permanently closed 1 4 some towers in this area being relocated 1 4 survey 1 4 where does the 45 mph speed limit begin? 1 4 where is the area located for this? 1 4 where is this area at? 1 4 where is this now? 1 4
It is evident that the San Fransisco area needs work. But not only the SF area. The process of data gathering needs to be improved. The users need to be more motivated to avoid errors at the root of the problem: at data entry. Since this is an open source where many people from all over the world enter data into the map, it will be a challenge to avoid most of the problems. But with data cleansing methods and gamification to improve data entry, we can create a cleaner map for everybody. It will be important to have visualizations so we can keep track of the progress we have done. These could be inline with the following questions: - How many entries have we cleaned? - How many new entries have errors? - Historically, are the errors going down or up. These are just examples of KPIs that we can follow to make sure, we are on track to achieve our goals. More KPIs will emerge as investigations continue and new problems uncover.