Skip to content

Instantly share code, notes, and snippets.

@samlexrod
Last active December 9, 2017 22:16
Show Gist options
  • Save samlexrod/8027aaf2965a30e6251ce5278ffebde3 to your computer and use it in GitHub Desktop.
Save samlexrod/8027aaf2965a30e6251ce5278ffebde3 to your computer and use it in GitHub Desktop.
Area of the world in https://www.openstreetmap.org and data munging techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity.

OpenStreetMap Data Case Study - Python & SQLite

Map Area

San Fransisco, CA, United States

I chose this position because I recently move here to San Fransisco, and I was interested in discovering more about the area in OpenStreetMap.

Parsing Code

All python cleaning methods use the following iteration for parsing the XML file. This method will save a tree into memory clearing the root of each element to save memory.

import xml.etree.cElementTree as ET
def get_element(osm_file, tags=('node', 'way', 'relation')):
   context = ET.iterparse(osm_file, events=('start', 'end'))
   _, root = next(context)
   
   for event, elem in context:
       if event == 'end' and elem.tag in tags:
           yield elem
           root.clear()

Problems & Cleaning

Here are 5 examples of problems found in the area mentioned above, the cleaning method, and the results.

Example 1: Problematic Characters

  • Problem: Some tags had problematic characters such as (=+/&<>;'"?%#$@,. \t\r\n[])
    sfgov.org:OFFICE_TYP
    sfgov.org:OBJECTID
    sfgov.org:OFFICE_TYP
    DOT #
    DOT #
    sfgov.org:OBJECTID
    sfgov.org:OFFICE_TYP
    
  • Cleaning Code:
    import re
    PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
    for element in get_element(osm_file):
        for child in element:
    	if child.tag == 'tag': #for Tags Only
    
    	    #Gets information from tags
    	    child_tag_key = child.get('k')
    
    	    #Identify problematic tags
    	    key_problems = PROBLEMCHARS.search(child_tag_key)
    
    	    #Skips any tags with problems
    	    if key_problems <> None:
    		continue
    
  • Description of Method: In this example, I use .compile from a regular expression (re) to compile the list of problematic characters into an object. Then I use the method .search to find them in a string variable. Such variable has a value assigned, which is extracted using the .get method. The continue statement skips those tags with problematic characters and avoids the code from including those in the final product.
  • Results:
    name:ru
    z_order
    wikidata
    wikipedia
    population
    short_name
    is_in:country
    is_in:continent
    census:population
    is_in:country_code
    

Example 2: Description of Key Separated by Colon

  • Problem: In our previous example, we can see that some tag keys contains metadata describing the key (e.g. census:population).

  • Cleaning Code:

    Continuation of last code

     #Skips any tags with problems
            if key_problems <> None:
                continue
                
            #Get the colon position in string
            colon_pos = child.get('k').find(':')
            
            if colon_pos > 0: #for Tags with colon only
                
                #Splitting keys inside tags
                key_tag = child.attrib['k'][colon_pos + 1:]
                type_tag = child.attrib['k'][0:colon_pos]
                
            else:
                key_tag = child.attrib['k']
                type_tag = 'regular'
    
  • Description of Method: In this example, I also use the .get method, which gets the value, combined with the .find method, which determines if the colon exists in the string returning an integer (index if found, -1 if not). Then I use an if statement to ensure that the cleaning only happens when a colon is found. Although the .get method will provide the same results, I used .attrib this time to clean the values in the attributes. I combined the .attrib method with an index slicing. This method will find the colon position of the string in the attribute k and extract the characters after [colon_pos + 1:] and before [0:colon_pos] the colon position. This will separate the metadata as the type and the string which is describing. Any keys with no colon will have a regular type as a placeholder.

  • Results:

   Before: name:zh | After: Type:name, key: zh
   Before: z_order | After: Type:regular, key: z_order
   Before: wikidata | After: Type:regular, key: wikidata
   Before: wikipedia | After: Type:regular, key: wikipedia
   Before: population | After: Type:regular, key: population
   Before: short_name | After: Type:regular, key: short_name
   Before: is_in:country | After: Type:is_in, key: country
   Before: is_in:continent | After: Type:is_in, key: continent
   Before: census:population | After: Type:census, key: population

Example 3: More Splitting Necessary in the Values

  • Problem: After loading the data, I ran a query to check for other values that could have been split. Although the same could have been achieved using python, I did not notice until I ran the SQL query--both methods are demonstrated below.
  • Cleaning Code:
    • Initial Query to Identify:

       #This query uses pandas to query SQLite data using 
       connection = "OpenMapProject"
       db = sqlite3.connect(connection)
      
       non_splitted_query = '''
       SELECT nodeid, type, key, value
       FROM node_tags
       WHERE value LIKE '%:%'
           AND key IN ('wikipedia') --change to identify trend by key;
       '''
       print pd.read_sql_query(non_splitted_query, db)
      
    • Groping Keys to Help Identifying Best Cleaning Code:

       Group_byKey = '''
       SELECT key, COUNT(*) AS keycount
       FROM ({subquery})
       GROUP BY key
       ORDER BY COUNT(*) DESC;
       '''
       print pd.read_sql_query(Group_byKey.format(subquery=non_splitted_query), db)
       >results
       			key  keycount
       0      opening_hours      1047
       1             source       287
       2          wikipedia       188
       3   collection_times        37
       4               note        30
       5               name        24
       6        description        10
       7    wpt_description         9
       8        conditional         4
       9            highway         2
       10     service_times         2
       11      traffic_sign         2
       12           amenity         1
       13          backward         1
       14                en         1
       15        happy_hour         1
       16       happy_hours         1
       17         housename         1
       18             image         1
       19       information         1
       20               ref         1
       21         route_ref         1
      
    • Python Code to Clean wikipedia Language Acronym:

      Modification of Code in Example 2

       #Gets the value of the element
          value_string = child.attrib['v']
      
          if colon_pos > 0: #for Tags with colon only
              
              #Splitting keys inside tags
              key_tag = child.attrib['k'][colon_pos + 1:]
              type_tag = child.attrib['k'][0:colon_pos]
              
          else:
              
              #Set to '' to not add wiki language to key tag
              laguage_part = ''
               
              #Ensuring that wikilanguage is with wiki
              if child.attrib['k'] == 'wikipedia':
                  
                  #Only does changes if colon is found
                  colon_pos = child.get('v').find(':')
                  laguage_part = '_in_' + child.attrib['v'][0:colon_pos]
                  value_string = child.attrib['v'][colon_pos + 1:]
              
              # Applying additions if needed and setting regular types
              key_tag = child.attrib['k'] + laguage_part
              type_tag = 'regular'
      
  • Description of Method: In this example, to clean the key tags I also used the .attrib method to get the k attribute wikipedia. I combined the results with an if statement to identify only those applicable tags. Then I also used the .get().find() method combinations to get the values in the v attribute and find the colon index. The separation of strings was done by using index slicing as well as explained in the previous example. Then I concatenated the values to create a wikipedia_in_en value. For the SQL part, I used the sqlite3 module using the .connect object to start a connection in memory. I did not use the connection methods. Instead, to query the data I used pandas' .read_sql_query method. This reads the SQL query into a DataFrame allowing me to view the data in a tabular format in Jupyter Notebook with headers. It uses less methods than doing the .cursor and .fetchall methods. To construct the SQL query, I looked for values that could be split by colon one more time. I achieved that by using a WHERE clause combined with a LIKE operator and wild characters, which allowed me to look for the colon in any position of the string. I also used the IN operator to find only those key tags that match wikipedia. To assess the trends and come to the conclusion that the wikipedia key's language acronym could be separate with confidence, I created another query using the GROUP statement to the group they key tags and count the instances by using the COUNT function. Then I used the ORDER BY statement to group the instances from highest to lowest using the DESC argument. To print this query, I used the .format method to use the non_splitted_query as a subquery.
  • Results:
    Before: Key: wikipedia, Value: en:San Francisco
    After: Key: wikipedia_in_en, Value: San Francisco
    Before: Key: wikipedia, Value: en:Hayward (Amtrak station)
    After: Key: wikipedia_in_en, Value: Hayward (Amtrak station)
    Before: Key: wikipedia, Value: en:Fitzgerald Marine Reserve
    After: Key: wikipedia_in_en, Value: Fitzgerald Marine Reserve
    Before: Key: wikipedia, Value: en:Crocker-Amazon, San Francisco
    After: Key: wikipedia_in_en, Value: Crocker-Amazon, San Francisco
    Before: Key: wikipedia, Value: en:Presidio Terrace
    After: Key: wikipedia_in_en, Value: Presidio Terrace
    

Example 4: Changing Type name to Language

  • Problem: There were two fields using the same string to describe the data. They both used name. However, the type filed used it to describe the name of languages and the key field used to describe the name of a place. To avoid any confusions in the future. All name strings describing a language will be changed to language.
  • Cleaning Code:
    • Initial Query to Identify:

       type_name_desc = '''
       SELECT
           nodeid,
           key,
           value
       FROM node_tags
       WHERE LOWER(type) = 'name'
       LIMIT 16 --limiting to save space
       '''
       print pd.read_sql_query(type_name_desc, db)
       >results
              nodeid  type key               value
       0    26819236  name  de       San Francisco
       1    26819236  name  eu       San Francisco
       2    26819236  name  fa       سان فرانسیسکو
       3    26819236  name  hi     सैन फ्रांसिस्को
       4    26819236  name  ja            サンフランシスコ
       5    26819236  name  kn   ಸಾನ್ ಫ್ರಾನ್ಸಿಸ್ಕೊ
       6    26819236  name  ko              샌프란시스코
       7    26819236  name  pt       São Francisco
       8    26819236  name  ru       Сан-Франциско
       9    26819236  name  ta  ஸான் ப்²ரான்ஸிஸ்கொ
       10   26819236  name  zh                 旧金山
       11  100571506  name  ja               バークレー
       12  100571506  name  zh                 伯克利
       13  140982569  name  zh                列治文区
       14  140982615  name  zh                马里纳区
       15  140982670  name  zh            海特-黑什伯里区
      
    • Python Code to Change type name to language

      Modification of Code in Example 3

           if colon_pos > 0: #for Tags with colon only
               
       	#Splitting keys inside tags
       	key_tag = child.attrib['k'][colon_pos + 1:]
       	type_tag = child.attrib['k'][0:colon_pos]
       	
           else:
      
       	#Set to '' to not add wiki language to key tag
       	laguage_part = ''
      
       	#Ensuring that wikilanguage is with wiki
       	if child.attrib['k'] == 'wikipedia':
      
       	    #Only does changes if colon is found
       	    colon_pos = child.get('v').find(':')
       	    laguage_part = '_in_' + child.attrib['v'][0:colon_pos]
       	    value_string = child.attrib['v'][colon_pos + 1:]
      
       	# Applying additions if needed and setting regular types
       	key_tag = child.attrib['k'] + laguage_part
       	type_tag = 'regular'
      
           #Changes type name to language only when found
           if type_tag == 'name':
       	type_tag = 'language'
      
  • Description of Method: In this example, to clean the type tags, I used just an if statement to find any type_tag that matches the name string. The type_tag value was assigned using the same .attrib method and index slicing combination. Those type tags with strings matching name were changed to language by reassigning the variable to the new string value.
  • Results
    Before: Type: name, Key and Value: de San Francisco
    After: Type: language, Key and Value: de San Francisco
    Before: Type: name, Key and Value: eu San Francisco
    After: Type: language, Key and Value: eu San Francisco
    Before: Type: name, Key and Value: fa سان فرانسیسکو
    After: Type: language, Key and Value: fa سان فرانسیسکو
    Before: Type: name, Key and Value: hi सैन फ्रांसिस्को
    After: Type: language, Key and Value: hi सैन फ्रांसिस्को
    Before: Type: name, Key and Value: ja サンフランシスコ
    After: Type: language, Key and Value: ja サンフランシスコ
    

Example 5: Cleaning Metadata of Address

  • Problem: The metadata describing the source of the housenumber is not consistent some tags present the key as source:housenumber and other as housenumber:source. I am going to change that to housenumber_source, which will give us a consistent key tag to query later if needed.
  • Cleaning Code:
    • Initial Query to Identify:

       address_keys = '''
       SELECT DISTINCT key
       FROM node_tags
       WHERE type = 'addr';
       '''
       print pd.read_sql_query(address_keys, db)
       >result
       			   key
       0          housenumber
       1               street
       2                 city
       3                state
       4             postcode
       5              country
       6                 unit
       7   source:housenumber
       8                 full
       9            housename
       10                pier
       11       interpolation
       12              county
       13  housenumber:source
       14            province
       15                door
       16               floor
       17               place
       18               suite
       19               flats
      
    • Python Code to Clean Address Metadata

      Modification of Code in Example 4

           #Changes to metadata when found
           if type_tag == 'name':
       	type_tag = 'language'
           elif type_tag == 'addr':
       	if key_tag.find(':') > 0:
       	    key_tag_array = key_tag.split(':')
       	    if key_tag_array[0] == 'source':
       		sep = '_'
       		key_tag = sep.join([key_tag_array[1], key_tag_array[0]])
       	    else:
       		key_tag = key_tag.replace(':', '_')
       	    print key_tag
      
  • Description of Method: The methods I used here are the .split, .join, and .replace. I used .split to separate the key tag a one-dimensional array to later change the order of the two separated values. They were joined back together inversely by using .join where the key tag's value starts with source. The result is a consistent housenumber_source key tag. I use .replace when the key tag's value ends with source to just replace the colon with an underscore since it is in the ideal format.
  • Results:
    Before: source:housenumber
    After: housenumber_source
    Before: housenumber:source
    After: housenumber_source
    Before: housenumber:source
    After: housenumber_source
    Before: source:housenumber
    After: housenumber_source
    Before: street:source
    After: street_source
    Before: source:housenumber
    After: housenumber_source	
    

Data Overview Using Jupyter Notebook SQLite3 Connection

This section contains basic statistics about the dataset, the SQL queries used to gather them, and some additional ideas about the data in context.

Timing of query function

from datetime import datetime as dt
def timing_print(s):
    print '____________________'
    print "Start Time: " + str(s)
    print "Timing: " + str(dt.now() - s)
    print "Conecttion: " + connection
    print '____________________'

File sizes

san-francisco_california.osm ......... 1,380,743 MB
SF_OpenMapProject.db .......... 752,287 MB
nodes.csv ............. 544,083 MB
node_tags.csv ........ 9,542 MB
ways.csv .............. 49,453 MB
way_tags.csv ......... 58,868 MB
way_nodes.cv ......... 185,116 MB  

Number of nodes

table_names = ('nodes', 'node_tags', 'ways', 'way_tags', 'way_nodes')
number_nodes_query = '''
SELECT COUNT(*) AS {table}_count
FROM {table};
'''
for table in table_names:
    print pd.read_sql_query(number_nodes_query.format(table=table), db)
  
-----------------
   nodes_count
0      6641035
-----------------
   node_tags_count
0           260460
-----------------
   ways_count
0      827729
-----------------
   way_tags_count
0         1784424
-----------------
   way_nodes_count
0          7885078
____________________
Start Time: 2017-11-23 11:23:14.859000
Timing: 0:00:01.275000
Conecttion: SF_OpenMapProject
____________________

Number of unique users

tables_user = ('nodes', 'ways')
number_uniq_user = '''
SELECT COUNT(DISTINCT uid) AS unique_{name}_count
FROM ({table})
'''
subquery = '''
SELECT uid
FROM nodes
UNION ALL
SELECT uid
FROM ways
'''
s = dt.now()
for table in tables_user:
    print pd.read_sql_query(number_uniq_user.format(table=table, name=table), db)
print pd.read_sql_query(number_uniq_user.format(table=subquery, name='total'), db)
timing_print(s)

---------------------
   unique_nodes_count
0                2596
---------------------
   unique_ways_count
0               1830
---------------------
   unique_total_count
0                2909
____________________
Start Time: 2017-11-23 11:30:29.738000
Timing: 0:00:04.547000
Conecttion: SF_OpenMapProject
____________________

Top 10 contributing users

number_uniq_user = '''
SELECT user, COUNT(*) AS contrib_count
FROM (SELECT uid, user FROM nodes UNION ALL SELECT uid, user FROM ways)
GROUP BY uid
ORDER BY COUNT(*) DESC
LIMIT 10;
'''
s = dt.now()
print pd.read_sql_query(number_uniq_user, db)
timing_print(s)

-----------------------------
          user  contrib_count
0      andygol        1495792
1       ediyes         887534
2    Luis36995         679457
3    dannykath         546016
4     RichRico         415649
5        Rub21         383321
6    calfarome         190862
7     oldtopos         165467
8  KindredCoda         151423
9     karitotp         139643
____________________
Start Time: 2017-11-23 15:50:20.792000
Timing: 0:00:05.117000
Conecttion: SF_OpenMapProject
____________________
'''

Top type entries with over 100

bytype_count = '''
SELECT type, COUNT(*) AS TypeCount
FROM node_tags
GROUP BY type
HAVING COUNT(*) > 100
ORDER BY COUNT(*) DESC
'''
s = dt.now()
print pd.read_sql_query(bytype_count, db)
timing_print(s)

------------------------------
              type  TypeCount
0           regular     154699
1              addr      83549
2              gnis       9594
3           species       2923
4   redwood_city_ca       2836
5           seamark       2673
6          language        392
7            survey        374
8           payment        339
9              noaa        297
10     fire_hydrant        277
11            wetap        275
12           source        271
13  traffic_signals        258
14          contact        206
15          toilets        197
16          service        117
17         building        112
18          disused        101
____________________
Start Time: 2017-11-23 16:03:22.740000
Timing: 0:00:00.135000
Conecttion: SF_OpenMapProject
____________________

Number of tags that need special attention by source node and type over total count of tags

validate_total_count = '''
SELECT COUNT(*) AS way_regular_total
FROM way_tags
WHERE type = 'regular' and LOWER(key) <> 'fixme'
'''
validate_fixme_count = '''
SELECT 
    --w1.wayid, --commented after crosschecking with source
    COUNT(w2.wayid) AS way_regular_fixme
FROM
(SELECT wayid FROM way_tags w1 WHERE type = 'regular' and LOWER(key) <> 'fixme') w1
LEFT JOIN 
(SELECT DISTINCT wayid FROM way_tags WHERE LOWER(key) = 'fixme') w2
ON w1.wayid = w2.wayid
--WHERE w1.wayid = 156124086 --commented after crosschecking
--GROUP BY w1.wayid --commented after crosschecking with source
'''
fixme_tags = '''
SELECT source_node, type, COUNT(fixid) AS fixme_count, COUNT(*) AS total_count
FROM
(
SELECT * FROM (
(SELECT 'node' AS source_node, nodeid as id, type, key, value FROM node_tags WHERE LOWER(key) <> 'fixme') node_nofix
LEFT JOIN
(SELECT nodeid as fixid FROM node_tags WHERE LOWER(key) = 'fixme') node_fix
ON node_nofix.id = node_fix.fixid)
UNION
SELECT * FROM (
(SELECT 'way' AS source_node, wayid as id, type, key, value FROM way_tags WHERE LOWER(key) <> 'fixme') way_nofix
LEFT JOIN
(SELECT wayid as fixid FROM way_tags WHERE LOWER(key) = 'fixme') AS way_fix
ON way_nofix.id = way_fix.fixid)
)
GROUP BY source_node, type
HAVING COUNT(fixid) > 0
ORDER BY COUNT(fixid) DESC;
'''
print pd.read_sql_query(validate_total_count, db)
print pd.read_sql_query(validate_fixme_count, db)
print pd.read_sql_query(fixme_tags, db)

-------------------
  way_regular_total
0            1286707
--------------------
   way_regular_fixme
0                641
---------------------------------------------------------
   source_node             type  fixme_count  total_count
0          way          regular          641      1286707
1          way            tiger          256       257884
2         node          regular          122       154625
3         node             addr           91        83549
4          way             addr           30       157411
5          way           source           16         4897
6         node         caltrans           11           28
7         node         language           11          392
8          way              hgv           11         3096
9          way             gnis            8         4994
10         way         building            6         3679
11         way            Tiger            5          342
12         way         cycleway            5         1873
13        node             gnis            4         9594
14        node            is_in            3           21
15         way      destination            2         1336
16         way             turn            2         9754
17        node           census            1           40
18         way           oneway            1           34
19         way             park            1           42
20         way  social_facility            1           19
21         way          toilets            1           72
____________________
Start Time: 2017-11-23 18:19:17.046000
Timing: 0:00:11.530000
Conecttion: SF_OpenMapProject
____________________

Additional Ideas

Idea 1

  • Verifying the integrity of the addresses entered.

    • I noticed inconsistencies in data entry such as entering the housenuber as housename, including the street number in the housename field and also in the street field, leaving important parts of the address format empty such as state and country while entering post code.
  • Example: SQL Script

     important_addr_parts = '''
     SELECT id, lat || ', ' || lon AS googlemap_checker, housenumber, housename, COALESCE(housenumber, housename) AS number, street, unit, state, country, postcode
     FROM nodes main
     LEFT JOIN
     (SELECT nodeid, value AS housenumber FROM node_tags WHERE key = 'housenumber') AS hnum
     ON hnum.nodeid = main.id
     LEFT JOIN
     (SELECT nodeid, value AS housename FROM node_tags WHERE key = 'housename' and (cast(value as integer) > 0) = 1) AS hnam
     ON hnam.nodeid = main.id
     LEFT JOIN
     (SELECT nodeid, value AS street FROM node_tags WHERE key = 'street') AS str
     ON str.nodeid = main.id
     LEFT JOIN
     (SELECT nodeid, value AS unit FROM node_tags WHERE key = 'unit') AS un
     ON un.nodeid = main.id
     LEFT JOIN
     (SELECT nodeid, value AS state FROM node_tags WHERE key = 'state') AS sta
     ON sta.nodeid = main.id
     LEFT JOIN
     (SELECT nodeid, value AS country FROM node_tags WHERE key = 'country') AS coun
     ON coun.nodeid = main.id
     LEFT JOIN
     (SELECT nodeid, value AS postcode FROM node_tags WHERE key = 'postcode') AS post
     ON post.nodeid = main.id
     WHERE housenumber IS NULL and housename IS NOT NULL
     --LIMIT 10
     '''
     print pd.read_sql_query(important_addr_parts, db)
    
  • Results:

     	    id         googlemap_checker housenumber          housename  \
     0    358772135  37.8822704, -122.1889555        None                 25   
     1    359099330  37.7993396, -122.4017555        None                243   
     2   1126948245  37.7720318, -122.4318431        None                588   
     3   1374302174  37.8750833, -122.2935778        None               1601   
     4   1514485184  37.5237877, -122.2740926        None                570   
     5   2322611878  37.4856225, -122.1468804        None                  1   
     6   2452206221  37.5672122, -122.3245315        None                 30   
     7   2452206222  37.5671059, -122.3243973        None                 34   
     8   2452235804  37.5667107, -122.3234299        None                139   
     9   2657361795  37.8850851, -122.0512286        None  1400 Creekside Dr   
     10  4073823484  37.8376837, -122.2628589        None               5179   
    
     	       number            street  unit state country postcode  
     0                  25        Orinda Way  None  None    None     None  
     1                 243    Vallejo Street  None  None    None     None  
     2                 588     Haight Street  None  None    None     None  
     3                1601  San Pablo Avenue  None  None    None     None  
     4                 570       Masonic Way  None  None    None     None  
     5                   1              None  None  None    None     None  
     6                  30    South B Street  None  None    None     None  
     7                  34    South B Street  None  None    None     None  
     8                 139    South B Street  None  None    None     None  
     9   1400 Creekside Dr   Creekside Drive  None  None    None    94596  
     10               5179              None  None  None    None     None    
    
  • Suggestion, Benefits, & Concerns:

    • The latter can be fixed quickly since we can fill the blanks in this particular area. This area is only for the US and only for CA. So replacing empty values using this logic is fairly easy for this example.
      • Benefit: The benefit of having all important parts is that consumers of the map will know the exact address for which they are looking. So by adding the empty values using logic such as placing CA and USA in the state and country field and doing a
      • Concern: On a larger scale, it would take a lot of resources to treat a symptom that can be eliminated or minimized at the root level.
    • Also, a coalesce when the housenumber is in the housename field (e.g. COALESCE(housenumber, housename) can take care of the housenumber issue. The mixture of address number with the street name (e.g. 1400 Creekside Dr) can be fixed while parsing the data cleaning non-numeric characters if the string starts with a number.
      • Benefit: The benefit of having the correct is the ease of identifying the location without having to use the latitude and longitude.
      • Concern: This will require a trend analysis to ensure that we are correcting the problem instead of creating more problems. For example, the user instead of writing the number at the beginning could write it at the end (e.g. Creekside Dr 1400). If the code is only fixing non-numeric characters if the string starts with a number, this value will remain the same.
    • Another solution would be to gamify data entry from users to incentivize them to enter at least the most important parts of the address.
      • Benefit: The benefit is that fewer resources will be needed to maintain completeness of address since the problem will be addressed at the root-preventing problems listed above.
      • Concern: This will require a heavy contribution from developers to create an open source gamification algorithm that incentivize the users to maintain completeness of the address at the root of the problem. We will have to provide A/B testing to see which incentives minimize the number of problems etc., which will require more resources. But in the end, the results will be better, in my opinion.

Using SQLite

Idea 2

  • Identify issues claimed by the user. These entries address a specific issue found by the user. We can identify these nodes and address those concerns. However, we have no categories for these issues. Thus, to visually present this data will be a challenge. We need to find these concerns by categories to see which categories are having more issues.

  • Examples and Results: Count of names with fixme tags

     sqlite> SELECT COUNT(*) AS namefixme_count
        ...> FROM node_tags as main
        ...> JOIN (SELECT nodeid FROM node_tags WHERE LOWER(key) = 'fixme') AS fixme
        ...> ON main.nodeid = fixme.nodeid
        ...> WHERE LOWER(key) = 'name';
        
     namefixme_count
     21
    

    Creating Temp Table

     sqlite> CREATE TEMP TABLE fixme AS SELECT *
        ...> FROM
        ...> (
        ...> SELECT
        ...>     id,
        ...>     substr(name.value, 1, 25) AS PlaceName,
        ...>     CASE
        ...>         WHEN LOWER(name.value) LIKE '%caltran%'THEN 'Caltrans'
        ...>         WHEN LOWER(name.value) LIKE '%trail%' THEN 'Trail'
        ...>         WHEN LOWER(name.value) LIKE '%association%' THEN 'Association'
        ...>         WHEN LOWER(name.value) LIKE '%center%' THEN 'Center'
        ...>         WHEN LOWER(name.value) LIKE '%entrance%' THEN 'Entrance'
        ...>         ELSE COALESCE(place.value, amen.value, high.value, misc.value, tour.value, waytag.value)
        ...>     END AS PlaceType,
        ...>     COALESCE(addr.value, waytag.value) AS Address,
        ...>     fixme.value AS FixMessage
        ...> FROM nodes n
        ...> JOIN --left to include those w/o names
        ...> (SELECT nodeid, type, key, value FROM node_tags WHERE key = 'name') name
        ...> ON n.id = name.nodeid
        ...> LEFT JOIN --left to include those w/o places
        ...> (SELECT nodeid, value FROM node_tags WHERE key = 'place') place
        ...> ON n.id = place.nodeid
        ...> LEFT JOIN --leff to include those w/o amenities
        ...> (SELECT nodeid, value FROM node_tags WHERE key = 'amenity') amen
        ...> ON n.id = amen.nodeid
        ...> LEFT JOIN --leff to include those w/o highway
        ...> (SELECT nodeid, value FROM node_tags WHERE key = 'highway') high
        ...> ON n.id = high.nodeid
        ...> LEFT JOIN --leff to include those w/o highway
        ...> (SELECT nodeid,
        ...> CASE WHEN LOWER(key) = 'shop' THEN 'Shop' ELSE value END AS value FROM node_tags
        ...> WHERE (LOWER(key) = 'shop' and value = 'yes') OR (LOWER(key) = 'entrance' and value = 'yes')) misc
        ...> ON n.id = misc.nodeid
        ...> LEFT JOIN --left to include those w/o address
        ...> (SELECT nodeid, group_concat(value) AS value FROM node_tags WHERE type = 'addr' GROUP BY nodeid) addr
        ...> ON n.id = addr.nodeid
        ...> LEFT JOIN --leff to include those w/o highway
        ...> (SELECT nodeid, value FROM node_tags WHERE LOWER(key) = 'tourism') tour
        ...> ON n.id = tour.nodeid
        ...> JOIN --left to force only those w/ potential issues
        ...> (SELECT nodeid, value FROM node_tags WHERE LOWER(key) = 'fixme') fixme
        ...> ON n.id = fixme.nodeid
        ...> LEFT JOIN
        ...> (SELECT DISTINCT node_id, key,
        ...> CASE WHEN LOWER(key) = 'building' THEN 'Building' ELSE value END AS Value FROM way_nodes wn
        ...> JOIN way_tags wt ON wn.wayid = wt.wayid WHERE LOWER(key) = 'waterway' OR LOWER(key) = 'building') waytag
        ...> --adds more tag descriptions but slows down the query
        ...> ON n.id = waytag.node_id
        ...> );
     sqlite> .tables
     node_tags   nodes       temp.fixme  way_nodes   way_tags    ways
    

    Validating names with fixme tags after complex query

     sqlite> SELECT
        ...>     COUNT(*) AS ValidCount
        ...> FROM fixme;
        
     ValidCount
     21
    

    Count of Fixme nodes by Place Type

     sqlite> .mode column
     sqlite> SELECT
        ...>     PlaceType,
        ...>     COUNT(*) AS FixmeCount
        ...> FROM fixme
        ...> GROUP BY PlaceType
        ...> ORDER BY COUNT(*) DESC;
     PlaceType   FixmeCount
     ----------  ----------
     Caltrans    4
     Trail       2
     bus_stop    2
     Associatio  1
     Center      1
     Entrance    1
     Shop        1
     cafe        1
     city        1
     fast_food   1
     hotel       1
     ice_cream   1
     library     1
     place_of_w  1
     restaurant  1
     stream      1
    

    List of Place Names, Place Type and Address

     sqlite> SELECT
        ...>     id AS nodeid,
        ...>     PlaceName,
        ...>     PlaceType,
        ...>     googlemap_checker
        ...> FROM fixme;
     nodeid      PlaceName                                 PlaceType        googlemap_checker
     ----------  ----------------------------------------  ---------------  ------------------------
     26819236    San Francisco                             city             37.7792808, -122.4192363
     282600552   First Baptist Church (in Covenant with F  place_of_worshi  37.8673104, -122.2614279
     368170033   Peninsula Hospital Health Sciences Libra  library          37.5920286, -122.3831222
     816289382   Market St & 17th St                       bus_stop         37.7624107, -122.4359472
     816300001   Market St & Castro St                     bus_stop         37.7628952, -122.4350231
     1110102182  Caltrans South San Francisco Maintenance  Caltrans         37.6540039, -122.4070986
     1110125147  Caltrans 55th Street Maintenance Station  Caltrans         37.8399997, -122.2622393
     1184118036  Caltrans Westborough Maintenance Station  Caltrans         37.6437518, -122.4537848
     1210822824  Caltrans Ettie Street Maintenance Statio  Caltrans         37.8260525, -122.2893159
     1901868592  Vertigo                                   hotel            37.7884016, -122.4158739
     1924019363  Blush Frozen Yogurt                       restaurant       37.7774125, -122.392276
     2217250747  Bayview trail                             Trail            37.562389, -122.096965
     2217251865  Bayview trail                             Trail            37.562486, -122.097385
     2356902019  Mental Health Association of Alameda Cou  Association      37.8441993, -122.2776973
     2525079629  Strawberry Creek                          stream           37.8676787, -122.2833163
     2639227318  Gelateria Naia                            ice_cream        37.7998759, -122.4093442
     2765693662  First Floor Entrance                      Entrance         37.7237531, -122.4767131
     3069400840  El Faro                                   fast_food        37.7698259, -122.449508
     3228601037  Integral Counseling Center                Center           37.7423814, -122.4267344
     3474438154  Jump Start Coffee and Whole Foods         Shop             37.7521445, -122.4230494
     4339548347  CAW Pacific                               cafe             37.7947938, -122.4225248
    

    Manual cross-checking with google using latitude and longitude

    Creating Table

     sqlite> CREATE TEMP TABLE PlaceTypeValid (
        ...>     id INT PRIMARY KEY,
        ...>     valid INT,
        ...>     googlefinding VARCHAR(75);	   
     sqlite> .tables
     node_tags            temp.fixme           ways
     nodes                way_nodes
     temp.PlaceTypeValid  way_tags
    

    Inserting Findings

     sqlite> INSERT INTO PlaceTypeValid
        ...> VALUES
        ...> ('4339548347',1, 'It is a contemporary pâtisserie'),
        ...> ('3474438154',1 ,'Deli, Sandwich, Coffee and Tea, Bagels shop'),
        ...> ('3228601037',1 ,'It is a nonprofit counseling center'),
        ...> ('3069400840',1 ,'Mexican Cuisine'),
        ...> ('2765693662',0 ,'Undetermined, more info needed'),
        ...> ('2639227318',1 ,'Ice cream shop selling gelato'),
        ...> ('2525079629',1 ,'Waterway stream in Berkerly'),
        ...> ('2356902019',1 ,'A mental health association'),
        ...> ('2217251865',1 ,'A trail in the coyote hills regional park'),
        ...> ('2217250747',1 ,'A trail in the coyote hills regional park'),
        ...> ('1924019363',0 ,'Frozen Yougurt not really representing a restaurant'),
        ...> ('1901868592',1 ,'A hotel in the city'),
        ...> ('1210822824',1 ,'A toll Place by the Interstate 80'),
        ...> ('1184118036',0 ,'Undetermined, more data needed'),
        ...> ('1110125147',0 ,'Undetermined, more data needed'),
        ...> ('1110102182',0 ,'Undetermined, more data needed'),
        ...> ('816300001',0 ,'Undetermined, more data needed'),
        ...> ('816289382',0 ,'Undetermined, more data needed'),
        ...> ('368170033',0 ,'Undetermined, more data needed'),
        ...> ('282600552',1 ,'A church'),
        ...> ('26819236',1 ,'San Francisco the City');						
    

    Places and Findings

     sqlite> SELECT main.id, placename, placetype, valid, googlefinding
        ...> FROM fixme AS main
        ...> JOIN placetypevalid AS ptv
        ...> ON main.id = ptv.id;
     id          PlaceName                                 PlaceType        valid       googlefinding
     ----------  ----------------------------------------  ---------------  ----------  ----------------------
     26819236    San Francisco                             city             1           San Francisco the City
     282600552   First Baptist Church (in Covenant with F  place_of_worshi  1           A church
     368170033   Peninsula Hospital Health Sciences Libra  library          0           Undetermined, more dat
     816289382   Market St & 17th St                       bus_stop         0           Undetermined, more dat
     816300001   Market St & Castro St                     bus_stop         0           Undetermined, more dat
     1110102182  Caltrans South San Francisco Maintenance  Caltrans         0           Undetermined, more dat
     1110125147  Caltrans 55th Street Maintenance Station  Caltrans         0           Undetermined, more dat
     1184118036  Caltrans Westborough Maintenance Station  Caltrans         0           Undetermined, more dat
     1210822824  Caltrans Ettie Street Maintenance Statio  Caltrans         1           A toll Place by the In
     1901868592  Vertigo                                   hotel            1           A hotel in the city
     1924019363  Blush Frozen Yogurt                       restaurant       0           Frozen Yougurt not rea
     2217250747  Bayview trail                             Trail            1           A trail in the coyote
     2217251865  Bayview trail                             Trail            1           A trail in the coyote
     2356902019  Mental Health Association of Alameda Cou  Association      1           A mental health associ
     2525079629  Strawberry Creek                          stream           1           Waterway stream in Ber
     2639227318  Gelateria Naia                            ice_cream        1           Ice cream shop selling
     2765693662  First Floor Entrance                      Entrance         0           Undetermined, more inf
     3069400840  El Faro                                   fast_food        1           Mexican Cuisine
     3228601037  Integral Counseling Center                Center           1           It is a nonprofit coun
     3474438154  Jump Start Coffee and Whole Foods         Shop             1           Deli, Sandwich, Coffee
     4339548347  CAW Pacific                               cafe             1           It is a contemporary p
    
  • Suggestion, Benefits, & Concerns:

    • Suggestion: We can start categorizing the issues as demonstrated above. On this example, I created a temporary table. But it would be useful to create a view that we can load into a visualization tool. Then the challenge is to find the right categories that are more important to us and validate them. We can take samples and validate them when the process scales up.
    • Benefit: This approach will allow us to categorize the issues. In this example, there are only 21 named places with some sort of issue as indicated by a user. If we extend the database or incentivize our users to provide us with more feedback, we will scale up on insights that we need to find root-cause solutions. If we categorize these issues, we will see which categories are the most prominent.
    • Concern: There are a couple of concerns about this approach. Every time that there is user interaction with data, there is room for errors. Thus, are the users pointing to the right issue? That is another challenge. Do we have the right process to filter the noise of misunderstandings where the user thinks there is a problem when there is no problem? The following example grouping the most frequent user concerns are listed below*

    Most Concerns by Rank*

     sqlite> WITH IssueTable AS
        ...> (
        ...> SELECT
        ...>     value,
        ...>     COUNT(*) AS IssueCount
        ...> FROM node_tags nt
        ...> LEFT JOIN nodes n
        ...> ON nt.nodeid = n.id
        ...> WHERE LOWER(key) = 'fixme'
        ...> GROUP BY value
        ...> ORDER BY COUNT(*) DESC
        ...> )
        ...> SELECT
        ...>     *,
        ...>     (
        ...>     SELECT
        ...>         COUNT(*) + 1
        ...>     FROM
        ...>         (
        ...>         SELECT DISTINCT IssueCount
        ...>         FROM IssueTable i1
        ...>         WHERE main.IssueCount < i1.IssueCount
        ...>         )
        ...>     ) AS IssueRank
        ...> FROM IssueTable main;
     value                                                         IssueCount  IssueRank
     ------------------------------------------------------------  ----------  ----------
     continue                                                      10          1
     Connect to a building, or use the door tag if this is inside  3           2
     address number not visible, number inferred                   3           2
     Connect to a building.                                        2           3
     There appears to be a social trail in the vicinity. Please c  2           3
     Where is this?                                                2           3
     address not actually visible;assuming based on numbers surro  2           3
     is no exit?                                                   2           3
     map to line                                                   2           3
     unsigned, so unsure of number                                 2           3
     yes                                                           2           3
     Check position.                                               1           4
     Continue track                                                1           4
     Discover what this facility is.                               1           4
     Does Muni stop here?                                          1           4
     Exact position                                                1           4
     I couldn't see this during a drive-by survey on 2016-06-15;   1           4
     Location needs some improvement                               1           4
     Needs Ground Survey. Road closed to through traffic near her  1           4
     Needs Ground survey                                           1           4
     North Field What?                                             1           4
     Please change address for 3580 Terrace way to the new locati  1           4
     This is an alternative address for Maru Sushi, differing onl  1           4
     This yard needs additional work and cleanup                   1           4
     What is this? Should be deleted?                              1           4
     When zooming out, Oakland (a nearby city) label covers over   1           4
     Where's exit 46B to Ygnacio Valley Road?                      1           4
     address info for upstairs residences                          1           4
     address not visible; presumed logical address                 1           4
     address not visible;presumed                                  1           4
     address number not visible;assumed 1192                       1           4
     address number not visible;assumed address                    1           4
     check address number;none was visible, so assumed 41          1           4
     check name                                                    1           4
     confirm address - sign hard to read from across the street    1           4
     confirm address, since last 2 digits of number worn off       1           4
     couldn't read address number clearly so not sure              1           4
     does this still exist?                                        1           4
     ground survey needed for Market St & Castro St for location   1           4
     ground survey needed to confirm name                          1           4
     is there really a library here?                               1           4
     more precise location within building                         1           4
     needs an approval on the ground                               1           4
     no visible address;making obvious assumption                  1           4
     not 100% sure this address was this building and not the nex  1           4
     not sure/can't tell where this goes                           1           4
     resurvey; yelp says permanently closed                        1           4
     some towers in this area being relocated                      1           4
     survey                                                        1           4
     where does the 45 mph speed limit begin?                      1           4
     where is the area located for this?                           1           4
     where is this area at?                                        1           4
     where is this now?                                            1           4
    

Conclusion

It is evident that the San Fransisco area needs work. But not only the SF area. The process of data gathering needs to be improved. The users need to be more motivated to avoid errors at the root of the problem: at data entry. Since this is an open source where many people from all over the world enter data into the map, it will be a challenge to avoid most of the problems. But with data cleansing methods and gamification to improve data entry, we can create a cleaner map for everybody. It will be important to have visualizations so we can keep track of the progress we have done. These could be inline with the following questions: - How many entries have we cleaned? - How many new entries have errors? - Historically, are the errors going down or up. These are just examples of KPIs that we can follow to make sure, we are on track to achieve our goals. More KPIs will emerge as investigations continue and new problems uncover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment