<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[James Kao]]></title><description><![CDATA[ML Hacker]]></description><link>https://jameskao.me/</link><generator>Ghost 0.11</generator><lastBuildDate>Sun, 12 Apr 2026 15:25:23 GMT</lastBuildDate><atom:link href="https://jameskao.me/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Data Wrangling with OpenStreetMap and MongoDB]]></title><description><![CDATA[<p>OpenStreetMap is a community built free editable map of the world, inspired by the success of Wikipedia where crowdsourced data is open and free from proprietary restricted use. We see some examples of its use by Craigslist and Foursquare, as an open source alternative to Google Maps.</p>

<p><a href="http://www.openstreetmap.org">http://www.openstreetmap.</a></p>]]></description><link>https://jameskao.me/data-wrangling-with-openstreetmap-and-mongodb/</link><guid isPermaLink="false">808a05de-cb4d-4e86-9f17-4ed02bbb18ee</guid><dc:creator><![CDATA[James Kao]]></dc:creator><pubDate>Tue, 06 Dec 2016 04:42:31 GMT</pubDate><content:encoded><![CDATA[<p>OpenStreetMap is a community built free editable map of the world, inspired by the success of Wikipedia where crowdsourced data is open and free from proprietary restricted use. We see some examples of its use by Craigslist and Foursquare, as an open source alternative to Google Maps.</p>

<p><a href="http://www.openstreetmap.org">http://www.openstreetmap.org</a></p>

<p>Users can map things such as polylines of roads, draw polygons of buildings or areas of interest, or insert nodes for landmarks. These map elements can be further tagged with details such as street addresses or amenity type. Map data is stored in an XML format. More details about the OSM XML can be found here:</p>

<p><a href="http://wiki.openstreetmap.org/wiki/OSM_XML">http://wiki.openstreetmap.org/wiki/OSM_XML</a></p>

<p>Some highlights of the OSM XML format relevent to this project are: <br>
- OSM XML is list of instances of data primatives (nodes, ways, and relations) found within a given bounds
- nodes represent dimensionless points on the map
- ways contain node references to form either a polyline or polygon on the map
- nodes and ways both contain children tag elements that represent key value pairs of descriptive information about a given node or way</p>

<p>As with any user generated content, there is likely going to be dirty data. In this project I'll attempt to do some auditing, cleaning, and data summarizing tasks with Python and MongoDB.</p>

<h2 id="chosenmaparea">Chosen Map Area</h2>

<p>For this project, I chose to ~50MB from the Cupertino, West San Jose Area. I grew up in Cupertino and lived through the tech sprawl of Apple and the Asian/Indian gentrification of the area. I figured that my familiarity with the area and intrinsic interest in my hometown makes it a good candidate for analysis.</p>

<pre><code class="language-python">from IPython.display import HTML  
HTML('&lt;iframe width="425" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.openstreetmap.org/export/embed.html?bbox=-122.1165%2C37.2571%2C-121.9060%2C37.3636&amp;amp;layer=mapnik"&gt;&lt;/iframe&gt;&lt;br/&gt;&lt;small&gt;&lt;a href="http://www.openstreetmap.org/#map=12/37.3105/-122.0135" target="_blank"&gt;View Larger Map&lt;/a&gt;&lt;/small&gt;')  
</code></pre>

<p><iframe width="425" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.openstreetmap.org/export/embed.html?bbox=-122.1165%2C37.2571%2C-121.9060%2C37.3636&amp;layer=mapnik"></iframe><br><small><a href="http://www.openstreetmap.org/#map=12/37.3105/-122.0135" target="_blank">View Larger Map</a></small></p>

<p>I used the Overpass API to download the OpenStreetMap XML for the corresponding bounding box:</p>

<p><a href="http://overpass-api.de/api/map?bbox=-122.1165,37.2571,-121.9060,37.3636">http://overpass-api.de/api/map?bbox=-122.1165,37.2571,-121.9060,37.3636</a></p>

<pre><code class="language-python">import requests

url = 'http://overpass-api.de/api/map?bbox=-122.1165%2C37.2571%2C-121.9060%2C37.3636'  
filename = 'cupertino_california.osm'  
</code></pre>

<p>Python's Requests library is pretty awesome for downloading this dataset, but it unfortunately keeps all the data in memory by default. Since we're using a much larger dataset, we overcome this limitation with this modified procedure from this stackoverflow post:</p>

<p><a href="http://stackoverflow.com/a/16696317">http://stackoverflow.com/a/16696317</a></p>

<pre><code class="language-python">def download_file(url, local_filename):  
    # stream = True allows downloading of large files; prevents loading entire file into memory
    r = requests.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()

download_file(url, filename)  
</code></pre>

<h2 id="auditingthedata">Auditing the Data</h2>

<p>With the OSM XML file downloaded, lets parse through it with ElementTree and count the number of unique element types. Iterative parsing is utilized since the XML is too large to process in memory.</p>

<pre><code class="language-python">import xml.etree.ElementTree as ET  
import pprint

tags = {}

for event, elem in ET.iterparse(filename):  
    if elem.tag in tags: tags[elem.tag] += 1
    else:                tags[elem.tag] = 1

pprint.pprint(tags)  
</code></pre>

<pre><code>{'bounds': 1,
 'member': 6644,
 'meta': 1,
 'nd': 255022,
 'node': 214642,
 'note': 1,
 'osm': 1,
 'relation': 313,
 'tag': 165782,
 'way': 28404}
</code></pre>

<p>Here I have built three regular expressions: <code>lower</code>, <code>lower_colon</code>, and <code>problemchars</code>. <br>
- <code>lower</code>: matches strings containing lower case characters
- <code>lower_colon</code>: matches strings containing lower case characters and a single colon within the string
- <code>problemchars</code>: matches characters that cannot be used within keys in MongoDB
Here is a sample of OSM XML:  </p>

<pre><code>&lt;node id="266587529" lat="37.3625767" lon="-122.0251570" version="4" timestamp="2015-03-30T03:17:30Z" changeset="29840833" uid="2793982" user="Dhruv Matani"&gt;  
    &lt;tag k="addr:city" v="Sunnyvale"/&gt;
    &lt;tag k="addr:housenumber" v="725"/&gt;
    &lt;tag k="addr:postcode" v="94086"/&gt;
    &lt;tag k="addr:state" v="California"/&gt;
    &lt;tag k="addr:street" v="South Fair Oaks Avenue"/&gt;
    &lt;tag k="amenity" v="restaurant"/&gt;
    &lt;tag k="cuisine" v="indian"/&gt;
    &lt;tag k="name" v="Arka"/&gt;
    &lt;tag k="opening_hours" v="10am - 2:30pm and 5:00pm - 10:00pm"/&gt;
    &lt;tag k="takeaway" v="yes"/&gt;
&lt;/node&gt;  
</code></pre>

<p>Within the node element there are ten <code>tag</code> children. The key for half of these children begin with <code>addr:</code>. Later in this notebook I will use the <code>lower_colon</code> regex to help find these keys so I can build a single <code>address</code> document within a larger json document.</p>

<pre><code class="language-python">import re

lower = re.compile(r'^([a-z]|_)*$')  
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')  
problemchars = re.compile(r'[=\+/&amp;&lt;&gt;;\'"\?%#$@\,\. \t\r\n]')

def key_type(element, keys):  
    if element.tag == "tag":
        for tag in element.iter('tag'):
            k = tag.get('k')
            if lower.search(k):
                keys['lower'] += 1
            elif lower_colon.search(k):
                keys['lower_colon'] += 1
            elif problemchars.search(k):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1

    return keys

def process_map(filename):  
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}

    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

keys = process_map(filename)  
pprint.pprint(keys)  
</code></pre>

<pre><code>{'lower': 78267, 'lower_colon': 83553, 'other': 3962, 'problemchars': 0}
</code></pre>

<p>Now lets redefine <code>process_map</code> to build a set of unique userid's found within the XML. I will then output the length of this set, representing the number of unique users making edits in the chosen map area.</p>

<pre><code class="language-python">def process_map(filename):  
    users = set()
    for _, element in ET.iterparse(filename):
        for e in element:
            if 'uid' in e.attrib:
                users.add(e.attrib['uid'])

    return users

users = process_map(filename)  
len(users)  
</code></pre>

<pre><code>534
</code></pre>

<h1 id="problemswiththedata">Problems with the Data</h1>

<p><strong>Street Names</strong></p>

<p>The majority of this project will be devoted to auditing and cleaning street names seen within the OSM XML. Street types used by users in the process of mapping are quite often abbreviated. I will attempt to find these abbreviations and replace them with their full text form. The plan of action is as follows: <br>
- Build a regex to match the last token in a string (with an optional '.') as this is typically where you would find the street type in an address
- Build a list of expected street types that do not need to be cleaned
- Parse through the XML looking for tag elements with <code>k="addr:street"</code> attributes
- Perform a search using the regex on the value of the v attribute of these elements (the street name string)
- Build a dictionary with keys that are matches to the regex (street types) and a set of street names where the particular key was found as the value. This will allow us to determine what needs to be cleaned.
- Build a second dictionary that contains a map from an offending street type to a clean street type
- Build a second regex that will match these offending street types anywhere in a string
- Build a function that will return a clean string using the mapping dictionary and this second regex</p>

<p>The first step is to build a regex to match the last token in a string optionally ending with a period. I will also build a list of street types I expect to see in a clean street name.</p>

<pre><code class="language-python">from collections import defaultdict

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected_street_types = ["Avenue", "Boulevard", "Commons", "Court", "Drive", "Lane", "Parkway",  
                         "Place", "Road", "Square", "Street", "Trail"]
</code></pre>

<p>The <code>audit_street_type</code> function will take in the dictionary of street types we are building, a string to audit, a regex to match against that string, and the list of expected street types.</p>

<p>The function will search the string for the regex. If there is a match and the match is not in our list of expected street types, add the match as a key to the dictionary and add the string to the set.</p>

<pre><code class="language-python">def audit_street_type(street_types, street_name, regex, expected_street_types):  
    m = regex.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected_street_types:
            street_types[street_type].add(street_name)
</code></pre>

<p>The function <code>is_street_name</code> determines if an element contains an attribute <code>k="addr:street"</code>. Lets use <code>is_street_name</code> as the <code>tag_filter</code> when I call the <code>audit</code> function to audit street names.</p>

<pre><code class="language-python">def is_street_name(elem):  
    return (elem.attrib['k'] == "addr:street")
</code></pre>

<p>Now I will define an <code>audit</code> function to do the parsing and auditing of the street names.</p>

<p>I have defined this function so that it not only audits <code>tag</code> elements where <code>k="addr:street"</code>, but whichever <code>tag</code> elements match the <code>tag_filter</code> function. The audit function also takes in a regex and the list of expected matches.</p>

<pre><code class="language-python">def audit(osmfile, regex):  
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)

    # iteratively parse the mapping xml
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        # iterate 'tag' tags within 'node' and 'way' tags
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'], regex, expected_street_types)

    return street_types
</code></pre>

<p>Now lets pretty print the output of <code>audit</code></p>

<pre><code class="language-python">street_types = audit(filename, street_type_re)

pprint.pprint(dict(street_types))  
</code></pre>

<pre><code>{'Alameda': set(['The Alameda']),
 'Ave': set(['Afton Ave',
             'Blake Ave',
             'Cabrillo Ave',
             'N Blaney Ave',
             'Saratoga Ave',
             'The Alameda Ave']),
 'Bascom': set(['S. Bascom']),
 'Bellomy': set(['Bellomy']),
 'Blvd': set(['De Anza Blvd', 'Stevens Creek Blvd']),
 'Circle': set(['Bobolink Circle',
                'Calabazas Creek Circle',
                'Continental Circle',
                'Winchester Circle']),
 'Dr': set(['Linwood Dr']),
 'East': set(['Vanderbilt Court East']),
 'Escuela': set(['Camina Escuela']),
 'Franklin': set(['Franklin']),
 'Ln': set(['Weyburn Ln']),
 'Loop': set(['Infinite Loop']),
 'Presada': set(['Paseo Presada']),
 'Rd': set(['Bollinger Rd', 'Homestead Rd', 'Saratoga Los Gatos Rd']),
 'Real': set(['E El Camino Real', 'East El Camino Real', 'El Camino Real']),
 'Row': set(['Santana Row']),
 'St': set(['Monroe St']),
 'Terrace': set(['Avon Terrace',
                 'Avoset Terrace',
                 'Devona Terrace',
                 'Hobart Terrace',
                 'Hogarth Terrace',
                 'Lautrec Terrace',
                 'Lessing Terrace',
                 'Manet Terrace',
                 'Oak Point Terrace',
                 'Panache Terrace',
                 'Pennyroyal Terrace',
                 'Pine Pass Terrace',
                 'Pistachio Terrace',
                 'Pumpkin Terrace',
                 'Pyracantha Terrace',
                 'Reston Terrace',
                 'Riorden Terrace',
                 'Springfield Terrace',
                 'Wilmington Terrace',
                 'Windsor Terrace',
                 'Wright Terrace',
                 'Yellowstone Terrace']),
 'Way': set(['Allison Way',
             'Anaconda Way',
             'Barnsley Way',
             'Belfry Way',
             'Belleville Way',
             'Bellingham Way',
             'Berwick Way',
             'Big Basin Way',
             'Blanchard Way',
             'Bonneville Way',
             'Brahms Way',
             'Carlisle Way',
             'Cheshire Way',
             "Coeur D'Alene Way",
             'Colinton Way',
             'Connemara Way',
             'Dartshire Way',
             'Devonshire Way',
             'Dorset Way',
             'Dublin Way',
             'Duncardine Way',
             'Dunholme Way',
             'Dunnock Way',
             'Durshire Way',
             'Edmonds Way',
             'Enderby Way',
             'Fife Way',
             'Firebird Way',
             'Flamingo Way',
             'Flicker Way',
             'Flin Way',
             'Golden Way',
             'Harney Way',
             'Humewick Way',
             'Kingfisher Way',
             'Lennox Way',
             'Locksunart Way',
             'Longfellow Way',
             'Mallard Way',
             'Miette Way',
             'Mitty Way',
             'Nandina Way',
             'Nelson Way',
             'Prince Edward Way',
             'Pyrus Way',
             'Radcliff Way',
             'Revelstoke Way',
             'Tangerine Way',
             'Tartarian Way',
             'Ward Way',
             'Zinfandel Way']),
 'West': set(['Vanderbilt Court West']),
 'Winchester': set(['Winchester'])}
</code></pre>

<p>Now I have a list of some abbreviated street types (as well as locations without street types). This is by no means a comprehensive list of all of the abbreviated street types used within the XML as all of these matches occur only as the last token at the end of a street name, but it is a very good first swipe at the problem.</p>

<p>To replace these abbreviated street types, I will define an update function that takes a string to update, a mapping dictionary, and a regex to search.</p>

<pre><code class="language-python">def update_name(name, mapping, regex):  
    m = regex.search(name)
    if m:
        street_type = m.group()
        if street_type in mapping:
            name = re.sub(regex, mapping[street_type], name)

    return name
</code></pre>

<p>Using the results of <code>audit</code>, I will build a dictionary to map abbreviations to their full, clean representations.</p>

<pre><code class="language-python">street_type_mapping = {'Ave'  : 'Avenue',  
                       'Blvd' : 'Boulevard',
                       'Dr'   : 'Drive',
                       'Ln'   : 'Lane',
                       'Pkwy' : 'Parkway',
                       'Rd'   : 'Road',
                       'St'   : 'Street'}
</code></pre>

<p>I now want to replace the keys of the map anywhere in the string. I'll build a new regex to do so.</p>

<pre><code class="language-python"># The pipe will cause the regex to search for any of the keys, lazily matching the first it finds
street_type_re  = re.compile(r'\b\S+\.?$', re.IGNORECASE)  
</code></pre>

<p>To see how this works, I will traverse the <code>street_types</code> dictionary from above</p>

<pre><code class="language-python">for street_type, ways in street_types.iteritems():  
    for name in ways:
        better_name = update_name(name, street_type_mapping, street_type_re)
        print name, "=&gt;", better_name
</code></pre>

<pre><code>El Camino Real =&gt; El Camino Real
E El Camino Real =&gt; E El Camino Real
East El Camino Real =&gt; East El Camino Real
S. Bascom =&gt; S. Bascom
Bellomy =&gt; Bellomy
Winchester =&gt; Winchester
Weyburn Ln =&gt; Weyburn Lane
Linwood Dr =&gt; Linwood Drive
Franklin =&gt; Franklin
Monroe St =&gt; Monroe Street
Bollinger Rd =&gt; Bollinger Road
Saratoga Los Gatos Rd =&gt; Saratoga Los Gatos Road
Homestead Rd =&gt; Homestead Road
Vanderbilt Court East =&gt; Vanderbilt Court East
Riorden Terrace =&gt; Riorden Terrace
Yellowstone Terrace =&gt; Yellowstone Terrace
Springfield Terrace =&gt; Springfield Terrace
Oak Point Terrace =&gt; Oak Point Terrace
Windsor Terrace =&gt; Windsor Terrace
Lessing Terrace =&gt; Lessing Terrace
Avon Terrace =&gt; Avon Terrace
Hobart Terrace =&gt; Hobart Terrace
Wright Terrace =&gt; Wright Terrace
Hogarth Terrace =&gt; Hogarth Terrace
Manet Terrace =&gt; Manet Terrace
Pyracantha Terrace =&gt; Pyracantha Terrace
Pistachio Terrace =&gt; Pistachio Terrace
Wilmington Terrace =&gt; Wilmington Terrace
Avoset Terrace =&gt; Avoset Terrace
Lautrec Terrace =&gt; Lautrec Terrace
Devona Terrace =&gt; Devona Terrace
Pennyroyal Terrace =&gt; Pennyroyal Terrace
Panache Terrace =&gt; Panache Terrace
Pumpkin Terrace =&gt; Pumpkin Terrace
Reston Terrace =&gt; Reston Terrace
Pine Pass Terrace =&gt; Pine Pass Terrace
Firebird Way =&gt; Firebird Way
Dublin Way =&gt; Dublin Way
Flicker Way =&gt; Flicker Way
Anaconda Way =&gt; Anaconda Way
Tartarian Way =&gt; Tartarian Way
Barnsley Way =&gt; Barnsley Way
Tangerine Way =&gt; Tangerine Way
Blanchard Way =&gt; Blanchard Way
Fife Way =&gt; Fife Way
Flamingo Way =&gt; Flamingo Way
Edmonds Way =&gt; Edmonds Way
Locksunart Way =&gt; Locksunart Way
Revelstoke Way =&gt; Revelstoke Way
Enderby Way =&gt; Enderby Way
Cheshire Way =&gt; Cheshire Way
Colinton Way =&gt; Colinton Way
Dorset Way =&gt; Dorset Way
Berwick Way =&gt; Berwick Way
Radcliff Way =&gt; Radcliff Way
Brahms Way =&gt; Brahms Way
Dunholme Way =&gt; Dunholme Way
Durshire Way =&gt; Durshire Way
Longfellow Way =&gt; Longfellow Way
Nandina Way =&gt; Nandina Way
Dunnock Way =&gt; Dunnock Way
Carlisle Way =&gt; Carlisle Way
Mitty Way =&gt; Mitty Way
Harney Way =&gt; Harney Way
Devonshire Way =&gt; Devonshire Way
Belfry Way =&gt; Belfry Way
Prince Edward Way =&gt; Prince Edward Way
Pyrus Way =&gt; Pyrus Way
Golden Way =&gt; Golden Way
Ward Way =&gt; Ward Way
Kingfisher Way =&gt; Kingfisher Way
Connemara Way =&gt; Connemara Way
Allison Way =&gt; Allison Way
Flin Way =&gt; Flin Way
Nelson Way =&gt; Nelson Way
Bellingham Way =&gt; Bellingham Way
Mallard Way =&gt; Mallard Way
Humewick Way =&gt; Humewick Way
Big Basin Way =&gt; Big Basin Way
Coeur D'Alene Way =&gt; Coeur D'Alene Way
Belleville Way =&gt; Belleville Way
Duncardine Way =&gt; Duncardine Way
Bonneville Way =&gt; Bonneville Way
Miette Way =&gt; Miette Way
Zinfandel Way =&gt; Zinfandel Way
Lennox Way =&gt; Lennox Way
Dartshire Way =&gt; Dartshire Way
Vanderbilt Court West =&gt; Vanderbilt Court West
De Anza Blvd =&gt; De Anza Boulevard
Stevens Creek Blvd =&gt; Stevens Creek Boulevard
Blake Ave =&gt; Blake Avenue
The Alameda Ave =&gt; The Alameda Avenue
Saratoga Ave =&gt; Saratoga Avenue
Afton Ave =&gt; Afton Avenue
N Blaney Ave =&gt; N Blaney Avenue
Cabrillo Ave =&gt; Cabrillo Avenue
Winchester Circle =&gt; Winchester Circle
Calabazas Creek Circle =&gt; Calabazas Creek Circle
Continental Circle =&gt; Continental Circle
Bobolink Circle =&gt; Bobolink Circle
Santana Row =&gt; Santana Row
The Alameda =&gt; The Alameda
Paseo Presada =&gt; Paseo Presada
Infinite Loop =&gt; Infinite Loop
Camina Escuela =&gt; Camina Escuela
</code></pre>

<p>Looks like the abbreviated street types updated as expected.</p>

<p>Upon closer inspection, I see another problem: cardinal directions. North, South, East, and West appear to be universally abbreviated. Lets apply similar techniques to replace these abbreviated cardinal directions.</p>

<p>First, I will create a new regex matching the set of characters NSEW at the beginning of a string, followed by an optional period</p>

<pre><code class="language-python">street_type_pre = re.compile(r'^[NSEW]\b\.?', re.IGNORECASE)  
</code></pre>

<p>To audit, I can use the same function with this new regex</p>

<pre><code class="language-python">cardinal_directions = audit(filename, street_type_pre)

pprint.pprint(dict(cardinal_directions))  
</code></pre>

<pre><code>{'E': set(['E El Camino Real']),
 'N': set(['N Blaney Ave']),
 'S.': set(['S. Bascom'])}
</code></pre>

<p>Looks like we found E, N, S, W, and W. at beginning of the street names. Informative, but I can just create an exhaustive mapping for this issue</p>

<pre><code class="language-python">cardinal_mapping = {'E'  : 'East',  
                    'E.' : 'East',
                    'N'  : 'North',
                    'N.' : 'North',
                    'S'  : 'South',
                    'S.' : 'South',
                    'W'  : 'West',
                    'W.' : 'West'}
</code></pre>

<p>Finally, I will traverse the <code>cardinal_directions</code> dictionary and apply the updates for both street type and cardinal direction</p>

<pre><code class="language-python">for cardinal_direction, ways in cardinal_directions.iteritems():  
    if cardinal_direction in cardinal_mapping:
        for name in ways:
            better_name = update_name(name, street_type_mapping, street_type_re)
            best_name   = update_name(better_name, cardinal_mapping, street_type_pre)
            print name, "=&gt;", better_name, "=&gt;", best_name
</code></pre>

<pre><code>E El Camino Real =&gt; E El Camino Real =&gt; East El Camino Real
S. Bascom =&gt; S. Bascom =&gt; South Bascom
N Blaney Ave =&gt; N Blaney Avenue =&gt; North Blaney Avenue
</code></pre>

<h2 id="preparingformongodb">Preparing for MongoDB</h2>

<p>To load the XML data into MongoDB, I will have to transform the data into json documents structured like this:  </p>

<pre><code>{
    "id": "2406124091",
    "type: "node",
    "visible":"true",
    "created": {
                  "version":"2",
                  "changeset":"17206049",
                  "timestamp":"2013-08-03T16:43:42Z",
                  "user":"linuxUser16",
                  "uid":"1219059"
               },
    "pos": [41.9757030, -87.6921867],
    "address": {
                  "housenumber": "5157",
                  "postcode": "60625",
                  "street": "North Lincoln Ave"
               },
    "amenity": "restaurant",
    "cuisine": "mexican",
    "name": "La Cabana De Don Luis",
    "phone": "1 (773)-271-5176"
}
</code></pre>

<p>The transform will follow these rules: <br>
- Process only 2 types of top level tags: node and way
- All attributes of node and way should be turned into regular key/value pairs, except:
  - The following attributes should be added under a key <code>created: version, changeset, timestamp, user, uid</code>
  - Attributes for latitude and longitude should be added to a pos array, for use in geospacial indexing. Make sure the values inside pos array are floats and not strings.
- If second level <code>tag</code> "k" value contains problematic characters, it should be ignored
- If second level <code>tag</code> "k" value starts with "addr:", it should be added to a dictionary address
- If second level <code>tag</code> "k" value does not start with "addr:", but contains ":", you can process it same as any other tag.
- If there is a second ":" that separates the type/direction of a street, the tag should be ignored, for example:</p>

<pre><code>&lt;tag k="addr:housenumber" v="5158"/&gt;  
&lt;tag k="addr:street" v="North Lincoln Avenue"/&gt;  
&lt;tag k="addr:street:name" v="Lincoln"/&gt;  
&lt;tag k="addr:street:prefix" v="North"/&gt;  
&lt;tag k="addr:street:type" v="Avenue"/&gt;  
&lt;tag k="amenity" v="pharmacy"/&gt;  
</code></pre>

<p>should be turned into:  </p>

<pre><code>{
    "address": {
                   "housenumber": 5158,
                   "street": "North Lincoln Avenue"
               },
    "amenity": "pharmacy"
}
</code></pre>

<p>For "way" specifically:  </p>

<pre><code>&lt;nd ref="305896090"/&gt;  
&lt;nd ref="1719825889"/&gt;  
</code></pre>

<p>should be turned into:  </p>

<pre><code>{
    "node_refs": ["305896090", "1719825889"]
}
</code></pre>

<p>To do this transformation, lets define a function <code>shape_element</code> that processes an element. Within this function I will use the update function with the regexes and mapping dictionaries defined above to clean street addresses. Additionally, I will store timestamp as a Python <code>datetime</code> rather than as a string. The format of the timestamp can be found here:</p>

<p><a href="http://overpass-api.de/output_formats.html">http://overpass-api.de/output_formats.html</a></p>

<pre><code class="language-python">from datetime import datetime

CREATED = ["version", "changeset", "timestamp", "user", "uid"]

def shape_element(element):  
    node = {}
    if element.tag == "node" or element.tag == "way" :
        node['type'] = element.tag

        # Parse attributes
        for attrib in element.attrib:

            # Data creation details
            if attrib in CREATED:
                if 'created' not in node:
                    node['created'] = {}
                if attrib == 'timestamp':
                    node['created'][attrib] = datetime.strptime(element.attrib[attrib], '%Y-%m-%dT%H:%M:%SZ')
                else:
                    node['created'][attrib] = element.get(attrib)

            # Parse location
            if attrib in ['lat', 'lon']:
                lat = float(element.attrib.get('lat'))
                lon = float(element.attrib.get('lon'))
                node['pos'] = [lat, lon]

            # Parse the rest of attributes
            else:
                node[attrib] = element.attrib.get(attrib)

        # Process tags
        for tag in element.iter('tag'):
            key   = tag.attrib['k']
            value = tag.attrib['v']
            if not problemchars.search(key):

                # Tags with single colon and beginning with addr
                if lower_colon.search(key) and key.find('addr') == 0:
                    if 'address' not in node:
                        node['address'] = {}
                    sub_attr = key.split(':')[1]
                    if is_street_name(tag):
                        # Do some cleaning
                        better_name = update_name(name, street_type_mapping, street_type_re)
                        best_name   = update_name(better_name, cardinal_mapping, street_type_pre)
                        node['address'][sub_attr] = best_name
                    else:
                        node['address'][sub_attr] = value

                # All other tags that don't begin with "addr"
                elif not key.find('addr') == 0:
                    if key not in node:
                        node[key] = value
                else:
                    node["tag:" + key] = value

        # Process nodes
        for nd in element.iter('nd'):
            if 'node_refs' not in node:
                node['node_refs'] = []
            node['node_refs'].append(nd.attrib['ref'])

        return node
    else:
        return None
</code></pre>

<p>Now parse the XML, shape the elements, and write to a json file.</p>

<p>We're using BSON for compatibility with the date aggregation operators. There is also a Timestamp type in MongoDB, but use of this type is explicitly discouraged by the <a href="http://docs.mongodb.org/manual/core/document/#timestamps">documentation</a>.</p>

<pre><code class="language-python">import json  
from bson import json_util

def process_map(file_in, pretty = False):  
    file_out = "{0}.json".format(file_in)
    with open(file_out, "wb") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                if pretty:
                    fo.write(json.dumps(el, indent=2, default=json_util.default)+"\n")
                else:
                    fo.write(json.dumps(el, default=json_util.default) + "\n")

process_map(filename)  
</code></pre>

<h2 id="overviewofthedata">Overview of the Data</h2>

<p>Lets look at the size of the files we worked with and generated.</p>

<pre><code class="language-python">import os  
print 'The downloaded file is {} MB'.format(os.path.getsize(filename)/1.0e6) # convert from bytes to megabytes  
</code></pre>

<pre><code>The downloaded file is 50.66996 MB
</code></pre>

<pre><code class="language-python">print 'The json file is {} MB'.format(os.path.getsize(filename + ".json")/1.0e6) # convert from bytes to megabytes  
</code></pre>

<pre><code>The json file is 83.383804 MB
</code></pre>

<p><strong>Plenty of Street Addresses</strong></p>

<p>Besides dirty data within the <code>addr:street</code> field, we're working with a sizeable amount of data on street addresses. Here I will count the total number of nodes and ways that contain a tag child with <code>k="addr:street"</code></p>

<pre><code class="language-python">osm_file = open(filename, "r")  
address_count = 0

for event, elem in ET.iterparse(osm_file, events=("start",)):  
    if elem.tag == "node" or elem.tag == "way":
        for tag in elem.iter("tag"):
            if is_street_name(tag):
                address_count += 1

address_count  
</code></pre>

<pre><code>8958
</code></pre>

<p>There are plenty of locations on the map that has their street addresses tagged. It looks like OpenStreetMap's community has collected a good amount of data for this area.</p>

<h2 id="workingwithmongodb">Working with MongoDB</h2>

<p>The first task is to execute mongod to run MongoDB. There are plenty of guides to do this. On OS X, if you have <code>mongodb</code> installed via homebrew, homebrew actually has a handy <code>brew services</code> command.</p>

<p>To start mongodb:</p>

<pre><code>brew services start mongodb
</code></pre>

<p>To stop mongodb if it's already running:</p>

<pre><code>brew services stop mongodb
</code></pre>

<p>Alternatively, if you have MongoDB installed and configured already we can run a subprocess for the duration of the python session:</p>

<pre><code class="language-python">import signal  
import subprocess

# The os.setsid() is passed in the argument preexec_fn so
# it's run after the fork() and before  exec() to run the shell.
pro = subprocess.Popen('mongod', preexec_fn = os.setsid)  
</code></pre>

<p>Next, connect to the database with <code>pymongo</code></p>

<pre><code class="language-python">from pymongo import MongoClient

db_name = 'openstreetmap'

# Connect to Mongo DB
client = MongoClient('localhost:27017')  
# Database 'openstreetmap' will be created if it does not exist.
db = client[db_name]  
</code></pre>

<p>Then just import the dataset with <code>mongoimport</code>.</p>

<pre><code class="language-python"># Build mongoimport command
collection = filename[:filename.find('.')]  
working_directory = '/Users/James/Dropbox/Projects/da/data-wrangling-with-openstreetmap-and-mongodb/'  
json_file = filename + '.json'

mongoimport_cmd = 'mongoimport -h 127.0.0.1:27017 ' + \  
                  '--db ' + db_name + \
                  ' --collection ' + collection + \
                  ' --file ' + working_directory + json_file

# Before importing, drop collection if it exists (i.e. a re-run)
if collection in db.collection_names():  
    print 'Dropping collection: ' + collection
    db[collection].drop()

# Execute the command
print 'Executing: ' + mongoimport_cmd  
subprocess.call(mongoimport_cmd.split())  
</code></pre>

<pre><code>Dropping collection: cupertino_california
Executing: mongoimport -h 127.0.0.1:27017 --db openstreetmap --collection cupertino_california --file /Users/James/Dropbox/Projects/da/data-wrangling-with-openstreetmap-and-mongodb/cupertino_california.osm.json





0
</code></pre>

<h2 id="investigatingthedata">Investigating the Data</h2>

<p>After importing, get the collection from the database.</p>

<pre><code class="language-python">cupertino_california = db[collection]  
</code></pre>

<p>Here's where the fun stuff starts. Now that we have a audited and cleaned up collection, we can query for a bunch of interesting statistics.</p>

<p><strong>Number of Documents</strong></p>

<pre><code class="language-python">cupertino_california.find().count()  
</code></pre>

<pre><code>243046
</code></pre>

<p><strong>Number of Unique Users</strong></p>

<pre><code class="language-python">len(cupertino_california.distinct('created.user'))  
</code></pre>

<pre><code>528
</code></pre>

<p><strong>Number of Nodes and Ways</strong></p>

<pre><code class="language-python">cupertino_california.aggregate({'$group': {'_id': '$type', \  
                                           'count': {'$sum' : 1}}})['result']
</code></pre>

<pre><code>[{u'_id': u'way', u'count': 28404}, {u'_id': u'node', u'count': 214642}]
</code></pre>

<p><strong>Top Three Contributors</strong></p>

<pre><code class="language-python">top_users = cupertino_california.aggregate([{'$group': {'_id': '$created.user', \  
                                                        'count': {'$sum' : 1}}}, \
                                            {'$sort': {'count' : -1}}, \
                                            {'$limit': 3}])['result']

pprint.pprint(top_users)  
print

for user in top_users:  
    pprint.pprint(cupertino_california.find({'created.user': user['_id']})[0])
</code></pre>

<pre><code>[{u'_id': u'n76', u'count': 66090},
 {u'_id': u'mk408', u'count': 37175},
 {u'_id': u'Bike Mapper', u'count': 27545}]

{u'_id': ObjectId('55e69dc45c014321a5c76759'),
 u'changeset': u'16866449',
 u'created': {u'changeset': u'16866449',
              u'timestamp': datetime.datetime(2013, 7, 7, 21, 29, 38),
              u'uid': u'318696',
              u'user': u'n76',
              u'version': u'21'},
 u'highway': u'traffic_signals',
 u'id': u'26027690',
 u'pos': [37.3531613, -122.0140663],
 u'timestamp': u'2013-07-07T21:29:38Z',
 u'type': u'node',
 u'uid': u'318696',
 u'user': u'n76',
 u'version': u'21'}
{u'_id': ObjectId('55e69dc45c014321a5c76765'),
 u'changeset': u'3923975',
 u'created': {u'changeset': u'3923975',
              u'timestamp': datetime.datetime(2010, 2, 20, 14, 28, 36),
              u'uid': u'201724',
              u'user': u'mk408',
              u'version': u'2'},
 u'id': u'26117855',
 u'pos': [37.3584066, -122.016459],
 u'timestamp': u'2010-02-20T14:28:36Z',
 u'type': u'node',
 u'uid': u'201724',
 u'user': u'mk408',
 u'version': u'2'}
{u'_id': ObjectId('55e69dc45c014321a5c76761'),
 u'changeset': u'31215083',
 u'created': {u'changeset': u'31215083',
              u'timestamp': datetime.datetime(2015, 5, 17, 0, 1, 56),
              u'uid': u'74705',
              u'user': u'Bike Mapper',
              u'version': u'25'},
 u'id': u'26029632',
 u'pos': [37.3523544, -122.0122361],
 u'timestamp': u'2015-05-17T00:01:56Z',
 u'type': u'node',
 u'uid': u'74705',
 u'user': u'Bike Mapper',
 u'version': u'25'}
</code></pre>

<p><strong>Three Most Referenced Nodes</strong></p>

<pre><code class="language-python">top_nodes = cupertino_california.aggregate([{'$unwind': '$node_refs'}, \  
                                            {'$group': {'_id': '$node_refs', \
                                                        'count': {'$sum': 1}}}, \
                                            {'$sort': {'count': -1}}, \
                                            {'$limit': 3}])['result']

pprint.pprint(top_nodes)  
print

for node in top_nodes:  
    pprint.pprint(cupertino_california.find({'id': node['_id']})[0])
</code></pre>

<pre><code>[{u'_id': u'282814553', u'count': 9},
 {u'_id': u'3567695709', u'count': 7},
 {u'_id': u'3678198975', u'count': 7}]

{u'_id': ObjectId('55e69dc45c014321a5c7b228'),
 u'changeset': u'32109704',
 u'created': {u'changeset': u'32109704',
              u'timestamp': datetime.datetime(2015, 6, 21, 5, 7, 49),
              u'uid': u'33757',
              u'user': u'Minh Nguyen',
              u'version': u'19'},
 u'highway': u'traffic_signals',
 u'id': u'282814553',
 u'pos': [37.3520588, -121.93721],
 u'timestamp': u'2015-06-21T05:07:49Z',
 u'type': u'node',
 u'uid': u'33757',
 u'user': u'Minh Nguyen',
 u'version': u'19'}
{u'_id': ObjectId('55e69dca5c014321a5ca7eed'),
 u'changeset': u'31682562',
 u'created': {u'changeset': u'31682562',
              u'timestamp': datetime.datetime(2015, 6, 3, 4, 56, 24),
              u'uid': u'33757',
              u'user': u'Minh Nguyen',
              u'version': u'1'},
 u'id': u'3567695709',
 u'pos': [37.3354655, -121.9078015],
 u'timestamp': u'2015-06-03T04:56:24Z',
 u'type': u'node',
 u'uid': u'33757',
 u'user': u'Minh Nguyen',
 u'version': u'1'}
{u'_id': ObjectId('55e69dca5c014321a5ca9881'),
 u'changeset': u'33058613',
 u'created': {u'changeset': u'33058613',
              u'timestamp': datetime.datetime(2015, 8, 2, 23, 59, 13),
              u'uid': u'33757',
              u'user': u'Minh Nguyen',
              u'version': u'1'},
 u'id': u'3678198975',
 u'pos': [37.327171, -121.9337872],
 u'timestamp': u'2015-08-02T23:59:13Z',
 u'type': u'node',
 u'uid': u'33757',
 u'user': u'Minh Nguyen',
 u'version': u'1'}
</code></pre>

<p><strong>Number of Documents with Street Addresses</strong></p>

<pre><code class="language-python">cupertino_california.find({'address.street': {'$exists': 1}}).count()  
</code></pre>

<pre><code>9095
</code></pre>

<p><strong>List of Zip Codes</strong></p>

<pre><code class="language-python">cupertino_california.aggregate([{'$match': {'address.postcode': {'$exists': 1}}}, \  
                                {'$group': {'_id': '$address.postcode', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}])['result']
</code></pre>

<pre><code>[{u'_id': u'94087', u'count': 226},
 {u'_id': u'95070', u'count': 225},
 {u'_id': u'95051', u'count': 127},
 {u'_id': u'95014', u'count': 106},
 {u'_id': u'95129', u'count': 86},
 {u'_id': u'95126', u'count': 45},
 {u'_id': u'95008', u'count': 41},
 {u'_id': u'95050', u'count': 28},
 {u'_id': u'95125', u'count': 13},
 {u'_id': u'94086', u'count': 12},
 {u'_id': u'95117', u'count': 9},
 {u'_id': u'95128', u'count': 8},
 {u'_id': u'94024', u'count': 5},
 {u'_id': u'95124', u'count': 4},
 {u'_id': u'94040', u'count': 3},
 {u'_id': u'95032', u'count': 3},
 {u'_id': u'94087-2248', u'count': 1},
 {u'_id': u'94087\u200e', u'count': 1},
 {u'_id': u'94088-3707', u'count': 1},
 {u'_id': u'95110', u'count': 1},
 {u'_id': u'95052', u'count': 1},
 {u'_id': u'CA 95014', u'count': 1},
 {u'_id': u'95914', u'count': 1},
 {u'_id': u'94022', u'count': 1},
 {u'_id': u'CA 94086', u'count': 1}]
</code></pre>

<p>It looks like have some invalid zip codes, with the state name or unicode characters included.</p>

<p>The zip codes with 4 digit postal codes included are still valid though, and we might consider removing these postal codes during the cleaning process.</p>

<p><strong>Cities with Most Records</strong></p>

<pre><code class="language-python">cupertino_california.aggregate([{'$match': {'address.city': {'$exists': 1}}}, \  
                                {'$group': {'_id': '$address.city', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}])['result']
</code></pre>

<pre><code>[{u'_id': u'Sunnyvale', u'count': 2476},
 {u'_id': u'Saratoga', u'count': 221},
 {u'_id': u'Santa Clara', u'count': 142},
 {u'_id': u'San Jose', u'count': 99},
 {u'_id': u'Cupertino', u'count': 59},
 {u'_id': u'Campbell', u'count': 37},
 {u'_id': u'San Jos\xe9', u'count': 9},
 {u'_id': u'Los Altos', u'count': 7},
 {u'_id': u'Campbelll', u'count': 3},
 {u'_id': u'Mountain View', u'count': 3},
 {u'_id': u'cupertino', u'count': 2},
 {u'_id': u'Santa clara', u'count': 1},
 {u'_id': u'santa clara', u'count': 1},
 {u'_id': u'campbell', u'count': 1},
 {u'_id': u'san jose', u'count': 1},
 {u'_id': u'Los Gatos', u'count': 1},
 {u'_id': u'South Mary Avenue', u'count': 1},
 {u'_id': u'sunnyvale', u'count': 1}]
</code></pre>

<p>Likewise, some cities capitalization and the accented-e gives way to more auditing and cleaning.</p>

<p>It's interesting to note how well Sunnyvale and Santa Clara have been documented, relative to the other cities despite having the area covering mostly Cupertino, Saratoga, West San Jose.</p>

<p><strong>Top 10 Amenities</strong></p>

<pre><code class="language-python">cupertino_california.aggregate([{'$match': {'amenity': {'$exists': 1}}}, \  
                                {'$group': {'_id': '$amenity', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}, \
                                {'$limit': 10}])['result']
</code></pre>

<pre><code>[{u'_id': u'parking', u'count': 437},
 {u'_id': u'restaurant', u'count': 279},
 {u'_id': u'school', u'count': 243},
 {u'_id': u'place_of_worship', u'count': 153},
 {u'_id': u'fast_food', u'count': 147},
 {u'_id': u'cafe', u'count': 85},
 {u'_id': u'fuel', u'count': 79},
 {u'_id': u'bicycle_parking', u'count': 72},
 {u'_id': u'bank', u'count': 66},
 {u'_id': u'bench', u'count': 60}]
</code></pre>

<p><em>* Top 10 Banks</em>*</p>

<p>It's a pain when there isn't a local branch of your bank closeby. Lets what banks have the most locations in this area to avoid this.</p>

<pre><code class="language-python">cupertino_california.aggregate([{'$match': {'amenity': 'bank'}}, \  
                                {'$group': {'_id': '$name', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}, \
                                {'$limit': 10}])['result']
</code></pre>

<pre><code>[{u'_id': u'Bank of America', u'count': 10},
 {u'_id': u'Chase', u'count': 8},
 {u'_id': None, u'count': 7},
 {u'_id': u'US Bank', u'count': 5},
 {u'_id': u'Citibank', u'count': 5},
 {u'_id': u'Wells Fargo', u'count': 5},
 {u'_id': u'First Tech Federal Credit Union', u'count': 2},
 {u'_id': u'Union Bank', u'count': 2},
 {u'_id': u'Bank of the West', u'count': 2},
 {u'_id': u'Chase Bank', u'count': 2}]
</code></pre>

<h2 id="otherideasaboutthedataset">Other Ideas About the Dataset</h2>

<p>From exploring the OpenStreetMap dataset, I found the data structure to be flexible enough to include a vast multitude of user generated quantitative and qualitative data beyond that of simply defining a virtual map. There's plenty of potential to extend OpenStreetMap to include user reviews of establishments, subjective areas of what classifies a good vs bad neighborhood, housing price data, school reviews, walkability/bikeability, quality of mass transit, and a bunch of other metrics that could form a solid foundation for robust recommender systems. These recommender systems could aid users in deciding where to live or what cool food joints to check out.</p>

<p>The data is far too incomplete to be able to implement such recommender systems as it stands now, but the OpenStreetMap project could really benefit from visualizing data on content generation within their maps. For example, a heat map layer could be overlayed on the map showing how frequently or how recently certain regions of the map have been updated. These map layers could help guide users towards areas of the map that need attention in order to help more fully complete the data set.</p>

<p>Next I will cover a couple of queries that are aligned with these ideas about the velocity and volume of content generation</p>

<p><strong>Amount of Nodes Elements Created by Day of Week</strong></p>

<p>I will use the <code>$dayOfWeek</code> operator to extract the day of week from the <code>created.timestamp</code> field, where 1 is Sunday and 7 is Saturday:</p>

<p><a href="http://docs.mongodb.org/manual/reference/operator/aggregation/dayOfWeek/">http://docs.mongodb.org/manual/reference/operator/aggregation/dayOfWeek/</a></p>

<pre><code class="language-python">cupertino_california.aggregate([{'$project': {'dayOfWeek': {'$dayOfWeek': '$created.timestamp'}}}, \  
                                {'$group': {'_id': '$dayOfWeek', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'_id': 1}}])['result']
</code></pre>

<pre><code>[{u'_id': 1, u'count': 44247},
 {u'_id': 2, u'count': 39205},
 {u'_id': 3, u'count': 39398},
 {u'_id': 4, u'count': 35923},
 {u'_id': 5, u'count': 33127},
 {u'_id': 6, u'count': 21174},
 {u'_id': 7, u'count': 29972}]
</code></pre>

<p>It seems like users were more active on in the beginning of the week.</p>

<p><strong>Age of Elements</strong></p>

<p>Lets see how old elements were created in the XML using the <code>created.timestamp</code> field and visualize this data by pushing the calculated values into a list.</p>

<pre><code class="language-python">ages = cupertino_california.aggregate([ \  
               {'$project': {'ageInMilliseconds': {'$subtract': [datetime.now(), '$created.timestamp']}}}, \
               {'$project': {'_id': 0, \
                             'ageInDays': {'$divide': ['$ageInMilliseconds', 1000*60*60*24]}}}, \
               {'$group'  : {'_id': 1, \
                             'ageInDays': {'$push': '$ageInDays'}}}, \
               {'$project': {'_id': 0, \
                             'ageInDays': 1}}])['result'][0]
</code></pre>

<p>Now I have a dictionary with an <code>ageInDays</code> key and a list of floats as the value. Next, I will create a pandas dataframe from this dictionary</p>

<pre><code class="language-python">from pandas import DataFrame

age_df = DataFrame.from_dict(ages)  
# age_df.index.name = 'element'
print age_df.head()  
</code></pre>

<pre><code>     ageInDays
0  1709.554143
1  1709.552893
2  1709.675798
3  1709.676018
4  1709.676331
</code></pre>

<p>Lets plot a histogram of this series with our best friend <code>ggplot</code>. The binwidth is set to 30 (about a month)</p>

<pre><code class="language-python">%matplotlib inline
from ggplot import *  
import warnings

# ggplot usage of pandas throws a future warning
warnings.filterwarnings('ignore')

print ggplot(aes(x='ageInDays'), data=age_df) + \  
             geom_histogram(binwidth=30, fill='#007ee5')
</code></pre>

<p><img src="https://raw.githubusercontent.com/bestkao/data-wrangling-with-openstreetmap-and-mongodb/master/output_97_0.png" alt="png"></p>

<pre><code>&lt;ggplot: (322176817)&gt;
</code></pre>

<p>Note the rise and fall of large spikes of activity occurring about every 400 days. I hypothesize that these are due to single users making many edits in this concentrated map area in a short period of time.</p>]]></content:encoded></item><item><title><![CDATA[Analyzing the NYC Subway Dataset]]></title><description><![CDATA[<p>Today we're investigating how subway ridership is affected by incidence of rain. I've taken a sample of MTA New York City Subway dataset from May, 2011 containing hourly entries and exits to turnstiles (UNIT) by day in the subway system. Joined with this dataset is weather data from Weather Underground</p>]]></description><link>https://jameskao.me/analyzing-the-nyc-subway-dataset/</link><guid isPermaLink="false">ffa991b6-ec6a-46e9-bd9e-13d5df78ce91</guid><dc:creator><![CDATA[James Kao]]></dc:creator><pubDate>Tue, 06 Dec 2016 04:41:12 GMT</pubDate><content:encoded><![CDATA[<p>Today we're investigating how subway ridership is affected by incidence of rain. I've taken a sample of MTA New York City Subway dataset from May, 2011 containing hourly entries and exits to turnstiles (UNIT) by day in the subway system. Joined with this dataset is weather data from Weather Underground containing features such as barometric pressure, wind speed, temperature, the total amount of precipitation, and the indicidence of rain, fog, and thunder.</p>

<p>The CSV is loaded into a Pandas dataframe named turnstile_master. Here's a preview of the first five rows:</p>

<pre><code>%matplotlib inline
import pandas as pd

turnstile_master = pd.read_csv('turnstile_data_master_with_weather.csv')
turnstile_master.head()
</code></pre>

<div>  
<table border="1" class="dataframe">  
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Unnamed: 0</th>
      <th>UNIT</th>
      <th>DATEn</th>
      <th>TIMEn</th>
      <th>Hour</th>
      <th>DESCn</th>
      <th>ENTRIESn_hourly</th>
      <th>EXITSn_hourly</th>
      <th>maxpressurei</th>
      <th>maxdewpti</th>
      <th>...</th>
      <th>meandewpti</th>
      <th>meanpressurei</th>
      <th>fog</th>
      <th>rain</th>
      <th>meanwindspdi</th>
      <th>mintempi</th>
      <th>meantempi</th>
      <th>maxtempi</th>
      <th>precipi</th>
      <th>thunder</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>R001</td>
      <td>2011-05-01</td>
      <td>01:00:00</td>
      <td>1</td>
      <td>REGULAR</td>
      <td>0</td>
      <td>0</td>
      <td>30.31</td>
      <td>42</td>
      <td>...</td>
      <td>39</td>
      <td>30.27</td>
      <td>0</td>
      <td>0</td>
      <td>5</td>
      <td>50</td>
      <td>60</td>
      <td>69</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>R001</td>
      <td>2011-05-01</td>
      <td>05:00:00</td>
      <td>5</td>
      <td>REGULAR</td>
      <td>217</td>
      <td>553</td>
      <td>30.31</td>
      <td>42</td>
      <td>...</td>
      <td>39</td>
      <td>30.27</td>
      <td>0</td>
      <td>0</td>
      <td>5</td>
      <td>50</td>
      <td>60</td>
      <td>69</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2</td>
      <td>R001</td>
      <td>2011-05-01</td>
      <td>09:00:00</td>
      <td>9</td>
      <td>REGULAR</td>
      <td>890</td>
      <td>1262</td>
      <td>30.31</td>
      <td>42</td>
      <td>...</td>
      <td>39</td>
      <td>30.27</td>
      <td>0</td>
      <td>0</td>
      <td>5</td>
      <td>50</td>
      <td>60</td>
      <td>69</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>3</td>
      <td>R001</td>
      <td>2011-05-01</td>
      <td>13:00:00</td>
      <td>13</td>
      <td>REGULAR</td>
      <td>2451</td>
      <td>3708</td>
      <td>30.31</td>
      <td>42</td>
      <td>...</td>
      <td>39</td>
      <td>30.27</td>
      <td>0</td>
      <td>0</td>
      <td>5</td>
      <td>50</td>
      <td>60</td>
      <td>69</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>4</td>
      <td>R001</td>
      <td>2011-05-01</td>
      <td>17:00:00</td>
      <td>17</td>
      <td>REGULAR</td>
      <td>4400</td>
      <td>2501</td>
      <td>30.31</td>
      <td>42</td>
      <td>...</td>
      <td>39</td>
      <td>30.27</td>
      <td>0</td>
      <td>0</td>
      <td>5</td>
      <td>50</td>
      <td>60</td>
      <td>69</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>  
<p>5 rows × 22 columns</p>  
</div>

<h1 id="section1statisticaltest">Section 1: Statistical Test</h1>

<h2 id="11methodology">1.1 Methodology</h2>

<p>I performed a Mann-Whitney U-test to find the U-statistic and p-value by comparing the number of turnstile entries with rain and without rain. A two-tail p-value was used with no assumptions made about the distributions of ridership on rainy and non-rainy days.</p>

<p>The p-value returned by scipy.stats.mannwhitneyu is one-tailed as noted here:</p>

<p><a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html">http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html</a></p>

<p>To perform the two-tailed test, the one-tailed p-value returned by scipy.stats.mannwhitneyu is multiplied by 2.</p>

<p>The p-critical value used in this test is 0.05 to test for significance (i.e. 5% chance of observing a result as least as extreme).</p>

<h2 id="12applicability">1.2 Applicability</h2>

<p>We're ultimately trying to determine whether subway ridership varies with weather. We start by split the hourly entries into two samples, entries with and without rain.</p>

<p>The Mann-Whitney U-Test tests the null hypothesis that the two samples being compared are derived from the same population. This null hypothesis allows us to test whether there is a statistically significant difference in distributions of ridership on rainy and non-rainy days (and thus, the hourly entries derived from the same population).</p>

<p>Exploratory data analysis (Section 3.1) indicates that the data is not normally distributed. The Mann-Whitney U-Test makes no assumptions about the normality of the data, validating that this test is appropriate.</p>

<h2 id="13results">1.3 Results</h2>

<p>Here's what we got from the Mann-Whitney U-Test:</p>

<pre><code>import numpy as np

with_rain = turnstile_master[turnstile_master.rain == 1]['ENTRIESn_hourly']
with_rain_mean = np.mean(with_rain)

without_rain = turnstile_master[turnstile_master.rain == 0]['ENTRIESn_hourly']
without_rain_mean = np.mean(without_rain)

print "Mean entries, with rain: {0}\nMean entries, without rain: {1}".format(with_rain_mean, without_rain_mean)

Mean entries, with rain: 1105.44637675
Mean entries, without rain: 1090.27878015



import scipy
import scipy.stats

[U, p] = scipy.stats.mannwhitneyu(with_rain, without_rain)

print "Mann-Whitney Test Statistic: {0}\np-Value: {1}".format(U, p)

Mann-Whitney Test Statistic: 1924409167.0
p-Value: 0.0249999127935
</code></pre>

<h2 id="14significance">1.4 Significance</h2>

<pre><code># Significance level defined in Section 1.1
alpha = 0.05

# two-tailed test
if (p * 2) &lt; alpha:
    print 'Reject the null hypothesis: {0} &lt; {1}'.format(p * 2, alpha)
else:
    print 'Fail to reject null hypothesis: {0} &lt; {1}'.format(p * 2, alpha)

Reject the null hypothesis: 0.049999825587 &lt; 0.05
</code></pre>

<p>The p-value barely didn't meet our two-tailed significance threshold of 5%, so we reject the null hypothesis of the Mann-Whitney U-Test. We conclude that the distribution of the entries is statistically different between rainy and non-rainy days.</p>

<h1 id="section2linearregression">Section 2: Linear Regression</h1>

<h2 id="21predictingridershipwithgradientdescent">2.1 Predicting Ridership with Gradient Descent</h2>

<p>Linear Regression with Gradient descent was performed using SciPy.</p>

<p>We want to choose a θ with an initial guess vector of 0 values so as to minimize J(θ). Gradient descent is used as a search algorithm that repeatedly changes θ to make J(θ) smaller, until hopefully we converge to a value of θ that minimizes J(θ).</p>

<p>Refer to the Machine Learning course taught at Stanford by Andrew Ng for more information on gradient descent:</p>

<p><a href="http://cs229.stanford.edu/notes/cs229-notes1.pdf">http://cs229.stanford.edu/notes/cs229-notes1.pdf</a></p>

<h2 id="22featureselection">2.2 Feature Selection</h2>

<p>The following features are taken directly from the turnstile_master dataset: <br>
- rain
- precipitation
- hour
- mean temperature</p>

<pre><code># Select Features
features = turnstile_master[['rain', 'precipi', 'Hour', 'meantempi']]
</code></pre>

<p>The dummy variable (UNIT) taken directly from the turnstile_master dataset is used as categorical features for our linear model.</p>

<pre><code># Add UNIT to features using dummy variables
dummy_units = pd.get_dummies(turnstile_master['UNIT'], prefix='unit')
features = features.join(dummy_units)
</code></pre>

<h2 id="23featureselectionrationale">2.3 Feature Selection Rationale</h2>

<p>meantempi is chosen as a feature as temperature is a component of the weather that affects people’s decision making. A given temperature may affect how long and how much effort it takes to clothe in the morning. A person may simply choose to stay indoors due to discomfort with a given temperature.</p>

<p>Hour and weekday features were chosen as it is easily observed how ridership varies with time of day and day of week. A phenomenon supporting the Hour feature is the daily rush hours that mass transit systems and roadways exhibit and accommodate. An observation supporting the use of weekday is how the MTA train schedule varies between weekdays and weekends, with trains arriving with more infrequency on Saturday and Sunday.</p>

<h2 id="24modelresults">2.4 Model Results</h2>

<p>Lets start by normalizing the features in the dataset.</p>

<pre><code># Values
values = turnstile_master['ENTRIESn_hourly']
m = len(values)

# Normalize the features in the dataset.
mu = features.mean()
sigma = features.std()

if (sigma == 0).any():
    raise Exception("One or more features had the same value for all samples, and thus could " + \
                     "not be normalized. Please do not include features with only a single value " + \
                     "in your model.")
features = (features - mu) / sigma

# Add a column of 1s (y intercept)
features['ones'] = np.ones(m)
</code></pre>

<p>Prepare features and output value arrays, learning rate alpha, parameters theta, and num_iteration values for gradient descent.</p>

<p>We set the intitial coefficient vector theta to a column vector of zeros.</p>

<pre><code># Convert features and values to numpy arrays
features_array = np.array(features)
values_array = np.array(values)

# Set values for alpha, number of iterations.
alpha = 0.1
num_iterations = 75

# Initialize theta for gradient descent
theta_gradient_descent = np.zeros(len(features.columns))
</code></pre>

<p>Finally perform gradient descent, building a cost history over time. During this process, the cost function recomputes the theta value for a given number of iterations as a numerical approach to approximating the ideal coefficients to fit the linear regression model.</p>

<pre><code># Perform gradient descent given a data set with an arbitrary number of features.
cost_history = []

for i in range(num_iterations):
    predicted_value = np.dot(features_array, theta_gradient_descent)
    theta_gradient_descent = theta_gradient_descent + alpha/m * np.dot(values_array - predicted_value, features_array)

    # Compute the cost function given a set of features / values, and the values for our thetas.
    sum_of_square_errors = np.square(np.dot(features_array, theta_gradient_descent) - values_array).sum()
    cost = sum_of_square_errors / (2 * m)

    # Add cost to history
    cost_history.append(cost)

cost_history = pd.Series(cost_history)

# Calculate predictions
predictions = np.dot(features_array, theta_gradient_descent)
</code></pre>

<p>Let's visualize this.</p>

<pre><code>from ggplot import *

# Plot cost history for viewing.
cost_df = pd.DataFrame({
  'Cost_History': cost_history,
  'Iteration': range(len(cost_history))
})

ggplot(cost_df, aes('Iteration', 'Cost_History')) + \
  geom_point() + ggtitle('Cost History for alpha = %.3f' % alpha )
</code></pre>

<p><img src="https://raw.githubusercontent.com/bestkao/analyzing-the-nyc-subway-dataset/master/output_27_0.png" alt="png"></p>

<pre><code>&lt;ggplot: (347426297)&gt;
</code></pre>

<h2 id="25goodnessoffit">2.5 Goodness of Fit</h2>

<p>After making predictions, use the coefficient of determination (R^2) to see how the model performed:</p>

<pre><code># Compute the coefficient of determination (R^2) given
# a list of original data points and a list of predicted data points
sq_data_predictions_diff = np.sum((values - predictions)**2)
mean = np.mean(values)
sq_data_mean_diff = np.sum((values - mean)**2)

r_squared = 1 - sq_data_predictions_diff / sq_data_mean_diff

print "Calculated R^2 value: {0}".format(r_squared)

Calculated R^2 value: 0.458044314039
</code></pre>

<p>This R^2 value is is rather low, indicating that linear regression model doesn't fit with the data too well. Perhaps linear models aren't the best models for this dataset since the histogram of hourly entries shows a more normal distribution. Thus, I would expect the R2 value to be low. Due to the categorical nature of data, ('rain', 'no rain') a linear model inappropriate.</p>

<h1 id="section3visualization">Section 3. Visualization</h1>

<h2 id="31distributionofentriesn_hourly">3.1 Distribution of ENTRIESn_hourly</h2>

<pre><code>import matplotlib.pyplot as plt

plt.figure()

# your code here to plot a historgram for hourly entries when it is not raining
non_rainy_data = turnstile_master[turnstile_master['rain'] == 0]['ENTRIESn_hourly']
non_rainy_data.hist(range = [0, 6000], bins = 25, label='No Rain')

# your code here to plot a historgram for hourly entries when it is raining
rainy_data = turnstile_master[turnstile_master['rain'] == 1]['ENTRIESn_hourly']
rainy_data.hist(range = [0, 6000], bins = 20, label='Rain')

plt.title('Histogram of ENTRIESn_hourly')
plt.xlabel('ENTRIESn_hourly')
plt.ylabel('Frequency')
plt.legend()

plt.figure()




&lt;matplotlib.figure.Figure at 0x10c905590&gt;
</code></pre>

<p><img src="https://raw.githubusercontent.com/bestkao/analyzing-the-nyc-subway-dataset/master/output_32_1.png" alt="png"></p>

<pre><code>&lt;matplotlib.figure.Figure at 0x10c905590&gt;
</code></pre>

<p>The distribution of ENTRIESn_hourly appears to not be normally distributed and skewed to the right on both rainy and non-rainy days. The mode appears to be within the smallest bin for both distributions. There are far fewer observations on rainy days than non-rainy days.</p>

<h2 id="32averagehourlyridershipbyhourofday">3.2 Average Hourly Ridership by Hour of Day</h2>

<pre><code>import pandasql

df = turnstile_master[['Hour', 'ENTRIESn_hourly']]

q = """
    SELECT Hour AS hour,
           ENTRIESn_hourly AS hourlyentries
    FROM df
    GROUP BY hour
    """

#Execute SQL command against the pandas frame
rainy_days = pandasql.sqldf(q.lower(), locals())

print ggplot(rainy_days, aes('hour', 'hourlyentries')) + \
             geom_path(color='#cc2127') + \
             scale_x_continuous(name='Hour',
                                breaks=[0, 1, 2, 3, 4, 5,
                                        6, 7, 8, 9, 10, 11,
                                        12, 13, 14, 15, 16, 17,
                                        18, 19, 20, 21, 22, 23],
                                labels=['12AM', '1AM', '2AM', '3AM', '4AM', '5AM',
                                        '6AM', '7AM', '8AM', '9AM', '10AM', '11AM',
                                        '12PM', '1PM', '2PM', '3PM', '4PM', '5PM',
                                        '6PM', '7PM', '8PM', '9PM', '10PM', '11PM']) + \
             ggtitle('Average ENTRIESn_hourly by Hour') + \
             xlab('Hour') + \
             ylab('ENTRIESn_hourly') + \
             xlim(0, 23)
</code></pre>

<p><img src="https://raw.githubusercontent.com/bestkao/analyzing-the-nyc-subway-dataset/master/output_35_1.png" alt="png"></p>

<pre><code>&lt;ggplot: (281690245)&gt;
</code></pre>

<p>The above bar chart shows the average hourly ridership by hour of day. The line plot shows high ridership aligned with commute times to and from work, along with lunch hours and approaching midnight.</p>

<h2 id="33averagehourlyridershipbydayofweek">3.3 Average Hourly Ridership by Day of Week</h2>

<pre><code>df = turnstile_master[['DATEn', 'ENTRIESn_hourly']]

q = """
    SELECT cast(strftime('%w', DATEn) AS integer) AS weekday,
           sum(ENTRIESn_hourly)/count(*) AS hourlyentries
    FROM df
    GROUP BY cast(strftime('%w', DATEn) AS integer)
    """

#Execute SQL command against the pandas frame
rainy_days = pandasql.sqldf(q.lower(), locals())

print ggplot(rainy_days, aes('weekday', 'hourlyentries')) + \
        geom_bar(fill = '#007ee5', stat='bar') + \
        scale_x_continuous(name='Weekday',
                           breaks=[0, 1, 2, 3, 4, 5, 6],
                           labels=['Sunday', 'Monday', 'Tuesday', 'Wednesday',
                                   'Thursday', 'Friday', 'Saturday']) + \
        ggtitle('Average ENTRIESn_hourly by Weekday') + \
        ylab('ENTRIESn_hourly') + \
        xlim(-.5, 6.5)
</code></pre>

<p><img src="https://raw.githubusercontent.com/bestkao/analyzing-the-nyc-subway-dataset/master/output_38_0.png" alt="png"></p>

<pre><code>&lt;ggplot: (347427845)&gt;
</code></pre>

<p>The above bar chart shows the average hourly ridership by day of week. The sum of ENTRIESn_hourly by day of week was divided by the count of rows for a given day of week (as each row represents an hour’s worth of data).</p>

<p>The bar chart shows that a higher average hourly ridership on weekdays than weekends, with Saturday seeing significantly higher ridership than Sunday.</p>

<p>Monday's average hourly ridership is notably different than the rest of the weekdays, perhaps  due to a seasonal effect of Monday holidays. The given data set is a sample from May 2011 and there's at least one major holiday that falls on Monday in the month of May: Memorial Day.</p>

<h1 id="section4conclusion">Section 4: Conclusion</h1>

<h2 id="41domorepeopleridethesubwaywhenitisraining">4.1 Do More People Ride the Subway When it is Raining?</h2>

<p>Based on this analysis, I'm confident that more people ride the NYC subway when it is raining.</p>

<h2 id="42rationale">4.2 Rationale</h2>

<p>As seen in Section 1.3, the mean of ENTRIESn_hourly is greater for hours with rain than without (1,105 vs. 1,090).</p>

<p>Additionally, the Mann-Whitney U-Test indicates that the ENTRIESn<em>hourly sample with rain appears to be drawn from a different distribution (or population) than the ENTRIESn</em>hourly sample from hours without rain.</p>

<pre><code>from datetime import datetime

df = turnstile_master[['DATEn', 'ENTRIESn_hourly', 'rain']]
df.is_copy = False

# get the weekday as a string
f = lambda x: datetime.strptime(x, "%Y-%m-%d").strftime('%A')

df['weekday'] = df['DATEn'].apply(f)

print ggplot(aes(x='ENTRIESn_hourly', color='rain'), data = df) \
        + geom_density() \
        + xlim(0,2500) \
        + facet_wrap('weekday')
</code></pre>

<p><img src="https://raw.githubusercontent.com/bestkao/analyzing-the-nyc-subway-dataset/master/output_43_0.png" alt="png"></p>

<pre><code>&lt;ggplot: (345798425)&gt;
</code></pre>

<p>This faceted grid shows the density functions of ENTRIESn<em>hourly with and without rain for each day of the week. Currently legends do not work within ggplot facets. Red represents without rain, and blue represents with rain. It appears for every day, except Monday, that the distribution of ENTRIESn</em>hourly has a fatter right tail when there is rain.</p>

<h1 id="section5reflection">Section 5: Reflection</h1>

<h2 id="51shortcomingsofthedatasetandmethodsofanalysis">5.1 Shortcomings of the Data Set and Methods of Analysis</h2>

<p>We're only working with a single month of MTA data, which is subject to effects of seasonality, as the time of year may also affect ridership. Additionally, this month contains a Monday holiday, which is reflected in the visualizations in Section 3.2 and Section 4.</p>

<p>The biggest shortcoming with this dataset is the discrepency between the units of the two joined datasets. The MTA turnstile entry data is produced on an hourly basis, whereas the weather data is produced daily. When it rains at any point in a given day, every hour of that day will reflect that it rained. This prevents a truly granular analysis of how the weather can affect ridership within a day.</p>]]></content:encoded></item><item><title><![CDATA[Introducing James Kao]]></title><description><![CDATA[<p>Yesterday at Hackathon Hackers, we launched <strong>James Kao</strong> – a more <em>human</em> development tool. Phew! <a href="https://instagram.com/thebestkao">Here are the slides from the presentation</a>.</p>

<p>Although it looks pretty minimalist, we’ve been refining on James for several months. After many iterations, we have created James to lead teams, champion new projects, and seamlessly</p>]]></description><link>https://jameskao.me/introducing-james-kao/</link><guid isPermaLink="false">6755b40f-4aa6-42d0-b7b6-55cf7dd3e8f7</guid><dc:creator><![CDATA[James Kao]]></dc:creator><pubDate>Tue, 06 Dec 2016 04:26:48 GMT</pubDate><content:encoded><![CDATA[<p>Yesterday at Hackathon Hackers, we launched <strong>James Kao</strong> – a more <em>human</em> development tool. Phew! <a href="https://instagram.com/thebestkao">Here are the slides from the presentation</a>.</p>

<p>Although it looks pretty minimalist, we’ve been refining on James for several months. After many iterations, we have created James to lead teams, champion new projects, and seamlessly introduce new technologies.</p>

<p>Not everything can tell what your clients are going to want before they say so. James can, helping you design your software for the inevitable.</p>]]></content:encoded></item></channel></rss>