Data Wrangling with OpenStreetMap and MongoDB
OpenStreetMap is a community built free editable map of the world, inspired by the success of Wikipedia where crowdsourced data is open and free from proprietary restricted use. We see some examples of its use by Craigslist and Foursquare, as an open source alternative to Google Maps.
Users can map things such as polylines of roads, draw polygons of buildings or areas of interest, or insert nodes for landmarks. These map elements can be further tagged with details such as street addresses or amenity type. Map data is stored in an XML format. More details about the OSM XML can be found here:
http://wiki.openstreetmap.org/wiki/OSM_XML
Some highlights of the OSM XML format relevent to this project are:
- OSM XML is list of instances of data primatives (nodes, ways, and relations) found within a given bounds
- nodes represent dimensionless points on the map
- ways contain node references to form either a polyline or polygon on the map
- nodes and ways both contain children tag elements that represent key value pairs of descriptive information about a given node or way
As with any user generated content, there is likely going to be dirty data. In this project I'll attempt to do some auditing, cleaning, and data summarizing tasks with Python and MongoDB.
Chosen Map Area
For this project, I chose to ~50MB from the Cupertino, West San Jose Area. I grew up in Cupertino and lived through the tech sprawl of Apple and the Asian/Indian gentrification of the area. I figured that my familiarity with the area and intrinsic interest in my hometown makes it a good candidate for analysis.
from IPython.display import HTML
HTML('<iframe width="425" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.openstreetmap.org/export/embed.html?bbox=-122.1165%2C37.2571%2C-121.9060%2C37.3636&layer=mapnik"></iframe><br/><small><a href="http://www.openstreetmap.org/#map=12/37.3105/-122.0135" target="_blank">View Larger Map</a></small>')
I used the Overpass API to download the OpenStreetMap XML for the corresponding bounding box:
http://overpass-api.de/api/map?bbox=-122.1165,37.2571,-121.9060,37.3636
import requests
url = 'http://overpass-api.de/api/map?bbox=-122.1165%2C37.2571%2C-121.9060%2C37.3636'
filename = 'cupertino_california.osm'
Python's Requests library is pretty awesome for downloading this dataset, but it unfortunately keeps all the data in memory by default. Since we're using a much larger dataset, we overcome this limitation with this modified procedure from this stackoverflow post:
http://stackoverflow.com/a/16696317
def download_file(url, local_filename):
# stream = True allows downloading of large files; prevents loading entire file into memory
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
download_file(url, filename)
Auditing the Data
With the OSM XML file downloaded, lets parse through it with ElementTree and count the number of unique element types. Iterative parsing is utilized since the XML is too large to process in memory.
import xml.etree.ElementTree as ET
import pprint
tags = {}
for event, elem in ET.iterparse(filename):
if elem.tag in tags: tags[elem.tag] += 1
else: tags[elem.tag] = 1
pprint.pprint(tags)
{'bounds': 1,
'member': 6644,
'meta': 1,
'nd': 255022,
'node': 214642,
'note': 1,
'osm': 1,
'relation': 313,
'tag': 165782,
'way': 28404}
Here I have built three regular expressions: lower
, lower_colon
, and problemchars
.
- lower
: matches strings containing lower case characters
- lower_colon
: matches strings containing lower case characters and a single colon within the string
- problemchars
: matches characters that cannot be used within keys in MongoDB
Here is a sample of OSM XML:
<node id="266587529" lat="37.3625767" lon="-122.0251570" version="4" timestamp="2015-03-30T03:17:30Z" changeset="29840833" uid="2793982" user="Dhruv Matani">
<tag k="addr:city" v="Sunnyvale"/>
<tag k="addr:housenumber" v="725"/>
<tag k="addr:postcode" v="94086"/>
<tag k="addr:state" v="California"/>
<tag k="addr:street" v="South Fair Oaks Avenue"/>
<tag k="amenity" v="restaurant"/>
<tag k="cuisine" v="indian"/>
<tag k="name" v="Arka"/>
<tag k="opening_hours" v="10am - 2:30pm and 5:00pm - 10:00pm"/>
<tag k="takeaway" v="yes"/>
</node>
Within the node element there are ten tag
children. The key for half of these children begin with addr:
. Later in this notebook I will use the lower_colon
regex to help find these keys so I can build a single address
document within a larger json document.
import re
lower = re.compile(r'^([a-z]|_)*