Email the Author
You can use this page to email Shoaib Burq and Dr. Kashif Rasul about Apache Spark & Geodata.
About the Book
In this book I will very quickly introduce you to the Apache Spark stack and then get into the meat of performing a full featured geospatial analysis. Using OpenStreetMap data as our base, our end goal will be to find the most cultural city in Western Europe!
That's right! In this book we will develop our own Cultural Weight Algorithm (TM) :) and apply it to a set of major cities in Europe. The data will be analyzed using Apache Spark and in the process we will learn the following phases of Big Data projects:
- Consuming: Retrieving raw data from REST API's (OpenStreetMap).
- Preparation: Data exploration and schema creation of geospatial data
- Summarize: We will query data by Location. Perform Spatial Operations such as finding Overlapping geospatial features, do joins by location, also known as Spatial Joins and finally obtain location based summary statistics to arrive at our answer regarding the cultural capital of Europe.
Here's a summary of the book. I hope you will join us on this journey of exploring one of the most exciting technology stacks to come out of the good folks at the UC Berkeley.
Why Spark?
Spark has quickly overtaken Hadoop as the front runner in big data analysis technologies. There are a number of reasons for this such as its support for developer friendly interactive mode, it's polyglot interface in Scala, Java, Python, and R, and the full stack of Algorithmic libraries that such language ecosystems offer.
Out of the box, Spark includes a powerful set of tools: such as the ability to write SQL queries, perform streaming analytics, run machine learning algorithms, and even tackle graph-parallel computations but what really stands out is its usability.
With it's interactive shells (in both Scala and Python) it makes prototyping big data applications a breeze.
Why PySpark?
PySpark provides integrated API bindings around Spark and enables full usage of the Python ecosystem within all the nodes of the Spark cluster with the pickle Python serialization and, more importantly, supplies access to the rich ecosystem of Python’s machine learning libraries such as Scikit-Learn or data processing such as Pandas.
Throughout this book I am going to use a Docker Container with the relevant libraries. Don't worry if you don't know Docker, I walk you through setting up and running Docker too
About the Authors
I am a Geospatial Applications Developer and have worked on projects ranging from Developing Geocoders for Australian emergency response agencies to Underwater mapping for Marine Exploration. My last startup was a geospatial database as a service with an easy to use API for developers to build mobile and geolocation apps.
I have a PhD. in Mathematics from the Freie Universität Berlin and in parallel to this I have been working as a software developer in the area of location based services and geospatial web application development. Together with Shoaib, we were among the first developers to use Ruby/Rails with the open source geospatial stack at the time, and we have extensive experience in this area. I have also worked on PostGIS, the geospatial extension to PostgreSQL and developed APIs from it to be consumed by mobile applications for geo-fencing and real-time triggers. We have also talked and presented in depth about developing geospatial web applications at conferences like RailsConf, FOSS4G and other local meetups.
You can follow me on Github: kashif or Twitter: @krasul