Life Is Study: edX - Introduction to Big Data with Apache Spark Review

CS100.1x Introduction to Big Data with Apache Spark is a 5-week intro to distributed computing offered by UC Berkeley through the edX MOOC platform focused on teaching students how to perform large-scale computation using Apache Spark. The assignments use PySpark, Spark’s Python API, so some familiarity with Python programming is necessary. You don’t need prior exposure to big data or distributed computing to take the course. Grades are based on four programming labs (80%), easy comprehension questions that allow unlimited attempts (12%) and setup of the course virtual machine used to complete the labs (8%).

Course lectures in to Big Data with Apache Spark are relatively brief and tend to stay at a high level, discussing general big data concepts rather than the details of Apache Spark. The instructor does a fine job in the few lectures the course offers, but there were not enough of them and they often felt disconnected from the assignments. The fifth week had no lectures.

The labs are the core of this course. While you can breeze through weekly lectures in half an hour or less, each of the four labs are lengthy reading and programming assignments packaged in IPython notebooks. Expect to spend 2 to 4 hours on labs 1, 2 and 4 and 3 to 6 hours on lab 3. The labs start by teaching basic Apache Spark manipulations and move on to some text analysis and machine learning. Using the IPython notebook to deliver labs is a convenient way to intermingle text and instructions with code. On the other hand, each exercise tends to depend on code executed somewhere above it, so a mistake made on earlier exercise can lead to some odd errors later on and Spark’s error traces aren’t particularly helpful. The course does provide some basic tests for each exercise, but it is easy to arrive at solutions that pass the checks but cause errors later on. The course forums on Piazza are a vital resource for troubleshooting and disambiguation; I imagine some of the snags will be resolved in future offerings. Despite the occasional hiccups, the labs do a good job familiarizing students with Apache Spark’s Resilient Distributed Dataset objects and the various transformations and actions you can perform with them.

Introduction to Big Data with Apache Spark is a great place to start learning about distributed computing if you know some Python. Although the lectures don’t add much technical depth to the course, they provide some big picture background that will be useful for students who have little prior exposure to big data concepts. The labs give you adequate opportunity to get your hands dirty with Apache Spark to gain basic familiarity with data manipulations it offers. UC Berkley is offering a follow-up course “Scalable Machine Learning” that builds on the foundation laid in CS100.1x.

I give this course 4 out of 5 stars: Very Good.

Life Is Study

Tuesday, June 30, 2015

edX - Introduction to Big Data with Apache Spark Review

1 comment: