Wednesday, June 29, 2016

edX: Introduction to Apache Spark Review


Introduction to Apache Spark is the first course in a new 5-part Data Science and Engineering with Spark series offered by the University of California Berkeley through edX. Intro to Spark is a short 3-week primer on spark basics that introduces the Python interface for Spark and the DataFrame, a Spark data structure that facilitates distributed data analysis. This course is largely a reworked version of "Introduction to Big Data With Apache Spark" which was offered by the same professor through edX in the summer of 2015. The main differences between this course and the old one is that the previous course focused on an older Spark data object called the resilient distributed dataset (RDD) and had students complete assignments locally. The new course uses DataFrames and you complete the assignments online using the Databricks computing platform. You don't need any prior knowledge besides basic Python programming skills to take this course. Grading is based on a handful of comprehension quizzes and two labs.


The weekly lecture content in Intro to Spark consists for several 2-10 minute lectures followed by one or two comprehension questions. The video quality and instruction are good, but the total amount of video content in the course is slim: only weeks 1 and 2 have lecture videos and the total lecture length is about 90 minutes. At least half of the video content is recycled from the old course.


The vast majority of your course time will be spent working on the assignments. The first week walks you though the process of creating a free Databricks account and setting up an autograder notebook you'll use to submit the graded labs in weeks 2 and 3. The lab assignments are delivered in code notebooks that you import into the Databricks, which will be familiar for anyone who has worked with the iPython/Jupyter notebook. The labs contain a lot of text and code, so it can take several hours to complete them even though much of the time you'll just be reading along and running code provided in the notebook rather than writing code yourself. The code you have to write is generally limited to small, one line operations on DataFrames. Despite the relative simplicity of the labs, you'll probably need to refer to Spark's documentation or peruse the course forums to get everything working because Spark objects and syntax take some time to get used to. Opting to use Databricks instead of having students complete assignments locally is probably a good decision on the whole to keep students on the same page, but I found Databricks to be slow. I had problems with clusters failing and the assignment submission process was overly complex. I would have preferred to do the assignments locally.


Like Coursera, edX is breaking old courses into parts and repackaging them as multi-part series. I don't have a problem with breaking up old courses per se, but individual courses should be able to stand on their own. As a standalone course, Introduction to Apache Spark, is too short. Too much time is spent doing setup and dealing with Databricks and the autograder for such a short course. The time investment you make with this course will probably be worth it if you continue on with the other courses in the Data Science and Engineering with Spark series, but I wouldn't recommend taking this course in isolation.


I give Introduction to Apache Spark 3 out of 5 stars: Okay.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.