Friday, February 14, 2014

Udacity Intro to Data Science Review



Udacity is kicking off 2014 with a new batch of courses focused on data science, the first of which, intro to data science, launched earlier this month. I've been looking forward to this course for a while, and since all the materials were made available right away, I plowed through the whole thing over the course of a few days.


Intro to data science is an intermediate level course that assumes basic Python programming skills and knowledge of statistics. The course focuses on gathering, manipulating, analyzing and visualizing data using Python and various Python packages such as numpy, scipy and pandas. One of the best parts about this course is getting some exposure to some Python packages in the scipy stack, although I wish more time was devoted to explaining what the various modules in the scipy stack do, how to set them up at home and when to use them.


The first lesson was fairly gentle introduction with an interesting homework project dealing with data from the Titanic disaster. Lesson 2 goes into more detail about gathering and cleaning data using Pandas and an additional module that lets you make SQL queries to extract data from Pandas data frames. Lesson 3 jumps into data analysis with a T test and linear regression using gradient descent. Going from basic data manipulation into these topics was a bit jarring in terms of difficulty and more time could have been spent explaining how the functions worked. I left without a great appreciation of what gradient descent is really doing. Lesson 4 is focused on making visualizations using a module that attempts to port the functionality R language’s ggplot2 plotting package. Finally, lesson 5 introduces the concept of big data and MapReduce as a solution to deal with large data sets. Each homework assignment after the first has students dealing with New York subway turnstile data, which allows students to get some level of familiarity with the data throughout the course. This was a very good decision, since it lets students focus on learning new concepts rather than spending time familiarizing themselves with new data sets over and over again.


Intro to data science introduces some major topics in data science and does a pretty good job given the amount of content it offers, but coverage of the topics is too brief. Hopefully the forthcoming Udacity courses, Exploratory Data Analysis and Data Wrangling with MongoDB will build on the foundation provided by this course and give students a bit more depth.

I give this course 4 out of 5 stars: Very Good.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.