Monday, July 21, 2014

Sabermetrics 101: Introduction to Baseball Analytics Review


Sabermetrics 101 is a beginner course in baseball analytics published by Boston University on the edX platform. The course is organized into 4 different content tracks: a statistics track, a sabremetrics track, a tech track, and a baseball history track. The statistics track covers basic statistical concepts like mean, median, measures of spread, regression to the mean and correlation. The sabremetrics track introduces a variety of concepts and computed statistics in baseball analytics like on base percentage, slugging, other hitting metrics and converting runs to wins. The tech track focuses on teaching SQL database queries using an interactive mySQL environment as well R basics. Each of the course's 6 weeks of content start with a brief overview of the material to be covered in each track.

SABR101x is a good intro to sabremetrics, but it suffers from several issues common to first run MOOCs that held it back from being a great course. The course has good instruction and the organization of the materials into different tracks was nice to let people focus on areas of interest. On the down side, information in the videos was sometimes hard to make out due small text size and poor color choices with backgrounds and pens. The difficulty level also seemed a bit unpredictable: the statistics track was very basic while the tech track gets into SQL and R at a rate that is probably a bit too fast for people with no background knowledge. In addition, tech exercises sometimes suffered from ambiguous wording and automated graders initially expected too much accuracy on rounded answers. Many of these kinks could be straightened out for a second offering of the course.

If you love baseball and have any interest in baseball analytics, you will probably enjoy this course. If you're mainly interested in analytics and picking up new technical skills, the SQL tech sections and SQL sandbox are the highlights of the course: you'll go from no SQL knowledge to being able to do basic queries and joins in the span of a couple weeks.

I give this course 4 out of 5 stars: Very Good.

Friday, July 18, 2014

John Hopkins Coursera Data Science Specialization Track--Final Review


The John Hopkins Data Science Specialization track has been an interesting experience over the past 4 months, but not for the reasons you might expect. There was some good content here and there but on the whole, the data science track was disappointing. There's only so much you can cover in a month. Most of the courses jump from topic to topic too quickly to develop any real depth of understanding. On the plus side, you will gain basic R proficiency if you complete the R programming, getting and cleaning data, reproducible research and exploratory data analysis courses. On the down side, the content is often dull, skimps on depth and has little instructor face time. Coursera, edX and Udacity all have offerings that cover similar topics that are more engaging and cover the material in greater depth.

The most interesting aspect of the data science track is pondering the true motivation behind the track and what it means for MOOCs in general. The MOOC educational paradigm is still relatively new and it's no secret that the big MOOC providers--Cousera, Udacity and edX--are trying to find the right formula to turn "free" education into a profitable--or at least sustainable--endeavor. Both of the for-profit platforms, Coursera and Udacity, seem to be moving toward  paid "mini degrees." In the case of Coursera, mini degrees are specializations offered by universities like John Hopkins. Udacity recently pulled free certificates for its courses and is gearing up to launch paid "Nanodegrees" earned by completing courses in various focus areas. It remains to be seen whether these paid mini degrees will have any real value as job credentials, although Udacity claims its degrees will be recognized by several of its strategic partners like AT&T, SalesForce and Cloudera.

Given the push toward monetizing MOOCs, a cynic might question John Hopkins' motivation in offering the data science specialization. As I was taking the second wave of courses in the track, I noticed John Hopkins was rerunning the first month's courses again. Then, when I took the 3rd wave, all of the courses in the first 2 months were rerun. This month, all 9 courses are running again. Creating 9 short, lackluster and often lack-content courses that can be rerun every month at the cost of $50 each for those interested in verified certificates seems more like a business experiment than a genuine attempt at providing high quality educational content. I hope the big MOOC providers find a way to make money while continuing to provide top-notch content. It would be a shame if shallow month long courses became the norm.

Saturday, July 12, 2014

John Hopkins Coursera Data Science Specialization Track--Part 3



The first run of John Hopkins Science Specialization on Coursera is drawing to a close. The final 3 courses of the 9 course series are just wrapping up, so it’s time for another batch of reviews.



Regression Models

Regression Models is the 7th course in the John Hopkins data science specialization track on Coursera. This course is essentially identical to the statistical inference course in terms of structure, presentation and quality: the entire course consists of dull, information-packed slides with mediocre voice-overs. It seems like half of the course consists of slides with verbose math expressions in summation notation and the instructor telling you don't really need to understand them unless you are interested in the math behind the models. As with other courses in the track, there are no in-lecture quizzes or interactive exercises and there is no instructor face time.

Overall this is a disappointing course that probably won’t keep your interest long enough for you to bother completing all the videos much less the quizzes and the project.

John Hopkins did release an interactive learning package for R called Swirl that provides a series of exercises for this course and some of their other Coursera offerings a few weeks after this course launched. The exercises in Swirl aren't the best around but they do help you understand the material a bit better than the main lecture content.

I give this course 2 out of 5 stars: Bad.



Practical Machine Learning

Practical machine learning is the 8th course in the 9-part data science specialization. It introduces machine learning in R, including the basics of prediction, splitting data into training and testing sets, regression, trees, random forests and boosting all in the span of 4 weeks. The course focuses on using the Caret package in R to apply machine learning algorithms.

Similar to other courses in the data science specialization, the course content is mainly static slides with voice-overs, but thankfully the slides are generally not overly cluttered and the voice-overs are of decent quality. The course has a lot of good information on how to use R to apply common machine learning techniques to data, but you aren't going to gain a deep understanding of how the machine learning methods work. "Practical" in this case means "learn how to use the tool, not how it works." I suspect students coming into this course with no prior knowledge of machine learning will find that the lectures jump from one topic to another too quickly as the course goes on. Taking a course that covers machine learning theory, like the 3 part machine learning series from Udacity, will give you a deeper understanding of the methods introduced in this course.

Practical machine learning does pretty good job introducing a machine learning topics in a limited amount of time, but the coverage is too brief to gain a solid understanding of many of the methods presented. This course would have been much better if it was 8 weeks and had at least 1 hour of solid lecture content per week with interactive exercises or homework. If you’re looking for an excellent practical machine learning course that spends enough time on each topic and has enough homework to really help students learn, check out MIT's Analytics Edge on edX.

I give this course 3 out of 5 stars: Satisfactory.



Developing Data Products

Developing data products is the final course in the 9-part data science specialization. The course introduces several tools you can use to put R code on the web, into slideshows and into R packages, including Shiny, rcharts, Google Vis, slidify and R studio presenter. Although the course is listed as 4 weeks it only has 3 weeks of lecture content, with one week devoted to giving students time to work on the course project. Unlike previous courses in the data specialization, this course is not taught by a single professor: each of the 3 professors involved in the data science specialization leads a few lectures.

This course provides a decent overview of some useful tools for integrating R with the web and in presentations, but it covers too many different tools in too short a time without any exercises to help students practice using the tools presented. You'll have to spend a lot of time on your own exploring the tools discussed to really learn how to use them. It's nice to be aware of the kinds of tools that are out there and have some basic information on each one to get started, but in keeping with the theme of the entire data science specialization, coverage is only skin deep.

I give this course 2 out of 5 stars: Bad.