Sunday, December 13, 2015

Coursera: Machine Learning: Regression Review

Machine Learning: Regression is the second course in the 6-part Machine Learning specialization offered by the University of Washington on Coursera. The 6-week course builds from simple linear regression with one input feature in the first week to ridge regression, the lasso and kernel regression. Week 3 also takes a detour to discuss important machine learning topics like the bias/variance trade-off, overfitting and validation to motivate ridge and lasso regression. Like the first course in the specialization, "Regression" uses GraphLab Create, a Python package that will only run on the 64-bit version of Python 2.7. You can technically use other tools like Scikit-learn or even R to complete the course, but using GraphLab will make things much easier because all the course materials are built around it. Knowledge of basic calculus (derivatives), linear algebra and Python is recommended. Grading is based upon weekly comprehension quizzes and programming assignments.

Each week of Machine Learning: Regression tackles specific a topic related to regression in significant depth. The lectures take adequate time to build your understanding and intuition about how the techniques work and go deep enough that you could implement the algorithms presented yourself. The presentation slides are high quality and available as .pdf downloads, although the text written by the lecturer isn't particularly neat. The lecturer isn't the best orator around but she manages to explain topics well and the course takes plenty of time to cover important considerations and review key concepts at the end of each week. Overall, the pacing and organization of course materials is excellent and the presentation, while not perfect, is personable and clear.

Every lesson in "Regression" has at least one accompanying programming assignment that explores the topics covered in lecture. The assignments are contained in Jupyter (iPython) notebooks and come with all the explanatory text and support code you need to complete them. The labs walk you through implementing some key machine learning algorithms like simple linear regression, multiple linear regression with gradient descent, ridge regression, lasso with coordinate descent and k-nearest neighbors regression. The assignments are not particularity difficult as much of the code is already written for you and most tasks you have to perform are spelled out in great detail sometimes to the point where each line of code you have to write is noted in a text comment. Some may not appreciate this level of guidance but it keeps the assignments moving along at a steady pace and puts the focus on understanding machine learning concepts rather than programming skills and limits time wasted troubleshooting bugs.

Machine Learning: Regression is an excellent introduction to regression that covers several key machine learning algorithms while building understanding of fundamental machine learning concepts that extend beyond regression. If you have any interest in regression and have an environment that can run GraphLab, take this course.

I give Machine Learning: Regression 5 out of 5 stars: Excellent.

Saturday, December 12, 2015

Coursera: Data Visualization and Communication with Tableau Review

Data Visualization and Communication with Tableau is the third course in Duke University's "Excel to MySQL: Analytic Techniques for Business" specialization offered on Coursera. The 5-week course starts is essentially an introduction to Tableau (weeks 2 and 3) book-ended by some lectures on considerations and best practices for communicating data insights in a business setting (weeks 1 and 4.). The final week is devoted to a peer-reviewed assignment and has no new lecture content. The course provides you with a free temporary license for the desktop version of Tableau. You can get through his course without any background knowledge, although some knowledge of MS Excel will help you appreciate some of the comparisons it makes. Grading is based on 4 weekly quizzes and a peer graded assignment.

Data Visualization has quality lectures that do a good job introducing Tableau in the context of creating visualizations for a business context. The Tableau walkthroughs are easy to follow and give you an appreciation for how much easier it is to make nice visualizations in Tableau than it is in Excel. You same data sets for the entire course, one data set for walkthoughs and one for homework assignments, which provides a nice sense of consistency. Weeks 1 and week 4 raise some useful considerations to keep in mind when preparing for and presenting a data analysis, but the Tableau sections in weeks 2 and 3 are the heart of the course. I would have preferred more content covering ins and outs of Tableau instead of the 2 weeks spent on communication topics, but the mix is probably about right for business-oriented students.

I give Data Visualization and Communication with Tableau 4 out of 5 stars: very good.

Friday, December 11, 2015

edX: Data Science and Machine Learning Essentials Review

Data Science and Machine Learning Essentials is a 5-week introductory data science course offered by Microsoft through edX that focuses on teaching students how to use Microsoft's cloud-based machine learning platform, Azure ML. The course divides content into two tracks, an R track and a Python track, so you can complete the course with either language, but you'll need to know the basics of at least one of the two. Grading is based on 5 weekly reviews and a single 20 question exam.

The course title "Data Science and Machine Learning Essentials" is misleading because this course is not really about data science or machine learning per se. The first week attempts to cram an entire machine learning course or two worth of concepts into a handful of mediocre lectures, while the remainder of the course is all about Azure ML. Weeks 2-5 provide a nice overview of Azure ML and the fact that it has full lectures for both R and Python is a great feature that surely took a lot of extra time and effort to produce. The main lecturer's presentation skills aren't the best, but the videos are still easy to follow. Azure ML offers a lot of interesting functionality, like the ability to use Python and R scrips in the same project and publish projects as web services, but some of the exercises were tedious and ran slowly.

If data "Data Science and Machine Learning Essentials" were renamed "Intro to Azure ML" and only included the content in weeks 2-5, it would be a good course. Weeks 2-5 are definitely worth checking out if you are interested in Azure ML. As it stands now, however, the first week bombards students with far too many concepts explained too quickly to foster real understanding and sets the wrong expectations for the remainder of the course.

I give Data Science and Machine Learning Essentials 2.75 out of 5: mediocre.

Thursday, December 10, 2015

edX: DAT206x Excel for Data Analysis and Visualization Review

Excel for Data Analysis and Visualization is an intermediate level course offered by Microsoft through the edX platform that covers cutting edge techniques for gathering, transforming and viewing data in Excel. The course focuses on getting students up to speed with new features and techniques offered in Excel 2016, such as the Excel data model, queries, DAX (a syntax of defining functions) and Power BI, an online productivity service that integrates with Excel. This course assumes you have some familiarity with MS Excel, particularly pivot tables and slicers. You can complete the course with Excel 2010 or 2013, but if you don't have Excel 2016 you'll have to download add ins and you'll have to work slightly harder to complete the assignments. Grading is based on 7 weekly labs and 12 comprehension quizzes.

Weekly content in DAT206x consists of one to three short video lectures describing new Excel features followed by a comprehension quiz. The amount of video content per week is usually under 30 minutes, so you shouldn't need to commit more than an hour or two a week to complete the course. The lecture videos have adequate resolution to see cell values and lecturer's presentation is easy to follow. Weeks 1-7 have lab assignments that let you apply the techniques presented lecture. You only get a couple of submissions for most lab and quiz questions, but most questions are not too difficult.

Excel for Data Analysis and Visualization is a succinct, informative course on new Excel features that is worth checking out for those interested in going beyond the basics. Using Excel 2016 for this course when it launched only a few months before the course debuted may partially be a ploy to convince Excel users to upgrade, but I can't fault Microsoft for teaching with the latest version of their own product, and I completed the course with Excel 2010 without much difficulty.

I give Excel for Data Analysis and Visualization 4 out of 5 stars: very good.

Tuesday, December 1, 2015

Python for Data Analysis Index

* Edit Jan 2021: I recently completed a YouTube video covering topics in this post. See the playlist here:

https://www.youtube.com/playlist?list=PLiC1doDIe9rCYWmH9wIEYEXXaJ4KAi3jc

Section 1: Getting Started

1: Python Setup

2: Python Arithmetic

3: Basic Data Types

Section 2: Data Structures

4: Variables

5: Lists

6: Tuples and Strings

7: Dictionaries and Sets

8: Numpy Arrays

9: Pandas DataFrames

10: Reading and Writing Data

Section 3: Programming Constructs

11: Control Flow

12: Defining Functions

13: List and Dictionary Comprehensions

Section 4: Data Exploration and Cleaning

14: Initial Data Exploration and Preparation

15: Working With Text Data

16: Preparing Numeric Data

17: Dealing With Dates

18: Merging Data

19: Frequency Tables

20: Plotting with pandas

Section 5: Basic Statistics

21: Descriptive Statistics

22: Probability Distributions

23: Point Estimates and Confidence Intervals

Section 6: Inferential Statistics

24: Hypothesis Testing and the T-Test

25: Chi-Squared Tests

26: Analysis of Variance (ANOVA)

Section 7: Predictive Modeling

27: Linear Regression

28: Logistic Regression

29: Decision Trees

30: Random Forests

Python for Data Analysis Part 30: Random Forests

* Edit Jan 2021: I recently completed a YouTube video covering topics in this post

For the final lesson in this guide, we'll learn about random forest models. As we saw last time, decision trees are a conceptually simple predictive modeling technique, but when you start building deep trees, they become complicated and likely to overfit your training data. In addition, decision trees are constructed in a way such that branch splits are always made on variables that appear to be the most significant first, even if those splits do not lead to optimal outcomes as the tree grows. Random forests are an extension of decision trees that address these shortcomings.

Random Forest Basics

A random forest model is a collection of decision tree models that are combined together to make predictions. When you make a random forest, you have to specify the number of decision trees you want to use to make the model. The random forest algorithm then takes random samples of observations from your training data and builds a decision tree model for each sample. The random samples are typically drawn with replacement, meaning the same observation can be drawn multiple times. The end result is a bunch of decision trees that are created with different groups of data records drawn from the original training data.

The decision trees in a random forest model are a little different than the standard decision trees we made last time. Instead of growing trees where every single explanatory variable can potentially be used to make a branch at any level in the tree, random forests limit the variables that can be used to make a split in the decision tree to some random subset of the explanatory variables. Limiting the splits in this fashion helps avoid the pitfall of always splitting on the same variables and helps random forests create a wider variety of trees to reduce overfitting.

Random forests are an example of an ensemble model: a model composed of some combination of several different underlying models. Ensemble models often yields better results than single models because different models may detect different patterns in the data and combining models tends to dull the tendency that complex single models have to overfit the data.

Random Forests on the Titanic

Python's sklearn package offers a random forest model that works much like the decision tree model we used last time. Let's use it to train a random forest model on the Titanic training set:

In [1]:

import numpy as np
import pandas as pd
import os

In [2]:

# Load and prepare Titanic data
os.chdir('C:\\Users\\Greg\\Desktop\\Kaggle\\titanic') # Set working directory

titanic_train = pd.read_csv("titanic_train.csv")    # Read the data

# Impute median Age for NA Age values
new_age_var = np.where(titanic_train["Age"].isnull(), # Logical check
                       28,                       # Value if check is true
                       titanic_train["Age"])     # Value if check is false

titanic_train["Age"] = new_age_var

In [3]:

from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing

In [4]:

# Set the seed
np.random.seed(12)

# Initialize label encoder
label_encoder = preprocessing.LabelEncoder()

# Convert some variables to numeric
titanic_train["Sex"] = label_encoder.fit_transform(titanic_train["Sex"])
titanic_train["Embarked"] = label_encoder.fit_transform(titanic_train["Embarked"])

# Initialize the model
rf_model = RandomForestClassifier(n_estimators=1000, # Number of trees
                                  max_features=2,    # Num features considered
                                  oob_score=True)    # Use OOB scoring*

features = ["Sex","Pclass","SibSp","Embarked","Age","Fare"]

# Train the model
rf_model.fit(X=titanic_train[features],
             y=titanic_train["Survived"])

print("OOB accuracy: ")
print(rf_model.oob_score_)

OOB accuracy: 
0.81664791901

Since random forest models involve building trees from random subsets or "bags" of data, model performance can be estimated by making predictions on the out-of-bag (OOB) samples instead of using cross validation. You can use cross validation on random forests, but OOB validation already provides a good estimate of performance and building several random forest models to conduct K-fold cross validation with random forest models can be computationally expensive.

The random forest classifier assigns an importance value to each feature used in training. Features with higher importance were more influential in creating the model, indicating a stronger association with the response variable. Let's check the feature importance for our random forest model:

In [5]:

for feature, imp in zip(features, rf_model.feature_importances_):
    print(feature, imp)

Sex 0.266812848384
Pclass 0.0892556347506
SibSp 0.0523628494934
Embarked 0.0320938468195
Age 0.2743081392
Fare 0.285166681353

Feature importance can help identify useful features and eliminate features that don't contribute much to the model.

As a final exercise, let's use the random forest model to make predictions on the titanic test set and submit them to Kaggle to see how our actual generalization performance compares to the OOB estimate:

In [6]:

# Read and prepare test data
titanic_test = pd.read_csv("titanic_test.csv")    # Read the data

# Impute median Age for NA Age values
new_age_var = np.where(titanic_test["Age"].isnull(),
                       28,                      
                       titanic_test["Age"])      

titanic_test["Age"] = new_age_var 

# Convert some variables to numeric
titanic_test["Sex"] = label_encoder.fit_transform(titanic_test["Sex"])
titanic_test["Embarked"] = label_encoder.fit_transform(titanic_test["Embarked"])

In [7]:

# Make test set predictions
test_preds = rf_model.predict(X= titanic_test[features])

# Create a submission for Kaggle
submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
                           "Survived":test_preds})

# Save submission to CSV
submission.to_csv("tutorial_randomForest_submission.csv", 
                  index=False)        # Do not save index values

Upon submission, the random forest model achieves an accuracy score of 0.75120, which is actually worse than the decision tree model and even the simple gender-based model. What gives? Is the model overfitting the training data? Did we choose bad variables and model parameters? Or perhaps our simplistic imputation of filling in missing age data using median ages is hurting our accuracy. Data analyses and predictive models often don't turn out how you expect, but even a "bad" result can give you more insight into your problem and help you improve your analysis or model in a future iteration.

Python for Data Analysis Conclusion

In this introduction to Python for data analysis series, we built up slowly from the most basic rudiments of the Python language to building predictive models that you can apply to real-world data. Although Python is a beginner-friendly programming language, it was not built specifically for data analysis, so we relied heavily upon libraries to extend base Python's functionality when doing data analysis. As a series focused on practical tools and geared toward beginners, we didn't always take the time to dig deep into the details of the language or the statistical and predictive models we covered. My hope is that some of the lessons in this guide piqued your interest and equipped you with the tools you need to dig deeper on your own.

If you're interested in learning more about Python, there are many ways to proceed. If you learn well with some structure, consider an online data science course that uses Python, like the intro to Machine Learning course on Udacity, the Machine Learning specialization track on Coursera one of the many data science offerings those sites or edX. If you like hands-on learning, try tackling some Kaggle competitions or finding a data set to analyze.

One of the hardest parts of learning a new skill is getting started. If any part of this guide helped you get started, it has served its purpose.

*Final Note: If you are interested in learning R, I have a 30-part introduction to R guide that covers most of the same topics as this Python guide and recreates many of the same examples in R.