Wednesday, August 31, 2016

Python vs R for Learning Data Science



Python and R are the two most popular programming languages for data analysis and machine learning. Almost every new online course about data science uses one or the other and it is becoming increasingly rare to find scripts written in other languages on predictive modeling websites like Kaggle. Both languages are a great place to start, but each excel in different areas, so is worth considering some of the pros and cons of each language before you choose one and jump in.


Setup

Setup essentially amounts to installing your programming language and any supporting software libraries you need for data science along with some sort of text editor or other development environment to write and run code. Setting up R is relatively simple: you install the appropriate version of R for your operating system and then you can install R Studio, an R editor that is essentially universal among R users. Adding new libraries to R is also very easy: you can install and then load new packages right from within an active R Studio session.

Python set up is is much more involved than R setup. First of all, Python has two major versions: Python 2.7 and Python 3.X (currently Python 3.5.2 is the newest version.). Python 3 is actively being developed, while Python 2.7 is an older version that many companies and software libraries still use because Python 3 is not backwards compatible with Python 2. The differences between Python 2.7 and Python 3 are minimal so it doesn't really matter which you choose in terms of learning the language, but certain libraries may only be available in one version. Most practitioners recommend that you start with the newest version of Python 3 as long you don't want to use a library that specifically requires Python 2.7.

After installing Python, you'll need to install various data science libraries like numpy, pandas and scikit-learn as well as an editor program to work with your code. Package management in Python can be a bit of nightmare so it is easiest to avoid it as much as possible; the simplest option is to install Anaconda, a popular distribution of Python intended for data science that comes prepackaged with many of the most popular libraries you'll need as well as a code editor called Spyder. When you're getting started, you shouldn't need to do much other than installing Anaconda, just be aware that as you mature with the language, library updates and compatibility issues will likely be the source of many headaches. You may eventually find yourself running multiple versions of Python as well as using virtual machines just to be able to use all the libraries you want to.

Verdict: Big edge R



Learning Curve

Python and and R are both good first programming languages that are much easier to learn than a low-level languages like C. R emphasizes interactivity: there is an interactive shell in R Studio that lets you type in commands and get immediate feedback as well as panes that summarize any data you load into the program and any plots you create. You can get a lot accomplished with R just using the interactive shell using built in commands without actually writing any custom functions. Learning how to to basic data analysis tasks in R is probably a bit easier than it is in Python, because R was built for statistics from the ground up so many common statistical functions are available in R by default so they don't require importation or learning any special libraries. On the other hand, Python is known for its clean, intuitive syntax which can make it easier to read than R's relatively ugly code. Python is commonly taught as a first programming language in general programming courses, due in part to its nice syntax and how easy it is to learn. R's syntax might pose a little bit of a stumbling block when you are first getting started, especially if you've never programmed in a different language.

Verdict: Slight edge Python



Exploratory Data Analysis

The first part of almost every data project involves loading and exploring a data set. Most of the tools you need to load data into R and explore it are built right into the base language. R has a data structure called the data frame that mirrors the sorts of tables you'd expect to see in a spreadsheet program like Excel, essentially allowing you to load excel sheets and other tabular data directly into R. You can then call various functions on the data frame or its individual columns, to produce numerical summaries, locate outliers and find missing values.

Python offers similar data-reading functionality as well as a data object that mirror's R's data frames in the pandas library. Once you load pandas, you'll have access to the same sorts of exploratory data analysis tools that you have in R and if you use the Spyder editor that comes with Anaconda, you'll have access to an interactive Python shell and a data explorer that mirror those available in R Studio. I personally prefer using R for exploratory analysis because I like R Studio better than Python editors but it really comes down to personal preference.

Verdict: Tie



Data Cleaning

Once you start exploring data, you'll probably find that it isn't as clean as you'd like. Perhaps it has missing values or contains text with odd punctuation that you need to remove. It may have extra variables that you don't need or columns that you want to combine or separate. Data cleaning, aka “data munging” or “data wrangling” describes massaging your data into a form that you can use in a data analysis. It isn't glamorous, but it can be one of the most time consuming parts of a data project, so it is important nonetheless.

Python and R both have all the tools necessary to perform data munging, such as functions for filling in missing values, joining data tables together and deleting unwanted data. Python is perhaps a bit easier to use when dealing with text data, especially if you have to write custom functions or regular expressions as a part of your cleaning. R is great for working with numeric data as well as categorical variables and dates. Again, the winner here comes down to personal preference.

Verdict: Tie



Plotting

R has two main plotting packages, a base plotting library that is built into R and an add-on package called ggplot2 that gives you access to fully featured plotting capabilities. R's base plotting functions create quick and dirty charts that are ideal for exploratory analysis, while ggplot2 is preferable for making prettier and more complex plots. The ggplot2 package uses an intuitive syntax structure that makes it easy to use once you get the basic constructs down.

Python has many plotting libraries, probably more than is healthy for beginners. matplotlib is perhaps the most popular Python plotting library and the pandas package includes some basic plotting functionality built on top of it. matplotlib is a capable plotting package, but its syntax is generally more verbose and confusing than R's ggplot2. There is a port of ggplot2 for Python, but it doesn't work as well and does not offer all the same features as the R version.

Verdict: Slight edge R



Statistics

The R language excels as statistics, since it was built by statisticians with that purpose in mind. R makes it easy to check descriptive statistics as well as to check things like correlations, conduct statistical inference tests like t-tests and work with probability distributions. Basically for any sort of statistical operation, R is your best bet for finding a function someone else has already created to carry it out on your data.

Python is a capable language for statistics, but it was created as a general programming language first, so its statistical functions aren't always as easy to find and use as those available in R. You can write functions yourself, but that takes time and is more prone to error that using established packages.

Verdict: Edge R



Programming

Depending on the nature of your data project, you might be able to get by only using functions provided by your programming language and its libraries, but as you advance in your learning you'll want to start writing some custom functions sooner or later. Python's clean syntax and origin as a general programming language makes it much nicer than R for writing user-defined functions. Writing anything other than small snippets of custom R code can get ugly and slow. Python is also well suited for creating new, potentially large applications from scratch. As a general-purpose language, Python is also useful in many areas outside of data science, such as web programming while R is pretty much only used for statistics and data science.

Verdict: Big edge Python



Machine Learning

Creating predictive models using machine learning techniques is the end goal of many data projects. R offers a wide array of machine learning models for classification and regression, from linear regression, which is built into base R, to more complicated algorithms like random forests and xgboost. The caret package in R provides a single interface to many different machine learning tools, making it easy for users to simply pass in a data set as well as a handful of parameters to do everything from logistic regression to making basic neural networks.

Python provides a suite of machine learning tools similar to those available in R in its scikit-learn library. Python has better packages for creating neural networks for deep learning such as Theano and Keras and since it is nicer as a general programming language it a better option if you are planning to create your own machine learning tools from scratch.

Verdict: Edge Python



You'll notice I didn't include a section on language speed. Accurate speed comparisons are hard to make; Python is generally considered to be a faster language than R, but both are slow relative to low level programming languages and speed differences often depend more on implementation details than the languages themselves. 



Recommendations

So which language is better? As you probably guessed by now: it depends. If you're mainly interested statistics, scientific research and self-contained data analyses that you can approach with pre-made tools, R is a good place to start. If you want want to learn about programming in general, create your own tools or do deep learning, Python is better. That said, it is a good idea to learn the basics of both languages eventually so you can communicate more effectively with fellow data scientists, form your own opinions and choose the right tools for the tasks you have at hand.



Getting Started


There are a ton of resources available online to get started programming with Python and R. For one, you can check out my Introduction to Python for Data Analysis and Introduction to R, which cover each language from the very basics through performing common data analysis tasks and predictive modeling. If you're interested in taking a university style courses online, I recommend Udacity's Intro to computer science and MIT's Intro to computer science and programming using Python for learning Python and MIT's The Analytics Edge for learning R.


1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete

Note: Only a member of this blog may post a comment.