Python and R are the two most popular
programming languages for data analysis and machine learning. Almost every new online
course about data science uses one or the other and it is becoming
increasingly rare to find scripts written in other languages on
predictive modeling websites like Kaggle. Both languages are a great
place to start, but each excel in different areas, so is worth
considering some of the pros and cons of each language before you
choose one and jump in.
Setup
Setup essentially amounts to installing
your programming language and any supporting software libraries you
need for data science along with some sort of text editor or other
development environment to write and run code. Setting up R is
relatively simple: you install the appropriate version of R for your
operating system and then you can install R Studio, an R editor that
is essentially universal among R users. Adding new libraries to R is
also very easy: you can install and then load new packages right from
within an active R Studio session.
Python set up is is much more involved
than R setup. First of all, Python has two major versions: Python 2.7
and Python 3.X (currently Python 3.5.2 is the newest version.).
Python 3 is actively being developed, while Python 2.7 is an older
version that many companies and software libraries still use because
Python 3 is not backwards compatible with Python 2. The differences
between Python 2.7 and Python 3 are minimal so it doesn't really
matter which you choose in terms of learning the language, but
certain libraries may only be available in one version. Most
practitioners recommend that you start with the newest version of
Python 3 as long you don't want to use a library that specifically
requires Python 2.7.
After installing Python, you'll need to
install various data science libraries like numpy, pandas and
scikit-learn as well as an editor program to work with your code.
Package management in Python can be a bit of nightmare so it is
easiest to avoid it as much as possible; the simplest option is to
install Anaconda, a popular distribution of Python intended for data
science that comes prepackaged with many of the most popular
libraries you'll need as well as a code editor called Spyder. When
you're getting started, you shouldn't need to do much other than
installing Anaconda, just be aware that as you mature with the
language, library updates and compatibility issues will likely be the
source of many headaches. You may eventually find yourself running
multiple versions of Python as well as using virtual machines just to
be able to use all the libraries you want to.
Verdict: Big edge R
Learning Curve
Python and and R are both good first
programming languages that are much easier to learn than a low-level
languages like C. R emphasizes interactivity: there is an interactive
shell in R Studio that lets you type in commands and get immediate
feedback as well as panes that summarize any data you load into the
program and any plots you create. You can get a lot accomplished with
R just using the interactive shell using built in commands without
actually writing any custom functions. Learning how to to basic data
analysis tasks in R is probably a bit easier than it is in Python,
because R was built for statistics from the ground up so many common
statistical functions are available in R by default so they don't
require importation or learning any special libraries. On the other
hand, Python is known for its clean, intuitive syntax which can make
it easier to read than R's relatively ugly code. Python is commonly
taught as a first programming language in general programming
courses, due in part to its nice syntax and how easy it is to learn.
R's syntax might pose a little bit of a stumbling block when you are
first getting started, especially if you've never programmed in a
different language.
Verdict: Slight edge Python
Exploratory Data Analysis
The first part of almost every data
project involves loading and exploring a data set. Most of the tools
you need to load data into R and explore it are built right into the
base language. R has a data structure called the data frame that
mirrors the sorts of tables you'd expect to see in a spreadsheet
program like Excel, essentially allowing you to load excel sheets and
other tabular data directly into R. You can then call various
functions on the data frame or its individual columns, to produce
numerical summaries, locate outliers and find missing values.
Python offers similar data-reading
functionality as well as a data object that mirror's R's data frames
in the pandas library. Once you load pandas, you'll have access to
the same sorts of exploratory data analysis tools that you have in R
and if you use the Spyder editor that comes with Anaconda, you'll
have access to an interactive Python shell and a data explorer that
mirror those available in R Studio. I personally prefer using R for
exploratory analysis because I like R Studio better than Python
editors but it really comes down to personal preference.
Verdict: Tie
Data Cleaning
Once you start exploring data, you'll
probably find that it isn't as clean as you'd like. Perhaps it has
missing values or contains text with odd punctuation that you need to
remove. It may have extra variables that you don't need or columns
that you want to combine or separate. Data cleaning, aka “data
munging” or “data wrangling” describes massaging your data into
a form that you can use in a data analysis. It isn't glamorous, but
it can be one of the most time consuming parts of a data project, so
it is important nonetheless.
Python and R both have all the tools
necessary to perform data munging, such as functions for filling in
missing values, joining data tables together and deleting unwanted
data. Python is perhaps a bit easier to use when dealing with text
data, especially if you have to write custom functions or regular
expressions as a part of your cleaning. R is great for working with
numeric data as well as categorical variables and dates. Again, the
winner here comes down to personal preference.
Verdict: Tie
Plotting
R has two main plotting packages, a
base plotting library that is built into R and an add-on package
called ggplot2 that gives you access to fully featured plotting
capabilities. R's base plotting functions create quick and dirty
charts that are ideal for exploratory analysis, while ggplot2 is
preferable for making prettier and more complex plots. The ggplot2
package uses an intuitive syntax structure that makes it easy to use
once you get the basic constructs down.
Python has many plotting libraries,
probably more than is healthy for beginners. matplotlib is perhaps
the most popular Python plotting library and the pandas package
includes some basic plotting functionality built on top of it.
matplotlib is a capable plotting package, but its syntax is generally
more verbose and confusing than R's ggplot2. There is a port of
ggplot2 for Python, but it doesn't work as well and does not offer
all the same features as the R version.
Verdict: Slight edge R
Statistics
The R language excels as statistics,
since it was built by statisticians with that purpose in mind. R
makes it easy to check descriptive statistics as well as to check
things like correlations, conduct statistical inference tests like
t-tests and work with probability distributions. Basically for any
sort of statistical operation, R is your best bet for finding a
function someone else has already created to carry it out on your
data.
Python is a capable language for
statistics, but it was created as a general programming language
first, so its statistical functions aren't always as easy to find and
use as those available in R. You can write functions yourself, but
that takes time and is more prone to error that using established
packages.
Verdict: Edge R
Programming
Depending on the nature of your data
project, you might be able to get by only using functions provided by
your programming language and its libraries, but as you advance in
your learning you'll want to start writing some custom functions
sooner or later. Python's clean syntax and origin as a general
programming language makes it much nicer than R for writing
user-defined functions. Writing anything other than small snippets of
custom R code can get ugly and slow. Python is also well suited for
creating new, potentially large applications from scratch. As a
general-purpose language, Python is also useful in many areas outside
of data science, such as web programming while R is pretty much only
used for statistics and data science.
Verdict: Big edge Python
Machine Learning
Creating predictive models using
machine learning techniques is the end goal of many data projects. R
offers a wide array of machine learning models for classification and
regression, from linear regression, which is built into base R, to
more complicated algorithms like random forests and xgboost. The
caret package in R provides a single interface to many different
machine learning tools, making it easy for users to simply pass in a
data set as well as a handful of parameters to do everything from
logistic regression to making basic neural networks.
Python provides a suite of machine
learning tools similar to those available in R in its scikit-learn
library. Python has better packages for creating neural networks for
deep learning such as Theano and Keras and since it is nicer as a
general programming language it a better option if you are planning
to create your own machine learning tools from scratch.
Verdict: Edge Python
You'll notice I didn't include a section on language speed. Accurate speed comparisons are hard to make; Python is generally considered to be a faster language than R, but both are slow relative to low level programming languages and speed differences often depend more on implementation details than the languages themselves.
Recommendations
So which language is better? As you
probably guessed by now: it depends. If you're mainly interested
statistics, scientific research and self-contained data analyses that you can approach with pre-made tools, R
is a good place to start. If you want want to learn about programming
in general, create your own tools or do deep learning, Python is
better. That said, it is a good idea to learn the basics of both
languages eventually so you can communicate more effectively with fellow data scientists, form your own opinions and choose the
right tools for the tasks you have at hand.
Getting Started
There are a ton of resources available
online to get started programming with Python and R. For one, you can
check out my Introduction
to Python for Data Analysis and Introduction
to R, which cover each language from the very basics through
performing common data analysis tasks and predictive modeling. If
you're interested in taking a university style courses online, I
recommend Udacity's Intro
to computer science and MIT's Intro
to computer science and programming using Python
for learning Python and MIT's The
Analytics Edge for learning R.
This comment has been removed by a blog administrator.
ReplyDelete