* Edit Jan 2021: I recently completed a YouTube video covering topics in this post:

I recently completed an introductory guide to R programming aimed at teaching the basic tools necessary to use R for data analysis and predictive modeling. R is a great language for statistics and data analysis because the language was built with that goal in mind. Python is only language to rival R's popularity for data analysis. Unlike R, Python is a general-purpose language that isn't designed for any particular task. It is a jack-of-all-trades language with clean syntax and a vibrant ecosystem of data science software libraries that extend its base functionality, making it an excellent first language to learn and a data science powerhouse.

This guide does not assume prior programming experience and focuses on using Python as a tool for data analysis. We won't spend much time digging into low level details of the language or functionality that is not needed to use Python for data analysis. Since Python is a general-purpose language, however, it will take several lessons to build the basic Python proficiency necessary to start using Python for data analysis. If you already have basic Python proficiency, you may want to skip ahead. After Part 1, this guide will not spend much time comparing Python and R, as it does not assume R knowledge. In my experience, it is easier to learn how to program in Python, but it is easier to get started with data analysis in R because all the tools you need are either baked in or one simple download away. If you're just getting into data science for the first time, you can't go wrong with either language and it is a good idea to learn both eventually.

Perhaps the biggest downside of Python as a language is that it has two major branches, Python 2.7 and Python 3.X, which are not fully compatible with one another. This means code written for 2.7 generally doesn't work in Python 3 and vice versa, so certain software libraries might only be available for one or the other. As a result, managing add-on libraries and their dependencies in Python can be troublesome. The differences between Python 2.7 and 3.X won't really affect our learning of the language itself, however, as the basic syntax is mostly the same between both versions.

This guide will use Python version 3.4. All code presented should work with version of Python 3.4 or later. Most if not all of the code will also work on Python 2.7, so you can still follow along if you are using Python 2.7.

Since Python package management can be difficult, I do not recommend installing Python and its data analysis libraries individually. It is easiest to download the Anaconda Python distribution from Continuum Analytics. Anaconda bundles Python with dozens of popular data analysis libraries and it comes with a nice integrated development environment (a fancy code editor) called Spyder. Simply go to the Continuum Analytics download page, click the download link appropriate for your operating system and Python version and then run the installer to set up the Anaconda Python environment.

A Brief Intro to Spyder

After installing Anaconda, open the Continuum analytics app Launcher and click the "launch" button next to the Spyder app or simply find Spyder in your program list and launch it directly. Spyder is code development tool geared toward data analysis. When you first open Spyder, you'll see an application window separated into several panes, each with one or more tabs. The arrangement of the panes and tabs is customizable: simply click and hold on the edge of a pane and drag it to a different part of the Spyder window to reorganize panes. Select a tab and click the window icon in the upper right corner of the pane cause the tab to pop out into its own pane that you can drag around or drop into an existing pane. When you open the editor for the first time, certain useful panes might be turned off. You can turn panes on and off under the "view -> panes" menu. My Spyder editor in this intro has the following panes turned on: Editor, Console, IPython Console, Variable Explorer, Object Inspector, File Explorer and History Log. I organized my editor into a 4-pane layout that mirrors R's popular RStudio code editor:

The upper left pane is a code editor that contains a tabbed list of code files. This is where you write code you want to save and run. To run code written in your code editor, highlight the code you want to run, hold shift and press enter. You can also click the green run button (looks like a play button) to run the entire code file.

The pane in the upper right consists of two tabs: the variable explorer tab and history tab. The history tab shows a list of commands you've run and the variable explorer shows a summary of the variables and data structures you've defined.

The pane in the bottom left corner is the interactive Python console. The console is where you enter Python code and view its output. When you run code from the code editor, the output appears in the console. You can also type code directly into the console and run it by pressing the enter key.

The pane in the bottom right consists of two tabs: the object inspector and the file explorer. The object inspector lets you get you view help information on objects in the console by typing the object's name into the search bar or placing your cursor in front of the object in the console and pressing control + I. The file explorer lets you navigate your computer's file system.

For demonstration purposes, I added some code to my editor and ran it:

The code in the editor window is:

In [1]:

# Lets make a list!

my_list = [1,2,3,4,5,6,7,8,9,10]

print ( len(my_list) )

*Note: Code in this guide consists of blocks of input labeled "In" and the corresponding output appears below the input block (in this case, the number 10.).

The first line of the code starts with a pound symbol "#". In Python, # defines a comment: a bit of text that the coder adds to explain something about the program that is not actually a part of the code that is executed.

The second line defines a new variable my_list.

Finally the third line prints the length of the my_list variable.

Notice that upon running the file, the number 10 appears in the console, but no other output appears. Comments and variable definitions produce no output, so the only output we see is the result of the print statement: the length of my_list, which is 10.

Also note that the variable my_list has appeared in the variable explorer pane. The pane shows the variable's type, size and a summary of its values. You can double click on a variable in the explorer window to get a more detailed view of the variable and even edit individual values it contains:

Finally notice the search for "list" in the bottom right object inspector pane, which pulled up a short description the list() function:

Spyder has a lot of similarities to R's popular RStudio code editor, which makes it a little bit easier to transition from one language to the other than it might be if you used a different editor. That said, this guide doesn't assume you are using any particular editor: you can use anything you like as long as you can run Python code.

Looking Ahead

Now that you have Python installed, you are ready to start learning Python for data analysis. We'll start slow, but by the end of this guide you'll have all the tools necessary to load, clean, explore, analyze and create predictive models from data.

Next Time: Python for Data Analysis Part 2: Python Arithmetic

*Note: See index for this 30-part guide here.

Life Is Study

Wednesday, October 21, 2015

Python for Data Analysis Part 1: Setup

A Brief Intro to Spyder

Looking Ahead

Next Time: Python for Data Analysis Part 2: Python Arithmetic

No comments:

Post a Comment