Thursday, November 5, 2015

Python for Data Analysis Part 10: Reading and Writing Data


* Edit Jan 2021: I recently completed a YouTube video covering topics in this post:




Reading data into pandas DataFrames is often the first step when conducting data analysis in Python. The pandas package comes equipped with several data reading and writing functions that let you read data directly from common file formats like comma separated values files (CSV) and Microsoft Excel files. This lesson will focus on reading and writing data from these common file formats, but Python has packages available to work with just about every data format you encounter.

Python Working Directory and File Paths

Before we can jump into reading and writing data, we need to learn a little bit about Python's working directory and file paths. When you launch Python, it starts in a default location in your computer's file system known as the working directory. You can check your current working directory by importing the os module and then using os.getcwd():
In [1]:
import os          

os.getcwd()
Out[1]:
'C:\\Users\\Greg'
The working directory acts as your starting point for accessing files on your computer from within Python. To load a data set from your hard drive, you either need to put the file in your working directory, change your working directory to the folder containing the data or supply the data file's file path to the data reading function.
You can change your working directory by supplying a new file path in quotes to the os.chdir() function:
In [2]:
os.chdir('C:\\Users\\Greg\\Desktop\\intro_python10')
        
os.getcwd()                     # Check the working directory again
Out[2]:
'C:\\Users\\Greg\\Desktop\\intro_python10'
You can list all of the objects in a directory by passing the file path to the os.listdir( ) function:
In [3]:
os.listdir('C:\\Users\\Greg\\Desktop\\intro_python10')
Out[3]:
['draft2015.csv', 'draft2015.tsv', 'draft2015.xlsx']
Notice my intro_python10 folder has 3 files named "draft2015" in different file formats. Let's load them into DataFrames. (Download the data files here: csv, tsv, xlsx)

Reading CSV and TSV Files

Data is commonly stored in simple flat text files consisting of values delimited(separated) by a special character like a comma (CSV) or tab (TSV).
You can read CSV files into a pandas DataFrame using the pandas function pd.read_csv():
In [4]:
import pandas as pd

draft1 = pd.read_csv('draft2015.csv')    # Supply the file name (path)

draft1.head(6)                           # Check the first 6 rows
Out[4]:
PlayerDraft_ExpressCBSCBS_2CBS_3BleacherReportSI
0Karl-Anthony Towns111111
1Jahlil Okafor222222
2Emmanuel Mudiay766676
3D'Angelo Russell334433
4Kristaps Porzingis653344
5Mario Hezonja478767
6 rows × 7 columns
To load a TSV file, use pd.read_table():
In [5]:
draft2 = pd.read_table('draft2015.tsv')  # Read a tsv into a DataFrame

draft2.head(6)                           # Check the first 6 rows
Out[5]:
PlayerDraft_ExpressCBSCBS_2CBS_3BleacherReportSI
0Karl-Anthony Towns111111
1Jahlil Okafor222222
2Emmanuel Mudiay766676
3D'Angelo Russell334433
4Kristaps Porzingis653344
5Mario Hezonja478767
6 rows × 7 columns
The read_table() function is a general file reading function that reads TSV files by default, but you can use to to read flat text files separated by any delimiting character by setting the "sep" argument to a different character. Read more about the options it offers here.

Reading Excel Files

Microsoft Excel is a ubiquitous enterprise spreadsheet program that stores data in its own format with the extension .xls or .xlsx. Although you can save Excel files as CSV from within Excel and then load it into Python with the functions we covered above, pandas is capable of loading data directly from Excel file formats.
To load data from an Excel file you need the "xlrd" module. This module comes with the Python Anaconda distribution. If you don't have it installed, you can get it by opening a command console and running "pip install xlrd" (without quotes.).
Load data from an Excel file to a DataFrame with pd.read_excel(), supplying the file path and the name of the worksheet you want to load:
In [6]:
draft3 = pd.read_excel('draft2015.xlsx',        # Path to Excel file
                       sheetname = 'draft2015') # Name of sheet to read from

draft3.head(6)                            # Check the first 6 rows
Out[6]:
PlayerDraft_ExpressCBSCBS_2CBS_3BleacherReportSI
0Karl-Anthony Towns111111
1Jahlil Okafor222222
2Emmanuel Mudiay766676
3D'Angelo Russell334433
4Kristaps Porzingis653344
5Mario Hezonja478767
6 rows × 7 columns

Reading Web Data

The Internet gives you access to more data than you could ever hope to analyze. Data analysis often begins with getting data from the web and loading it into Python. Websites that offer data for download usually let you download data as CSV, TSV or excel files. Perhaps the easiest way load web data, is to simply download data to your hard drive and then use the functions we discussed earlier to load it into a DataFrame.
Reading from the clipboard is another quick and dirty option for reading web data and other tabular data. To read data from the clipboard, highlight the data you want to copy and use the appropriate copy function on your keyboard (typically control+C) as if you were going to copy and paste the data. Next, use the pd.read_clipboard() function with the appropriate separator to load the data into a pandas DataFrame:
In [7]:
# Go to http://www.basketball-reference.com/leagues/NBA_2015_totals.html
# click the CSV button and then copy some data to the clipboard

BB_reference_data = pd.read_clipboard(sep=",")  # Read data from the clipboard

BB_reference_data.ix[:, 0:10].head(5)   # Check 5 rows (10 columns only)
Out[7]:
RkPlayerPosAgeTmGGSMPFGFGA
01Quincy AcyPF24NYK68221287152331
12Jordan AdamsSG20MEM3002483586
23Steven AdamsC21OKC70671771217399
34Jeff AdrienPF28MIN1702151944
45Arron AfflaloSG29TOT78722502375884
5 rows × 10 columns
Pandas also comes with a read_html() function to read data directly from web pages. To use read_html() you need the HTML5lib package. Install it by opening a command console and running "pip install HTML5lib" (without quotes.). Note that HTML can have all sorts of nested structures and formatting quirks, which makes parsing it to extract data troublesome. The read_html() function does its best to draw out tabular data in web pages, but the results aren't always perfect. Let's read html directly from basketball-reference.com to get the same data we loaded from the clipboard:
In [8]:
url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"

BB_data = pd.read_html(url)         # Read data from the specified url

BB_data[0].ix[:, 0:10].head(5)      # Check 5 rows (10 columns only)*
Out[8]:
RkPlayerPosAgeTmGGSMPFGFGA
01Quincy AcyPF24NYK6822NaT152331
12Jordan AdamsSG20MEM300NaT3586
23Steven AdamsC21OKC70671771-01-01217399
34Jeff AdrienPF28MIN170NaT1944
45Arron AfflaloSG29TOT7872NaT375884
5 rows × 10 columns
*Note: read_html() returns a list of DataFrames, regardless of the number of tables on the web page you read from. In the code above, BB_data[0] gets the first DataFrame from the list and then .ix[:, 0:10].head(5) slices the first 10 columns and checks the head.
Data comes in all sorts of formats other than the ones we've discussed here. The pandas library has several other data reading functions to work with data in other common formats, like json, SAS and stata files and SQL databases.

Writing Data

Each of the data reading functions in pandas has a corresponding writer function that lets you write data back to into the format it came from. Most of the time, however, you'll probably want to save your data in an easy-to-use format like CSV. Write a DataFrame to CSV in the working directory by passing the desired file name to the df.to_csv() function:
In [9]:
BB_reference_data.to_csv("bb_data.csv") 

os.listdir('C:\\Users\\Greg\\Desktop\\intro_python10')
Out[9]:
['bb_data.csv', 'draft2015.csv', 'draft2015.tsv', 'draft2015.xlsx']
Notice 'bb_data.csv' now exists in the folder.

Wrap Up

The pandas library makes it easy to read data into DataFrames and export it back into common data formats like CSV files.
Now that we know how to load data into Python we're almost ready to start doing data analysis, but before we do, we need to learn some basic Python programming constructs.

7 comments:

  1. Seems this link is ain't working anymore
    url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"

    ReplyDelete
    Replies
    1. The link still seems to work for me. Maybe the site was down?

      Delete
    2. Oh yeah it was down i suppose. Thanks

      Delete
  2. Where can I download files : draft2015.csv, draft2015.tsv and draft2015.xlsx

    ReplyDelete
    Replies
    1. I added links above where you can download the data files above.

      Delete
    2. Thank you very much. Your tutorials are excellent, easy to understand and follow.

      Delete
    3. Thanks. The main reason I wrote them was to make something very approachable because so many of the courses and tutorials online assume prior knowledge or are just not that easy to understand, so I'm glad you are finding them easy to follow so far.

      Delete

Note: Only a member of this blog may post a comment.