* Edit Jan 2021: I recently completed a YouTube video covering topics in this post:
Reading data into pandas DataFrames is often the first step when conducting data analysis in Python. The pandas package comes equipped with several data reading and writing functions that let you read data directly from common file formats like comma separated values files (CSV) and Microsoft Excel files. This lesson will focus on reading and writing data from these common file formats, but Python has packages available to work with just about every data format you encounter.
Python Working Directory and File Paths
Before we can jump into reading and writing data, we need to learn a little bit about Python's working directory and file paths. When you launch Python, it starts in a default location in your computer's file system known as the working directory. You can check your current working directory by importing the os module and then using os.getcwd():
In [1]:
import os
os.getcwd()
Out[1]:
The working directory acts as your starting point for accessing files on your computer from within Python. To load a data set from your hard drive, you either need to put the file in your working directory, change your working directory to the folder containing the data or supply the data file's file path to the data reading function.
You can change your working directory by supplying a new file path in quotes to the os.chdir() function:
In [2]:
os.chdir('C:\\Users\\Greg\\Desktop\\intro_python10')
os.getcwd() # Check the working directory again
Out[2]:
You can list all of the objects in a directory by passing the file path to the os.listdir( ) function:
In [3]:
os.listdir('C:\\Users\\Greg\\Desktop\\intro_python10')
Out[3]:
Reading CSV and TSV Files
Data is commonly stored in simple flat text files consisting of values delimited(separated) by a special character like a comma (CSV) or tab (TSV).
You can read CSV files into a pandas DataFrame using the pandas function pd.read_csv():
In [4]:
import pandas as pd
draft1 = pd.read_csv('draft2015.csv') # Supply the file name (path)
draft1.head(6) # Check the first 6 rows
Out[4]:
To load a TSV file, use pd.read_table():
In [5]:
draft2 = pd.read_table('draft2015.tsv') # Read a tsv into a DataFrame
draft2.head(6) # Check the first 6 rows
Out[5]:
The read_table() function is a general file reading function that reads TSV files by default, but you can use to to read flat text files separated by any delimiting character by setting the "sep" argument to a different character. Read more about the options it offers here.
Reading Excel Files
Microsoft Excel is a ubiquitous enterprise spreadsheet program that stores data in its own format with the extension .xls or .xlsx. Although you can save Excel files as CSV from within Excel and then load it into Python with the functions we covered above, pandas is capable of loading data directly from Excel file formats.
To load data from an Excel file you need the "xlrd" module. This module comes with the Python Anaconda distribution. If you don't have it installed, you can get it by opening a command console and running "pip install xlrd" (without quotes.).
Load data from an Excel file to a DataFrame with pd.read_excel(), supplying the file path and the name of the worksheet you want to load:
In [6]:
draft3 = pd.read_excel('draft2015.xlsx', # Path to Excel file
sheetname = 'draft2015') # Name of sheet to read from
draft3.head(6) # Check the first 6 rows
Out[6]:
Reading Web Data
The Internet gives you access to more data than you could ever hope to analyze. Data analysis often begins with getting data from the web and loading it into Python. Websites that offer data for download usually let you download data as CSV, TSV or excel files. Perhaps the easiest way load web data, is to simply download data to your hard drive and then use the functions we discussed earlier to load it into a DataFrame.
Reading from the clipboard is another quick and dirty option for reading web data and other tabular data. To read data from the clipboard, highlight the data you want to copy and use the appropriate copy function on your keyboard (typically control+C) as if you were going to copy and paste the data. Next, use the pd.read_clipboard() function with the appropriate separator to load the data into a pandas DataFrame:
In [7]:
# Go to http://www.basketball-reference.com/leagues/NBA_2015_totals.html
# click the CSV button and then copy some data to the clipboard
BB_reference_data = pd.read_clipboard(sep=",") # Read data from the clipboard
BB_reference_data.ix[:, 0:10].head(5) # Check 5 rows (10 columns only)
Out[7]:
Pandas also comes with a read_html() function to read data directly from web pages. To use read_html() you need the HTML5lib package. Install it by opening a command console and running "pip install HTML5lib" (without quotes.). Note that HTML can have all sorts of nested structures and formatting quirks, which makes parsing it to extract data troublesome. The read_html() function does its best to draw out tabular data in web pages, but the results aren't always perfect. Let's read html directly from basketball-reference.com to get the same data we loaded from the clipboard:
In [8]:
url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"
BB_data = pd.read_html(url) # Read data from the specified url
BB_data[0].ix[:, 0:10].head(5) # Check 5 rows (10 columns only)*
Out[8]:
*Note: read_html() returns a list of DataFrames, regardless of the number of tables on the web page you read from. In the code above, BB_data[0] gets the first DataFrame from the list and then .ix[:, 0:10].head(5) slices the first 10 columns and checks the head.
Data comes in all sorts of formats other than the ones we've discussed here. The pandas library has several other data reading functions to work with data in other common formats, like json, SAS and stata files and SQL databases.
Writing Data
Each of the data reading functions in pandas has a corresponding writer function that lets you write data back to into the format it came from. Most of the time, however, you'll probably want to save your data in an easy-to-use format like CSV. Write a DataFrame to CSV in the working directory by passing the desired file name to the df.to_csv() function:
In [9]:
BB_reference_data.to_csv("bb_data.csv")
os.listdir('C:\\Users\\Greg\\Desktop\\intro_python10')
Out[9]:
Notice 'bb_data.csv' now exists in the folder.
Wrap Up
The pandas library makes it easy to read data into DataFrames and export it back into common data formats like CSV files.
Now that we know how to load data into Python we're almost ready to start doing data analysis, but before we do, we need to learn some basic Python programming constructs.
Seems this link is ain't working anymore
ReplyDeleteurl = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"
The link still seems to work for me. Maybe the site was down?
DeleteOh yeah it was down i suppose. Thanks
DeleteWhere can I download files : draft2015.csv, draft2015.tsv and draft2015.xlsx
ReplyDeleteI added links above where you can download the data files above.
DeleteThank you very much. Your tutorials are excellent, easy to understand and follow.
DeleteThanks. The main reason I wrote them was to make something very approachable because so many of the courses and tutorials online assume prior knowledge or are just not that easy to understand, so I'm glad you are finding them easy to follow so far.
Delete