User Tools

Site Tools


chapter5

Chapter 5

ERRATUM

beginning of 5.3.1.

It is not the author's fault; due to the change of ticker symbol of Google on 2014.4.3

NG : for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
OK : for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOGL']: 

just before 5.5.1

namespace is missing.

NG: MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
OK: pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],

beginning of 5.6.2

NG: for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))
OK: for stk in ['AAPL', 'GOOGL', 'MSFT', 'DELL']))

5.0 Import statements in this section

Type the following statements before starting

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import pandas.io.data as web
from numpy import nan as NA

5.1 Introduction to pandas Data Structures

5.1.1 "Series" object

Define a “Series” object (with default index)

obj = Series([4, 7, -5, 3])
obj   	
obj.values	# array([ 4,  7, -5,  3])
obj.index	# Int64Index([0, 1, 2, 3], dtype='int64')

Definition of a “Series” object with a given index

obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
obj2.index  # Index([u'd', u'b', u'a', u'c'], dtype='object')

Handling of “Series” object

obj2['a']
obj2['d'] = 6
obj2[['c', 'a', 'd']]   # [3, -5, 4]
obj2['d'] = 6
obj2[['c', 'a', 'd']]   # [3, -5, 6]

Filtering, Calculation, and function import numpy as np

obj2                    # [6, 7, -5, 3]
obj2[obj2 > 0]          # [6, 7, 3]
obj2*2                  # [12, 14, -10, 6]
np.exp(obj2)            # exponential of each element in obj2, with base e

Checking existence of a certain index

'b' in obj2  # True
'e' in obj2	# False
6 in obj2 	# False: values cannot be tested by this statement 

Definition of a “Series” object using a python dictionary data

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3        # automatically sorted in the order of index data

Definition of a “Series” object with a dictionary and a given index

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)  

Missing values

pd.isnull(obj4)   # Class method: similar to "static" method in Java
pd.notnull(obj4)
obj4.isnull       # Instance method: refers to an existing object
obj3
obj4
obj3+obj4  # Note: missing data ('Utah') is counted as 'NaN'

Handling properties of a “Series” object

obj4.name = 'population'
obj4.index.name = 'state'
obj4
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] 
obj

5.1.2 "DataFrame" object

Definition of a “Dataframe” object

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
      'year': [2000, 2001, 2002, 2001, 2002],
      'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)	# column is automatically sorted in alphabetical order
frame     
frame = DataFrame(data, columns=['year', 'state', 'pop'])	# predetermined column order
frame
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                 index=['one', 'two', 'three', 'four', 'five'])
frame2			# missing column ('debt') is null
frame2.columns		# column list
frame2['state']		# extract a column
frame2.year		# another way to extract a column
frame2.ix['three']	# extract a row

Handling a “DataFrame” object

frame2['debt'] = 16.5		# [16.5, 16.5, 16.5, 16.5, 16.5]
frame2['debt'] = np.arange(5.)	# [0, 1, 2, 3, 4]
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val		# missing data is counted as 'NaN'
farme2
frame2['eastern'] = frame2.state == 'Ohio'	# create new column with preset values
del frame2['eastern']		# delete a column
frame2.columns

Other TIPS of “DataFrame”

# Definition of DataFrame by nested dictionary
pop = {'Nevada':{2001:2.4,2002:2.9},	# nested dictionary data
	'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
frame3 = DataFrame(pop)		# outer key -> column, innner key -> row
frame3				# index(row) sorted automatically
DataFrame(pop, index=[2003,2002,2001]) 	# index(row) in defined order 
#
frame3.T	# transposition
#
# Another definition of DataFrame from a dictionary 
pdata = {'Ohio':frame3['Ohio'][:-1],	# Select 'Ohio' column & discard last 1 row
	 'Nevada':frame3['Nevada'][:2]}	# Select 'Ohio' column & select rowNo < 2
DataFrame(pdata)

Handling property of “DataFrame” object

frame3.index.name = 'year'; frame3.columns.name = 'state'; 
frame3
frame3.values	# extract data as a 2nd-dimension ndarray
frame2.values	# suitable dtype is used

5.1.3 "Index" object

“Index” is a special object to carry metadata of axes.
The row and column labels of “Series” or “DataFrame” object is an “Index” object.

obj = Series (range(3), index=['a', 'b', 'c'])
idx = obj.index
idx
idx[1:]	# first element(index[0]) is ignored 

“Index” object is immutable

idx[1] = 'd'	# ERROR! "Index" object accepts no changes
idx = ['a', 'd', 'c']	# This operation is OK

(Better to skip these trivial things)

idx = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=idx)
obj2.index is idx
frame3
'Ohio' in frame3.columns	# True
2003 in frame3.index		#False

5.2 Essential functionality of pandas

5.2.1 Reindexing

obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2		# re-ordered; value of 'e' is NULL

Setting default value for empty data

obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)	# re-ordered; value of 'e' is set as "0"
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')	# ffill : forwardfill (use preceding data)
obj3.reindex(range(6), method='bfill')	# bfill : backfill (use following data)

Reindexing for a “DataFrame” object

frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
	columns=['Ohio', 'Texas', 'California'])
frame
frame2 = frame.reindex(['a', 'b', 'c', 'd'])	# values of row 'b' are NULL 
frame2
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)			# values of column 'Utah' are NA
frame.reindex(['a', 'b', 'c', 'd'], method='ffill', columns=states)
frame.ix[['a', 'b', 'c', 'd'], states]

5.2.2 Removing elements from axis

obj = Series(np.arange(5.), index=['a','b', 'c', 'd', 'e'])
new_obj = obj.drop(['d', 'c'])

Remove data from a “DataFrame” object

data = DataFrame(np.arange(16).reshape((4,4)), 
	index=['Ohio', 'Colorado', 'Utah', 'NewYork'],
	columns=['one', 'two', 'three', 'four'])
data		# 4 * 4 table
data.drop(['Colorado', 'Ohio'])
data.drop('two', axis=1)	# assign axis(row=0, column=1) to delete columns

5.2.3 Index reference, selection, filtering

Reference in “Series” object

obj = Series(np.arange(2, 6), index=['a', 'b', 'c', 'd'])
obj[1]
obj[[1,3]]
obj['b']
obj['b', 'a', 'd']
obj[obj<2]

Slicing

obj[2:4]		# End (row 4) is not included
obj['b':'d']		# End (row 'd') is included
obj['b':'c']=5
obj

Reference in DataFrame“ object

data = DataFrame(np.arange(16).reshape((4,4)), 
	index=['Ohio', 'Colorado', 'Utah', 'NewYork'],
	columns=['one', 'two', 'three', 'four'])
data
data['two']		# returns a column
data[:2]		# returns rows

Filtering by values

data[data['three']>5]	# returns rows
data < 5
data[data<5] = 0

Reference using “ix” field

data.ix['Colorado', ['two','three']]	# describe as "ix[column_name, row_name]"
data.ix[['Colorado','Utah'], [3,0,1]]	# 2 rows * 3 cols
data.ix[2]				# 1 row
data.ix[:'Utah', 'two']			# 3 rows * 1 col
data.ix[data.three > 5, :3]

5.2.4 Arithmetics and data formatt

s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'b', 'c', 'd'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'b', 'c', 'd', 'e'])
s1+s2		# undefined value is NA

Adding “DataFrame” objects

df1 = DataFrame(np.arange(9).reshape((3,3)), columns=list('bcd'),
	index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
df2
df1+df2			# undefined values are NA
5.2.4.1 Arithmetic method and data substitution
df1 = DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20).reshape((4,5)), columns=list('abcde'))
df1
df2
df1+df2				# undefined values are NA
df1.add(df2,fill_value=0)	# undefined values are regarded as 0
df1.reindex(columns=df2.columns, fill_value=0)
5.2.4.2 Arithmetics operation
arr = np.arange(12).reshape((3,4))
arr
arr[0]
arr - arr[0]	# BROADCAST: described in chapter 12

Calculation between “DataFrame” and “Series” objects

frame = DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'),
	index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]	# returns row 1
frame
series
frame - series	# Broadcast for each row
series2 = Series(range(3), index=['b', 'e', 'f'])
frame + series2
series3 = frame['d']
frame.sub(series3, axis=0)	# Broadcast for each column 

5.2.5 Function

Definition of a function

frame = DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'),
	index=['Utah', 'Ohio', 'Texas', 'Oregon'])
f = lambda x: x.max() - x.min()
frame.apply(f)		# applied to each column
frame.apply(f, axis=1) 	# applied to each row	
def f(x):		# Definition of a function: another way
	return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Function for elements

format = lambda x: '%.2f' % x
frame.applymap(format)	# applied to all elements
frame['e'].map(format)	# applied to column e

5.2.6 Sort

obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()	# sort by index
frame = DataFrame(np.arange(8).reshape((2,4)), index=['three', 'one'],
	columns=['d', 'a', 'b', 'c'])
frame.sort_index()
frame.sort_index(axis=1)
frame.sort_index(axis=1, ascending=False)
obj=Series([4, 7, -3, 2])
obj.order()
obj=Series([4, np.nan, 7, np.nan, -3, 2]) # NaN is ordered at the end 
obj.order()

Sorting in “DataFrame”

frame = DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})
frame
frame.sort_index(by='b')	# sorted by value of column b
frame.sort_index(by=['a', 'b'])	# sorted by value of column a and b

Rank

obj = Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()
obj.rank(method='first')	# handling of tied values changed
obj.rank(ascending=False, method='first')	# descending order
frame = DataFrame({'b':[4.3, 7, -3, 2], 'a':[0, 1, 0, 1],
	'c':[-2, 5, 8, -2.5]})
frame
frame.rank(axis=1)	# rank in each row

5.2.7 Duplicated index

obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
obj.index.is_unique
obj['a']	# returns a "Series"
obj['c']	# returns a scalar
df = DataFrame(np.arange(12).reshape((4, 3)), index=['a', 'a', 'b', 'b'])
df .ix['b']

5.3 Summarizing and Computing Descriptive Statistics

Missing values are eliminated from statistical analysis

df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
	index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
df
df.sum()	# sum for each column
df.sum(axis=1)	# sum for each row
df.mean(axis=1, skipna=False)	# do not skip NaN
df.idxmax()	# returns the index of the maximam calue for each column 
df.cumsum()	# cumulative sum for each column
df.describe()	# statistical summary
obj = Series(['a','a','b','c'])
obj.describe()	# statistical summary for non-numeric data

5.3.1 Correlation and covariance

Get data from Yahoo! finance

import pandas.io.data as web
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOGL']:
	all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2000', '1/1/2010')
price = DataFrame({tic: data['Adj Close']
	 for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume']
	for tic, data in all_data.iteritems()})

Statistical analyses

returns = price.pct_change()
returns.tail()
returns.MSFT.corr(returns.IBM)
returns.MSFT.cov(returns.IBM)
returns.corr()
returns.cov()
returns.corrwith(returns.IBM)	# correlation with a specific column
returns.corrwith(volume)	# correlation with another DataFrame

5.3.2 Unique value, frequency, affiliation

obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques
obj.value_counts()	# sorted by frequency 
pd.value_counts(obj.values, sort=False)	
mask = obj.isin(['b','c' ])
mask
obj[mask]

5.4 Handling missing data

string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
string_data.isnull()
string_data[0] = None
string_data.isnull()

5.4.1 Eliminating missing data

from numpy import nan as NA
data = Series ([1, NA, 3.5, NA, 7])
data.dropna()
data[data.notnull()] # same result	

Missing data in “DataFrame” object

data = DataFrame([[1.,6.5,3.], [1.,NA, NA],
	[NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()	# row with no NaN remains
data
cleaned
data.dropna(how='all') # eliminated only row 2 (all NaN)
data[4] = NA
data
data.dropna(axis=1, how='all')	# similar operation for columns

Setting threshold on the number of valid data

df = DataFrame(np.arange(21).reshape((7,3)))
df.ix[:4,1]=NA; df.ix[:2,2]=NA	# setting NaN in some cells
df
df.dropna(thresh=2)	# returns rows with at least 2 valid data

5.4.2 Filling missing data

df.fillna(0)
df.fillna({1:0.5, 3:-1})	# set individual filling value for each column 
_ = df.fillna(0, inplace=True)	# directly change original data
df
df = DataFrame(np.arange(18).reshape((6,3)))
df.ix[2:,1]=NA; df.ix[4:,2]=NA

Other filling types

df.fillna(method='ffill')	# 'ffill' = forward filling
df.fillna(method='ffill', limit=2)	
data = Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())	# filling with mean value

5.5 Hierarchical indexing

data = Series(np.arange(10),
	index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
	[1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data
data.index	# returns a "MultiIndex" object
data['b']
data['b':'c']
data.ix[['b','d']]
data[:,2]	# select rows using secondary index

Stacking/unstacking a “DataFrame” (described in chapter 7)

data.unstack()	# pivot the inner-most index to a new column 
data.unstack().stack()	# inverse of unstack

Hierarchical indexing in a “DataFrame”

frame = DataFrame(np.arange(12).reshape((4,3)),
	index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
	columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
frame
frame.index.names=['key1', 'key2']
frame.columns.names=['state', 'color']
frame['Ohio']

Creating a “MultiIndex” object

pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
	names=['state', 'color'])	# "pd" is missing in textbook

5.5.1 Ordering and sorting of hierarchy

frame.swaplevel('key1', 'key2')
frame.sortlevel(1)
frame.swaplevel(0,1).sortlevel(0)

5.5.2 Statistical summary for each hierarchy

frame.sum(level='key2')
frame.sum(level='color', axis=1)

5.5.3 Using a column of a "DataFrame" as an index

frame = DataFrame({'a':range(7), 'b':range(7,0,-1),
	'c':['one', 'one', 'one', 'two', 'two', 'two', 'two'],
	'd':[0, 1, 2, 0, 1, 2, 3]}) 
frame
frame2 = frame.set_index(['c', 'd'])
frame2
frame.set_index(['c', 'd'], drop=False)	
frame2.reset_index()	# turns indexes into columns and gives a new index

5.6 Other pandas topics

5.6.1 Integer indexing

ser = Series(np.arange(3.))
ser
ser[-1]	# returns error: label '-1' does not exist
ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])
ser2
ser2[-1]	# returns the 1st element from the end 
ser.ix[:1]	# '1' is regarded as row label 
ser2.ix[:1]	# '1' is regarded as row number

Forced reference using data position

ser3 = Series(np.arange(3.), index=[-5, 1, 3])
ser3.iget_value(2)
frame = DataFrame(np.arange(6).reshape((3,2)), index=[2, 0, 1])
frame.irow(0)

5.6.2 Panel

import pandas.io.data as web
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012'))
	for stk in ['AAPL', 'GOOGL', 'MSFT', 'DELL']))
pdata
pdata = pdata.swapaxes('items', 'minor')
pdata['Adj Close']
pdata.ix[:, '6/1/2012', :]	# all rows, date='6/1/2012', all columns
pdata.ix['Adj Close', '5/22/2012':, :]
stacked = pdata.ix[:, '5/30/2012':, :].to_frame()
stacked
stacked.to_panel()
chapter5.txt · Last modified: 2014-06-26 09:48 by miho_sera