Data100学习笔记- Pandas

Data Frames, Series, and Indices

-Data frame: 2D tabular data ( a collection of series that all share the same index)
-Series: 1D data/ columnar data
-Index: A sequence of row labels (不一定是数字，不需要unique)
Column name通常是unique的（数字1 和 string “1” 被认为是不同的；相同的column名称会被自动修改成 name name1 name2）

Basic operations

读取 elections = pd.read_csv("elections.csv")

创建

new_elections =  pd.DataFrame({
    result:["win","lose"], 
    colunName:["Acontent","Bcontent"]
})

截取开头或结尾的内容，行数可以自定义
elections.head()
eletions.tail()
//可以将index改成任意column
elections_year_index = pd.read_csv("elections.csv", index_col = "Year") //返回的是copy
elections_party_index = elections.set_index("Party") //效果相同

Random Sample

elctions.sample(20)

The [] Operator

Indexing with []

Column name argument to [ ] 产生 Series
List Argument to [ ] 产生 Data Frame
[ ：] row slicing. [inclusive : exclusive]

//Series
elections_year_index["Candidate"] 
//Data Frame
elections_year_index[ ["Candidate", "Party"] ]
elections_year_index[ ["Candidate"] ]
elections_year_index["Candidate"].to_frame()
//get 第二到第四行
elections_year_index[2:5]
//get first row
elections_year_index[0:1]
//error if there is no column named numeric 0
elections_year_index[0]

Boolean Array Selection and Querying

//假设一共6行，需要第2 3 5行(获得胜利的)
elections[ [False, True, True, False, True, False] ]
//需要获得胜利的
elections[ elections["Result"] == "win" ]
elections[ (elections["Result"] == "win") &(elections["someOthercolumn"] < someothercondition) ] //可以组合

isin() 和 query()

elections['Party'].isin(["Republican", "Democratic"])  //返回list of true or false
elections[elections_year_index['Party'].isin(["Republican", "Democratic"])] //返回符合要求的row
//query 快捷输入筛选条件
elections_year_index.query("Result == 'win' and Year < 2000") //返回符合要求的row

Not：~ (e.g A不等于B： ~(A==B) )

Index with loc and iloc

Loc: Access values by labels(names), index
iloc: Access value by position
(因为index不一定unique，所以index不等于position， position更像传统意义上的数组index)

Loc (inclusive on both end)

elections.loc[ [index_name, iname2, iname 3], ["column name1", "cname2"]]
elections_year_index.loc[[1980, 1984], ['Candidate','Party']] //返回数据可能不止两条 可能有多条index为1980/1984的row
//slice here is inclusive on both end
elections.loc[0:4, 'Candidate':'Year'] //返回data frame
elections.loc[0:4, 'Candidate'] //返回series
elections.loc[0:4, ['Candidate']] //返回data frame
elections.loc[0, 'Candidate':'Year']  //series
elections.loc[[0], 'Candidate':'Year']  //data frame
//loc支持boolean arrays，所以也可以进行row 和column的筛选
elections.loc[[True, False, False, True], 'Candidate':'%']

elections.loc[['Party']] 
'If we omit the column argument altogether, the default behavior is to retrieve all columns.'

iloc

elections.iloc[0:3, 0:3] //返回top3 rows， exclusive on right end

Some Useful Function

max(A)
np.mean(A)
DF.head()
DF.size //all entries, row*column
DF.shape //(row, column)
DF.describe()
DF.sort_values('columnName', ascending=False) //默认升序ascending = true
DF.query()
Series.value_counts()
Series.unique() //  returns all unique values as a numpy array

Series.str

babynames["Name"].str.startswith('J').head(10)
Series.contains('as').sample(5) //随机5行
Series.str.split('a').to_frame().head(5)
Series.str.split(',').str[1]
Series.str.len()

GroupBy

DF.groupby("name").agg(f)
DF.groupby(" ").size()
DF.groupby(" ").filter()
DF.groupby(["Year", "Sex").agg(f)

Pivot Tables

babynames_pivot = babynames.pivot_table(
    index='Year', # the rows (turned into index)
    columns='Sex', # the column values
    values='Count', # the field(s) to processed in each group
    aggfunc=np.max, # group operation
)

Join

elections.merge(presidents, 
            how = "inner",
            left_on = "Candidate", right_on = "President")
pd.merge(leftDF, rigthDF, how = "left", on = "id")

Left join / Right join / Inner join / Outer join

Read data from zip file

join_demo_filename = "lec6_join_demo_data.zip"
my_zip = zipfile.ZipFile(join_demo_filename, 'r')  //打开
list_names = [f.filename for f in my_zip.filelist]
list_names
my_zip.extractall() //unzip all

with my_zip.open("elections.csv") as f:  //read directly from zip file
    elections = pd.read_csv(f)