Data Frames, Series, and Indices
-Data frame: 2D tabular data ( a collection of series that all share the same index)
-Series: 1D data/ columnar data
-Index: A sequence of row labels (不一定是数字,不需要unique)
Column name通常是unique的(数字1 和 string “1” 被认为是不同的;相同的column名称会被自动修改成 name name1 name2)
Basic operations
读取 elections = pd.read_csv("elections.csv")
创建
new_elections = pd.DataFrame({
result:["win","lose"],
colunName:["Acontent","Bcontent"]
})
截取开头或结尾的内容,行数可以自定义
elections.head()
eletions.tail()
//可以将index改成任意column
elections_year_index = pd.read_csv("elections.csv", index_col = "Year") //返回的是copy
elections_party_index = elections.set_index("Party") //效果相同
Random Sample
elctions.sample(20)
The [] Operator
Indexing with []
Column name argument to [ ] 产生 Series
List Argument to [ ] 产生 Data Frame
[ :] row slicing. [inclusive : exclusive]
//Series
elections_year_index["Candidate"]
//Data Frame
elections_year_index[ ["Candidate", "Party"] ]
elections_year_index[ ["Candidate"] ]
elections_year_index["Candidate"].to_frame()
//get 第二到第四行
elections_year_index[2:5]
//get first row
elections_year_index[0:1]
//error if there is no column named numeric 0
elections_year_index[0]
Boolean Array Selection and Querying
//假设一共6行,需要第2 3 5行(获得胜利的)
elections[ [False, True, True, False, True, False] ]
//需要获得胜利的
elections[ elections["Result"] == "win" ]
elections[ (elections["Result"] == "win") &(elections["someOthercolumn"] < someothercondition) ] //可以组合
isin() 和 query()
elections['Party'].isin(["Republican", "Democratic"]) //返回list of true or false
elections[elections_year_index['Party'].isin(["Republican", "Democratic"])] //返回符合要求的row
//query 快捷输入筛选条件
elections_year_index.query("Result == 'win' and Year < 2000") //返回符合要求的row
Not:~ (e.g A不等于B: ~(A==B) )
Index with loc and iloc
Loc: Access values by labels(names), index
iloc: Access value by position
(因为index不一定unique,所以index不等于position, position更像传统意义上的数组index)
Loc (inclusive on both end)
elections.loc[ [index_name, iname2, iname 3], ["column name1", "cname2"]]
elections_year_index.loc[[1980, 1984], ['Candidate','Party']] //返回数据可能不止两条 可能有多条index为1980/1984的row
//slice here is inclusive on both end
elections.loc[0:4, 'Candidate':'Year'] //返回data frame
elections.loc[0:4, 'Candidate'] //返回series
elections.loc[0:4, ['Candidate']] //返回data frame
elections.loc[0, 'Candidate':'Year'] //series
elections.loc[[0], 'Candidate':'Year'] //data frame
//loc支持boolean arrays,所以也可以进行row 和column的筛选
elections.loc[[True, False, False, True], 'Candidate':'%']
elections.loc[['Party']]
'If we omit the column argument altogether, the default behavior is to retrieve all columns.'
iloc
elections.iloc[0:3, 0:3] //返回top3 rows, exclusive on right end
Some Useful Function
max(A)
np.mean(A)
DF.head()
DF.size //all entries, row*column
DF.shape //(row, column)
DF.describe()
DF.sort_values('columnName', ascending=False) //默认升序ascending = true
DF.query()
Series.value_counts()
Series.unique() // returns all unique values as a numpy array
Series.str
babynames["Name"].str.startswith('J').head(10)
Series.contains('as').sample(5) //随机5行
Series.str.split('a').to_frame().head(5)
Series.str.split(',').str[1]
Series.str.len()
GroupBy
DF.groupby("name").agg(f)
DF.groupby(" ").size()
DF.groupby(" ").filter()
DF.groupby(["Year", "Sex").agg(f)
Pivot Tables
babynames_pivot = babynames.pivot_table(
index='Year', # the rows (turned into index)
columns='Sex', # the column values
values='Count', # the field(s) to processed in each group
aggfunc=np.max, # group operation
)
Join
elections.merge(presidents,
how = "inner",
left_on = "Candidate", right_on = "President")
pd.merge(leftDF, rigthDF, how = "left", on = "id")
Left join / Right join / Inner join / Outer join
Read data from zip file
join_demo_filename = "lec6_join_demo_data.zip"
my_zip = zipfile.ZipFile(join_demo_filename, 'r') //打开
list_names = [f.filename for f in my_zip.filelist]
list_names
my_zip.extractall() //unzip all
with my_zip.open("elections.csv") as f: //read directly from zip file
elections = pd.read_csv(f)