1. Introduction to Pandas
One of the biggest advantages that pandas has over NumPy is the ability to store mixed data types in rows and columns. Many tabular datasets contain a range of data types and pandas dataframes handle mixed data types effortlessly while NumPy doesn't. Pandas dataframes can also handle missing values gracefully using a custom object, NaN, to represent those values. A common complaint with NumPy is its lack of an object to represent missing values and people end up having to find and replace these values manually
2. 读取CSV格式文件:pandas.read_csv()
input
import pandas
food_info = pandas.read_csv("food_info.csv")
print(type(food_info))
output
<class 'pandas.core.frame.DataFrame'>
3. 显示dataframe的头几行dataframe:XXXX.head()
- To select the first 5 rows of a dataframe, use the dataframe method head()
input
print(food_info.head(3)) # First 3 rows
output
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \
0 1001 BUTTER WITH SALT 15.87 717 0.85
1 1002 BUTTER WHIPPED WITH SALT 15.87 717 0.85
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28
Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \
0 81.11 2.11 0.06 0.0 0.06
1 81.11 2.11 0.06 0.0 0.06
2 99.48 0.00 0.00 0.0 0.00
... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU \
0 ... 2499.0 684.0 2.32 1.5 60.0
1 ... 2499.0 684.0 2.32 1.5 60.0
2 ... 3069.0 840.0 2.80 1.8 73.0
Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
0 7.0 51.368 21.021 3.043 215.0
1 7.0 50.489 23.426 3.012 219.0
2 8.6 61.924 28.732 3.694 256.0
[3 rows x 36 columns]
#4.显示维度与行/列:XXXX.shape
input
dimensions = food_info.shape
print(dimensions)
output
(8618, 36)
input
num_rows = dimensions[0] # The number of rows, 8618.
print(num_rows)
num_cols = dimensions[1] # The number of columns, 36.
print(num_cols)
output
861836
5.从数据框中选取一行:XXXX.loc[N]
input
num_rows = food_info.shape[0] #行数目
last_rows = food_info.loc[num_rows-5:num_rows-1] #显示最后5行
6.查看数据类型:XXXX.dtypes - object
- object - for representing string values.
- int - for representing integer values.
- float - for representing float values.
- datetime - for representing time values.
- bool - for representing Boolean values.
input
print(food_info.dtypes)
output
NDB_No int64
Shrt_Desc object
Water_(g) float64
Energ_Kcal int64
Protein_(g) float64
Lipid_Tot_(g) float64
Ash_(g) float64
Carbohydrt_(g) float64
Fiber_TD_(g) float64
Sugar_Tot_(g) float64
Calcium_(mg) float64
Iron_(mg) float64
Magnesium_(mg) float64
Phosphorus_(mg) float64
Potassium_(mg) float64
Sodium_(mg) float64
Zinc_(mg) float64
Copper_(mg) float64
Manganese_(mg) float64
Selenium_(mcg) float64
Vit_C_(mg) float64
Thiamin_(mg) float64
Riboflavin_(mg) float64
Niacin_(mg) float64
Vit_B6_(mg) float64
Vit_B12_(mcg) float64
Vit_A_IU float64
Vit_A_RAE float64
Vit_E_(mg) float64
Vit_D_mcg float64
Vit_D_IU float64
Vit_K_(mcg) float64
FA_Sat_(g) float64
FA_Mono_(g) float64
FA_Poly_(g) float64
Cholestrl_(mg) float64
dtype: object
7.选择某一列:XXXX['XXXX']
input
saturated_fat = food_info['FA_Sat_(g)']
cholesterol = food_info['Cholestrl_(mg)']
columns = ["Zinc_(mg)", "Copper_(mg)"]
zinc_copper = food_info[columns]
selenium_thiamin = food_info[['Selenium_(mcg)', 'Thiamin_(mg)']]
8.选取特定的列
- XXXX.tolist():在本例中,是将引索转化为list
- XXXX.endswith('YY'):结尾如果是"YY",返回TRUE
- XXXX.startswith('YY'):开头如果是"YY",返回FALSE
input
col_names = food_info.columns.tolist()
gram_columns = []
for c in col_names:
if c.endswith('(g)'):
gram_columns.append(c)
gram_df = food_info[gram_columns]
print(gram_df)