numpy & pandas

numpy & pandas


numpy

creating arrays

The core data structure in NumPy is the ndarray object,which stands for N-dimensional array.
An array is a collection of values, similar to a list.

N-dimensional refers to the number of indices needed to select individual values from the object.

A 1-dimensional array is often referred to as a vector while a 2-dimensional array is often referred to as a matrix.

We can directly construct arrays from lists using the numpy.array() function. To construct a vector, we need to pass in a single list (with no nesting):

vector = np.array([5, 10, 15, 20])
The numpy.array() function also accepts a list of lists, which we use to create a matrix:

import numpy as np
#single list
vector = np.array([1,2,3])

#a list of lists
matrix = np.array([1,2,3],[2,3,4])

array shape

It's often useful to know how many elements an array contains. We can use the ndarray.shape property to figure out how many elements are in the array.

  • For vectors, the shape property contains a tuple with 1 element.
  • For matrices, the shape property contains a tuple with 2 elements.
import numpy as np
vector = numpy.array([1, 2, 3, 4])
print(vector.shape)
#output(4,) ,,,,
matrix = numpy.array([[5, 10, 15], [20, 25, 30]])
print(matrix.shape)
#output(2,3)


reading numpy & pandas

numpy

Instructions

  • When reading in world_alcohol.csv using numpy.genfromtxt():
    • Use the "U75" data type
    • Skip the first line in the dataset
    • Use the comma delimiter.
  • Assign the result to world_alcohol.
  • Use the print() function to display world_alcohol.
import numpy as np
world_alcohol = np.genfromtxt('world_alcohol.csv',dtype='U75',delimiter = ',',skip_header=1)

pandas

To read a CSV file into a dataframe, we use the pandas.read_csv() function and pass in the file name as a string.

Instructions

  • Import the pandas library.
  • Use the pandas.read_csv() function to read the file "food_info.csv" into a dataframe named food_info.
  • Use the type() and print() functions to display the type of food_info to confirm that it's a dataframe object.
import pandas
food_info = pandas.read_csv('food_info.csv')
print(type(food_info))
#output <class 'pandas.core.frame.DataFrame'>

indexing numpy & pandas

index numpy arrays

Here's how we would index a NumPy vector:

vector = np.array([5, 10, 15, 20])
print(vector[0])
# output 5

Matrax:
The first index specifies which row the data comes from, and the second index specifies which column the data comes from.

import numpy as np

matrix = np.array([
                    [5, 10, 15], 
                    [20, 25, 30]
                ])

print(matrix[1,2])
#output 30

Instructions

  • Assign the country in the third row to third_country. Country is the third column.
import numpy as np
matrix = world_alcochol
third_country = matrix[2,2]

indexing pandas

The Series object is a core data structure that pandas uses to represent rows and columns. A Series is a labelled collection of values similar to the NumPy vector.

The main advantage of Series objects is the ability to utilize non-integer labels. NumPy arrays can only utilize integer labels for indexing.

When you read in a file into a dataframe, pandas uses the values in the first row (also known as the header) for the column labels and the row number for the row labels. Collectively, the labels are referred to as the index. dataframes contain both a row index and a column index.

\ column label(c index) .....
row label (r index ) 0
row label 1
......

when you select a row from a dataframe, instead of just returning the values in that row as a list, pandas returns a Series object that contains the column labels as well as the corresponding values:

NDB_No Shrt_Desc Water_(g) ...
1001 BUTTER WITH SALT 15.87 ...

The Series object representing the first row looks like:

DB_No -----------------1001

Shrt_Desc -------------- BUTTERWITHSALT

Water_(g) ---------------15.87



slicing numpy

numpy slicing arrays

vector

like select subset with list

vector = np.array([5,10,15,20])
vector [0:3]
#output [5,10,15]
#from the first index upto but not including the second index

matrix

matrix = np.array ([
                    [5,10,15],
                    [20,25,30],
                    [35,40,45]
])

matrix [:,1]
#output [10,25,40]
  • select all of the rows,but only the column with index 1
  • colon : specifies the entirety of sigle dimension shoud be selected.

Instructions

  • Assign the whole third column from world_alcohol to the variable countries.
countries = world_alcohol[:,2]

slicing one dimension

matrix = np.array([
                [5,10,15],
                [20,25,30],
                [35,40,45]
])
matrix [:,0:2]
#output ([[5,10],
#        [20,25],
#        [35,40]])

matrix [1:3,1]
#output [25,40]

Instructions

  • Assign all the rows and the first 2 columns of world_alcohol to first_two_columns.
  • Assign the first 10 rows and the first column of world_alcohol to first_ten_years.
  • Assign the first 10 rows and all of the columns of world_alcohol to first_ten_rows.
first_two_columns = world_alcohol[:,0:2]
first_ten_years = world_alcohol[0:10,0]
first_ten_years = world_alcohol[0:10,:]

slicing arrays (both dimensions)

matrix = np.array([
        [5,10,15],
        [20,25,30],
        [35,40,45]
])

matrix [1:3,0:2]
#poutput
# [[20,25],
#  [35,40]]

Instructions

  • Assign the first 20 rows of the columns at index 1 and 2 of world_alcohol to first_twenty_regions.
first_twenty_regions = world_alcohol[0:20,1:3]


pandas

selecting a row (pandas)

we use bracket notation to access elements in a NumPy array or a standard list

we need to use the pandas method loc[ ] to select rows in a dataframe.

The loc[] method allows you to select rows by row labels. Recall that when you read a file into a dataframe, pandas uses the row number (or position) as each row's label. Pandas uses zero-indexing, so the first row is at index 0, the second row at index 1, and so on.

# Series object representing the row at index 0.
food_info.loc[0]

# Series object representing the seventh row.
food_info.loc[6]

selecting multiple rows

pass in either a slice of row labels or a list of row labels and pandas will return a dataframe.

Note that unlike slicing lists in Python, a slice of a dataframe using .loc[] will include both the start and the end row.

# DataFrame containing the rows at index 3, 4, 5, and 6 returned.
food_info.loc[3:6]

# DataFrame containing the rows at index 2, 5, and 10 returned. Either of the following work.
# Method 1
two_five_ten = [2,5,10] 
food_info.loc[two_five_ten]

# Method 2
food_info.loc[[2,5,10]]

Instructions

  • Select the last 5 rows of food_info and assign to the variable last_rows.
num_rows = food_info.shape[0]
last_rows = food_info.loc[num_rows-5:num_rows-1]
print(last_rows)


selecting a column

To access a single column, use bracket notation and pass in the column name as a string:

# Series object representing the "NDB_No" column.
ndb_col = food_info["NDB_No"]

# You can instead access a column by passing in a string variable.
col_name = "NDB_No"
ndb_col = food_info[col_name]

selecting multiple columns by name

To select multiple columns, pass in a list of strings representing the column names and pandas will return a dataframe containing only the values in those columns

When selecting multiple columns, the order of the columns in the returned dataframe matches the order of the column names in the list of strings that you passed in. This allows you to easily explore specific columns that may not be positioned next to each other in the dataframe.

columns = ["Zinc_(mg)", "Copper_(mg)"]
zinc_copper = food_info[columns]

# Skipping the assignment.
zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]


Computation With NumPy

Learn how to select elements in arrays and perform computations with NumPy.

array comparisons

One of the most powerful aspects of the NumPy module is the ability to make comparisons across an entire array. These comparisons result inBoolean values.

vector = numpy.array([5, 10, 15, 20])
vector == 10

#output a vector bool
#[False,True,...]
matrix = numpy.array([
                    [5, 10, 15], 
                    [20, 25, 30],
                    [35, 40, 45]
                 ])
    matrix == 25
#output a bool matrix
#[
#    [False, False, False], 
#    [False, True,  False],
#    [False, False, False]
#                            ]

Instructions

The variable world_alcohol already contains the data set we're working with.

  • Extract the third column in world_alcohol, and compare it to the string Canada. Assign the result to countries_canada.
  • Extract the first column in world_alcohol, and compare it to the string 1984. Assign the result to years_1984.
countries_canada= (world_alcohol[:,2]=='Canada')
years_1984 = (world_alcohol[:,0]=='1984')


selecting elements

Comparisons give us the power to select elements in arrays using Boolean vectors. This allows us to conditionally select certain elements in vectors, or certain rows in matrices.


vector = numpy.array([5, 10, 15, 20])
equal_to_ten = (vector == 10)
# output [10]


matrix = numpy.array([
                [5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]
             ])
second_column_25 = (matrix[:,1] == 25)
matrix[second_column_25,:]
#output [[20 ,25,30]]

Instructions

  • Compare the third column of world_alcohol to the string Algeria.
  • Assign the result to country_is_algeria.
  • Select only the rows in world_alcohol where country_is_algeria is True.
  • Assign the result to country_algeria.
country_is_algeria =(world_alcohol[:,2]=='Algeria')
country_algeria = world_alcohol[country_is_algeria,:]


Comparisons with Multiple Conditions

joining multiple conditions with an ampersand (&). it's critical to put each one in parentheses.

the** pipe symbol (|)** to specify that either one condition or the other should be True .

Instructions

  • Perform a comparison with multiple conditions, and join the conditions with &.
    • Compare the first column of world_alcohol to the string 1986.
    • Compare the third column of world_alcohol to the string Algeria.
    • Enclose each condition in parentheses, and join the conditions with &.
    • Assign the result to is_algeria_and_1986.
  • Use is_algeria_and_1986 to select rows from world_alcohol.
  • Assign the rows that is_algeria_and_1986 selects to rows_with_algeria_and_1986.
is_algeria_and_1986 =
(world_alcohol[:,0] == "1986") &
(world_alcohol[:,2] == "Algeria")

rows_with_algeria_and_1986 =
world_alcohol[is_algeria_and_1986,:]


Replacing Values

use comparisons to replace values

vector = numpy.array([5, 10, 15, 20])
equal_to_ten_or_five = (vector == 10) | (vector == 5)
vector[equal_to_ten_or_five] = 50
print(vector)
#output [50,50,15,20]


matrix = numpy.array([
            [5, 10, 15], 
            [20, 25, 30],
            [35, 40, 45]
         ])
second_column_25 = matrix[:,1] == 25
matrix[second_column_25, 1] = 10
#output
#[
#    [5, 10, 15], 
#    [20, 10, 30],
#    [35, 40, 45]
# ]

Instructions

  • Replace all instances of the string 1986 in the first column of world_alcohol with the string 2014.
  • Replace all instances of the string Wine in the fourth column of world_alcohol with the string Grog.
world_alcohol[(world_alcohol[:,0]=='1986'),0]='2014'
#world_alcohol[(world_alcohol[:,3]=='Wine'),3] ='Grog'
world_alcohol[:,3][world_alcohol[:,3] == 'Wine'] = 'Grog'


Replacing Empty Strings

Instructions

  • Compare all the items in the fifth column of world_alcohol with an empty string ''. Assign the result to is_value_empty.
  • Select all the values in the fifth column of world_alcohol where is_value_empty is True, and replace them with the string 0.

is_value_empty =world_alcohol[:,4] == ''
world_alcohol[is_value_empty,4] = '0'

#world_alcohol[:,4][world_alcohol[:,4]==''] = '0'


Converting Data Types

We can convert the data type of an array with the astype() method.

Instructions

  • Extract the fifth column from world_alcohol, and assign it to the variable alcohol_consumption.
  • Use the astype() method to convert alcohol_consumption to the float data type.
alcohol_consumption= world_alcohol[:,4]
alcohol_consumption = alcohol_consumption.astype(float)



Computing with NumPy

vector = numpy.array([5, 10, 15, 20])
vector.sum()
# output 50

matrix = numpy.array([
                [5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]
             ])
matrix.sum(axis=1) # 1 row ,0 column
# output [30,75,120]

Instructions

  • Use the sum() method to calculate the sum of the values in alcohol_consumption. Assign the result to total_alcohol.
  • Use the mean() method to calculate the average of the values in alcohol_consumption. Assign the result to average_alcohol.
total_alcohol = alcohol_consumption.sum()
average_alcohol =alcohol_consumption.mean()


Total Annual Alcohol Consumption

Instructions

  • Create a matrix called canada_1986 that only contains the rows in world_alcohol where the first column is the string 1986 and the third column is the string Canada.
  • Extract the fifth column of canada_1986, replace any empty strings ('') with the string 0, and convert the column to the float data type. Assign the result to canada_alcohol.
  • Compute the sum of canada_alcohol. Assign the result to total_canadian_drinking.
is_canada_1986 = 
(world_alcohol[:,2] == "Canada") &
(world_alcohol[:,0] == '1986')

canada_1986 = world_alcohol[is_canada_1986,:]

canada_alcohol = canada_1986[:,4]

empty_strings = canada_alcohol == ''

canada_alcohol[empty_strings] = "0"

canada_alcohol = canada_alcohol.astype(float)

total_canadian_drinking = canada_alcohol.sum()


Calculating Consumption for Each Country

  • Create an empty dictionary called totals.
  • Select only the rows in world_alcohol that match a given year. Assign the result to year.
  • Loop through a list of countries. For each country:
    • Select only the rows from year that match the given country.
    • asign the result to country_consumption.
    • Extract the fifth column from country_consumption.
    • Replace any empty string values in the column with the string 0.
    • Convert the column to the float data type.
    • Find the sum of the column.
      ---Add the sum to the totals dictionary, with the country name as the key.
  • After the code executes, you'll have a dictionary containing all of the country names as keys, with the associated alcohol consumption totals as the values.

Calculating Consumption for Each Country

  • We've assigned the list of all countries to the variable countries.
  • Find the total consumption for each country in countries for the year 1989.
    • Refer to the steps outlined above for help.
  • When you're finished, totals should contain all of the country names as keys, with the corresponding alcohol consumption totals for 1989 as values.
totals = {}
is_year = world_alcohol[:,0] == "1989"
year = world_alcohol[is_year,:]

for country in countries:
    is_country = year[:,2] == country
    
    country_consumption = year[is_country,:]
    
    alcohol_column = country_consumption[:,4]
    
    is_empty = alcohol_column == ''
    
    alcohol_column[is_empty] = "0"
    
    alcohol_column = alcohol_column.astype(float)
    
    totals[country] = alcohol_column.sum()


Finding the Country that Drinks the Most

Now that we've computed total alcohol consumption for each country in 1989, we can loop through the totals dictionary to find the country with the highest value.

The process we've outlined below will help you find the key with the highest value in a dictionary:

  • Create a variable called highest_value that will keep track of the highest value. Set its value to 0.

  • Create a variable called highest_key that will keep track of the key associated with the highest value. Set its value to None.

  • Loop through each key in the dictionary.

    • If the value associated with the key is greater than highest_value, assign the value to highest_value, and assign the key to highest_key.
      After the code runs, highest_key will be the key associated with the highest value in the dictionary.

Instructions

  • Find the country with the highest total alcohol consumption.
  • To do this, you'll need to find the key associated with the highest value in the totals dictionary.
  • Follow the process outlined above to find the highest value in totals.
  • When you're finished, highest_value will contain the highest average alcohol consumption, and highest_key will contain the country that had the highest per capital alcohol consumption in 1989.
highest_value = 0
highest_key = None
for country in totals:
    if totals[country]> highest_value:
        highest_value = totals[country]
        highest_key = country


NumPy Strengths and Weaknesses

You should now have a good foundation in NumPy, and in handling issues with your data. NumPy is much easier to work with than lists of lists, because:

  • It's easy to perform computations on data.
  • Data indexing and slicing is faster and easier.
  • We can convert data types quickly.

Overall, NumPy makes working with data in Python much more efficient. It's widely used for this reason, especially for machine learning.

You may have noticed some limitations with NumPy as you worked through the past two missions, though. For example:

  • All of the items in an array must have the same data type. For many datasets, this can make arrays cumbersome to work with.
  • Columns and rows must be referred to by number, which gets confusing when you go back and forth from column name to column number.

In the next few missions, we'll learn about the Pandas library, one of the most popular data analysis libraries. Pandas builds on NumPy, but does a better job addressing the limitations of NumPy.



最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,319评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,801评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,567评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,156评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,019评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,090评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,500评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,192评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,474评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,566评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,338评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,212评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,572评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,890评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,169评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,478评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,661评论 2 335

推荐阅读更多精彩内容