Introduction to Data Science in Python学习笔记

本文主要是作者在学习coursera的Introduction to Data Science in Python课程的学习笔记,仅供参考。


1. 50 Years of Data Science

    (1) Data Exploration and Preparation 

    (2) Data Representation and Transformation

    (3) Computing with Data

    (4) Data Modeling

    (5) Data Visualization and Presentation

    (6) Science about Data Science


2. Functions

def add_numbers(x,  y,  z = None, flag = False):

    if (flag):

        print('Flag is true!')

    if (z == None):

        return x + y

    else:

        return x + y + z

print(add_numbers(1, 2, flag=true))


Assign function add_numbers to a variable a:

a = add_numbers

a = (1, 2, flag=true)


3. 查看数据类型

type('This is a string')

-> str

type(None)

-> NoneType


4. Tuple 元组

Tuples are an immutable data structure (cannot be altered).

元组是一个不变的数据结构(无法更改)。

x = (1, 'a', 2, 'b')

type(x)

->tuple


5. List 列表

Lists are a mutable data structure.

列表是可变的数据结构。

x = [1, 'a', 2, 'b']

type(x)

->list


6. Append 附加

Use append to append an object to a list.

使用附加将对象附加到列表。

x.append(3.3)

print(x)

->[1, 'a', 2, 'b', 3.3]


7. Loop through each item in the list

for item in x:

    print(item)

->1

    a

    2

    b

    3.3


8. Using the indexing operator to loop through each item in the list

i = 0

while( i != len(x) ):

        print(x[I])

        i = i +1

->1

    a

    2

    b

    3.3


9. List 基本操作

(1)Use + to concatenate连接 lists

[1, 2] + [3, 4]

-> [1, 2, 3, 4]

(2)Use * to repeat lists

[1]*3

->[1, 1, 1]

(3) Use the in operator to check if something is inside a list

1 in [1, 2, 3]

->True


10. String 基本操作

(1)Use bracket notation to slice a string.

          使用方括号符号来分割字符串。

x = 'This is a string'

print(x[0])

->T

print(x[0:1])

->T

print(x[0:2])

->Th

print(x[-1])  # the last element

->g

print(x[-4:-2])  # start from the 4th element from the end and stop before the 2nd element from the end

->ri

x[:3]  # This is a slice from the beginning of the string and stopping before the 3rd element.

->Thi

x[3:] # this is a slice starting from the 4th element of the string and going all the way to the end.

-> s is a string

(2) New example on list

firstname = 'Christopher'

lastname = 'Brooks'

print(firstname + ' ' + lastname)

->Christopher Brooks

print(firstname*3)

->ChristopherChristopherChristopher

print('Chris' in firstname)

->True

(3) Split returns a list of all the words in a string, or a list split on a specific character.

firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0] 

lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1] 

print(firstname)

->Christopher

print(lastname)

->Brooks

(4) Make sure you convert objects to strings before concatenating串联.

'Chris' + 2

->Error

'Chris' + str(2)

->Chris2


11. Dictionary 字典 

(1)Dictionaries associate keys with values

x = {'Christopher Brooks': 'brooksch@umich.edu', 'Bill Gates': 'billg@microsoft.com'}

x['Christopher Brooks']

->brooksch@umich.edu

x['Kevyn Collins-Thompson'] = None

x['Kevyn Collins-Thompson']

->没有输出

(2)Iterate over all of the keys:

          遍历所有的键:

for name in x:

    print(x[name])

->brooksch@umich.edu

    billg@microsoft.com

    None

(3) Iterate over all of the values:

for email in x.values():

    print(email)

->brooksch@umich.edu

    billg@microsoft.com

    None

(4) Iterate over all of the items in the list:

for name, email in x.items():

    print(name)

    print(email)

->Christopher Brooks

    brooksch@umich.edu

    Bill Gates

    billg@microsoft.com

    Kevyn Collins-Thompson

    None

(5) unpack a sequence into different variables:

          将序列解压为不同的变量:

x = ('Christopher', 'Brooks', 'brooksch@umich.edu')

fname, lname, email = x

fname

->Christopher

lname

->Brooks

(6) Make sure the number of values you are unpacking matches the number of variables being assigned.

x = ('Christopher', 'Brooks', 'brooksch@umich.edu', 'Ann Anbor')

fname, lname, email = x

->error


12. More on Strings

(1) Simple Samples

print('Chris' + 2)

->error

print('Chris' + str(2))

->Chris2

(2) Python has a built in method for convenient string formatting.

sales_record = {'price': 3.24, 'num_items': 4, 'person': 'Chris' }

sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'

print(sales_statement.format(sales_record['person'], sales_record['num_items'], sales_record['price'], sales_record['num_items']*sales_record['price']))

->Chris bought 4 item(s) at a price of 3.24 each for a total of 12.96


13. Reading and Writing CSV files

(1)导入csv

import csv

%precision 2

with open('mpg.csv') as csvfile:

    mpg = list(csv.DictReader(csvfile)) # 将csvfile转化为元素为字典的list

mpg[:3]

->

[OrderedDict([('', '1'),

              ('manufacturer', 'audi'),

              ('model', 'a4'),

              ('displ', '1.8'),

              ('year', '1999'),

              ('cyl', '4'),

              ('trans', 'auto(l5)'),

              ('drv', 'f'),

              ('cty', '18'),

              ('hwy', '29'),

              ('fl', 'p'),

              ('class', 'compact')]),

OrderedDict([('', '2'),

              ('manufacturer', 'audi'),

              ('model', 'a4'),

              ('displ', '1.8'),

              ('year', '1999'),

              ('cyl', '4'),

              ('trans', 'manual(m5)'),

              ('drv', 'f'),

              ('cty', '21'),

              ('hwy', '29'),

              ('fl', 'p'),

              ('class', 'compact')]),

OrderedDict([('', '3'),

              ('manufacturer', 'audi'),

              ('model', 'a4'),

              ('displ', '2'),

              ('year', '2008'),

              ('cyl', '4'),

              ('trans', 'manual(m6)'),

              ('drv', 'f'),

              ('cty', '20'),

              ('hwy', '31'),

              ('fl', 'p'),

              ('class', 'compact')])]

(2)查看list长度

len(mpg)

->234

(3)keys gives us the column names of our csv

mpg[0].keys()

->odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

(4)Find the average cty fuel economy across all car. All values in the dictionaries are strings, so we need to convert to float.

sum(float(d['hwy']) for d in mpg) / len(mpg)

->23.44

(5)Use set to return the unique values for the number of cylinders the cars in our dataset have.

使用set返回数据集中汽车具有的汽缸数的唯一值。

cylinders = set(d['cyl'] for d in mpg)

cylinders

->'4', '5', '6', '8'

(6) We are grouping the cars by number of cylinder, and find the average cty mpg for each group.

CtyMpgByCyl = []

for c in cylinders:

    summpg = 0

    cyltypecount = 0

    for d in mpg:

            if d['cyl'] == c:

                summpg += float(d['cty'])

                cyltypecount += 1

    CtyMpgByCyl.append((c, summpg / cyltypecount))

CtyMpgByCyl.sort(key = lambda x: x[0])

CtyMpgByCyl

->[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]

(7) Use set to return the unique values for the class types in our dataset

vehicleclass = set(d['class'] for d in mpg)

vehicleclass

->{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

(8) How to find the average hwy mpg for each class of vehicle in our dataset.

HwyMpgByClass = []

for t in vehicleclass:

    summpg = 0

    vclasscount = 0

    for d in mpg:

            if d['class'] == t:

                    summpg += float(d['hwy'])

                    vclasscount += 1

    HwyMpgByClass.append((t, summpg / vclasscount))

HwyMpgByClass.sort(key = lambda x: x[1])

HwyMpgByClass

->

[('pickup', 16.88),

('suv', 18.13),

('minivan', 22.36),

('2seater', 24.80),

('midsize', 27.29),

('subcompact', 28.14),

('compact', 28.30)]


14. Dates and Times

(1) 安装Datetime和Times的包

import datetime as dt

import time as tm

(2) Time returns the current time in seconds since the Epoch

tm.time()

->1583932727.90

(3) Convert the timestamp to datetime

dtnow = dt.datetime.fromtimestamp(tm.time())

dtnow

->

datetime.datetime(2020, 3, 11, 13, 18, 56, 990293)

(4) Handy datetime attributes: get year, month, day, etc. from a datetime

dtnow.year, dtnow.month, dtnow.day, dtnow.hour, dtnow.minute, dtnow.second

->(2020, 3, 11, 13, 18, 56)

(5) Timedelta is a duration expressing the difference between two dates.

delta = dt.timedelta(days = 100)

delta

->datetime.timedelta(100)

(6) date.today returns the current local date

today = dt.date.today()

today

->datetime.date(2020, 3, 11)

(7) the date 100 days ago

today - delta

->datetime.date(2019, 12, 2)

(8) compare dates

today > today - delta

-> True


15. Objects and map()

(1) an example of a class in python:

class Person:

    department = 'School of Information'

    def set_name(self, new_name)

            self.name = new_name

    def set_location(self, new_location)

            self.location = new_location


person = Person()

person.set_name('Christopher Brooks')

person.set_location('Ann Arbor, MI, USA')

print('{} live in {} and work in the department {}'.format(person.name, person.location, person.department))

(2) mapping the min function between two lists

store1 = [10.00, 11.00, 12.34, 2.34]

store2 = [9.00, 11.10, 12.34, 2.01]

cheapest = map(min, store1, store2)

cheapest

-><map at 0x7f74034a8860>

(3) iterate through the map object to see the values

for item in cheapest:

    print(item)

->

9.0

11.0

12.34

2.01


16. Lambda and List Comprehensions

(1) an example of lambda that takes in three parameters and adds the first two

my_function = lambda a, b, c: a+b

my_function(1, 2, 3)

->3

(2) iterate from 0 to 999 and return the even numbers.

my_list = []

for number in range(0, 1000):

        if number % 2 == 0:

                my_list.append(number)

my_list

->[0, 2, 4,...]

(3) Now the same thing but with list comprehension

my_list = [number for number in range(0, 1000) if number % 2 == 0]

my_list

->[0, 2, 4,...]


17. Numpy

(1) import package

import numpy as np


18.creating array数组(tuple元组,list列表)

(1) create a list and convert it to a numpy array

mylist = [1, 2, 3]

x = np.array(mylist)

x

->array([1, 2, 3])

(2) just pass in a list directly

y = np.array([4, 5, 6])

y

->array([4, 5, 6])

(3) pass in a list of lists to create a multidimensional array

m = np.array([[[7, 8, 9,],[10, 11, 12]])

m

->

array([[ 7, 8, 9],

      [10, 11, 12]])

(4) use the shape method to find the dimensions of array

m.shape 

->(2,3)

(5) arange returns evenly spaced values within a given interval

n = np.arange(0, 30, 2)

n

->array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

(6) reshape returns an array with the same data with a new shape

n = n.reshape(3, 5)

n

->

array([[ 0, 2, 4, 6, 8],

      [10, 12, 14, 16, 18],

      [20, 22, 24, 26, 28]])

(7) linspace returns evenly spaced numbers over a specified interval

o = np.linspace(0, 4, 9)

o

->array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])

(8) resize changes the shape and size of array in-space

o.resize(3, 3)

o

->

array([[ 0. , 0.5, 1. ],

      [ 1.5,  2. ,  2.5],

      [ 3. ,  3.5,  4. ]])

(9) ones returns a new array of given shape and type, filled with ones

np.ones((3, 2))

->

array([[ 1., 1.],

      [ 1.,  1.],

      [ 1.,  1.]])

(10) zeros returns a new array of given shape and type, filled with zeros

np.zeros((2,3))

->

array([[ 0., 0., 0.],

      [ 0.,  0.,  0.]])

(11) eye returns a 2D array with ones on the diagonal and zeros

np.eye(3)

->

array([[ 1., 0., 0.],

      [ 0.,  1.,  0.],

      [ 0.,  0.,  1.]])

(12) diag extracts a diagonal or constructs a diagonal array

np.diag(y)

->

array([[4, 0, 0],

      [0, 5, 0],

      [0, 0, 6]])

(13)creating an array using repeating list

np.array([1, 2, 3]*3)

->array([1, 2, 3, 1, 2, 3, 1, 2, 3])

(14) repeat elements of an array using repeat

np.repeat([1, 2, 3], 3)

->array([1, 1, 1, 2, 2, 2, 3, 3, 3])

(15) combine arrays

p = np.ones([2, 3], int)

p

->

array([[1, 1, 1],

      [1, 1, 1]])

(16) use vstack to stack arrays in sequence vertically (row wise).

np.vstack([p, 2*p])

->

array([[1, 1, 1],

      [1, 1, 1],

      [2, 2, 2],

      [2, 2, 2]])

(17) use hstack to stack arrays in sequence horizontally (column wise).

np.hstack([p, 2*p])

->

array([[1, 1, 1, 2, 2, 2],

      [1, 1, 1, 2, 2, 2]])


19. Operations

(1) element wise + - * /

print(x+y)

print(x-y)

->

[5 7 9]

[-3 -3 -3]

print(x*y)

print(x/y)

->

[ 4 10 18]

[ 0.25  0.4  0.5 ]

print(x**2)

->[1 4 9]

(2) Dot Product

x.dot(y) # x1y1+x2y2+x3y3

->32

(3)

 z = np.array([y, y**2])

print(z)

print(len(z)) #number of rows of array

->

[[ 4 5 6]

[16 25 36]]

2

(4) transpose array

z

->

[[ 4 5 6]

[16 25 36]]

z.T

->

array([[ 4, 16],

      [ 5, 25],

      [ 6, 36]])

(5) use .dtype to see the data type of the elements in the array

z.dtype

->dtype('int64')

(6) use .astype to cast to a specific type 

z = z.astype('f')

z.dtype

->dtype('float32')

(7) math functions 

a = np.array([-4, -2, 1, 3, 5])

a.sum()

->3

a.max()

->5

a.min()

->-4

a.mean()

->0.59999999998

a.std()

->3.2619012860600183

a.argmax()

->4

a.argmin()

->0

(8) indexing / slicing

s = np.arange(13)**2

s

->array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144])

(9)use bracket notation to get the value at a specific index

s[0], s[4], s[-1]

->(0, 16, 144)

(10) use : to indicate a range.array[start:stop]

s[1:5]

->array([ 1, 4, 9, 16])

(11) use negatives to count from the back

s[-4:]

->array([ 81, 100, 121, 144])

(12) A second : can be used to indicate step-size.array[start : stop : stepsize]

Here we are starting 5th element from the end, and counting backwards by 2 until the beginning of the array is reached.

s[-5::-2]

->array([64, 36, 16, 4, 0])

(13) look at the multidimensional array

r = np.arange(36)

r.resize((6,6))

r

->

array([[ 0, 1, 2, 3, 4, 5],

      [ 6,  7,  8,  9, 10, 11],

      [12, 13, 14, 15, 16, 17],

      [18, 19, 20, 21, 22, 23],

      [24, 25, 26, 27, 28, 29],

      [30, 31, 32, 33, 34, 35]])

(14) use bracket notation to slice

r[2, 2]

->14

(15) use : to select a range of rows or columns

r[3, 3:6]

->array([21, 22, 23])

(16) select all the rows up to row2 , and all the columns up to the last column.

r[:2, :-1]

->

array([[ 0, 1, 2, 3, 4],

      [ 6,  7,  8,  9, 10]])

(17) a slice of last row, only every other element

r[-1, ::2]

->array([30, 32, 34])

(18) perform conditional indexing.

r[r > 30]

->array([31, 32, 33, 34, 35])

(19) assigning all values in the array that are greater than 30 to the value of 30

r[r > 30] = 30

r

->

array([[ 0, 1, 2, 3, 4, 5],

      [ 6,  7,  8,  9, 10, 11],

      [12, 13, 14, 15, 16, 17],

      [18, 19, 20, 21, 22, 23],

      [24, 25, 26, 27, 28, 29],

      [30, 30, 30, 30, 30, 30]])

(20) copy and modify arrays

r2 = r[:3, :3]

r2

->

array([[ 0, 1, 2],

      [ 6,  7,  8],

      [12, 13, 14]])

(21)set this slice's values to zero([:] selects the entire array)

r2[:] = 0

r2

->

array([[0, 0, 0],

      [0, 0, 0],

      [0, 0, 0]])

(22) r has also be changed

r

->

array([[ 0, 0, 0, 3, 4, 5],

      [ 0,  0,  0,  9, 10, 11],

      [ 0,  0,  0, 15, 16, 17],

      [18, 19, 20, 21, 22, 23],

      [24, 25, 26, 27, 28, 29],

      [30, 30, 30, 30, 30, 30]])

(23) to avoid this, use .copy()

r_copy = r.copy()

r_copy

->

array([[ 0, 0, 0, 3, 4, 5],

      [ 0,  0,  0,  9, 10, 11],

      [ 0,  0,  0, 15, 16, 17],

      [18, 19, 20, 21, 22, 23],

      [24, 25, 26, 27, 28, 29],

      [30, 30, 30, 30, 30, 30]])

(24) now when r_copy is modified, r will not be changed

r_copy[:] =10

print(r_copy, '\n')

print(r)

->

[[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]]


[[ 0  0  0  3  4  5]

[ 0  0  0  9 10 11]

[ 0  0  0 15 16 17]

[18 19 20 21 22 23]

[24 25 26 27 28 29]

[30 30 30 30 30 30]]

(25) create a new 4*3 array of random numbers 0-9

test = np.random.randint(0, 10, (4,3))

test

->

array([[1, 8, 2],

      [6, 1, 5],

      [7, 8, 0],

      [7, 6, 2]])

(26) iterate by row

for row in test:

    print(row)

->

[1 8 2] 

[6 1 5]

[7 8 0]

[7 6 2]

(27) iterate by index

for i in range(len(test)):

        print(test[I])

->

[1 8 2]

[6 1 5]

[7 8 0]

[7 6 2]

(28) iterate by row and index

for i, row in enumerate(test):

        print('row', i, 'is', row)

->

row 0 is [1 8 2]

row 1 is [6 1 5]

row 2 is [7 8 0]

row 3 is [7 6 2]

(29) use zip to iterate over multiple iterables

test2 = test**2

test2

->

array([[ 1, 64, 4],

      [36,  1, 25],

      [49, 64,  0],

      [49, 36,  4]])


for i, j in zip(test, test2):

        print(i, '+', j, '=', i+j)

->

[1 8 2] + [ 1 64 4] = [ 2 72 6]

[6 1 5] + [36  1 25] = [42  2 30]

[7 8 0] + [49 64  0] = [56 72  0]

[7 6 2] + [49 36  4] = [56 42  6]

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,826评论 6 506
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,968评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,234评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,562评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,611评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,482评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,271评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,166评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,608评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,814评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,926评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,644评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,249评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,866评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,991评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,063评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,871评论 2 354

推荐阅读更多精彩内容