数据分析过程

1.读取数据

enrollments.csv
daily_engagement.csv
project_submissions.csv
三个文件的数据,并打印第一行

import unicodecsv

def read_csv(filename):
    with open(filename,'rb') as f:
        reader = unicodecsv.DictReader(f)
        return list(reader)

enrollments = read_csv('enrollments.csv')
daily_engagement = read_csv('daily-engagement.csv')
project_submissions = read_csv('project-submissions.csv')

print enrollments[0]
print daily_engagement[0]
print project_submissions[0]

2.修正数据类型

将字符串转换为时间的函数

from datetime import datetime as dt
def parse_date(date):
    if date =='':
        return None
    else:
        return dt.strptime(date,'%Y-%m-%d')

将字符串转换为整数的函数

def parse_maybe_int(i):
    if i =='':
        return None
    else:
        return int(i)

转换enrollments中的数据类型

for enrollment in enrollments:
    enrollment['join_date'] = parse_date(enrollment['join_date'])
    enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
    enrollment['days_to_cancel'] = parse_maybe_int(enrollment['days_to_cancel'])
    enrollment['is_udacity'] = enrollment['is_udacity'] =='True'
    enrollment['is_canceled'] = enrollment['is_canceled'] =='True'

转换daily_engagement 中的数据类型

for engagement_record in daily_engagement:
    engagement_record['utc_date'] = parse_date(engagement_record['utc_date'])
    engagement_record['num_courses_visited'] =int(float(engagement_record['num_courses_visited']))
    engagement_record['total_minutes_visited'] =float(engagement_record['total_minutes_visited'])
    engagement_record['lessons_completed'] =int(float(engagement_record['lessons_completed']))
    engagement_record['projects_completed'] =int(float(engagement_record['projects_completed']))

转换project_submissions 中的数据类型

for project_submission in project_submissions:
    project_submission['creation_date'] = parse_date(project_submission['creation_date'])
    project_submission['completion_date'] = parse_date(project_submission['completion_date'])

3.找到 csv 中的总行数以及不重复学员的数量

对于你加载的每个文件(一共有三个),找到 csv 中的总行数以及不重复学员的数量。

找三个文件的总行数

enrollment_num_rows = len(enrollments)             #1640
engagement_num_rows = len(daily_engagement)    #136240
submission_num_rows = len(project_submissions)   #3642

找enrollments中不重复学员的数量

unique_enrolled_students = set()  
for enrollment in enrollments:
    unique_enrolled_students.add(enrollment['account_key'])
enrollment_num_unique_students=len(unique_enrolled_students)

找daily_engagement中不重复学员的数量

unique_engaged_students = set()  
for engagement_record in daily_engagement:
    unique_engaged_students.add(engagement_record['acct'])
engagement_num_unique_students =len(unique_engaged_students)

找project_submissions中不重复学员的数量

submission_num_rows = len(project_submissions)          
unique_submission_students = set() 
for project_submission in project_submissions:
    unique_submission_students.add(project_submission['account_key'])
submission_num_unique_students = len(unique_submission_students)  # Replace this with your code

4.数据中的问题

  1. more unique students in enrollment than engagement table
  2. colunm named account_key in two tables and acct in the third
    fix:change column from acct to account_key
    rename the acct column to 'account_key' in the daily_engagement table
for engagement_record in daily_engagement:
    engagement_record['account_key'] = engagement_record['acct']
    del(engagement_record['acct'])

5. 编写函数查找三个csv文件中不重复学员的数量

def get_unique_students(data):
    unique_students = set()
    for data_point in data:
        unique_students.add(data_point['account_key'])
    return unique_students
unique_enrolled_students = get_unique_students(enrollments)
unique_engaged_students = get_unique_students(daily_engagement)
unique_project_submissions =get_unique_students(project_submissions)
print len(unique_enrolled_students)             #1302
print len(unique_engaged_students)              #1237
print len(unique_project_submissions)           #743

6.缺失的参与记录

Investigate first problem
why are students missing from daily_engagement?
1.identify surprising data poins
-- any enrollment record with no corresponding engagement data
2.print out one or a few surprising data points

for enrollment in enrollments:
    student = enrollment['account_key']
    if student not in unique_engaged_students:
        print enrollment
        break

结论:join_date=cancel_date,days_to_cancel=0

7. 核查更多问题记录

Investigating data problems
1.identify surprising data poins
2.print out one or a few surprising data points
3.fix any problems you find
-- more investigation may be necessary
-- or there might not be a problem

在上面,我们发现某些学生在注册一天内就注销了账号,这并不算什么问题,这解释了为什么在engagement表中没有该学生的信息,在随后的分析中,可能要排除此类学生,或者要知道此类学生的存在以便防止代码边际问题的产生
查找注册表中注册至少一天的学生,未出现在参与表中,并且不是在一天之内就注销的学生

num_problem_students = 0
for enrollment in enrollments:
    student = enrollment['account_key']
    if student not in unique_engaged_students and enrollment['days_to_cancel'] != 0:
        num_problem_students +=1
        print enrollment
print num_problem_students             #3

打印出这些异常数据后,发现这三个问题数据都是Udacity的测试账号,而这些账号不一定会在daily_engagement表格中出现,这就回答了我们的疑虑

8. 排除udacity测试账号

找出enrollments中测试账号

udacity_test_accounts = set()
for enrollment in enrollments:
    if enrollment['is_udacity']:
        udacity_test_accounts.add(enrollment['account_key'])
len(udacity_test_accounts)                #6

写函数删除与测试账号相关的所有数据

def remove_udacity_accounts(data):
    non_udacity_accounts = []
    for data_point in data:
        if data_point['account_key'] not in udacity_test_accounts:
            non_udacity_accounts.append(data_point)
    return non_udacity_accounts

在三个表格中调用上面的函数,看每个表中还有多少记录

non_udacity_enrollments = remove_udacity_accounts(enrollments)
non_udacity_engagement = remove_udacity_accounts(daily_engagement)
non_udacity_submissions = remove_udacity_accounts(project_submissions)

print len(non_udacity_enrollments) #1622
print len(non_udacity_engagement)  #135656
print len(non_udacity_submissions) #3634

9.提炼问题

only look at engagement from first week,and exclude students who cancel within a week
create a dictionary of students who either:

  • haven't canceled yet(days_to_cancel is none)
  • stayed enrolled more than 7 days (days_to_cancel >7)
    key:account_key value:enrollment date
    name:paid_students
paid_students = {}
for enrollment in non_udacity_enrollments:
    if not enrollment['is_canceled'] or enrollment['days_to_cancel'] >7:
        account_key = enrollment['account_key']
        enrollment_date = enrollment['join_date']
        paid_students[account_key] = enrollment_date
len(paid_students)                 #995

由于同一个学生可以注册多次,那么上面保存的enrollment_date就是学生多个注册日期中的任意一个,在这种情况下,我们应该保存最近的注册日期,做出如下修改

paid_students = {}
for enrollment in non_udacity_enrollments:
    if not enrollment['is_canceled'] or enrollment['days_to_cancel'] >7:
        account_key = enrollment['account_key']
        enrollment_date = enrollment['join_date']
        
        if account_key not in paid_students or enrollment_date>paid_students[account_key]:
            paid_students[account_key] = enrollment_date
len(paid_students)

10. 获取第一周数据

找到paid_students中的学生,且参与时间utc_data距离enrollment_date不超过一周
列表paid_engagement_in_first_week

确定两个间隔不超过一周的函数

def within_one_week(join_date,engagement_date):
    time_delta = engagement_date - join_date
    return time_delta.days <7

删除免费期注销的学生

def remove_free_trial_cancels(data):
    new_data = []
    for data_point in data:
        if data_point['account_key'] in paid_students:
            new_data.append(data_point)
    return new_data

在三个非udacity测试账号的数据中调用上面的函数,得到付费的enrollments,付费的engagement,付费的submissions

paid_enrollments = remove_free_trial_cancels(non_udacity_enrollments)
paid_engagement = remove_free_trial_cancels(non_udacity_engagement)
paid_submissions = remove_free_trial_cancels(non_udacity_submissions)

print len(paid_enrollments)        #1293
print len(paid_engagement)       #134549
print len(paid_submissions)       #3618

获取第一周的付费engagement

paid_engagement_in_first_week = []
for engagement_record in paid_engagement:
    account_key = engagement_record['account_key']
    join_date = paid_students[account_key]
    engagement_record_date = engagement_record['utc_date']
    
    if within_one_week(join_date,engagement_record_date):
        paid_engagement_in_first_week.append(engagement_record)
len(paid_engagement_in_first_week)               #21580

11. 探索学员参与度

探索学员第一周上课的平均时间
1.对参与记录进行分组,使各组分别含有某学生的所有参与记录

from collections import defaultdict  #如果在字典中寻找不存在的key,就会得到空列表

for engagement_record in paid_engagement_in_first_week:
    account_key = engagement_record['account_key']
    engagement_by_account[account_key].append(engagement_record)

2.将各个学生的参与时间相加

total_minutes_by_account = {}
for account_key,engagement_for_student in engagement_by_account.items():
    total_minutes = 0
    for engagement_record in engagement_for_student:
        total_minutes += engagement_record['total_minutes_visited']
    total_minutes_by_account[account_key]=total_minutes

3.计算上面总数的平均数

import numpy as np

total_minutes = total_minutes_by_account.values()
print 'mean:',np.mean(total_minutes)   #647.590173826
print 'standard deviation:',np.std(total_minutes)  #1129.27121042
print 'Maximum:',np.max(total_minutes)  #10568.1008673
print 'Minimum:',np.min(total_minutes)  #0.0

12.调试数据分析代码

Maximum>一周的总时间
上课分钟数最多的哪个学生的数据出现了异常,需要先找到那个学生

student_with_max_minutes = None
max_minutes = 0

for student,total_minutes in total_minutes_by_account.items():
    if total_minutes > max_minutes:
        student_with_max_minutes = student
        max_minutes = total_minutes
max_minutes

打印出这个学生的每条参与记录

for engagement_record in paid_engagement_in_first_week:
    if engagement_record['account_key'] == student_with_max_minutes:
        print engagement_record

得到的条目数超过了7,并且数据点也不在一周的范围内
由此判断within_one_week函数出现了问题,engagement_date应该在join_date之后

def within_one_week(join_date,engagement_date):
    time_delta = engagement_date - join_date
    return time_delta.days <7 and time_delta.days >= 0

修改within_one_week函数后,得到的Maximum=3564,这是合理的

13.第一周完成的课程数

total_lessons_by_account = {}
for account_key,engagement_for_student in engagement_by_account.items():
    total_lessons = 0
    for engagement_record in engagement_for_student:
        total_lessons += engagement_record['total_lessons_visited']
    total_lessons_by_account[account_key]=total_lessons

total_lessons = total_lessons_by_account.values()
print 'mean:',np.mean(total_lessons)   #647.590173826
print 'standard deviation:',np.std(total_lessons)  #1129.27121042
print 'Maximum:',np.max(total_lessons)  #10568.1008673
print 'Minimum:',np.min(total_lessons)  #0.0

使用函数解决这个问题
1.按照账户对记录进行分组的函数

from collections import defaultdict
def group_data(data,key_name):
    grouped_data = defaultdict(list)

    for data_point in data:
        key = data_point[key_name]
        grouped_data[key].append(data_point)
    return grouped_data
engagement_by_account = group_data(paid_engagement_in_first_week,'account_key')

2.将各个账户的总条目数值加总

def sum_grouped_items(grouped_data,field_name):
    summed_data = {}
    for key,data_points in grouped_data.items():
        total = 0
        for data_point in data_points:
            total += data_point[field_name]
        summed_data[key]=total
    return summed_data
total_minutes_by_account=sum_grouped_items(engagement_by_account,'total_minutes_visited')

3.打印统计结果

def describe_data(data):
    print 'mean:',np.mean(data)
    print 'standard deviation:',np.std(data)
    print 'Maximum:',np.max(data)
    print 'Minimum:',np.min(data)
describe_data(total_minutes_by_account.values())

第一周完成的课程数

total_lessons_by_account=sum_grouped_items(engagement_by_account,'lessons_completed')
describe_data(total_lessons_by_account.values())

14.分析各学生上课的总天数

在数据中创建has_visited字段

for engagement_record in paid_engagement:
    if engagement_record['num_courses_visited']>0:
        engagement_record['has_visited'] =1
    else:
        engagement_record['has_visited']=0

计算各学生上课的总天数

days_visited_by_account = sum_grouped_items(engagement_by_account,'has_visited')
describe_data(days_visited_by_account.values())

15.划分及格学员

创建通过项目的学生的集合

subway_lessons_key = ['746169184','3176718735']
pass_project_students = set()
for project_submission in paid_submissions:
    lesson_key = project_submission['lesson_key']
    assigned_rating = project_submission['assigned_rating']
    if lesson_key in subway_lessons_key and (assigned_rating =='PASSED' or assigned_rating =='DISTINCTION'):
            pass_project_students.add(project_submission['account_key'])
len(pass_project_students)  #647

划分通过项目和未通过项目的学生的参与记录

passing_engagement = []
non_passing_engagement = []

for engagement_record in paid_engagement_in_first_week:
    if engagement_record['account_key'] in pass_project:
        passing_engagement.append(engagement_record)
    else:
        non_passing_engagement.append(engagement_record)
         
print len(passing_engagement)   #4527
print len(non_passing_engagement)  #2392

16.比较两组学员

指标:
total_minutes_visited
total_lessons_visited
has_visited
将两组学员按照account_key进行汇总

passing_engagement_by_account = group_data(passing_engagement,'account_key')

non_passing_engagement_by_account = group_data(non_passing_engagement,'account_key')

指标total_minutes_visited的对比_

passing_minutes=sum_grouped_items(passing_engagement_by_account,'total_minutes_visited')
non_passing_minutes=sum_grouped_items(non_passing_engagement_by_account,'total_minutes_visited')

print"passing engagement"
describe_data(passing_minutes.values())

print"non passing engagement"
describe_data(non_passing_minutes.values())

指标total_lessons_visited的对比_

passing_lessons=sum_grouped_items(passing_engagement_by_account,'lessons_completed')
non_passing_lessons=sum_grouped_items(non_passing_engagement_by_account,'lessons_completed')
print"passing engagement"
describe_data(passing_lessons.values())

print"non passing engagement"
describe_data(non_passing_lessons.values())

指标has_visited的对比_

passing_visits=sum_grouped_items(passing_engagement_by_account,'has_visited')
non_passing_visits=sum_grouped_items(non_passing_engagement_by_account,'has_visited')
print"passing engagement"
describe_data(passing_visits.values())

print"non passing engagement"
describe_data(non_passing_visits.values())

17.创建直方图

可视化数据

尽管你知道各种指标的均值、标准偏差、最大值和最小值,但是每个指标都有更多值得一提的方面。是否有更多与最小值或最大值接近的值?什么是中位数?等等。

在此处使用直方图可视化数据,要比输出更多统计数据更有意义。

创在 Python 中创建直方图

要在 Python 中创建直方图,你可以使用 Anaconda 随附的 matplotlib 库。以下代码将使用被称为 data 的数据点示例列表来创建直方图。

data = [1, 2, 1, 3, 3, 1, 4, 2]

%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(data)

%matplotlib inline 这行代码专门用于 IPython 笔记本,可使图形呈现在你的笔记本而非新窗口中。如果你没有使用 IPython 笔记本,你无需包含这行代码,而是应该在底部添加 plt.show() 这行代码,以便图形能够呈现在新窗口中。

创建直方图 of student data

我们在研究通过未通过地铁项目考核的学员时用到了三个指标,现在就让我们创建每个指标的直方图。也就是,你应该创建 6 个直方图。在这些直方图中,对于通过和未通过地铁项目考核的学员,两者的图形是否有很大的差异?

describe_data() 功能包括:数据的统计量、直方图
%pylab inline
from matplotlib import pyplot as plt

def describe_data(data):
    print 'mean:',np.mean(data)
    print 'standard deviation:',np.std(data)
    print 'Maximum:',np.max(data)
    print 'Minimum:',np.min(data)
    plt.hist(data)
六个直方图
describe_data(passing_minutes.values())
describe_data(non_passing_minutes.values())
describe_data(passing_lessons.values())
describe_data(non_passing_lessons.values())
describe_data(passing_visits.values())
describe_data(non_passing_visits.values())
修正分组数量

要改变每个直方图中分组的数量,请尝试对 hist 函数使用 bins 参数。你可以在此处找到有关 hist 函数和参数的文档。

18. 你的结果只是噪音吗

tentative conclusion:
students who pass the subway project spend more minutes in the classroom during their first week
but is this a true difference,or due to noise in the data
you can check this using statistics

19.相关性不表明因果关系

correlation does not imply causation
correlation:
students who pass the first project are more likely to visit the classroom multiple times in their first week
causation:
dose visiting the classroom multiple times cause students to pass their project?

third factors that could cause visiting the classroom and passing projects:
--level of interest
--background knowledge

or this correlation could be because of causation
we just don't know

to find out,run on a/b test

20.基于众多特征进行预测

which students are likely to pass their first project?
could take a first pass using heuristics,but getting a really good prediction this way could be difficult
-- lots of different pieces of information to look at
-- these features can interact
machine learning can make predictions automatically

21.沟通

what findings are most interesting?
--difference in total minutes
-- difference in days visited

how will you present them?
--report average minutes
--show histograms(polish any visualizations

22.改善图形 分享心得

添加标签和标题

在 matplotlib 中,你可以使用 plt.xlabel("Label for x axis")plt.ylabel("Label for y axis") 添加轴标签。对于直方图,你通常仅需要一个 x 轴标签,但其他类型的图形可能还需要 y 轴标签。你还可以使用 plt.title("Title of plot") 添加标题。

使用 seaborn 美化绘图

你可以使用 seaborn 库自动美化 matplotlib 图形。该库没有自动包含在 Anaconda 中,但是 Anaconda 自带的包管理器可使你更加轻松地添加新库。要使用这个被称为“conda”的包管理器,你应该打开命令提示符(在 PC 上)或终端行界面(在 Mac 或 Linux 上),然后键入命令 conda install seaborn

如果你使用了与 Anaconda 不同的 Python 安装程序,你的包管理器可能会有所不同。最常见的就是 pip 和 easy_install,你可以分别通过 pip install seaborneasy_install seaborn 命令来使用它们。

一旦你安装了 seaborn,你就可以使用 import seaborn as sns 将其导入代码的任何位置。这样,你在此后创建的图形就会自动进行美化。试一试吧!

seaborn 包还包括一些附加函数,你可以用来创建在 matplotlib 中可能难以绘制的复杂图形。我们不会在本课程中涉及此方面内容,但是如果你想知道 seaborn 中有哪些函数,你可以查阅文档。

向图形添加额外参数

你还将频繁添加一些参数到图形中,用来调整图形的外观。你可以在 hist 函数的文档页面查看可用参数。可用来传入图形的一个常见参数就是 bins 参数,可设置直方图所使用的分组数量。例如,plt.hist(data, bins=20) 可以确保直方图有 20 个分组。

改善你的一个图形

使用这些方法至少改善你之前绘制的一个图形。

分享心得

最后,确定你最想和他人交流的关于本节课的心得体会

解法代码
import seaborn as sns
sns.set(color_codes=True)

plt.hist(non_passing_visits.values(), bins=8)
plt.xlabel('Number of days')
plt.title('Distribution of classroom visits in the first week ' + 
          'for students who do not pass the subway project')

plt.hist(passing_visits.values(), bins=8)
plt.xlabel('Number of days')
plt.title('Distribution of classroom visits in the first week ' + 
          'for students who pass the subway project')

特别注意:

如果你使用的是 seaborn 版本是 0.8 之后的话,你还需要额外加上 seaborn.set 函数才能启用(文档)。比如:

sns.set(color_codes=True)

23.数据分析与相关术语

data analysis and related terms
data science
-- similar to data analysis
-- more focused on building systems
-- may require more experience

data engineering
-- more focused on data wrangling
-- involves data storage and processing

big data
-- fuzzy term for 'a lot' of data
-- data analysts,scientist,and engineers and all work with big data

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,558评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,002评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,036评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,024评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,144评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,255评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,295评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,068评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,478评论 1 305
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,789评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,965评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,649评论 4 336
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,267评论 3 318
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,982评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,223评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,800评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,847评论 2 351

推荐阅读更多精彩内容