1.读取数据
enrollments.csv
daily_engagement.csv
project_submissions.csv
三个文件的数据,并打印第一行
import unicodecsv
def read_csv(filename):
with open(filename,'rb') as f:
reader = unicodecsv.DictReader(f)
return list(reader)
enrollments = read_csv('enrollments.csv')
daily_engagement = read_csv('daily-engagement.csv')
project_submissions = read_csv('project-submissions.csv')
print enrollments[0]
print daily_engagement[0]
print project_submissions[0]
2.修正数据类型
将字符串转换为时间的函数
from datetime import datetime as dt
def parse_date(date):
if date =='':
return None
else:
return dt.strptime(date,'%Y-%m-%d')
将字符串转换为整数的函数
def parse_maybe_int(i):
if i =='':
return None
else:
return int(i)
转换enrollments中的数据类型
for enrollment in enrollments:
enrollment['join_date'] = parse_date(enrollment['join_date'])
enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
enrollment['days_to_cancel'] = parse_maybe_int(enrollment['days_to_cancel'])
enrollment['is_udacity'] = enrollment['is_udacity'] =='True'
enrollment['is_canceled'] = enrollment['is_canceled'] =='True'
转换daily_engagement 中的数据类型
for engagement_record in daily_engagement:
engagement_record['utc_date'] = parse_date(engagement_record['utc_date'])
engagement_record['num_courses_visited'] =int(float(engagement_record['num_courses_visited']))
engagement_record['total_minutes_visited'] =float(engagement_record['total_minutes_visited'])
engagement_record['lessons_completed'] =int(float(engagement_record['lessons_completed']))
engagement_record['projects_completed'] =int(float(engagement_record['projects_completed']))
转换project_submissions 中的数据类型
for project_submission in project_submissions:
project_submission['creation_date'] = parse_date(project_submission['creation_date'])
project_submission['completion_date'] = parse_date(project_submission['completion_date'])
3.找到 csv 中的总行数以及不重复学员的数量
对于你加载的每个文件(一共有三个),找到 csv 中的总行数以及不重复学员的数量。
找三个文件的总行数
enrollment_num_rows = len(enrollments) #1640
engagement_num_rows = len(daily_engagement) #136240
submission_num_rows = len(project_submissions) #3642
找enrollments中不重复学员的数量
unique_enrolled_students = set()
for enrollment in enrollments:
unique_enrolled_students.add(enrollment['account_key'])
enrollment_num_unique_students=len(unique_enrolled_students)
找daily_engagement中不重复学员的数量
unique_engaged_students = set()
for engagement_record in daily_engagement:
unique_engaged_students.add(engagement_record['acct'])
engagement_num_unique_students =len(unique_engaged_students)
找project_submissions中不重复学员的数量
submission_num_rows = len(project_submissions)
unique_submission_students = set()
for project_submission in project_submissions:
unique_submission_students.add(project_submission['account_key'])
submission_num_unique_students = len(unique_submission_students) # Replace this with your code
4.数据中的问题
- more unique students in enrollment than engagement table
- colunm named
account_key
in two tables andacct
in the third
fix:change column fromacct
toaccount_key
rename theacct
column to 'account_key' in thedaily_engagement
table
for engagement_record in daily_engagement:
engagement_record['account_key'] = engagement_record['acct']
del(engagement_record['acct'])
5. 编写函数查找三个csv文件中不重复学员的数量
def get_unique_students(data):
unique_students = set()
for data_point in data:
unique_students.add(data_point['account_key'])
return unique_students
unique_enrolled_students = get_unique_students(enrollments)
unique_engaged_students = get_unique_students(daily_engagement)
unique_project_submissions =get_unique_students(project_submissions)
print len(unique_enrolled_students) #1302
print len(unique_engaged_students) #1237
print len(unique_project_submissions) #743
6.缺失的参与记录
Investigate first problem
why are students missing from daily_engagement?
1.identify surprising data poins
-- any enrollment record with no corresponding engagement data
2.print out one or a few surprising data points
for enrollment in enrollments:
student = enrollment['account_key']
if student not in unique_engaged_students:
print enrollment
break
结论:join_date=cancel_date,days_to_cancel=0
7. 核查更多问题记录
Investigating data problems
1.identify surprising data poins
2.print out one or a few surprising data points
3.fix any problems you find
-- more investigation may be necessary
-- or there might not be a problem
在上面,我们发现某些学生在注册一天内就注销了账号,这并不算什么问题,这解释了为什么在engagement表中没有该学生的信息,在随后的分析中,可能要排除此类学生,或者要知道此类学生的存在以便防止代码边际问题的产生
查找注册表中注册至少一天的学生,未出现在参与表中,并且不是在一天之内就注销的学生
num_problem_students = 0
for enrollment in enrollments:
student = enrollment['account_key']
if student not in unique_engaged_students and enrollment['days_to_cancel'] != 0:
num_problem_students +=1
print enrollment
print num_problem_students #3
打印出这些异常数据后,发现这三个问题数据都是Udacity的测试账号,而这些账号不一定会在daily_engagement表格中出现,这就回答了我们的疑虑
8. 排除udacity测试账号
找出enrollments中测试账号
udacity_test_accounts = set()
for enrollment in enrollments:
if enrollment['is_udacity']:
udacity_test_accounts.add(enrollment['account_key'])
len(udacity_test_accounts) #6
写函数删除与测试账号相关的所有数据
def remove_udacity_accounts(data):
non_udacity_accounts = []
for data_point in data:
if data_point['account_key'] not in udacity_test_accounts:
non_udacity_accounts.append(data_point)
return non_udacity_accounts
在三个表格中调用上面的函数,看每个表中还有多少记录
non_udacity_enrollments = remove_udacity_accounts(enrollments)
non_udacity_engagement = remove_udacity_accounts(daily_engagement)
non_udacity_submissions = remove_udacity_accounts(project_submissions)
print len(non_udacity_enrollments) #1622
print len(non_udacity_engagement) #135656
print len(non_udacity_submissions) #3634
9.提炼问题
only look at engagement from first week,and exclude students who cancel within a week
create a dictionary of students who either:
- haven't canceled yet(days_to_cancel is none)
- stayed enrolled more than 7 days (days_to_cancel >7)
key:account_key value:enrollment date
name:paid_students
paid_students = {}
for enrollment in non_udacity_enrollments:
if not enrollment['is_canceled'] or enrollment['days_to_cancel'] >7:
account_key = enrollment['account_key']
enrollment_date = enrollment['join_date']
paid_students[account_key] = enrollment_date
len(paid_students) #995
由于同一个学生可以注册多次,那么上面保存的enrollment_date就是学生多个注册日期中的任意一个,在这种情况下,我们应该保存最近的注册日期,做出如下修改
paid_students = {}
for enrollment in non_udacity_enrollments:
if not enrollment['is_canceled'] or enrollment['days_to_cancel'] >7:
account_key = enrollment['account_key']
enrollment_date = enrollment['join_date']
if account_key not in paid_students or enrollment_date>paid_students[account_key]:
paid_students[account_key] = enrollment_date
len(paid_students)
10. 获取第一周数据
找到paid_students中的学生,且参与时间utc_data距离enrollment_date不超过一周
列表paid_engagement_in_first_week
确定两个间隔不超过一周的函数
def within_one_week(join_date,engagement_date):
time_delta = engagement_date - join_date
return time_delta.days <7
删除免费期注销的学生
def remove_free_trial_cancels(data):
new_data = []
for data_point in data:
if data_point['account_key'] in paid_students:
new_data.append(data_point)
return new_data
在三个非udacity测试账号的数据中调用上面的函数,得到付费的enrollments,付费的engagement,付费的submissions
paid_enrollments = remove_free_trial_cancels(non_udacity_enrollments)
paid_engagement = remove_free_trial_cancels(non_udacity_engagement)
paid_submissions = remove_free_trial_cancels(non_udacity_submissions)
print len(paid_enrollments) #1293
print len(paid_engagement) #134549
print len(paid_submissions) #3618
获取第一周的付费engagement
paid_engagement_in_first_week = []
for engagement_record in paid_engagement:
account_key = engagement_record['account_key']
join_date = paid_students[account_key]
engagement_record_date = engagement_record['utc_date']
if within_one_week(join_date,engagement_record_date):
paid_engagement_in_first_week.append(engagement_record)
len(paid_engagement_in_first_week) #21580
11. 探索学员参与度
探索学员第一周上课的平均时间
1.对参与记录进行分组,使各组分别含有某学生的所有参与记录
from collections import defaultdict #如果在字典中寻找不存在的key,就会得到空列表
for engagement_record in paid_engagement_in_first_week:
account_key = engagement_record['account_key']
engagement_by_account[account_key].append(engagement_record)
2.将各个学生的参与时间相加
total_minutes_by_account = {}
for account_key,engagement_for_student in engagement_by_account.items():
total_minutes = 0
for engagement_record in engagement_for_student:
total_minutes += engagement_record['total_minutes_visited']
total_minutes_by_account[account_key]=total_minutes
3.计算上面总数的平均数
import numpy as np
total_minutes = total_minutes_by_account.values()
print 'mean:',np.mean(total_minutes) #647.590173826
print 'standard deviation:',np.std(total_minutes) #1129.27121042
print 'Maximum:',np.max(total_minutes) #10568.1008673
print 'Minimum:',np.min(total_minutes) #0.0
12.调试数据分析代码
Maximum>一周的总时间
上课分钟数最多的哪个学生的数据出现了异常,需要先找到那个学生
student_with_max_minutes = None
max_minutes = 0
for student,total_minutes in total_minutes_by_account.items():
if total_minutes > max_minutes:
student_with_max_minutes = student
max_minutes = total_minutes
max_minutes
打印出这个学生的每条参与记录
for engagement_record in paid_engagement_in_first_week:
if engagement_record['account_key'] == student_with_max_minutes:
print engagement_record
得到的条目数超过了7,并且数据点也不在一周的范围内
由此判断within_one_week函数出现了问题,engagement_date应该在join_date之后
def within_one_week(join_date,engagement_date):
time_delta = engagement_date - join_date
return time_delta.days <7 and time_delta.days >= 0
修改within_one_week函数后,得到的Maximum=3564,这是合理的
13.第一周完成的课程数
total_lessons_by_account = {}
for account_key,engagement_for_student in engagement_by_account.items():
total_lessons = 0
for engagement_record in engagement_for_student:
total_lessons += engagement_record['total_lessons_visited']
total_lessons_by_account[account_key]=total_lessons
total_lessons = total_lessons_by_account.values()
print 'mean:',np.mean(total_lessons) #647.590173826
print 'standard deviation:',np.std(total_lessons) #1129.27121042
print 'Maximum:',np.max(total_lessons) #10568.1008673
print 'Minimum:',np.min(total_lessons) #0.0
使用函数解决这个问题
1.按照账户对记录进行分组的函数
from collections import defaultdict
def group_data(data,key_name):
grouped_data = defaultdict(list)
for data_point in data:
key = data_point[key_name]
grouped_data[key].append(data_point)
return grouped_data
engagement_by_account = group_data(paid_engagement_in_first_week,'account_key')
2.将各个账户的总条目数值加总
def sum_grouped_items(grouped_data,field_name):
summed_data = {}
for key,data_points in grouped_data.items():
total = 0
for data_point in data_points:
total += data_point[field_name]
summed_data[key]=total
return summed_data
total_minutes_by_account=sum_grouped_items(engagement_by_account,'total_minutes_visited')
3.打印统计结果
def describe_data(data):
print 'mean:',np.mean(data)
print 'standard deviation:',np.std(data)
print 'Maximum:',np.max(data)
print 'Minimum:',np.min(data)
describe_data(total_minutes_by_account.values())
第一周完成的课程数
total_lessons_by_account=sum_grouped_items(engagement_by_account,'lessons_completed')
describe_data(total_lessons_by_account.values())
14.分析各学生上课的总天数
在数据中创建has_visited
字段
for engagement_record in paid_engagement:
if engagement_record['num_courses_visited']>0:
engagement_record['has_visited'] =1
else:
engagement_record['has_visited']=0
计算各学生上课的总天数
days_visited_by_account = sum_grouped_items(engagement_by_account,'has_visited')
describe_data(days_visited_by_account.values())
15.划分及格学员
创建通过项目的学生的集合
subway_lessons_key = ['746169184','3176718735']
pass_project_students = set()
for project_submission in paid_submissions:
lesson_key = project_submission['lesson_key']
assigned_rating = project_submission['assigned_rating']
if lesson_key in subway_lessons_key and (assigned_rating =='PASSED' or assigned_rating =='DISTINCTION'):
pass_project_students.add(project_submission['account_key'])
len(pass_project_students) #647
划分通过项目和未通过项目的学生的参与记录
passing_engagement = []
non_passing_engagement = []
for engagement_record in paid_engagement_in_first_week:
if engagement_record['account_key'] in pass_project:
passing_engagement.append(engagement_record)
else:
non_passing_engagement.append(engagement_record)
print len(passing_engagement) #4527
print len(non_passing_engagement) #2392
16.比较两组学员
指标:
total_minutes_visited
total_lessons_visited
has_visited
将两组学员按照account_key进行汇总
passing_engagement_by_account = group_data(passing_engagement,'account_key')
non_passing_engagement_by_account = group_data(non_passing_engagement,'account_key')
指标total_minutes_visited的对比_
passing_minutes=sum_grouped_items(passing_engagement_by_account,'total_minutes_visited')
non_passing_minutes=sum_grouped_items(non_passing_engagement_by_account,'total_minutes_visited')
print"passing engagement"
describe_data(passing_minutes.values())
print"non passing engagement"
describe_data(non_passing_minutes.values())
指标total_lessons_visited的对比_
passing_lessons=sum_grouped_items(passing_engagement_by_account,'lessons_completed')
non_passing_lessons=sum_grouped_items(non_passing_engagement_by_account,'lessons_completed')
print"passing engagement"
describe_data(passing_lessons.values())
print"non passing engagement"
describe_data(non_passing_lessons.values())
指标has_visited的对比_
passing_visits=sum_grouped_items(passing_engagement_by_account,'has_visited')
non_passing_visits=sum_grouped_items(non_passing_engagement_by_account,'has_visited')
print"passing engagement"
describe_data(passing_visits.values())
print"non passing engagement"
describe_data(non_passing_visits.values())
17.创建直方图
可视化数据
尽管你知道各种指标的均值、标准偏差、最大值和最小值,但是每个指标都有更多值得一提的方面。是否有更多与最小值或最大值接近的值?什么是中位数?等等。
在此处使用直方图可视化数据,要比输出更多统计数据更有意义。
创在 Python 中创建直方图
要在 Python 中创建直方图,你可以使用 Anaconda 随附的 matplotlib 库。以下代码将使用被称为 data
的数据点示例列表来创建直方图。
data = [1, 2, 1, 3, 3, 1, 4, 2]
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(data)
%matplotlib inline
这行代码专门用于 IPython 笔记本,可使图形呈现在你的笔记本而非新窗口中。如果你没有使用 IPython 笔记本,你无需包含这行代码,而是应该在底部添加 plt.show()
这行代码,以便图形能够呈现在新窗口中。
创建直方图 of student data
我们在研究通过和未通过地铁项目考核的学员时用到了三个指标,现在就让我们创建每个指标的直方图。也就是,你应该创建 6 个直方图。在这些直方图中,对于通过和未通过地铁项目考核的学员,两者的图形是否有很大的差异?
describe_data() 功能包括:数据的统计量、直方图
%pylab inline
from matplotlib import pyplot as plt
def describe_data(data):
print 'mean:',np.mean(data)
print 'standard deviation:',np.std(data)
print 'Maximum:',np.max(data)
print 'Minimum:',np.min(data)
plt.hist(data)
六个直方图
describe_data(passing_minutes.values())
describe_data(non_passing_minutes.values())
describe_data(passing_lessons.values())
describe_data(non_passing_lessons.values())
describe_data(passing_visits.values())
describe_data(non_passing_visits.values())
修正分组数量
要改变每个直方图中分组的数量,请尝试对 hist
函数使用 bins
参数。你可以在此处找到有关 hist
函数和参数的文档。
18. 你的结果只是噪音吗
tentative conclusion:
students who pass the subway project spend more minutes in the classroom during their first week
but is this a true difference,or due to noise in the data
you can check this using statistics
19.相关性不表明因果关系
correlation does not imply causation
correlation:
students who pass the first project are more likely to visit the classroom multiple times in their first week
causation:
dose visiting the classroom multiple times cause students to pass their project?
third factors that could cause visiting the classroom and passing projects:
--level of interest
--background knowledge
or this correlation could be because of causation
we just don't know
to find out,run on a/b test
20.基于众多特征进行预测
which students are likely to pass their first project?
could take a first pass using heuristics,but getting a really good prediction this way could be difficult
-- lots of different pieces of information to look at
-- these features can interact
machine learning can make predictions automatically
21.沟通
what findings are most interesting?
--difference in total minutes
-- difference in days visited
how will you present them?
--report average minutes
--show histograms(polish any visualizations
22.改善图形 分享心得
添加标签和标题
在 matplotlib 中,你可以使用 plt.xlabel("Label for x axis")
和 plt.ylabel("Label for y axis")
添加轴标签。对于直方图,你通常仅需要一个 x 轴标签,但其他类型的图形可能还需要 y 轴标签。你还可以使用 plt.title("Title of plot")
添加标题。
使用 seaborn 美化绘图
你可以使用 seaborn 库自动美化 matplotlib 图形。该库没有自动包含在 Anaconda 中,但是 Anaconda 自带的包管理器可使你更加轻松地添加新库。要使用这个被称为“conda”的包管理器,你应该打开命令提示符(在 PC 上)或终端行界面(在 Mac 或 Linux 上),然后键入命令 conda install seaborn
。
如果你使用了与 Anaconda 不同的 Python 安装程序,你的包管理器可能会有所不同。最常见的就是 pip 和 easy_install,你可以分别通过 pip install seaborn
或 easy_install seaborn
命令来使用它们。
一旦你安装了 seaborn,你就可以使用 import seaborn as sns
将其导入代码的任何位置。这样,你在此后创建的图形就会自动进行美化。试一试吧!
seaborn 包还包括一些附加函数,你可以用来创建在 matplotlib 中可能难以绘制的复杂图形。我们不会在本课程中涉及此方面内容,但是如果你想知道 seaborn 中有哪些函数,你可以查阅文档。
向图形添加额外参数
你还将频繁添加一些参数到图形中,用来调整图形的外观。你可以在 hist
函数的文档页面查看可用参数。可用来传入图形的一个常见参数就是 bins
参数,可设置直方图所使用的分组数量。例如,plt.hist(data, bins=20)
可以确保直方图有 20 个分组。
改善你的一个图形
使用这些方法至少改善你之前绘制的一个图形。
分享心得
最后,确定你最想和他人交流的关于本节课的心得体会
解法代码
import seaborn as sns
sns.set(color_codes=True)
plt.hist(non_passing_visits.values(), bins=8)
plt.xlabel('Number of days')
plt.title('Distribution of classroom visits in the first week ' +
'for students who do not pass the subway project')
plt.hist(passing_visits.values(), bins=8)
plt.xlabel('Number of days')
plt.title('Distribution of classroom visits in the first week ' +
'for students who pass the subway project')
特别注意:
如果你使用的是 seaborn
版本是 0.8 之后的话,你还需要额外加上 seaborn.set
函数才能启用(文档)。比如:
sns.set(color_codes=True)
23.数据分析与相关术语
data analysis and related terms
data science
-- similar to data analysis
-- more focused on building systems
-- may require more experience
data engineering
-- more focused on data wrangling
-- involves data storage and processing
big data
-- fuzzy term for 'a lot' of data
-- data analysts,scientist,and engineers and all work with big data