Machine Learning in a Week-机器学习一周入门实践

Machine Learning in a Week

机器学习一周入门实践


Getting into machine learning(ml) can seem like an unachievable task from the outside.

在外界来看,机器学习的入门是一件难以企及的任务。

And it definitely can be, if you attack it from the wrong end.

事实是一旦你用错误的姿势打开,你的确可能永远不会真正入门。

However, after dedicating one week to learning the basics of the subject, I found it to be much more accessible than I anticipated.

然而,当我花了一周的时间学习机器学习相关的基础内容,我发现机器学习的门槛并没有我想象中那么高。

This article is intended to give others who’re interested in getting into ml a roadmap of how to get started, drawing from the experiences I made in my intro week.

本文是希望给那些同样对机器学习的同学一个蓝图,给大家分享我是怎么开始并规划我这一周的。


Background

背景

Before my machine learning week, I had been reading about the subject for a while, and had gone through half of Andrew Ng’s course on Coursera and a few other theoretical courses. So I had a tiny bit of conceptual understanding of ml, though I was completely unable to transfer any of my knowledge into code. This is what I want to change.

在我开始机器学习的这一周前,我已经阅读过这个学科的内容有一阵并在Coursera上学习了一半Andrew Ng的课程以及一些其他相关的理论课程。所以我已经有一些机器学习基础概念的了解。我并没有将这些知识应用并转化为代码,这也是我为啥想要改变并开始新的一周的原因。

I wanted to be able to solve problems with ml by the end of the week, even though this meant skipping a lot of fundamentals,and going for a top-down approach, instead of bottoms up.

我希望我可以在这周结束的时候能够通过机器学习解决一些实际的问题。这意味着我需要跳过基本的概念,通过自上而下的方法来学习,而不是自下而上的方式。

After asking for advice on Hacker News, I came to the conclusion that Python’s Scikit Learn-Module was the best starting point. This module gives you a wealth of algorithms to choose from, reducing the actual machine learning to a few lines of code.

通过在Hacker News上寻求的建议,我了解到Python Scikit learn模块是一个最好开始的一个点,这个模块给我们提供了一系列算法实践,让我们可以通过很少的代码来调用这些算法,用于处理实际的机器学习任务。


Monday: Learning some practicalities

周一:学习一些实例

I started off the week by looking for video tutorials which involved Scikit learn. I finally landed on Sentdex’s tutorial on how to use ml for investing in stocks, which gave me the necessary knowledge to move on to the next step.

在这一周的开始,我们通过观看一些介绍ScikitLearn视频教程来学习。最终我决定登录Sentdex’s教程学习机器学习在股票投资上如何应用,让我获取必要的知识来进入到下一步。

The good thing about the Sentdex tutorials that the instructor takes you through all the steps of gathering the data.As you go along, you realize that fetching and cleaning up the data can be much more time consuming than doing the actually machine learning. So the ability to write scripts to scrape data from files or crawl the web are essential skills for aspiring machine learning learning geeks.

Sentdex教程中有一点很赞的是给你详细介绍了数据收集相关的步骤。当你开始做机器学习以后,你会意识到抓取、清洗数据上花费的时间往往会多于真正去做机器学习的时间。所以,通过写脚本从文件中收集数据或者在网上爬取数据的能力是一个有追求的机器学习极客必须的技能。

I have re-watched several of the videos later on, to help me when I’ve been stuck with problem, so I’d recommend you to do the same.

我被卡住的时候,会去反复观看这些视频,这解决了我的疑问,所以也推荐你这么去实践。

However, if you already know how to scrape data from websites, this tutorial might not be the perfect fit, as a lot of the videos evolve around data fetching. In that case, the Udacity’s Intro to Machine Learning might be a better place to start

另外,如果你已经具备了从网上收集数据的技能,这个教程可能并没有能特别适合你,不过关于数据抓取的视频教程晚上还有很多。真那样的话,Udacity’s Intro to Machine Learning应该会是个更好的开始。


Tuesday: Applying it to a real problem

周二:应用机器学习到一个真实的问题

Tuesday I wanted to see if I could use what I had learned to solve an actual problem. As another developer in my coding cooperative was working on Bank of England’s data visualization competition, I teamed up with him to check out the datasets the bank has released. The most interesting data was their household surveys. This is an annual survey the bank perform on a few thousand households, regarding money related subjects.

周二我想看看有没有什么真实的问题能把我学到的机器学习相关的知识应用上。另外有一个开发童鞋,是我的开发伙伴,我们一起组队参加了大英银行数据可视化比赛,比赛支持我们下载银行公布出来的数据。里面最让我们感兴趣的数据就是家庭调研数据:银行每年对成千上万的家庭进行一项主题和收入相关的调研。

The Problem we decided to solve was the following:

我们决定想要解决的问题阈:

Given a person education level, age and income, can the computer predict its gender?

给定一个人的教育情况,年龄和收入,预测样本的性别

I Played around with the dataset, spent a few hours cleaning up the data, and used the Scikit Learn map to find a suitable algorithm for the problem.

我开始和这些数据集打交道,花了几小时的时间来清洗数据,然后在Scikit Learn map中找到一个合适的算法来解决上述问题。

We ended up with a success ratio at around63%, which isn’t impressive at all. But the machine did at least manage to guess a little better than flipping a coin, which would have given a success rate at 50%.

我们算法最终将预测准确率稳定在63%左右。这并不是一个令人亮瞎双眼的结果,但至少已经比抛硬币的50%的准确率高了一些了。

Seeing results is like fuel to your motivation, so I’d recommend you doing this for yourself, once you have a basic grasp of how to use Scikit Learn

看到结果能点燃你的激情,所以我推荐你自己亲手完成这个过程,这样你会让你对Scikit learn有一个直观的把握。

It’s a pivotal moment when you realize that you can start using ml to solve in real life problems.

关键的是让自己意识到你已经开始使用机器学习来解决一些生活中的实际问题了。


Wednesday: From the ground up

周三:从头开始

After playing around with various Scikit Learn modules, I decided to try and write linear regression algorithm from the ground up.

当我已经玩过了Scikit learn不同的模型,我决定尝试自己重头写一个线性回归算法。

I wanted to do this, because I felt (and still feel) that I really don’t understand what’s happening on under the hood.

从头做一个算法,是因为我觉得至今都没有真正理解在算法的内部发生了什么,我尝试去理解内部的逻辑。

Luckily, the Courera course goes into detail on how a few of the algorithms work, which came to great use at great use at this point. More specifically, ti describes the underlying concepts of using linear regressing with gradient descent.

幸运的是,Coursera课程会详细介绍一些算法的工作原理以及使用的方式。尤其是课程详细介绍了基于梯度下降的线性回归算法的基本概念。

This has definitely been the most effective of learning technique, as it forces you to understand the steps that are going on ‘under the hood’. I strongly recommend you to do this at some point.

将你的精力都集中在理解算法‘内部’发生了什么,绝对是非常有效的一种学习方式。我强烈推荐在这个阶段你也需要通过这种方式学习。

I plan to rewrite my own implementations of more complex algorithms as I go along, but I prefer doing this after I’ve played around with the respective algorithms in Scikit Learn.

我计划重写更多复杂的算法实践,不过当前我更需要在我完全掌握应用Scikit Learn中各个算法,所以我计划以后再去完成算法的重写。


Thursday: Start competing

周四:开始比赛

On Thursday, I started doing Kaggle’s introductory tutorials. Kaggle is a platform for machine learning competitions,where you can submit solutions to problems released by companies or organizations.

周四,我开始接触Kaggle论坛上的介绍教程。Kaggle是一个机器学习竞赛的平台,在平台上你可以提交基于一些公司/组织公布数据问题的解决方案。

I recommend you trying out Kaggle after having a little bit of a theoretical and practical understanding of machine learning. You’ll need this in order to start using Kaggle. Otherwise, it will be more frustrating than rewarding.

我推荐你在有一些机器学习的理论知识了解和实际练习之后再参加Kaggle的比赛。你会需要用到这些知识,不然贸然去参赛得到的挫败感比获得的成就感大得多。

The Bag of Words tutorial guides you through every steps you need to take in order to enter a submission to a competition, plus gives you a brief and exciting introduction into natural language Processing(NLP). I ended the tutorial with much higher interest in NLP than I had when entering it.

词袋模型的教程会引导你一步一步提交一次比赛结果,另外给你简要并激奋的介绍了和NLP(自然语言处理)相关的内容。这也让我除了提交的流程之外更多的对NLP产生了兴趣。


Friday: Back to school

周五:回到学校

Friday, I continued working on the Kaggle tutorials, and also started Udacity’s Intro to Machine Learning. I’m currently half ways through, and find it quite enjoyable.

周五,我继续把时间花在Kaggle上,也开始了学习Udacity’s Intro to Machine Learning课程,现在已经完成了一半的学习,我发现里面有很多有意思的东西。

It’s a lot easier the Coursera course, as it doesn’t go in depth in the algorithms. But it’s also more practical, as it teaches you Scikit Learn, which is a whole lot easier to apply to the real world than writing algorithms from the ground up in Octave, as you do in the Coursera course.

在Coursera的课程中有很多相对更简单的课程,并没有详细深入的介绍这些算法。相对来说,更多的Scikit Learn相关的练习,这些联系比起从Octave上从头开始写一个算法来说更容易在现实中得到应用。


The road ahead

前方的路

Doing it for a week hasn’t just been great fun, it has also helped my awareness of its usefulness of machine learning in society. The more I learn about it, the more I see which areas it can be used to solve problems.

过去的一周不仅仅让我获得了极大的成就感,还让我意识到机器学习在社会中的应用。我越对机器学习了解越多,发现有越多的领域可以用机器学习的方式来解决。

If you’re interested in getting into machine learning, I strongly recommend you setting off a few days or evenings and simply dive into it.

如果你有兴趣进入机器学习的世界,强烈推荐你腾出一些天或者一些晚上出来,好好的研究下这个领域。

Choose a top down approach if you’re not ready for the heavy stuff, and get into problem solving as quickly as possible.

如果你还没有准备好全面深入的学习这些东西,建议选择由上至下的方法,从尽快找一个需要解决的问题域开始。

Good luck

祝你好运

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,332评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,508评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,812评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,607评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,728评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,919评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,071评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,802评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,256评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,576评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,712评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,389评论 4 332
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,032评论 3 316
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,798评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,026评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,473评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,606评论 2 350

推荐阅读更多精彩内容