Marios Michailidis:How to start in Data Science

Many people have asked me how to improve or even how to start with data science (possibly moved by my kaggle experience ) and that the latter seems chaotic. Coming from an economics (and accounting) background, having limited exposure to coding/programming until midway of my Masters 6 years ago (apart from some very basic introduction to programming in high school and excel operations/functions which is the closest I can think of), meant I had to go through an exploration phase of picking up different skills to fill the role.

This post is meant to explain what worked for me. It focuses on the technical bits as there are qualitative attributes that help a lot in data science, such as formulating data science problems, articulating results, making actionable analyses etc, but are outside the scope of this analysis and some of it come (inevitably) with experience in corporate environments. This is just my personal opinion (and experience) of a good pipeline, but consider other options/ideas too. This pipeline assumes you have no programming or direct machine learning/data science experience. Some (high school level) maths is required though.

Learn the basics of a programming Language

The first thing you need is to learn Python or R (I prefer python over R and I am not the best person to advice on the latter as I rarely use it). There is an easy to read free ebook here for python that explains the basic elements like loops and if statements, variables and lists. There are good resources in the main python page and other python books. There is an interactive tutorial too. I personally have used only the first ebook and was enough to move on. Occasionally I went back to different sources to learn specific concepts. However, after you get the concept, stackoverflow is your best friend.

Setup Data science Packages

Once you learn the basics of the programming Language of your choice, you need to install the necessary packages that will allow you to exploit different data science functions. You could do this manually via finding the packages you need one by one or get a distribution that installs everything you need (including the programming language) like Anaconda for python. You can see the packages as add-ons to the programming language. The programming language is the dvd player and now you need the actual dvds! Anaconda has almost everything and comes with a nice IDLE that helps you write python code, but for those that want to do this manually (which is better in the long run I think), the ones I most commonly use are:

numpy for statistics and data formats
scipy for optimization and sparse data formats.operations
pandas for data manipulation
matplolib for graphs and plots.
sklearn for machine learning in general
h2o which for both machine learning algorithms and data manipulation operations .
keras (with theano or tensorflow for deep learning (predictive modelling)
Xgboost and lightgbm for 2 of the most successful predictive algorithms for a vast variety of predictive problems out there.
The list may seem daunting for someone that starts from absolute zero, but you don't need to worry too much as there is a big overlap between the packages and once you learn one, you have also learnt a bit from the others without realising it.

Learn the Basics of Linear Algebra, Statistical modelling and machine learning.

Linear Algebra

For linear Algebra, many good resources can be found here . I also liked this one . Consider this too . The one I used when I started were the slides from UCL. However I did have some knowledge of linear Algebra, therefore consider the other links too plus this . You only need one – make your pick.

Statistics, Statistical modelling

Some basic knowledge about descriptive statistics (like mean standard deviation, median), distributions (like normal, poisson), probability theory (like combining probabilities, conditional probabilities), hypothesis testing (like t-test or ANOVA, correlations) linear modelling (like regression) will be really useful to fully master data science.

To be honest, I learnt the basics solely from this book and was more than enough, however I did so in SPSS which is a propriety software (and my first intro to predictive modelling), hence it may be outside the scope of this post because of the open source mentality taking over the data science world. There is a version of the same book in R though. Decent alternatives in python are the introduction using scipy , the statsmodel course, the intro to stats repository . You should pick one of these.

For diving more into probability (in python) consider this . You may have noticed that some links come from Github .Github is platform for sharing code and getting familiar with it is a must as many open source tools are first available through there (and then through other channels). Consider this an intro to github.

Machine learning

The best intro is probably Andrew's Ng Free Online coursera module, (Coursera is a good source for learning in general). It teaches you some basics in a nicely structured way. The only downside is it does so in Octave (last time I checked) which is not a programming language commonly used. Another good course is this one. I have learnt a lot from he University of Utah slides for machine learning (MSc). They explain many of the algorithms in a graphical and easy to follow way and I’ve found it very clear.For General theory of machine learning could also be found in this book.

Looking at the h2o documentation you could get a good idea of the basic operations commonly combined in data science and specifically, data manipulation ,modelling, cross-validation, hyper-parameter optimization, productionizing and more. Additionally I have learnt lots from sklearn documentation (mentioned before in the python packages), particularly following the structure and examples from supervised and unsupervised learning .The documentation provides a road map of what you need to learn with many references and clear (and small) examples. This scikit-learn book can help you go through the documentation of sklearn.

Hardware

You don’t need something very expensive to begin with. Any modern laptop/pc would do with at least 4 GB of Ram. However for deep-learning you will need a graphics card with cuda support such as the Geforce series from nvidia . Another free option is kaggle kernels which allow you to run anything you want using their servers, while having most of the packages you will ever need already installed for you (apart from cuda for now).

Kaggle

Kaggle proclaims itself as the home of data science and to my experience it lives up to its name. There may be some who disagree with this statement, however I have picked up more than 50% of my skills from this platform , hence I heavily recommend it. Kaggle is where you will put all this accumulated knowledge into practice. It started as a predictive modelling competition platform, where companies would host data and ask competitors to predict best certain objectives (like future sales or if someone will default on his/her loan etc), however Kaggle nowadays is a lot more.

In combination with the kernels, which help do data analysis even without own resources, you could participate in “Get Started“ competitions and try to study their tutorials and see how predictive modelling is done in practice. This one is a good example. Go back to some finished kaggle competitions and search for benchmarks and see how well you could have done even if the competition was still active. Get familiar with many different problems, performance metrics, techniques and tools people are posting/using. A 4-year old competition I liked is this one and specifically this benchmark from it .

In Kaggle’s official blog page you can find many tutorials and information about competition winning solutions. I will (shamelessly) recommend a specific blog I am included. If you go to 'How Did you Get Better at Kaggle' section you will find some more kaggle related material and tutorials that helped me get better.

Generally, bleed in the battlefield - No better way to learn that put yourself into the situation where you actually have to solve the problem at hand. Therefore you should tackle as many kaggle competitions as you can. I think this is a good start and you don't need to do them all, one or 2 should be good enough to help you gain some confidence. And don't be a stranger, ask away, kaggle has more than 1 million members right now and people in forums will surely assist you.

Analyticsvidhya and other platforms

Analyticsvidhya is a comprehensive platform satisfying many of the educational aspects for data science. It has many data science tutorials, a great and big community, they host predictive modelling competitions and hackathons and they keep you up-to-date with what is new in data science.

There are other platforms to test your skills and learn like Crowdanalytics , numerai and more.

Diving Deeper - Stacking

Optionally and for predictive modelling, if you want to make really powerful models, you will (sooner or later) hear the term ensembling , which refers to combining many machine learning models together to form a more powerful model. There is an excellent guide written by a friend of mine (and experienced data scientist), that includes code and tricks on how people do it (in kaggle). For stacking (which is a very common form of ensembling) you can find details in my github repo . More details about Stacking and Meta modelling can be found here.

Stay Updated

Keep reading the blogs/news in the platforms mentioned above and in other channels to remain updated. Learn new tools and techniques as data science is a rapid-changing field and new stuff come out everyday nowadays. The options mentioned in the first thread here are good .

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,258评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,335评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,225评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,126评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,140评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,098评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,018评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,857评论 0 273
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,298评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,518评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,678评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,400评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,993评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,638评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,801评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,661评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,558评论 2 352

推荐阅读更多精彩内容