Marios Michailidis：How to start in Data Science

Many people have asked me how to improve or even how to start with data science (possibly moved by my kaggle experience ) and that the latter seems chaotic. Coming from an economics (and accounting) background, having limited exposure to coding/programming until midway of my Masters 6 years ago (apart from some very basic introduction to programming in high school and excel operations/functions which is the closest I can think of), meant I had to go through an exploration phase of picking up different skills to fill the role.

This post is meant to explain what worked for me. It focuses on the technical bits as there are qualitative attributes that help a lot in data science, such as formulating data science problems, articulating results, making actionable analyses etc, but are outside the scope of this analysis and some of it come (inevitably) with experience in corporate environments. This is just my personal opinion (and experience) of a good pipeline, but consider other options/ideas too. This pipeline assumes you have no programming or direct machine learning/data science experience. Some (high school level) maths is required though.

Learn the basics of a programming Language

The first thing you need is to learn Python or R (I prefer python over R and I am not the best person to advice on the latter as I rarely use it). There is an easy to read free ebook here for python that explains the basic elements like loops and if statements, variables and lists. There are good resources in the main python page and other python books. There is an interactive tutorial too. I personally have used only the first ebook and was enough to move on. Occasionally I went back to different sources to learn specific concepts. However, after you get the concept, stackoverflow is your best friend.

Setup Data science Packages

Once you learn the basics of the programming Language of your choice, you need to install the necessary packages that will allow you to exploit different data science functions. You could do this manually via finding the packages you need one by one or get a distribution that installs everything you need (including the programming language) like Anaconda for python. You can see the packages as add-ons to the programming language. The programming language is the dvd player and now you need the actual dvds! Anaconda has almost everything and comes with a nice IDLE that helps you write python code, but for those that want to do this manually (which is better in the long run I think), the ones I most commonly use are:

numpy for statistics and data formats
scipy for optimization and sparse data formats.operations
pandas for data manipulation
matplolib for graphs and plots.
sklearn for machine learning in general
h2o which for both machine learning algorithms and data manipulation operations .
keras (with theano or tensorflow for deep learning (predictive modelling)
Xgboost and lightgbm for 2 of the most successful predictive algorithms for a vast variety of predictive problems out there.
The list may seem daunting for someone that starts from absolute zero, but you don't need to worry too much as there is a big overlap between the packages and once you learn one, you have also learnt a bit from the others without realising it.

Learn the Basics of Linear Algebra, Statistical modelling and machine learning.

Linear Algebra

For linear Algebra, many good resources can be found here . I also liked this one . Consider this too . The one I used when I started were the slides from UCL. However I did have some knowledge of linear Algebra, therefore consider the other links too plus this . You only need one – make your pick.

Statistics, Statistical modelling

Some basic knowledge about descriptive statistics (like mean standard deviation, median), distributions (like normal, poisson), probability theory (like combining probabilities, conditional probabilities), hypothesis testing (like t-test or ANOVA, correlations) linear modelling (like regression) will be really useful to fully master data science.

To be honest, I learnt the basics solely from this book and was more than enough, however I did so in SPSS which is a propriety software (and my first intro to predictive modelling), hence it may be outside the scope of this post because of the open source mentality taking over the data science world. There is a version of the same book in R though. Decent alternatives in python are the introduction using scipy , the statsmodel course, the intro to stats repository . You should pick one of these.

For diving more into probability (in python) consider this . You may have noticed that some links come from Github .Github is platform for sharing code and getting familiar with it is a must as many open source tools are first available through there (and then through other channels). Consider this an intro to github.

Machine learning

The best intro is probably Andrew's Ng Free Online coursera module, (Coursera is a good source for learning in general). It teaches you some basics in a nicely structured way. The only downside is it does so in Octave (last time I checked) which is not a programming language commonly used. Another good course is this one. I have learnt a lot from he University of Utah slides for machine learning (MSc). They explain many of the algorithms in a graphical and easy to follow way and I’ve found it very clear.For General theory of machine learning could also be found in this book.

Looking at the h2o documentation you could get a good idea of the basic operations commonly combined in data science and specifically, data manipulation ,modelling, cross-validation, hyper-parameter optimization, productionizing and more. Additionally I have learnt lots from sklearn documentation (mentioned before in the python packages), particularly following the structure and examples from supervised and unsupervised learning .The documentation provides a road map of what you need to learn with many references and clear (and small) examples. This scikit-learn book can help you go through the documentation of sklearn.

Hardware

You don’t need something very expensive to begin with. Any modern laptop/pc would do with at least 4 GB of Ram. However for deep-learning you will need a graphics card with cuda support such as the Geforce series from nvidia . Another free option is kaggle kernels which allow you to run anything you want using their servers, while having most of the packages you will ever need already installed for you (apart from cuda for now).

Kaggle

Kaggle proclaims itself as the home of data science and to my experience it lives up to its name. There may be some who disagree with this statement, however I have picked up more than 50% of my skills from this platform , hence I heavily recommend it. Kaggle is where you will put all this accumulated knowledge into practice. It started as a predictive modelling competition platform, where companies would host data and ask competitors to predict best certain objectives (like future sales or if someone will default on his/her loan etc), however Kaggle nowadays is a lot more.

In combination with the kernels, which help do data analysis even without own resources, you could participate in “Get Started“ competitions and try to study their tutorials and see how predictive modelling is done in practice. This one is a good example. Go back to some finished kaggle competitions and search for benchmarks and see how well you could have done even if the competition was still active. Get familiar with many different problems, performance metrics, techniques and tools people are posting/using. A 4-year old competition I liked is this one and specifically this benchmark from it .

In Kaggle’s official blog page you can find many tutorials and information about competition winning solutions. I will (shamelessly) recommend a specific blog I am included. If you go to 'How Did you Get Better at Kaggle' section you will find some more kaggle related material and tutorials that helped me get better.

Generally, bleed in the battlefield - No better way to learn that put yourself into the situation where you actually have to solve the problem at hand. Therefore you should tackle as many kaggle competitions as you can. I think this is a good start and you don't need to do them all, one or 2 should be good enough to help you gain some confidence. And don't be a stranger, ask away, kaggle has more than 1 million members right now and people in forums will surely assist you.

Analyticsvidhya and other platforms

Analyticsvidhya is a comprehensive platform satisfying many of the educational aspects for data science. It has many data science tutorials, a great and big community, they host predictive modelling competitions and hackathons and they keep you up-to-date with what is new in data science.

There are other platforms to test your skills and learn like Crowdanalytics , numerai and more.

Diving Deeper - Stacking

Optionally and for predictive modelling, if you want to make really powerful models, you will (sooner or later) hear the term ensembling , which refers to combining many machine learning models together to form a more powerful model. There is an excellent guide written by a friend of mine (and experienced data scientist), that includes code and tricks on how people do it (in kaggle). For stacking (which is a very common form of ensembling) you can find details in my github repo . More details about Stacking and Meta modelling can be found here.

Stay Updated

Keep reading the blogs/news in the platforms mentioned above and in other channels to remain updated. Learn new tools and techniques as data science is a rapid-changing field and new stuff come out everyday nowadays. The options mentioned in the first thread here are good .

Marios Michailidis：How to start in Data Science

推荐阅读更多精彩内容