Data Science in a nutshell

I intend to keep this blog post as a glossary for some of the most important concepts in data science. All of these concepts are immense topics on their own. I won’t be diving deep into each one, but what I will try to do is to just give a basic summary of what it is. All explanations are my own with some inspiration from wikipedia and An Introduction to Statistical Learning by Gareth JamesDaniela WittenTrevor Hastie and Robert Tibshirani – one of the best data science books of all time!

What is Data Science, Machine Learning, AI, Deep learning?

All these terms seems to imply the same and it is quite easy to mix up the meanings, but they are all different from each other.

Data Science is a field in which knowledge is extracted using data, statistics, mathematics and advanced analysis techniques. The fundamental differences between a data scientist and a data analyst is the way the analysis is done and what it is used for. A data scientist uses statistical analysis in order to prove a hypothesis or predict future trends. A data analyst would slice and dice data to look for KPI, trends without using any machine learning techniques.

Machine Learning is the process of developing algorithms to predict future data trends or find patterns in the data.

Artificial Intelligence is widely used to denote the way machines work on their own. In simplest terms, once we train an algorithm to learn about something and test it to see if it’s learnt enough we then use the algorithm to predict something we as humans cannot do. This is AI.

Deep Learning is a specific type of machine learning where the algorithm learns through complex network patterns and many layers, sometimes as complicated as human thoughts. (Yes, we humans are the most complex machines on earth!)

Can machine learning steal our jobs?

There is a huge claim that machine learning will replace humans soon and we may all become obsolete – or worse we might be terminated by robots. This is not the case (again, in my opinion) but I do love a good science fiction story! When computers were invented, we were worried they will replace humans soon but we learnt to work with them. I think we as humans can adapt with the change in order to survive.

Here are some key definitions of machine learning

Supervised Learning 

This is a type of problem where we are given a data set to learn from and try to predict something (or some variable) from it. For example, we might learn about millions of patient’s blood test results data and predict if a new patient is likely to have a disease. There are 2 types of supervised learning methods

  1. Regression: Here we try to predict a continuous variable (most likely a number) from a data set. An example is where we learn about the different characteristics of a house such as the house type, the location, whether it has a swimming pool, the number of rooms, etc. Once we learn all this we try to predict the house price which is a numerical variable.
  2. Classification: This is used to predict a class variable. In the same house example, say we want to predict the house would sell in the next 3 months or not. The variable we want to predict is a Yes / No answer. This is a classification problem. However in many cases a classification problem may have more than 2 answers. For example we may want to know if a house would sell in the next 6 months, 1 year or 2 years. This is known as a multi-class problem.

Unsupervised learning

This is the type of problem where we have no idea what to predict but just look out for patterns in the data. A popular method used in this type is a clustering algorithm. Say we have a data set of many customers and their purchase patterns. We may want to put them into different buckets based on their behaviour. This is a clustering algorithm.

Steps to create a machine learning model

  1. Brainstorming This is the first phase in which we determine what we want the model to do. Do we want a predictive model (supervised techniques)? Are we looking to find clusters in the data (unsupervised techniques)? Are we looking to forecast a time series data? These are the questions we need to answer.
  2. Data Cleansing The unavoidable phase! No data set is perfect. We need to lookout for outliers, missing data, convert categories into dummy variables, etc.
  3. Data Visualisation This is where we get an understanding on the data through graphs/tables. Do we see a relationship between age of the customer and the way they purchase? Which is the highest influence of a house price, is it the size of the house or is it the location? Do we see a lot of patients who are diabetic to have someone in the family diabetic as well? These are the types of questions we need to answer with the visuals.
  4. Data Splitting Divide your data into training and test sets
  5. Modelling The actual fitting of the model to train the training data to learn how to classify or predict a numeric output. Get an idea of how much the model has learnt by testing the algorithm to new unseen test data. Get the accuracy of your predictions
  6. Post modelling Not all models would work the first time. We need to fine tune the parameters of the model, tweak the data set, extract features that work, eliminate features that do not contribute. Trial and test until perfection.
  7. Predictions Once you have accomplished all the above, the model is now ready to predict. It’s not easy getting here!

Bias and variance are both errors that could happen when fitting a model. Bias leads to underfitting and variance causes overfitting. Bias is an error caused while missing to see an underlying relation in the model. Variance is the erroneous behaviour  of a model that is too sensitive when there are small fluctuations in the training data. A model with high variance can misinterpret the random noise in a dataset to be a true relation whereas a model with high bias can miss important correlations in the model. Also when bias is high, variance is low and vice versa, i.e they are inversely proportional.

Of course, all the above is just a scratch on the surface of data science, there is a lot more(!) but hey, this blog post is data science in a nutshell.