If you have skimmed through the high tech job listings lately, you’ve surely noticed the demand for data scientists. Although there is no consensus on the exact job definition, many companies want data experts and are willing to pay a high price for the right candidates.
Let’s start with a working definition: A data scientist is someone who knows how to extract insights from data. The insights could take the form of descriptive analytics for past events, predictions of future events, or persuasive analytics – a plan that can persuade people to exhibit a desired behavior such as buying your product. To derive these insights, a data scientist needs a variety of skills:
- Data wrangling: knowing how to cleanse data and organize it so that it can be analyzed;
- Big data engineering: knowing how to store massive quantities of data and query it efficiently;
- Statistics: knowing how to find patterns in data;
- Machine Learning: knowing which algorithm to use in order to solve a given prediction problem and knowing how to tune that algorithm to give the best possible prediction;
- Data visualization: knowing how to present insights in a way that is clear to non-technical users.
No wonder good data scientists are hard to find! The profession requires a wide variety of skills, as well as the intuition and experience to know how to put insights together.
To illustrate, let’s take a simple example. Suppose you run an online retail site and have collected a number of data records about online shoppers, including features such as time spent on site, pages viewed, spending history, abandoned shopping carts and customer location. You would like to organize this data into clusters, so you can better understand what features matter most and what features are interrelated. Once the data has been clustered, you hope to analyze these clusters in order to define various classes of customers and create a marketing program tailored to each class.
This is a classic unsupervised machine learning problem requiring clustering of unlabeled data. But, there are a number of clustering algorithms you need to choose from:
K-means clustering is the most well-known clustering algorithm, involving the following steps:
- Select a number of classes to use (K), e.g., based on visual examination.
- Select K random cluster centers (centroids) in the unlabeled data.
Repeat the following steps for a set number of iterations or until the centroids don’t change much between iterations:
- Step 1: Each centroid “claims” as its own the data points closest to it.
- Step 2: Claimed data points move their centroid to be located at the average of the points in its cluster.
K-Means is popular because it trains quickly and it works well with a lot of data. But, it also has some serious drawbacks:
- You need to know in advance (or guess) how many clusters there will be in your data.
- K-means assumes clusters are circular, but in the real world most clusters are elliptical, i.e., have a normal distribution with a bulge around the median value.
- The algorithm does not converge; i.e., it can give different results depending on where the random cluster centers are initially placed.
- K-means assumes each data point belongs to only one cluster. Sometimes a data point cannot be clearly identified with a single cluster – all we can determine is the probability that it belongs to one of several clusters.
Mean Shift Clustering
The mean shift clustering algorithm creates clusters by maximizing the density of each cluster’s region (a circle or ellipse), by maximizing the number of data points within the radius of each cluster’s centroid. The iterative algorithm is performed with many sliding windows until all points lie within a window. Mean Shift has the following advantages over K-Means:
- Mean Shift determines the number of clusters, while K-Means has to be told.
- Mean Shift can handle arbitrary cluster shapes, while K-Means can only handle circles.
- Mean Shift is fairly robust and ends with convergence, while K-Means can give different results depending on initialization.
- Mean Shift is insensitive to outlier values, while K-Means is highly sensitive.
Mean Shift is far slower than K-Means because it is quadratic rather than linear. A number of performance improvements have been made to Mean Shift; e.g., efficient neighbor lookup and binned seeding (placing the initial centroids in the most densely populated areas), but it cannot be used when the number of data points is very large.
Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
This algorithm assumes each data point has been generated by some mixture of Gaussians (normal distributions). It is a soft clustering algorithm because a data point can be in more than one cluster with a probability assigned to each cluster. The algorithm uses an iterative algorithm called Expectation-Maximization (EM), which performs maximum likelihood estimation in probabilistic models.
Like K-Means, EM requires that you know in advance how many clusters there will be in your data, and it can give different results depending on where the random cluster centers are placed. But, unlike K-Means, it can handle Gaussian clusters, and it can perform soft, probabilistic cluster assignments for a given data point.
Agglomerative Hierarchical Clustering
This algorithm works bottom up, gradually combining more and more nearby points into unified clusters, until all the points reside in a single cluster. The user then decides the cut point, i.e., how many clusters should be defined, based on distance between data points. The algorithm is easy to understand, requires no prior knowledge of the number of clusters, and produces convergent results. Unfortunately, it is far slower than K-Means and cannot be run on very large data sets.
I’ve barely touched on the various options, and there are many more algorithms out there. A good data scientist would probably put all of these options together into a “recipe,” e.g.:
- If the clusters look circular:
Start with K Means
- Cheapest to run – linear performance
- Requires you to make a good guess about the number of clusters
If there is a reasonable number of data points, then also run hierarchical clustering
- Reproducible results, no need to guess number of clusters
- If the clusters look elliptical:
Use Mean Shift with a Gaussian kernel
- Improve quadratic performance via binned seeding, efficient neighbor lookup
- If a data point can belong to more than one cluster:
- Try EM Clustering using Gaussian Mixture Models
Every good data scientist has a guarded cookbook full of recipes, one for each kind of problem he/she can reasonably expect to encounter: feature engineering, regression, clustering, binary classification, multi-class classification, market basket analysis, etc. Therein lies a glimmer of hope: If we can learn the recipes of the very best data scientists, we can create an “Instant Data Scientist,” a system that knows:
- The right questions to ask about the problem and the data.
- The right families of algorithms that should be tried.
- How to select the relevant features to be analyzed and ignore features that are irrelevant or redundant.
- How to cleanse the data so it is suited for the selected algorithms, e.g., eliminate records with missing or outlier values.
- How to tune the algorithm so it provides the best possible answer.
- When to stop; i.e., deciding that the result is good enough to solve the problem, or, alternatively, deciding that no algorithm will give a good answer; e.g., due to insufficient data.
Such systems are being built today, using the tools of data science to make data science more accessible. While none of these “Instant Data Science” products are ready for prime time, they indicate a clear direction: the use of data science to automate the solution of data science problems. It will take a few years, but the ultimate success of data science could well be to make data science available to all.
About the author: Moshe Kranc is Chief Technology Officer at Ness Digital Engineering, a provider of digital transformation and software engineering services. Moshe has worked in the high tech industry for over 30 years in the United States and Israel, and has extensive experience in leading adoption of bleeding edge technologies. He previously headed the Big Data Centre of Excellence at Barclays’ Israel Development Centre (IDEC), and was part of the Emmy award-winning team that designed the scrambling system for DIRECTV. Moshe holds six patents in areas related to pay television, computer security and text mining. He has led R&D teams at companies such as Zoomix (purchased by Microsoft) and NDS (purchased by Cisco). He is a graduate of Brandeis University and earned graduate degrees from both the University of California at Berkeley and Boston University.