_linkedin_partner_id = "1409130"; window._linkedin_data_partner_ids = window._linkedin_data_partner_ids || []; window._linkedin_data_partner_ids.push(_linkedin_partner_id); (function(){var s = document.getElementsByTagName("script")[0]; var b = document.createElement("script"); b.type = "text/javascript";b.async = true; b.src = "https://snap.licdn.com/li.lms-analytics/insight.min.js"; s.parentNode.insertBefore(b, s);})();
Authored by:
Annie Liu, Consultant, TruQua Enterprises
JS Irick, Lead Developer and Principal Consultant, TruQua Enterprises
Daniel Settanni, Senior Cloud Architect, TruQua Enterprises
As industry leaders ramp up their investments in Machine Learning, there is a growing need to communicate effectively with Data Scientists. Without a true understanding of both the technology and business factors involved in the Machine Learning scenario, it is impossible to create long term solutions.
In Part 1 of this 2-part blog series, we will work through the first of two Machine Learning examples and describe the communication and collaboration necessary to successfully leverage Machine Learning for business scenarios.
Machine Learning algorithms are very good at predicting outcomes for many different types of scenarios by analyzing existing data and learning how it relates to the known outcomes (what you’re trying to predict). Two of the most common types of machine learning algorithms are classification and regression.
With classification, the predicted values are fixed, meaning there are a limited number of outcomes, such as determining if a customer will make a purchase or not. Regressions on the other hand, make continuous numerical predictions, such as determining the lifetime value of a customer. In each case, it is critical that the Data Scientist understands both the inputs (the source of the individual factors and how they are created) and the business event you are trying to categorize or predict.
First, let’s look at an example that demonstrates how to use Machine Learning to perform categorization. In this case, we are trying to better predict Employee Turnover. So, the goal of the machine learning algorithm is to categorize current employees as “Likely to Leave” or “Unlikely to Leave”. The categorization will be based on factors we have about each employee.
However, our goal is slightly different. Our business requirement is to identify the employees likely to leave so that actions can be taken to retain the employees. Before we continue, it is important to understand the cost of both a false positive and a false negative with regards to your business.
False Positive: An employee that is not going to leave is flagged as likely to leave.
False Negative: An employee leaves despite no indication from the machine learning algorithm.
In this case, False Negatives are costlier than False Positives. The algorithm with the best fit (overall performance) may not be the most effective for your business if it does not appropriately weigh the cost of the outcomes.
Machine learning algorithms need to be developed and trained on historical data, so for each historical employee we have features that we believe are related to whether an employee stays or leaves, as well as whether they remain at the company.
When undertaking a Machine Learning project, it’s critical to work with a partner who will take the time to understand the various features that can be used within the model. If the data scientist does not understand the inputs into the model, it is likely to end up with models that perform well in testing, but poorly in production. This is called “overfitting”.
This communication with the Data Scientist can also lead to the inclusion of additional valuable external data that were initially missing from the model.
There are three important items to note here:
1. Satisfaction level is self-reported and people are notoriously poor self-reporters.
2. The job role column is labeled “sales” in the input dataset. While descriptive column names are nice, they are no replacement for a good data dictionary.
3. Salary is a simple “High/Medium/Low” value, but is not normalized for job role.
Once we have reviewed the factors, as well as the business event we are trying to model, we need to better understand how they relate to each other. An analysis should occur on the relationships between factors and results, as well as between individual factors. Here we see a chart describing the correlations between our various factors, and whether the employee stayed with the company.
When looking at the relationships, we start to understand the correlations between our data. This step should reveal a number of data relationships which make intuitive sense, and may show some surprising results.
1. Number of current projects and number of hours worked are related. [Intuitive]
2. Employees with a longer tenure are less likely to leave. [Intuitive]
3. There is a slight negative relationship between satisfaction and retention, [Surprising]
When looking at the relationships between data, we can also find highly correlated associations. This can help determine factors to either combine or remove.
Additionally, it is necessary to look at the numerical data to determine if we should change certain values to ranges/buckets. For example, look at the relationship between monthly hours and employee retention.
Note the monthly hours for employees that were not retained. This should make intuitive sense, as the only thing worse than working too much, is working too little. Rather than use monthly hours as a value, our model would be better served by defining categories for monthly hours.
Once the data set has been analyzed, model development can begin. This is generally an iterative process, going through a number of different model types, as well as re-examining the initial data set.
While this iterative process is being performed, it is important to look at the output of the models, not just this fit. This is where the definition of your business goal, as well as communication with an experienced Data Scientist is critical. For example, a fraud detection algorithm that never detects fraud is over 99% accurate. Fit is not enough.
For our employee retention example, we tested three popular machine learning algorithms. Below you can see the Fit of each of the three models, more importantly you can see the output for a subset of the testing data.
We have taken an abbreviated look at how a data scientist might approach this scenario, but in the real world this is only a part of the solution. There are still questions surrounding how the model is served, how it is consumed within the business process and how a strategy is devised in order to retrain the model with updated data.
If you have questions, we have the answers. TruQua’s team of consultants and data scientists merge theory and practice to help customers gain deeper insights into their data for more informed decision making. For more information or to schedule a complimentary workshop that identifies what Machine Learning scenarios make sense for your business, contact us today at info@truqua.com.