Predicting Employee Turnover using Machine Learning

by Gabriele Volpi – Head of Digital Technology, Kilpatrick Digital

Predicting employee turnover is a key priority of a HR Manager. Turnover rate represents a major challenge for today’s businesses, particularly when the labor market is competitive and specific abilities are in high demand. 

It concerns the percentage of employees who leave a company and are replaced by new workers, which implies the loss of talent in the workforce over time. This includes any employee departure including resignations, dismissals, terminations, retirements or transfers of residence.

If and when an employee leaves, the cost isn’t negligible: according to some studies, such as the Society for Human Resource Management (SHRM), every time a company replaces a salaried employee, it costs 6 to 9 months’ salary on average. Moreover, finding substitutes can require months of time and effort on the part of human resources directors and recruiting team. An organization must spend a lot of time and money to search for the best replacements through advertising, recruiting companies, screening, interviewing and hiring. When the right candidate is found, it may require weeks to months for the new staff member to be entirely onboarded and working at full capacity. Consequently, measuring employee turnover can be useful to recruiters that want to explore the reasons for turnover or estimate the cost-to-hire for budget purposes.

Turnover prediction has been a research topic from the beginning of the 20thcentury, with different studies approaching the problem with different methodologies, as you can read in the article “One hundred years of employee turnover theory and research”. But with the developing of Machine Learning techniques, thanks to the advance on computational power, this kind of assignment can be carried out effectively, as a Supervised Regression.

Like every Data Science project, the analysis and prediction of employee turnover is divided into multiple steps, each one fundamental for the entire lifecycle of the project. In particular, it is possible to identify 6 phases:

  • Business Understanding and Data Understanding, the first step in this phase is understanding your client’s needs and goals, since those goals will become the objectives of the project. Within this phase, we collect information through the means of interviews and technical literature. The data-understanding phase includes also four tasks: gathering data, describing data, exploring data, verifying data quality;
  • Data Acquisition, the delicate phase where data is acquired from all the data sources acquirable;

  • Data Preparation, the stage where data is normalized in order to create a unique dataset and to correct missing values and errors;
  • Modeling, the step where the best modelis selected by testing the performance on the dataset;
  • Evaluation and Visualization, where the results obtained by the best model are evaluated and visualized with the best method;
  • Deployment, where thedeveloped framework is installedon IT systems.


The first phase could be seen as the most important one, because it allows to understand what is realistically obtainable from the entire data analysis process.

In fact, even the most sophisticated and evolved model cannot predict anything if there is not the right amount of the right type of data. 

Indeed, often, real life company has not well prepared and accessible data sources, so it is hard to generalize the results obtainable from a predictive model, as these depend on data quality.

In addition, HR data is often noisy, inconsistent and contains missing information, a problem that is exacerbated by the small proportion of employee turnover that typically exists within a given set of HR data. 

Several remedials operation can be adopted to deal with these problems, such as data anonymization and data regularization (see Data Preparation paragraph), but it is important to understand well from the beginning what can be the real achievement of the project.

Obviously, the more data one can have access to, the better it is, so it is normal that best results can be achieved with big companies with lots of employees, a good Data Quality framework and – particularly important – a sufficiently long turnover time series.


Once the data situation is clear, the second relevant phase is data acquisition. In an ideal joyful world, all data could be imported from one source, all ordered and normalized and organized with unique primary keys in a single database. Unfortunately, facing with an average company’s data sources is like discovering that Santa Claus doesn’t exist…

In order to deal with this chaotic situation, a cost-benefit analysis of every data source is needed: will the advantages obtained from this analysis be sufficiently useful to justify the efforts used to reach the source? It is important to remember that the usefulness of the features varies depending on the models used for the analysis.

Moreover, some data could be useful for a big company with big redundancy of instances, but in small companies the number of employees could be not sufficient for that feature to be significant.


After all data has been acquired, the next big task is data preprocessing. This phase concerns every action needed to have a clean and standardized dataset ready to be used for training models, such as facing missing value computation, data type converting and feature scaling.

In the database of a real company, missing data (or irremediable typos) are extremely common. To handle them, they are generally replaced with default values based on data type: as regards numerical data types, the missing entries are typically replaced by the median value, while for the categorical data types, the missing entries can be replaced by the mode, and so on.

For some models, converting categorical features to numerical ones could be essential.

Furthermore, in the case of a big number of features, it could be appropriate to make a selection (with method like PCA or similar).

Last but not least, big differences of scaling in features dimension are generally not favored within the optimization stage of these algorithms. For example, the age of employees could range between 24 and 65, consequently the salary range could vary from 20k to 1M euros. It is recommended to normalize all features – in the range 0-1 – and to standardize them.


In the modeling phase, the best algorithm for predicting employee turnover is found.

As every supervised machine learning task, in order to train the predictive model, dataset is splitted into a Training Dataset, where the model is trained and where its parameters are fine-tuned to best fit the target variable, and into a Test Dataset, where the performance of the trained model is tested.

Usually there are a lot of algorithms that can be tried (such as Decision Tree, Random Forest, XGBoosted tree, KNNs, SVM, Neural Networks…), and the one which will perform better depends often on the conditions (company’s dimension, number of historical data, number of features, …) and it’s not so obvious that it will be the most sophisticated one!

Even the performance evaluation of a model must be done carefully: typically, people use as a metric the Accuracy, which is calculated as the number of correctly predicted cases divided per the total number of predictions. But this metric alone could be biased! If for example the positive (employee gives up) and negative (employee doesn’t give up) classes are highly disproportioned, like 5% positive and 95% negative, a dummy classifier which always tag the negative class would have 95% accuracy even if doesn’t predict anything! In order to have a correct predictivity evaluation other metrics have to be taken care of (such as Precision, Recall, ROC curve and so on).


Once the best model has been chosen and trained and predictions have been made, it is time to draw conclusion. The best way is through graphical visualization, which is an essential part in order to also interact with the customer. The principal languages used for modeling (like R, Python,…) have a lot of packages which allow you to use a big variety of different useful and beautiful graphs, or depending on the architecture used there are a lot of framework and different apps that work very well too.


The last important phase of the project is the interface with customers and with their IT system. Project team should keep in mind, from the very beginning, that all the work and results produced must be easily developed in the customer system, if they don’t want to live in a deep nightmare.

Works cited: