For many organisations, Machine Learning is technological gold dust. Sprinkle a little onto your data and watch it magically transform your business in ways you could never imagine. Complex calculations spring into motion. Seemingly predicting the future with accuracy that Nostradamus could only ever dream of. Despite the promises, many organisations fail to derive any meaningful benefits from this new and emerging technology.
As with other IT initiatives, this often happens when there are no clear business goals. When technical projects are not aligned with corporate strategy. Success requires contributions from business and technical staff. A purely IT driven approach can often result in a solution looking for a problem. Of course, there are many success stories, touching millions of people. Invisible forces that make our personal and professional lives a little easier. Medical diagnoses, credit fraud detection, shopping recommendations, and virtual assistants to name a few.
Problems occur when people believe that spending sufficient time on a problem guarantees success. Look for long enough and meaningful insights will magically appear. But too much exploration can consume significant effort. Figure 1 compares random and focused exploration. Exploring within the context of a clear goal will generally yield faster results.
"In many cases, the change-management challenges of incorporating AI into employee processes and decision making far outweigh the technical challenges."
- A Harvard Business Review article. Survey of 3000 executives. https://shrtm.nu/bvd1
This is not to say that exploration is valueless. History books record many discoveries that owe their existence to people that dared to stray from the beaten track. Scientists who risked their reputations, and sometimes even their lives, to challenge established doctrines. If your goal is a disruptive innovation, experimentation will likely play a more significant role in your story. If time is short then a meticulously planned initiative is more likely to work out better. If time is on your side, then you may choose to explore a little more. The balance between exploration and exploitation is worth considering before you embark on your data science journey.
Putting it all together
A guiding process will help you to be consistent and efficient. Figure 2 defines a workflow that should point you in the right direction. The key stages are grouped by colour and relate to:
- Defining clear and achievable objectives.
- Finding the right data.
- Preparing the data and generating predictive models.
- Leveraging information and monitoring outcomes.
Aligning machine learning to business objectives
Mapping key decisions to business goals is a good way to start if your goals are unclear. In this way, you can assign relative values and priorities for each of these decisions.
Figure 3 shows a way to map value against effort for core business decisions. The top left quadrant contains objectives that are expected to return high value for relatively little effort. Look here for quick wins that may get early buy-in from key stakeholders. The top right quadrant shows high-value items that require more effort to complete. These are typically far-reaching, strategic objectives that have the greatest impact on your business. The bottom right quadrant shows items high effort items that return little value. Think hard before approving work in this area. And finally, the bottom left quadrant shows low effort, low-value items. There may be objectives here that provide benefits to discrete business units or functions.
"If you can't measire it, you can't improve it."
- Peter Drucker. Austrian-born American management consultant, educator, and author.
In Figure 3, the colour of each decision/goal relates to the likelihood of success. This is a useful dimension to explore. You may find, for example, that high effort, high-value items carry greater risks in terms of implementation.
For each objective, there should be a set of success criteria. Measurements that allow you to determine how successfully you have achieved your aims. As Peter Drucker once said, “If you can’t measure it, you can’t improve it”. At least not consciously or in a consistent and quantifiable way.
The diagram does not show work priorities and this is an important consideration. Unless, of course, you are able to do everything at once. One option, is to use the size of each bubble to represent its relative priority.
The value — effort diagram is a great way to provide ‘at a glance’ information to help you select the best data objectives for your business. Experienced data scientists will be able to advise on how well machine learning is suited to any particular objective.
Finding the right data
Existing data is an essential ingredient for machine learning problems. We think of data in terms of four attributes:
Accessible:It must be possible to locate and extract data whenever necessary. Whether via database queries, API calls, processing flat files or some other mechanism. This must be possible without compromising the integrity or performance of your business systems. There is also GDPR to consider — ensuring that personal details are processed fairly.
Appropriate:Data must be appropriate for the objective you are addressing. This typically relates to what each column of data represents. Too many columns can significantly increase processing times. There are techniques that can help here. For example, it is possible to show the relative contribution that each column of data makes to any particular model. In this way, it may be possible to remove some columns without significantly impacting the results. A model may fail to provide meaningful predictions If key data is missed (or not available). This is much harder to address and domain experts can really help.
Sufficient:There must be sufficient data (number of rows) to train and test machine learning algorithms. Whilst there are some techniques for handling insufficient data, there really is no substitute for having access to lots of historical data. Equally, some algorithms perform better with smaller amounts of data than others. Where possible, test the effectiveness of prediction models as more data becomes available. This may help you to determine the minimum amount of viable data.
Accurate:Data must accurately represent those things that it is measuring. Be mindful of data columns that are misused or incorrectly labelled. There are many cases where people store incorrect data in a database column rather than the effort that goes into extending a schema.
If the data you have does not give acceptable results then your options are limited:
Wait until sufficient data is available.
Estimate the effort to capture missing data.
Constrain the domain you are investigating.
Move on to a different objective.
Work harder on feature extraction.
The cost and effort to collect data can be high. For example, installing remote sensors on equipment or significantly changing a business process or application workflow. Equally, manual administration can be difficult to enforce. Burdening people with repetitive data entry can often create more problems than it solves. If you do not have enough data then it’s time to seriously consider whether to continue on your current path.
The processing part
Preparing the data
By this point, you should have translated business problems into prediction problems. You also have the right data to generate predictive models. There should be a plan that defines what happens and when. Your data will be accessible, appropriate, sufficient and accurate. This does not mean it is ready for modelling — it must first be prepared. There may be significant effort required to prepare data for modelling. Some common operations are:
Aggregating data that comes from several different places.
Ensuring data formats are consistent, for example, £123.00, 123.00, £123.
Converting nominal data to a numeric form.
Estimating/approximating missing values. It is often better to fill in missing data than delete entire rows. Another option here is to provide default values where data is missing.
Discarding data that does not materially contribute to the predicted outcomes.
Splitting a single field into two or more different fields. For example, a residential or commercial address.
Normalising data — a fancy word for ensuring the values of one field do not over power the values in another.
Categorising numeric data to form discrete values. For example, a column of data may contain a value between 0 and 10 to represent the size of a risk. You may decide to change this to Low (0–3), Medium (3–6) and High (7–10).
These are just some of the ways in which data can be treated before the modelling process begins.
Generating predictive models
This is the part where science and art overlap. Where there are no wrong questions and no absolutely right answers. The possible combinations of algorithms, datasets, and configuration can be truly staggering. Experience and intuition both play a part in selecting and fine-tuning algorithms until the results seem acceptable. This is where creating a talented team can really pay off.
The idea is to define a reasonable target before starting. Then, spend the least possible effort to reach that target. Or to establish that the target is beyond reach. If it seems that you cannot satisfy your goal then the choices are simple. Either update your expectations or move on to a different objective. If you have quantifiable goals then your decision-making process will be easier. This is the importance of defining clear and achievable objectives as discussed earlier.
Be mindful of diminishing returns. It is easy to reach a point where considerable further effort results in little or no improvement. Knowing when to stop is as important as knowing when to start.
A talented team focused on teamwork
A good data science team does not consist solely of data scientists. Neither does it work in isolation, disengaged from the business people it is trying to help. Domain expertise is essential. Data scientists are smart people but it’s highly unlikely they will know your business better than you do. Subject matter experts will know what data fits which problem.
Ideally, the composition of your team should reflect the nature of your challenge. In broader terms, you will likely be focused on one or more of:
- Providing meaningful information that leads to actionable insights for your business.
- Visualising data patterns and relationships using statistical analysis.
- Creating models that predict outcomes based on historical data.
There are many different job titles and seemingly no absolute reference. What’s important is creating the right balance of experience and expertise. Creating a simple view of what’s needed is an easy way to ensure you have all of your bases covered.
More than anything else, working together and sharing ideas is the best catalyst for delivering value. After all, nobody is better than everybody.
Predicting the future
In summary, predicting future outcomes can deliver value to your business. Sadly, data scientists do not have crystal balls in their toolkit — there are no cast-iron guarantees of success. Having clear goals, a structured plan and the right team will definitely improve your chances of success.
Over time, predictive modelling will become faster and cheaper. As this happens, the decisions you make will become more valuable. The people that make these decisions will behave differently. They will need new skills to fully exploit emerging technologies.
Better decisions accelerate change. In turn, this change requires new ways of thinking. You will need to adapt your data models accordingly. Effectively building a dynamic environment where continual improvement is mandatory rather than optional. This all requires strong executive leadership and support. Buy-in from your key stakeholders is essential in justifying investment. Whatever your strategy, AI and digital disruption has arrived. Creating a strategy to embrace them may be the easiest decision you have to make.