Human In The Loop (HITL) is traditionally defined as "a model that requires human interaction". In a Machine Learning context, it implies a hybrid computation model whereby a human can intervene to overrule decisions taken by a machine where they are less likely to be correct. This is determined via the machine's confidence in every decision it makes.
The point of the HITL model is to leverage the cost reduction benefits of automation – through algorithms, statistical models, AI, you name it – while mitigating potential errors by having people take over tricky situations. This matters more the higher the cost of a mistake.
Most HITL systems will assume high confidence algorithm-based decisions need no human intervention. And it makes sense, as the value of a human-based decision on top of the algorithm's is increasingly redundant the higher the confidence.
However, removing human oversight over certain kinds of algorithmic decisions – the high confidence ones – can generate hidden risk as the conditions of the environment the algorithm operates in change.
Uncertainty manifests as the parameters of most environments are dynamic. Thus, keeping a pulse on the decisions of the algorithm across the confidence spectrum becomes necessary to ensure everything is functioning as expected.
Typically, there are two distinct reasons why we want to use the Human in the Loop model in the context of a predictive system:
1. Reduce the mistake rate of the system, by letting a human take over low confidence decisions.
2. Measure performance – the overall mistake rate – of the system, by letting a human evaluate algorithmic decisions across the entire confidence spectrum.
When implementing Human in the Loop as a fallback mechanism to reduce the error rate of our automated decisions, we will need to pick one or more thresholds to determine which decisions should be routed to a human.
Depending on what type of predictive model we are using, we will be using a confidence or a probability indicator. This will determine the thresholding model to implement:
- Confidence. One threshold to separate decisions we're comfortable with automating from those we wouldn't. 90% might be acceptable (9 out of 10 correct decisions would be the expectation), whereas 20% will likely not be tolerated.
- Probability. Two thresholds to separate mid probability predictions versus high or low probability predictions. In a binary class prediction problem, 90% probability means very likely a positive case, 10% means very likely a negative case and 50% means we're not sure at all.
Regardless of these technicalities, when it comes to selecting thresholds you can look at it from two directly opposed perspectives:
Your costs per mistake and per human review will guide how you trade these ones off. Some tasks are more costly for a human than others, and the same goes for the cost of an error.
A common problem with predictive systems is their performance decays over time. By performance we are referring to decreasing accuracy or increasing error rates. This decay happens because the conditions of the environment we're operating in are constantly changing.
This is why having an ongoing understanding of the actual performance of your predictive systems matters. This is usually done by sampling decisions taken by your systems and putting them through a Human in the Loop system that can give you ground truth answers to compare against.
Your goal should be to uniformly sample across the confidence – or probability – spectrum to gain a good understanding of the accuracy of those scores across the board.
The most common sampling techniques for this use case are:
To obtain a meaningful measurement, defined as a measurement with enough statistical significant to be actionable, you will likely need 100-500 measured samples each time you want to compute your performance metric at different confidence levels. You probably want to compute this metric at least once a week, once a day if your environment is highly dynamic or with a severe mistake cost.
Once you have this granular performance metric, you can even use it as a way to dynamically tweak your confidence threshold. This way, you can automatically guarantee your overall performance doesn't cross a certain mistake rate guardrail.
Systematizing measurement has the – sometimes underrated – side benefit of allowing you to easily test new predictive models, rules and algorithms by measuring their effectiveness.
Both previously mentioned use cases of a Human in the Loop operation partially overlap. The information generated from all low confidence decisions handled via our HITL system can also be used for measurement.
As a result, in order to measure performance, we don't necessarily need to sample across the entire confidence spectrum. Instead, we only need to sample above the confidence threshold.
Those are the fully automated decisions we would otherwise have no visibility into. This way we can make a more efficient use of our human capital.
An optional but very powerful part of the Human in the Loop cycle is feeding human decisions back into the predictive system. After all, our goal with these models, rules and algorithms is to emulate human judgement in a faster and cheaper way. Therefore, we should make sure we are using this information to improve the performance of our predictive systems.
We can close the loop in two different ways:
Regardless, these improvements should be detectable in the metrics developed as part of the HITL performance measurement system.
A common dilemma companies face when they build a Human in the Loop system is whether to engineer their own internal tools to establish their own private Mechanical Turk.
We would advise those companies to consider setting up those manual workflows in the Human Lambdas platform as a cheaper, faster and more reliable way to run their Human in the Loop flows.
Previously published at https://www.humanlambdas.com/post/an-introduction-to-human-in-the-loop