In the real world, we often have very poor inputs to automate a process. For example, we want to automate a process that is now completely manual and people make a lot of mistakes in it. Moreover, there is a long delay in correcting errors and because of this, the current data contains a lot of misleading answers.
However, if we have a lot of data, quantity inevitably turns into quality (hello, emergentism!).
In this article, I will share a real case study of how we automated the transaction categorization process in bookkeeping. As a result, we managed to save the company $20,000+ in direct operational costs, while also reducing the average turnaround time from 24 to 2 hours and increasing the average quality of categorization from ~80% to 95%+.
In this article, I'll touch on a lot of practical aspects of how you can automate almost any repetitive manual process.
You don't need to build your rocket science models for that - you just need to choose the right architecture from the ML model directly making the final decision, a set of additional algorithms to process the data, and autonomous agents to collect the information (yes, we'll be using ChatGPT!).
The article will contain a certain amount of technical details to make it interesting for senior-level ML specialists, but I will also try to immerse the reader in the context of the business problem and explain the logic of making certain decisions in simple language. So even if you don't create ML models manually, the gained knowledge probably will give you many useful ideas for your project.
Let's go!
Accounting categorization is a simple enough process to explain even to a 5-year-old child. You have some business transactions and you need to categorize them from an accounting perspective.
For small companies, there's not a lot of nuance. You spend money, you record it as an expense. You get the money - put it in revenue. But the bigger the business, the more difficult it is to categorize its operations.
Besides, many non-standard forms of documents appear, the description of transactions becomes more and more exotic, and some are not described in almost any way.
The question appears: how do bookkeepers do their job under these conditions?
We interviewed many of the agents doing categorization in our Bookkeeping Factory and watched dozens of hours of recordings of their work to understand exactly how they make decisions. It turned out that they were often making a single keyword choice, while also looking at secondary factors like counterparty or transaction amount.
It quickly became clear that we don't need neural networks (at the beginning of the project we were considering specifically FCNN as it is often used for similar tasks), and we can solve the problem with a simpler model because we will very rarely categorize full-fledged sentences, and the keywords themselves can be conveniently turned into separate near-categorical features.
Important: try to use as simple solutions as possible. Don't overengineer the solution architecture at the start and avoid more complex solutions in the distance. Very often you don't get the coverage / quality due to misunderstanding of the business context and, as a consequence, missing important features.
With this in mind, Gradient Boosting proved to be the most optimal algorithm to train on our data. Here are several reasons why gradient-boosting algorithms excel in our scenario:
Handling Mixed Data Types: Gradient Boosting algorithms can naturally handle mixed data types, including both scalar and categorical features, without requiring extensive preprocessing. Many other algorithms might struggle with categorical features, which often need encoding or transformation.
Effective Handling of Imbalanced Data: Imbalanced class distributions are common in multi-class classification. Gradient Boosting algorithms allow you to assign different weights to different classes, which can help address class imbalance issues.
Automatic Feature Interaction Detection: Gradient Boosting algorithms can automatically detect and utilize feature interactions, which is essential for complex datasets with a mix of scalar and categorical features.
High Predictive Accuracy: Gradient Boosting algorithms are known for their high predictive accuracy, making them well-suited for multi-class classification tasks where you want to maximize classification performance. (Most frequent winner algorithm in kaggle for multi classifications)
Robust to Noisy Data: Gradient Boosting models are generally robust to noisy or incomplete data, which is often encountered when dealing with mixed feature types.
We were also using 2 different libraries for gradient boosting in parallel - XGBoost and CatBoost. XGBoost is the gold standard in the industry and needs no introduction. As for CatBoost, it is the only implementation of the Gradient Boosting algorithm that works with text features without any additional preprocessing. Also, CatBoost is specifically designed to handle categorical features seamlessly without the need for one-hot encoding or label encoding.
In our case, CatBoost has shown better results.
The first thing we did was to build a playground to quickly test all the new features. We couldn't predict ahead of time all the features and approaches we would eventually come up with and which of them would show the best results. However, we realized that we wanted to make experiments as fast and easy as possible when testing new hypotheses.
We called this playground:
1-command run dataset preparation (download the actual dataset from an analytical database, split it into train/validation/test, and prepare it for use)
Convenient configuration of experiments: when you work on a project for a long time, you accumulate a lot of new features and hyperparameters and their best combination with each other may change over time. You want to tune many parameters really fast and better to create the config file for that.
Splitting the code into layers of data aggregation, preprocessing (e.g., text correction), enrichment (to add important features), model training, calculation and visualization of the experiment result.
The layers themselves are also divided into separate domain-specific services. For example, in our case additional categorization of transaction counterparties is important. We use autonomous agents on LLM models (specifically ChatGPT) for that and, thus, we build a separate service with counterparty information retrieval.
Without the existence of such a playground, the speed of new experiments will be many times slower. So I advise you to build such an environment from the beginning.
When it comes to training, the implementation was done based on Azure cloud solution, since at that moment we had a grant from Microsoft, but a similar approach could be utilized with any cloud platform.
The training process consists of the following steps:
Run the needed instance*. First of all, check the number of kernels/threads available for instance, since the Gradient Boosting algorithm can be parallelized quite well;
Copy the project to a remote server;
Install the dependencies;
Launch the experiment.
Also pay attention to the instance type, because some of them might have limited work time with a full load, which can increase the training time. A good approach would be to create a docker and copy it instead. We could also have an overview of the instances, but it’s a bit of a different topic.
By the time we started working on the model our dataset consisted of 3,000,000 manually categorized line items (this is a single line from a document, e.g. invoices often consist of multiple line items). Of these, only about half were reviewed by accountation: this is due to a peculiarity of the way our bookkeeping factory works. To build real-time bookkeeping, we immediately send all documents to our Kuala Lumpur-based bookkeeper department for categorization. They can escalate complex cases to accountants, but do most of the categorization themselves.
That said, the accountants make sure to review all categorizations on their own before submitting the accounting period as they are responsible to the government for this. On average, a bookkeeper-level categorization contains about 20% errors of different severity. It’s important to note, most of the errors are not as critical from a financial cost analysis perspective. For example, bookkeepers often confuse revenue from core and non-core sales. From a tax perspective, this doesn't affect the bottom line, but from a potential audit perspective, it shouldn't be confused.
These nuances created additional challenges in training, but we were able to solve them safely.
Apart from feature engineering, the following actions significantly contribute to the model performance:
2 test datasets: Considering that the final accuracy can be only calculated on the submitted data (which can take up to several months), but coverage of the model should be estimated on the most recent data, we used two different datasets to get the model accuracy/coverage. For accuracy, we used the last 10% of the records that were submitted. And for the coverage we used 10% of the most recent data.
Balancing class weights: since the classes are not balanced, we should adjust the weights of the rare account codes for training calculated as following: n_samples / (n_classes * np.bincount(y)). Catboost provides a specific “class_weights” parameter for passing an array with weights.
Adjusting classes weights based on error likelihood: We also added additional weights for each record based on the frequency of human error while creating an account code (this is relevant for the model training on the not fully validated data)
Feature with thresholds: it’s also possible to let the model decide on the thresholds of the features, where it’s applicable
The most contributing features are often born at the intersection of domain and technical knowledge.
In our case, the “strongest” feature that added the top 30% of coverage turned out to be the probability of keyword categorization in different dimensions.
How it works.
Based on the results of shadowing, it became clear that very often bookkeepers look at how similar cases have been categorized in the past to make decisions. At the same time, they determine similarity by matching keywords.
We decided to formalize this process by engineering. For that, we record the keywords in all categorization cases. Then for each keyword, based on our dataset, we calculate the probability that a particular ledger account will be selected. Since there is often more than one keyword in a one-line item, we need to assign weights to them somehow. This can be done either as a simple weighted average - i.e., give the same weight to each keyword encountered in a line item - or it can be done in a similar way to tf-idf - i.e., give more weight to rarer keywords.
Important: From our practice, the second approach (tf-idf like) works better, but we ended up adding both approaches as 2 separate features - one counts by the first approach and the other by the second. In doing so, both features essentially pick values from the same set (I'll talk about this a bit below).
This approach, as a rule, does not lead to a drop in accuracy, but allows you to add an extra percentage of coverage, which is always nice. Let me point out why this works: with proper training, the algorithm will realize that one of the features has a higher correlation with the correct answer and therefore it will have a higher feature importance. In the case of a conflict between two features, the one with the higher feature importance will be prioritized.
So the next time you have several hypotheses for how to calculate a particular feature, ask yourself whether to add all of them as separate features. Next, measure the error contribution of all the features. If it is negative, then feel free to leave all the features.
Next, we extract keywords from the new line item that we want to categorize and calculate the products of the probabilities of selecting the bookkeeping accounts for these keywords. You'll end up with something like this:
Account A - 80%
Account B - 15%
The rest - 5%.
In the feature itself, we will write only the class that will pass the threshold specified in the config parameters. At the same time, you can add several such features with different threshold levels, for example, 75%, 95%, 99%. If some account has 99% probability, then all such features will have the same value - this account. If an account has a probability of 76%, then the first feature will have the value of the account, and the remaining two will have the value of null. Why does this work? Read the previous "important" section above.
LLM is a great technology for a lot of textual tasks. But categorizing small word combinations, often with no unambiguous meaning, is not the most appropriate task for it. Nevertheless, we can use it to search and summarize additional information and add the result of its work as separate features.
In our case, we want to understand whether the counterparty of some transaction is the same counterparty that was encountered in a previous transaction. If yes, it very much increases the probability of categorization by the same account. If it is a new counterparty, we want to get additional useful information on it.
To do this, we first want to make sure that the difference in names is not a typo/error at the OCR stage, and count the distance between counterparty names The Damerau–Levenshtein distance algorithm. If it does not exceed a threshold, then we run our autonomous agent on ChatGPT to crawl local company registers (e.g. for Singapore it is Acra Register) with the name of that counterparty. Thanks to this we can make sure that the company name is spelled correctly and also find out the main field of activity of the counterparty.
We do a similar operation when we encounter obscure words - we Google the word and ask Chatgpt to make some conclusion about what the word is. For example, "M1133" may turn out to be the name of a specific phone model, or it may remain an incomprehensible set of letters and then we have to ask our client what it means.
Important: use any technologies for those tasks for which they are created/show the best results. In our case, we are learning to predict the probability of categorization based on a statistical distribution of many factors whose weight of significance we do not know, and there is no formalized list of rules for that. In such circumstances, we are better served by simpler models that construct some statistical distribution of values depending on the input data. This is something that gradient boosting can do very well. And something that the Transformer architecture, on which all current LLM models are built, can easily dumb down.
One of the nice bonuses of ML model building turned out to be the following observation.
Since the gradient boosting returns class probabilities, we can treat them as the confidence of the response. An obvious application of the confidence level is to define a threshold at which we use the model response. If the threshold is not reached (the model is not confident enough), then we send it out for human classification.
The less obvious thing is that very low confidence can correlate with really complex cases that require additional human verification. That was exactly our situation!
Most cases where the most likely class had a very low probability tend to have a higher error rate in human process. If such cases are sent through a standard process where they are categorized by bookkeepers, they are much more likely to exhibit an incorrect answer.
From a product perspective, it’s more beneficial to send these cases straight to accountants or clients. That's why we put up a second threshold at the bottom: when it is breached, the document is sent to a separate flow.
Although this approach leads to local costs, it generally contributes to global optimization of the process: less data errors -> more trust in real-time accounting (higher NPS) => less time for accountants to check (optimization of operating costs).
In addition to the ML model itself, which makes the final categorization decision, we built 2 important supporting components for it:
Without these two components, the result of the model would be many times worse. So we can't say that our ML model magically started to be smarter than humans. It's just that we added additional features that allow the model to make a more informed decision, and we did quality training.
Important: in real life, the existing business process we are working on is often far from ideal. To avoid being blocked by product development, consider gathering important information on the backend side. By cutting off the visual interface that people need but ML models don't, you'll save a significant amount of resources. And with LLM technologies, you can implement many routine information gathering and summarization processes much faster, especially if they don't require high accuracy.
The problem we solved is both unique and standard at the same time.
On the one hand, it is unique because every domain and business process has its nuances that need to be analyzed, comprehended, and taken into account when building an ML model. In this article, I have detailed a few such nuances in the context of building a bookkeeping classifier. But, of course, there are many more.
On the other hand, the approaches outlined in this article can be widely reused to automate many other business processes, and not necessarily those where you need to classify something explicitly. After all, in essence, any business process is an ensemble of classifiers: you choose one of the possible solutions at stage 1, then make one of the possible solutions at stage 2, and so on.
Combining LLM technologies with simpler ML models and depth of understanding of the business process one can achieve amazing results. In our case, we were able to automate 90%+ of the operations and achieve a much better quality (98%+) of responses than the people whose work we automated. In addition, we prevented a lot of errors and saved labor in the manual process by sending anomalies to a separate stage.
Good luck with your ML experiments!)
by Peter Potapov,
special thanks to Andrey Mescheryakov