Chilled Geek's Blog

Always curious, always learning

Categorising Short Text Descriptions: Machine Learning or Not?

Posted at — Feb 22, 2020

Disclaimer:

Keeping track of pocket money

When I was younger, I liked to keep track of how I spent my pocket money. I used to keep receipts, register the amount and category of the transaction on a spreadsheet, and every so often I could generate a pie chart (see graph below) to see the breakdown on where my pocket money went, and review which expense category I should spend less on if I needed to save more money for bigger purchases.

Sample Monthly Expenditure Breakdown

Things got easier when I moved from retaining receipts to getting digital statements (e.g. csv) which included information such as the amount and description of each transaction. Categorising transactions manually was boring, so I wrote a python script to automate this process, which did the following:

Then I’d go and analyse the batch of transactions. But instead of processing all the transactions, I mostly focused on the ones with the highest transaction amount or the most frequently occurring descriptions, making sure their assigned categories were correct (amending those that weren’t), and added them to the pile of training data to improve the prediction on the next batch of new transactions. That way, with much less effort, although I wasn’t getting all the transactions correctly categorised, the ballpark of each category was good enough to give me a good feel of what I wanted to know with my expenses.

All of this was done several years ago, before financial institutions started offering analysis on account transactions and spending habits. So I don’t use my script as often anymore. That said, I always like to think of how to do things differently in hindsight, which helps me to learn and improve continuously.

Non-ML solution?

There are many ways to address this challenge, which is a general multiclass classification problem for short text descriptions. If you treat this as a ML problem, you could throw all the data science tools at it, with better ways of processing and representing the data other than the bag of words model, or using a different algorithm to random forest with better hyperparameter optimisation.

But taking a step back, was ML the best and only way (it is probably not easy to say no to this question for data scientists)? I was thinking about this after playing around with elasticsearch and I came across its fuzzy query function, and thought that this function alone might do the trick.

The fuzzy match function in elasticsearch uses the Levenshtein distance (which measures the difference between two string sequences) to get the closest matching entry. For example, given a new entry “HONEST BURGER”, it requires less steps to change the text to match an “EAT OUT” entry like “BURGER KING” than it is to a “PUBLIC TRANSPORT” entry like “GATWICK EXPRESS”. So we would expect that this method should have some degree of predictability, possibly comparable to the performance of ML methods.

I wrote a simple elasticsearch python client (here is the github repo) to see how feasible this non-ML solution was. The strategy was:

Test drive

To put this non-ML solution to the test, I did some rough analysis in comparing it against some ML methods like Neural Network, XGBoost, and Random Forest, just to pick a few. My sample jupyter notebook on github contains the exploration code that uses a sample dataset (an anonymised sample of my bank transaction descriptions with annotated categories) to produce the results shown below.

Before I go on to the results, here are a few assumptions and caveats that are worth mentioning:

Category occurrence distribution in training and test set (randomly split)

Results

The graph below shows the performance of models trained (using 80% of the data) when applied to both the training and test set. Here are some points worth highlighting:

Snapshot comparison of ML model and non-ML (elasticsearch) performance

To compare the predictions in a fairer way, KFold cross validation (CV) (n = 10) was applied, and the results of the accuracy and balanced accuracy are shown in the graph below. What we can see is:

train_accuracy test_accuracy
balanced_train_accuracy balanced_test_accuracy
Prediction performance on training data Prediction performance on test data

Thoughts

While one can argue about how the ML methods can be made much more superior with more sophisticated feature generation, hyperparameter optimisation, or more layers of neural networks (one counterargument is you can also fine-tune the fuzzy matching as well!), what I like about the non-ML elasticsearch approach is:

While ML methods are amazing and get solve a lot of problems, sometimes it is easy to overlook the simpler solutions that can also elegantly address the problem at hand. Also, if you sell a product that contains the “machine learning” or “artificial intelligence” buzzwords, it will most certainly sell better than something like “lookup closest match to your records”!

I’ve really just touched the surface of what elasticsearch can do, and am already loving it. Elasticsearch also has ML capabilities, which I will be further exploring in the future!