Chilled Geek's Blog

Always curious, always learning

A Little Code Optimisation Goes a Long Way

Posted at — Mar 19, 2022

Even with prototype code, tiny optimisations can be beneficial

Prototype code often has the potential to cause serious headaches or nightmares. This can cause friction between different teams, and easily accumulate technical debt. This does not have to be the case, if attention is paid at every stage on a little code optimisation and simplification.

A model prototype needs to be built quickly for a client. Then, the bomb gets thrown over to be put to production. Somewhere during production, new bombs get thrown over, the cycle continues, and we all live happily ever after (not).

We are always rushed to develop code quickly to meet deadlines, which often means sloppy/convoluted code, translating to technical debt (thus the term “bomb”, or some might use the term “snowball”). While it is easy to think that “it’s just a little debt here”, or “we can sort this out later”, the terrifying thing is, every little counts and adds up. A debt is a debt, and if it isn’t paid, the “interest” or penalty is huge, and this can easily cause friction and drown teams when all of these little things go unnoticed and accumulate till it is too late.

While it is common to consider optimising code only when nearer (or even after) to production, it never hurts if at the early stages of prototype building that code is optimised (i.e. runs more efficiently or uses fewer resources) and simplified. Just to be clear, I am not saying that prototype models need to be fully optimised. What I am advocating for is that anyone that develops code should:

To elaborate a bit on the second point, there are very inefficient ways to use a package (e.g. pandas). But with minor tweaks, this can significantly improve your code.

Below is some fictional scenario showing some extremely dumb code to illustrate this.

Fictional scenario

Say we have some sales data of 20 million items, with “columns” selling_prices, costs_per_unit, unit_sold_counts, e.g.:

image

Which gives us something like:

selling_prices = [142, 52, 94, 72, 47, 86, ...] 
costs_per_unit = [21, 14, 26, 18, 27, 38, ...] 
units_sold_counts = [9090, 70, 300, 930, 204, 650, ...]

There might be better ways instead of a list to store/process data at this size, But for argument’s sake, let’s stick with 3 separate lists as the starting point, shown above. We then want to compute the total_profit, i.e. summing up the (selling_price - cost_per_unit) x units for each fruit.

Exploration

For demonstration purposes, let’s do something exaggeratedly dumb (i.e. rush and sloppy code):

  1. Create an empty pandas.DataFrame()
  2. Insert the columns of information one by one
  3. Compute the total_profit with df.apply + a lambda function and summing the results
  4. Job done! (really?)

image

Optimise/simplify

If you are using pandas, there is a simple change to optimise what we want to achieve by:

  1. Creating the full pandas.DataFrame in one step
  2. Using broadcasting operations to compute the required output
  3. (Asserting the output just to check the result is consistent with the original total_profit)

image

To further simplify (and optimise) this, we can really just compute the output directly without the need for pandas.

image

At this kind of scale (20 million rows), maybe using numpy could speed things up, but still keep the code simple, like this:

image

Timeit!

Now let’s time how long each method takes to run our 20 million dataset with IPython/jupyter notebook’s %timeit magic command. Here are the results:

image

Thoughts

The results are pretty clear. With a simple tweak in how you use pandas for calculations, this can make it at more than 10 times faster already! Imagine having to use pandas again and again for various exploratory data analysis (EDA) projects, wouldn’t you want to have this automatic speed up just by using the package more efficiently?

And if you didn’t really need to use pandas, another extra tweak in stripping it out (i.e. computing the results directly) gives you even better performance (> 50 times faster), simply because you do not need to create the whole dataframe (which is a big overhead, even if refactored!) during the process. The method is still slower than that which uses numpy, but the performance is not too far off. This is the often overlooked and under-appreciated beauty of KISS, which by stripping away things you don’t need, the code is naturally optimised to some extent, without the use of any “specialised” speed-up libraries.

These are tiny details, but every little detail counts! The better you know how to use your tools more efficiently, and the more KISS attitude, you can naturally optimise and simplify your code. This can go a long way, as teams will accumulate less technical debt over time, and code will generally run faster and better!

“That is all!” (quote from a team member whenever he feels like he’s said something sensible)

Acknowledgements

Many thanks to Yaqub Alwan, Gal Lellouche, and Saed Hussain for the insightful discussions and for proofreading this article.

Notes: