Prototype code often has the potential to cause serious headaches or nightmares. This can cause friction between different teams, and easily accumulate technical debt. This does not have to be the case, if attention is paid at every stage on a little code optimisation and simplification.
A model prototype needs to be built quickly for a client. Then, the bomb gets thrown over to be put to production. Somewhere during production, new bombs get thrown over, the cycle continues, and we all live happily ever after (not).
We are always rushed to develop code quickly to meet deadlines, which often means sloppy/convoluted code, translating to technical debt (thus the term “bomb”, or some might use the term “snowball”). While it is easy to think that “it’s just a little debt here”, or “we can sort this out later”, the terrifying thing is, every little counts and adds up. A debt is a debt, and if it isn’t paid, the “interest” or penalty is huge, and this can easily cause friction and drown teams when all of these little things go unnoticed and accumulate till it is too late.
While it is common to consider optimising code only when nearer (or even after) to production, it never hurts if at the early stages of prototype building that code is optimised (i.e. runs more efficiently or uses fewer resources) and simplified. Just to be clear, I am not saying that prototype models need to be fully optimised. What I am advocating for is that anyone that develops code should:
To elaborate a bit on the second point, there are very inefficient ways to use a package (e.g. pandas
). But with minor tweaks, this can significantly improve your code.
Below is some fictional scenario showing some extremely dumb code to illustrate this.
Say we have some sales data of 20 million items, with “columns” selling_prices
, costs_per_unit
, unit_sold_counts
, e.g.:
Which gives us something like:
selling_prices = [142, 52, 94, 72, 47, 86, ...]
costs_per_unit = [21, 14, 26, 18, 27, 38, ...]
units_sold_counts = [9090, 70, 300, 930, 204, 650, ...]
There might be better ways instead of a list to store/process data at this size, But for argument’s sake, let’s stick with 3 separate lists as the starting point, shown above. We then want to compute the total_profit
, i.e. summing up the (selling_price
- cost_per_unit
) x units
for each fruit.
For demonstration purposes, let’s do something exaggeratedly dumb (i.e. rush and sloppy code):
pandas.DataFrame()
total_profit
with df.apply
+ a lambda
function and summing the resultsIf you are using pandas
, there is a simple change to optimise what we want to achieve by:
pandas.DataFrame
in one steptotal_profit
)To further simplify (and optimise) this, we can really just compute the output directly without the need for pandas
.
At this kind of scale (20 million rows), maybe using numpy
could speed things up, but still keep the code simple, like this:
Now let’s time how long each method takes to run our 20 million dataset with IPython/jupyter notebook’s %timeit magic command. Here are the results:
pandas
(improved): 17.7 secondspandas
: 4.48 secondsnumpy
: 3.82 secondsThe results are pretty clear. With a simple tweak in how you use pandas
for calculations, this can make it at more than 10 times faster already! Imagine having to use pandas
again and again for various exploratory data analysis (EDA) projects, wouldn’t you want to have this automatic speed up just by using the package more efficiently?
And if you didn’t really need to use pandas
, another extra tweak in stripping it out (i.e. computing the results directly) gives you even better performance (> 50 times faster), simply because you do not need to create the whole dataframe (which is a big overhead, even if refactored!) during the process. The method is still slower than that which uses numpy
, but the performance is not too far off. This is the often overlooked and under-appreciated beauty of KISS, which by stripping away things you don’t need, the code is naturally optimised to some extent, without the use of any “specialised” speed-up libraries.
These are tiny details, but every little detail counts! The better you know how to use your tools more efficiently, and the more KISS attitude, you can naturally optimise and simplify your code. This can go a long way, as teams will accumulate less technical debt over time, and code will generally run faster and better!
“That is all!” (quote from a team member whenever he feels like he’s said something sensible)
Many thanks to Yaqub Alwan, Gal Lellouche, and Saed Hussain for the insightful discussions and for proofreading this article.
Notes: