An Optimisation Tale
posted on
Optimised code is something every programmer wants to write. We want our code to run as fast as possible using as few resources as possible. Unfortunately optimisation can often seem quite daunting. From needing to deal with parallelism, to knowing how to use profiling tools, to potentially even dropping down to assembly and hand crafting each instruction. Thankfully, most optimisation needn't be this complicated. Indeed, much like debugging, a lot of it can be solved with a few well placed log statements and a bit of thought about the problem you're trying to solve.
Recently I've been working through a simple problem, first implementing a naive solution, then proceeding to optimise it. I thought I'd share how I approached it in the hope that it may help others, or at the very least give an interesting read. Thankfully the problem can be talked about in abstract terms using images. There will be a bit of pseudo code where needing but for the most part this is intended to show how I optimised the algorithm without having to think purely in terms of code.
The Problem
I'm writing an app that displays a map. One user (the editor) can edit the map while other users (the players) view it on a different screen. One component of this is having a "fog of war", which obscures part of the map for the players, letting them only see the parts of the map the editing user wants them to see.
Implementing this was quite easy. I created a black image the same size as the map and displayed that on top of the map. The editor can then remove parts of the fog, making that area transparent and allowing the map to show through. They can also add parts of the fog back, making that area black and obscuring the map again.
Unfortunately, while it obscures details of the map, it does not obscure the map's size. As the idea of the app is that the viewer is meant to be exploring, this gives them a lot of information about where they can and cannot go. To solve this I want to be able to focus the player's view purely on those parts of the map which are shown. This requires knowing the bounding rect of all the clear pixels in the fog image.
A Naive Solution
It is often said that premature optimisation is the root of all evil. This can sound a bit extreme but it does give some useful advice. In most cases you want to first implement the simplest solution possible, even if you know it's inefficient. It is far better to have a slow solution that works than a fast solution that doesn't. It also means you can then write unit tests and performance tests against the output, which helps make sure that our optimisations don't harm performance or introduce bugs.
So for my first solution I did simplest thing I could: scan every pixel. Every time fog was added or removed, I scanned the mask image from the top left pixel to the bottom right pixel making note of the minimum and maximum x and y co-ordinates in which I found a transparent pixel. I then used this to create the bounding rect.
Write Tests, Get Numbers
After creating my simple solution I created a set of automated tests. One set tested the logic, the second tested the performance. The logic tests were to make sure I didn't introduce bugs. Optimisation usually involves a lot of refactoring, so it's important to have logic tests to make sure you don't break anything. The performance tests on the other hand were to provide data.
The first step in any optimisation is to get numbers. This can be as simple as getting a time stamp before your code is run and comparing to the time after it finishes, or as complex as running a full suite of profiling tools on your code. Numbers are important as they show you if you're improving your code. Using performance tests for this has the extra benefit of giving you an automated suite you can run to make sure you don't accidentally regress performance.
My performance suite was fairly simple, just checking various test cases on images of 100 x 100px, and then again on images of 1000 x 1000px. For this post I'm only going to focus on a single test, which goes through a complex problem, removing and then adding fog. The gif below shows the steps.
After each step I re-calculated the bounding rect of the transparent pixels. The expected behaviour is shown in the following gif.
Running this test with a 1000 x 1000px image through my naive solution gave a time of 6.2 seconds. Now bare in mind that this test has 12 remove/add fog operations running one after another. A user doing this same pattern would likely have 1-2 seconds between invoking each operation, so for a 1000 x 1000px map this algorithm is already viable as each step only takes 0.5 seconds.
Complexity
I want to temporarily segue into some computer science. You may have heard of something called Big O Notiation. This is used to signify how the computing resources an algorithm needs (be that time, memory, etc) grow as the problem space grows.
On the faster end of the scale, you have O(1) or constant time. For example, a function to check if an integer is odd or even will likely be O(1). No matter how big a number you throw at it, it will always take the same amount of time to run.
On the other end you have something like O(N!) or factorial time. An example of this would be trying to solve the Travelling Salesman problem via brute force. The N in this notation referrs to the size of the data set. So if a problem with 1 node took 1 second to solve, a problem with 2 nodes would take 2! seconds to solve, a problem with 3 nodes 3! seconds and so on. Initially this isn't too bad. 2! is 2 seconds, 3! is 6 seconds. But the time rapidly increases. By the time we reach 10 nodes we will take up to 3,628,800 seconds, or 42 days, to solve.
It is important to note that this does not tell you exactly how long the problem may take to run, it merely specifies an upper bound. For example, searching for a value in an array is an O(N) problem. This means that the worst case increases linearly (so 1 second for an array of 1 item, 2 second for 2 items, 10 seconds for 10 items, etc). However, the worst case is that the value we're looking for is the last item. It the value we're looking for was always guaranteed to be the first item then our algorithm would actually run in constant time.
So what is the complexity of my problem? Well, images are basically just arrays of pixels. Assuming that checking a pixel is pretty much a constant time operation, my algorithm becomes O(N) complexity. What does this mean for my naive solution? Well, 1000 x 1000px is a fairly small image. It's easily possible that a user could put in a 10,000 x 10,000px image. Suddenly that 6 seconds becomes 600 seconds (or 10 minutes). At this point I could no longer write this off as my test case being extreme, as this implies that each step will take 50 seconds.
Don't Do Needless Work
With optimisation the temptation is always to ask "how can I do this work faster?". However, a far better question to ask is "how do I do less work?". The answer to this can often be far simpler. For example, if you do a lot of network requests, rather than asking how you can speed up the network connection, maybe the answer is to create a local cache so you don't need to talk to the network as often.
In my case the expensive operation wasn't talking to the network but scanning the image. So I started to look for times I could avoid scanning the image. If you look again at the gif showing the expected behaviour of the bounding rect, you'll notice that for a lot of steps the bounnding rect doesn't actually change as the modified area is entirely inside of it. I realised that if the modified rect is entirely inside the bounding rect, I could avoid re-scanning the image.
So I added rules that were basically
if (modifiedRect.minY >= boundingRect.minY &&
modifiedRect.minX >= boundingRect.minX &&
modifiedRect.maxY <= boundingRect.maxY &&
modifiedRect.maxX <= boundingRect.maxX) {
//Skip Scan
}
Now this worked fine for removing fog, but introduced a bug when adding fog back in. The problem was around the edges of the bounding rect. Removing fog at these edges makes no difference to the bounds, there were already transparent pixels there, I'm just adding more. However, by adding fog at the edge I may be removing the last transparent pixels causing the bounding rect to need to be re-calculated. As such I needed a variant of the above rule for adding fog. Rather than using >=/<=
I used a straight >/<
. As long as the modified rect was completely inside, but not touching the edge of, the bounding rect I could skip the scan.
This shows an important point when considering whether you can optimise. You need to be 100% sure that your optimisation won't impact the result.
There was one more case I found where I could avoid scanning. It doesn't actually impact the test case I'm showing, but it's worth mentioning anyway. If, when adding fog, the modified rect is equal to or entirely outside of the bounding rect then effectively I'm removing all the transparent pixels. As such I can just reset the bounding rect to zero.
Running the test again after these changes gave a time of 3.62 seconds. Just this simple change of avoiding needless work has gave me 1.7x speed up.
Do Different Work
I did have two options when I was looking at how to implement this algorithm. The other option would have been to keep track of the transparent areas separately, so I could perform some simple maths on them to find out the bounds. Unfortunately, when I started to consider adding fog back in, the maths became a fair bit more complicated, especially as I was now potentially dealling with transparent areas that werent simple rectangles. Image scanning, while slower, did provide a much simpler initial algorithm.
Optimisation almost always leads to more complex code, which in turn becomes harder to maintain. You always have to consider the trade off between your code running faster and your code being cleaner. Unless you're working in extremely performance sensitive areas, it is often better to favour easy to understand code over fast to execute code if it won't greatly affect your users. Indeed, you make this trade off every day by using high level languages and frameworks, sacrificing some potential computational performance in favour of code that is easier to understand, develop, and debug.
While I initially went down the image scanning route, I did revisit the rect maths option. As stated earlier, it is actually a very simple algorithm if the bounding rect is only ever going to grow, arguably even simpler than the image scanning. Because of this I opted to go for separate approaches to handling removing and adding fog. By doing this I could choose the simplest algorithm for each case. My algorithm for removing fog outside the bounding rect now became the following:
boundingRect.minY = MIN(boundingRect.minY, modifiedRect.minY)
boundingRect.minX = MIN(boundingRect.minX, modifiedRect.minX)
boundingRect.maxY = MAX(boundingRect.maxY, modifiedRect.maxY)
boundingRect.maxX = MAX(boundingRect.maxX, modifiedRect.maxX)
I had replaced the whole image scanning with 4 simple bits of maths. Running the test now only took 1.6 seconds, a 3.9x speed up over the initial algorithm. However, this actually underestimates the significance. Due to how my app works, a user will spend more time removing fog than adding it back in, so optimising this code path has a far greater impact on how users will actually use the app than this test implies. This is always worth considering when you choose what to optimise. A small change in a feature used frequently can have more of an impact than a large change in a rarely used feature.
More Efficient Scanning
At this point I had eliminated the need for scanning when removing fog, but I still needed it for adding fog. Thankfully my decision to treat removing and adding fog as two separate problems gave me opportunity to further refine my scanning.
When a user adds fog, the bounding rect can either stay the same, or it can shrink. Therefore it is safe to assume that all pixels outside of the bounding rect will be black and so don't need to be scanned. This allowed me to reduce the search area from the whole image, down to just the bounding rect. It takes a quarter of the time to search a 500 x 500px bounding rect than it does to search a 1000 x 1000px image.
Running the test with this change gave a time of 0.96 seconds. I was quite happy to be down below 1 second, but I was well aware that I was using a fairly fast machine. What is 1 second for me may be several seconds for others, so I went and looked further
My naive algorithm was scanning from the top-left to the bottom-right. This required scanning the entire search area as, even when I'd found the bottom-right most transparent pixel, I still had to scan the rest of the image to be sure. That represented a lot of wasteful scanning.
To fix this I opted to change my scanning to work from the outside in. Doing this allowed me to find the left-most, right-most, top-most, and bottom-most transparent pixels and then stop, not needing to scan the inside of transparent areas.
The gif below shows how the algorithm works. First I scan each column of pixels in the bounding rect from left to right until I hit a transparent pixel, at which point I shift the left edge of the bounding box. I then repeat this for the right side (going right to left), the top side and the bottom side (which in this example exits immediately as the first pixel found is transparent).
After implementing this approach I ran the test again and got a time of 0.2 seconds, or a 31x speed up over my initial algorithm.
Go Big or Go Home
At this point I was pretty happy. 0.2 seconds is plenty fast enough considering the test is an extreme case, running multiple operations in one go. However, I was well aware that 1000 x 1000px is not a particularly big image, so I ran my test again, this time with a 10,000 x 10,000px image. With this bigger image the test took 20.2 seconds to run. Given I have an algorithm with linear complexity it makes sense that giving it 100 times the data means it takes 100 times longer to calculate
This does show that when doing any sort of performance tests you should always be going beyond what an average user will use. By testing on extreme cases you can more accurately see the impact of your optimisations. And if you can get an extreme case running quickly, then the common use cases will take care of themselves.
Memory = Time
When optimising an algorithm you shouldn't just be measuring the time it takes, but the amount of memory it consumes. It's obvious that if you require more memory than is available, your app may well crash (or be forced to terminate). However, the amount of memory you use can also affect the time it takes for your algorithm to run. RAM is pretty fast, but it still takes longer to itterate over a larger chunk of memory than a smaller one. And if the OS starts putting some of your RAM into virtual memory then you can end up slowing it down even further while you wait for disk access.
As I'd just bumped up the size of the image I decided to take a look at memory usage. I had an idea of where any memory issues might be, so rather than using Instruments I just used the memory tool inside of Xcode's debug navigator. Running the larger 10k image test I found the test peaked at 8.23GB. That is a LOT of memory.
The first thing I opted to look at the image itself. I was creating this using a plain NSImage. The code for this is pretty simple, but it doesn't give you many customisation options. As such the image created was RGBA. Each component of this requires a byte, meaning we need 4 bytes per pixel. So a 10k RGBA image would take up 384MB. Now that's fine if you need that. However, we just care about the transparency of pixels. If we threw away the rest of the data our image suddenly shrinks to 95MB, still a lot of data but a lot more managable.
So I opted to drop down to CoreGraphics so I could create an image just with the alpha layer. Running the test again I found my peak memory use reduced to 3.31GB. Still a lot, but nearly 2.5x less than what I was using before. I then checked the time the test took and found that it had dropped to 13.7 seconds. This is 45x faster than the original algorithm (remember, we're using an image 100x bigger now, so the 6.2 seconds for the original algorithm would actually be 620 seconds with this image).
@autoreleasepool and Loops
I was still worried about that 3.31GB. How come the memory use was so high when the image should be so low? My app was built with ARC so leaks shouldn't be an issue, especially as I had no circular references. However, this doesn't mean that memory wasn't sticking around longer than it needed.
Every time I scanned a pixel I was creating an NSColor to compare against. These colours should in theory be auto-released. However, by default the auto-release pool only runs at the end of each run loop. This means that my scan would have to finish before any of these colours could be released. In cases like this, in a pre-ARC world it would have been common to create a new auto-release pool for this loop and to drain it periodically. With ARC we have the @autoreleasepool {}
statement which does the same thing. I added this to my code like so
for (x = 0; x < maxX; x++) {
@autoreleasepool {
for (y = 0; y < maxY; y++) {
//Check pixel
}
}
}
This meant that after checking each column I released all the temporary objects created during that check. Running my test again brought the peak memory usage down to 2.78GB. Still a lot, but a significant improvement.
Tests Are Not Realistic
It's at this point I'd like to remind you that, if you run your own performance tests, they may not reflect real-world performance. I was still struggling to figure out where that 2.78GB was coming from, when I remembered that I was running 12 add/remove fog operations back to back. This meant that they all ran during a single run loop pass, so any temporary objects wouldn't be released, objects such as the temporary image objects used when generating the mask image or searching.
Much like the time spent processing, my memory use was being exaggerated by my tests. Now this isn't necessarily a bad thing. If your tests are using a lot more memory and taking longer than the algorithm would in real world use, it means you're testing against a worst case scenario. To fix this I wrapped both the whole image search algorithm and the mask image generation algorithm in their own @autoreleasepool blocks which brought the peak memory use down from the 8.23GB I had started with to just 643MB.
Optimise For Your Actual Use Case, Not a Theoretical One
This is the last step on this optimisation journey. While working through this implementation I realised something important. In the real world these fog of war areas are going to be rather large. you're easily talking upwards of 50px in size in most cases. The chance of a fog of war element being a few pixels tall or wide is rare, and even the cases where it is are more likely to be mistakes rather than intentional decisions on the part of the user.
This information gave me an idea. Why work at the highest resolution at first? I could rule out a lot of pixels quite quickly if I scanned at a lower resolution, at least at first. I modified the scanning algorithm to allow me to pass in a step value. This let me initially scan every 10th column or row for transparent pixels. If I didn't find any I would skip along, skipping 90% of the pixels. When I did find a transparent pixel I then ran the scan over the previous set of columns/rows stepping over each pixel to find the exact bound.
This shows that it's important to remember what your use case is. If it was likely that users would intentionally draw 1px fog of war lines then I would have to resort to checking every pixel. But due to what the app is designed for I'm able to make some compromises and gain some more performance. This is breaking the rule I stated earlier, as I'm no longer 100% sure I'll get the same result, but sometimes you'll find that the different result doesn't matter for your case.
Running the test a final time took just 5.2 seconds, representing a 119x speed increase over my initial algorithm. To put this in more perspective, the initial algorithm took a whole second longer with 100 times less data.
Hopefully this has been an interesting read, and maybe even given you more to think about when you need to optimise your own code. It doesn't always have to be about profiling tools and delving into machine code. A lot of the time you don't even need to look at your code to figure out how to speed it up.