Debugging Patterns for Resource Leaks
In software engineering a design pattern is a guideline for solving commonly-seen problems in software design. While design patterns for software engineering and development are widely used, there is not a lot of meaningful material on patterns specifically for software debugging. There are some mentioned in the wiki page but as you can see, the patterns just provide a high-level overview of how one might go about reproducing bugs in a controlled environment rather than finding the root cause and narrowing down.
Even if you haven’t read through the famous Gang of Four or the books here, here, or here cover to cover, you’re probably in a position to understand that debugging design patterns are not well provided for in comparison.
In practice, however, the story is vastly different. Some large companies, especially those in enterprise software, have separate teams that work specifically on debugging bugs on products that have shipped. These are called continuous product development (CPD), or escalations/sustaining engineering team. Once the product ships, the development team basically hands over ownership to this team, who front end any issues seen by the customer
This post will be helpful for CPD engineers who may not always have access to the environment where a bug is being seen. Having spent some time in a CPD team myself, I will articulate a pattern for solving a particularly nasty variety of bugs, called resource leaks, under what can be challenging conditions.
The Drudgerous Art of Debugging
Irrespective of the quality of work put in during the design and development phases of software engineering, bugs are inevitable.
During the testing phase, Quality Assurance (QA) engineers find and file bugs and the developers have to solve these bugs before the product can ship. It is the time spent debugging that usually holds up a product release, so there is immense pressure on developers to fix these bugs ASAP.
The development team usually has the luxury of a QA setup where the bug is reproducible in this stage, which makes this process slightly less painful. With this setup, the developer can make changes to the product’s code to add diagnostic information or other changes, which can aid in finding the root cause of the defect. Patches containing such changes that solely serve the purpose of diagnosing a defect seen are called “debug-patches”.
Debug patches are allowed liberties like causing side effects that may alter the normal acceptable behavior of the product. In internal/in-house environments, putting out a debug patch and having the QA engineer try it out where the issue is seen is a common occurrence. Such patches, which are only going to be run on systems/machines that are meant for use by the QA engineers, can change the behavior of the product, and even if the changes made by the developer have adverse side effects on the QA setup, it can always be recreated. Needless to say, no customer (especially enterprise customer) will ever agree to run builds that are not tested and sanctioned by the QA team and come with warnings.
Regardless of the importance and inevitability of solving bugs, some developers don't look forward to this phase of the development process as much as the design and development phase. Reasons may be many—the debugging phase most likely offers the least scope for learning anything new, and besides, it means that the QA team has found mistakes in the work thus far, which can feel embarrassing.
In my humble opinion, debugging never receives enough academic attention compared to the other phases of software development because it is never taught at schools explicitly, which may be one of the main reasons it is so not glamorous.
The burden of bugs doesn’t end with the product shipping to the customer. Some bugs can even slip by the QA team and find themselves on a released product being used by a customer who has paid for the software. They are many reasons the bug seen by a customer may have been missed by the QA team, for example:
- There are environmental idiosyncrasies in the customer’s environment that are causing the issue only to be seen by the customer and not in-house. For example, some patterns of network traffic or hardware features/failures seen by the customer may cause code to behave in a certain way.
- Other 3rd party software that is interfering with the operations of our product.
- It might be that the use-case was completely missed during the planning phase.
Sometimes, reasons 1 and 2 may be so severe that bugs may never be reproducible in-house even knowing the nature of the problem, the environmental details, and the customer use-case. These bugs will again be reverted to the development team, who don’t have insight into the environment where the problem has occurred, either.
Going after bugs that are unreproducible in-house but are only seen in a customer’s environment is very challenging. Unlike an in-house environment, the developer cannot install a debug-patch that might have harmful side effects in a customer environment. Instead, they have to rely on logs and other diagnostic information already collected in the product’s support bundle to go after the bug.
Bugs found by the customer are commonly called escalations. Solving bugs at this stage is more critical than before (because now it is the customer who will have to wait). If timely resolution is not provided, they may move on your product’s competitor and this will affect the future of the product and ultimately the company itself. Also, any time spent on escalations is time away from working on the current/future release.
No effort should be spared to prevent bugs from becoming escalations, but once we are faced with them, working a quick resolution is paramount to the reputation of the product and the company. This is why teams tasked with going after escalations must be well-armed with tools and techniques to solve bugs.
The Dreaded Resource Leaks
Of all the kinds of bugs out there, there are few that strike fear into the hearts of software engineers like resource leaks. Memory and socket leaks are the most commonly seen, but the technique I describe can be used broadly for any kind of leak. Typically a resource leak is defined as an erroneous condition of a program when it is allocating more resources than it actually needs.
From a code perspective, resources are allocated/acquired with a system call (malloc/open) and released back to the system with another (free/close). Leaks are caused when an allocated resource is not released once the resource has served its purpose.
Besides freezing up the process that is causing the leak, resource leaks can bring the whole system running the process to a standstill if not detected and remediated early. These consequences are obviously dire. For example, consider a process that is leaking sockets, slowly consuming all available sockets and not closing them. At some point, the system where the process runs will run out of sockets and other processes won’t be able to get any when needed. Very soon, nobody may be able to access the system from Secure Shell, etc.
These bugs can easily slip the QA cycle as they take a period of time to manifest—even with a robust endurance testing schedule. There are tools that are very effective against them, but they may not be usable in all environments; not many customers, for example, will be willing to install a tool like valgrind (a very powerful memory leak detection tool) in their production environment.
I have spent considerable amounts of time debugging this class of issues, especially memory leaks. To help others with resolving resource leaks, I will describe a language/platform-agnostic, generalized pattern of steps to follow.
Patterns for Debugging Resource Leaks
A process that is not leaking resources has, at the code level, released every resource it has allocated/acquired. So, if a process is leaking memory, then it must have executed more acquire function calls than release function calls. To figure out the cause of the leak and fix it, we have to figure out which of the acquire functions don’t have the corresponding release function, and then find out why the release functions are not getting called.
The steps below are to be followed in the order in which they are listed. They are language/platform agnostic and should work for all kinds of resources.
Each step answers a question about the leak/bug, which helps answer the subsequent question, and ultimately helps us in figuring out the problematic codepath.
Each question/step aids in driving to the solution starting from the bottom of the software stack (the actual acquire or release function) to the top of the stack (the actual user action or use-case that caused the acquire or release function that created the leak). It can be used when the environment/setup where the leak is seen is a customer environment where the client may not agree to installing tools that could help the developer.
1. Where is the problem?
The first step in debugging a resource leak is to identify the layer where the problem exists. The aim here is to figure out if the bug is in the process itself or in the platform.
In terms of code, the question to ask is: was the release function invoked? For example, if free() was indeed called on a particular chunk of memory but is still lying around, it means the problem is in the OS level and not in the application.
The following are some ways to figure out if we are missing calls to close/free:
- If the leak is reproducible in a setup available within the company, we can log/count every allocate and release function. Otherwise, we have to rely on the log bundle and look for log spews around the acquire and release function as an indication. For example, consider the following pseudo code:If we are seeing ABCD in the logs but not WXYZ, most likely variable a is being leaked as the release_function was not called.
- Use leak detection tools that can be applied in various environments. For example:; Valgrind, WinDbg, etc. Unfortunately, this applies only to internal environments. Even here, some of the powerful tools slow down processes considerably and may render the process under inspection useless. No customer would ever allow a tool like that to be installed on a production setup.
- Look for indications that the operating system may provide. For example, some OSes provide information about running processes that include the number of sockets open and their states, which will tell you if close was called.
- Develop your own tool/technique for the purpose of counting/auditing the acquire and release functions. The best way to go about this might be to develop a tool, make it part of the product, and have an option to make the product run in a debug-mode, which gives out the information you need. Some platforms allow constructs like these using LD_PRELOAD, _malloc_hook_, etc. These allow you to wrap around allocation, release functions, and inject code that tracks what is needed.
- If all else fails, you will have to put out a debug patch, even to the customer. It would be ideal if the patch only had extra logging, but that should still be made to go through basic QA validations to ensure that the patch will not have adverse side effects in the customer’s environment.
2. Which resource is leaking?
The next step is to figure out which resource instance is causing the leak. In terms of code, the question to ask is: is the reason for the memory leak object X or is it Y? If you have used part A, B, D, or E to answer question 1, the answer for this question follows directly.
For example, 1A directly tells you that the problem is in the application and not the platform, and that variable A is leaking.However, if you used part C to answer question 1 or if somehow you already know that there is no bug in the platform/OS, this answer is not clear so you will have to use the other steps to find your answer.
3. Where is the resource used?
Given that we know object X is leaking, it may be used in a number of workflows/use-cases in the code. For example, in the pseudo code snippet in 1A, foo could be called in multiple places, so we need to figure out which of these invocations leaves the object behind, and which use-case of the product causes this. At the code level, we need to know the call trace of the code that lead to acquisition of this particular object.
Here, if there are few instances where the leaky object is being used or if the call trace is not very deep, you will know right away where the object is being used. Otherwise, you need to understand the role that the leaky instance of the resource plays in the product itself. Sometimes the best way to understand its role is good old code inspection. Getting an understanding of the code will put you in a very good position to answer this question.
If that proves to be impractical within the time available, build a call tree, correlating logs and code, and go from there. Going back to our example in 1A, we know foo calls acquire and we need to see where all foo is called right from the entry point of the process (e.g. main function in C/C++/Java).
The branch in foo’s call tree that doesn't have a corresponding branch in bar’s is the culprit.
4. Why is the resource leaking?
Once we know the context of the leaky resource instance, the final step is to see why it is not being released. It is the case that either the codepath that is supposed to release the leaking resource is not being called at all, or that the code path is terminating in between without calling the function that is supposed to release the resource.
For example, in 1A, we need to find out why the codepath that calls bar is not being called or if we know that bar is called, why release_function(variable A) is not called. This is done by correlating what is seen in the logs and the code.
Once the leak has been identified, the fix depends on its nature and cause and is beyond the scope of the pattern here. I will say, though, that by following through this process/pattern, you will learn many good lessons. For example, you will at the very least learn the location and nature of the log messages, which will help you debug the issue faster. These lessons are invaluable and must be harnessed to implement features that help improve the diagnosability of the product. Doing this will help anyone who has to deal with these issues.
Final Word of Advice
I wouldn’t be surprised if the steps above seem easier said than done, because they are. Like I said, debugging is a drudgery—it requires patience, grit and hard work.
It is said that “the best punch is the one you don't have to throw” (or something like that). Likewise, the best way to solve a bug is to avoid it in the first place or catch it in a QA setup before it ships.
Investing on Unit, Integration and pre-check-in tests is a good way to catch bugs early. A robust endurance test alone won’t help catch leaks of every kind in-house. In addition, you should think of ways to write tools that monitor resource usages of various processes in your product to indicate leaks by, for example, or example, periodically collecting resource utilization. If it grows monotonically, we know there is a problem.
But, like I said, escalations are unavoidable, so developers must engineer features for product diagnosability and maintainability when the product itself is being designed. At the very least, strategically-placed meaningful log messages are a must if developers know that they might be faced with a situation of having to go after bugs in environments that they may not have access to.
At the high level, some process/arrangements with a customer to deliver debug-patches may also come in handy. To catch leaks in a customer’s environment, developing your own tools and making them part of the product is ideal.
When all else fails, and they eventually will, I hope the pattern described here comes in handy. Resource leaks are not easy to go after and are especially nasty if not seen in-house.
I will not say that you must follow my pattern to the tee, but like any pattern, this one is a suggestion and it is up to you to adopt it according to your situation and context. I will tell you, though, that the order of the questions in the pattern described and the questions themselves are most unlikely to change if you want a bottom-up approach to debugging a resource leak.
I’ve successfully used the sequence of steps described on more than one occasion and I hope you will too.