I have been thinking about this for quite sometime. An event occurred last week prompted me to finally write about it. It is about the problem investigating approach that my former group leader taught me when I was still in the early period of my working life. The approach goes like this
Sometimes, we just don’t know exactly how to handle a problem. There may be many unknown factors or the behavior of the case is unpredictable. We may not want to just choose an approach and put all effort toward the direction. It’s possible that the approach will lead us to the solution but it’s also possible that we waste all effort just to find out in the end that we have chosen the wrong path.
With many unknown factors or unpredictable behaviors, it’s very likely that we will end up in the second case. For this kind of problem, it may be better to try gathering information from different possible directions. We may try to experiment a bit of something here and there to see whether there are any real convincing evidences show up in the experiment. Who knows, we may pick up the scent of the root cause in the area that hasn’t been in our original focus at all
There is no special technique in this approach at all. It’s just a reminder for keeping our eyes and minds open for other possible directions. I myself didn’t pay much attention to the method when my leader explained it to me at the time. You may wonder that who in the right mind would jump to a path and just walk along it without knowing where the end of the path is. Well, I have done that a couple times. There are some factors that can trick us to keep sticking to a path and ignore all other possible things. Let me tell you some of my stories that make me realize the how much useful the approach can be
My first memory leak investigation
If I recall it correctly, it is my first experience in handling reported software issue. Our support team hadn’t been formed yet so all issues had to be handled by developers. This story is not much about the unpredictable behavior of the case but more about my lack of experiences in gathering necessary information. A production team raised an issue for a possible memory leak in my product. I managed to get the heap configuration of the problematic server, number of users and their characteristic of activities. I set out to reproduce the memory leak in my development environment immediately.
I have to admit I don’t know what happened to my mind at the time. May be it was because the idea of memory leak was so cool and exciting or because this was the first time I had a chance to apply what I had studied to a real world problem. I was so convinced that there was memory leak in my code base. I wrote a small program to simulate user’s activities, investigated GC log and looked through the code base to find a possible leak. I couldn’t reproduce the problem but I still kept trying. The recorded effort used for this issue was growing and growing. At some point, the production team seemed to lost interest in the case. I finally asked them for the status of their side. They replied that the root cause was found. Apparently, the problem was in their module that was a plug-in configured to my product to perform customized entitlement
The product had been on production for quite a long time before the memory leak issue was reported. If I stepped back and asked myself what might be the change that triggered this memory leak then I might change my attention to a different direction. When I failed to reproduce the leak within a couple days, it might occur to me that the root cause might be something out of my control. But I was so caught in the idea of finding memory leak in my code base and ignored other possibilities
Strange performance drop – story 1
There was a strange performance problem when my team was just about to release a new version of our product. We performed performance testing for this new release on the fastest machine available to our team. The new release turned out to perform not as good as the previous release. The strange thing was that the difference in performance figures wasn’t stable. For example, a test round reported the final average response time 25 ms but other rounds reported 30 – 35 ms(the response time in the previous version is 18 ms). I tried to measure the time spent in each part of the modules and found that there was no part of the system that looked like a bottleneck. It seemed like the overall system just got slower.
Our platform migration from 32 bit system to 64 bit turned out to be quite a misleading for this problem. I saw many resource claimed that when a system was ported to 64 bit, it might get slower. I tried tuning OS dependent parameters and also investigating whether large heap size in 64 bit java could cause performance problem. Nothing helped me identified the root cause
One day while I was searching for the technique to better monitor CPU usage, I stumbled across a blog post about CPU frequency scaling. I read it and learned that some processors were able to lower its frequency to save power and generate less heat. It just happened that my performance testing machine was HP DL585 with a dynamic power management feature which was a kind of frequency scaling mechanism. I queried the current mode of my system and found that it was set to Ondemand by default
$ more /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
ondemand
Then I changed it to performance and rerun my test again
$ echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
(perform the above step for every CPU cores)
And everything just got back to the normal day gain. Now the new release performed as good as the previous one. It was about the CPU frequency mode all along. It was another time that the root cause lied in someplace totally out of our focus
Strange performance drop – story 2
This story happened last Friday which inspired me to write this post. We planed to support Tomcat as another target platform. Our QA team ran a rough performance test of our product on Tomcat. Tomcat was quite mature and wildly used so we didn’t expect any kind of problem at all. But like all other performance tests in our project, unexpected thing happened. We found that Tomcat on Solaris X86 made our product run much slower than Sun Java System Web Server on the same machine. The strange thing is that our product on Tomcat running on Linux is slightly faster than our product on SJSWS on the same Linux machine
Our product contained both Java code and Native C module. My colleague performed some measurements and pointed out that it was something to do with the native side. I tried to perform tuning on Tomcat server and play with various Tomcat connectors but nothing helped
In our team meeting on last Wednesday, we agreed that we needed to do some code profiling on native side. We didn’t know yet how could we run C profiler in the environment of Tomcat but we thought it was the way to go. After the meeting, I still keep playing with the testing environment.
I roughly scanned through the startup script of SJSWS looking for some customized tuning specific to Solaris platform. I discovered that the script contained a section to load libumem library. The comment above the section described that this was for performance reason.
# Preload libumem to improve performance on Solaris 10
LIBUMEM_32=/usr/lib/libumem.so
if [ -f “${LIBUMEM_32}” ] ; then
if [ `uname -r | sed s/\\\.//` -ge 510 ] ; then
LD_PRELOAD_32=”${LIBUMEM_32} ${LD_PRELOAD_32}”; export LD_PRELOAD_32
fi
fi
I copied the section to Tomcat startup script then started the server for performance test again. You want to guess the result? That’s right; I got the figures that as good as the one on SJSWS. That just save us from C profiling task that might takes us sometime to figure it out
LIBUMEM_64=/usr/lib/64/libumem.so
if [ -f “${LIBUMEM_64}” ] ; then
if [ `uname -r | sed s/\\\.//` -ge 510 ] ; then
LD_PRELOAD_64=”${LIBUMEM_64} ${LD_PRELOAD_64}”; export LD_PRELOAD_64
fi
fi
The approach is a balancing act. If you have enough information or you experiences suggest you to choose a certain path then it’s reasonable to choose it. You may just want to remind yourself that if the path doesn’t seem to lead to a solution then the answer might lie elsewhere