Imagine you and your friend are trying to use a high-tech microwave for the first time. Your friend push some buttons and it doesn’t seem to get the machine going. He then turn to you ask if you know what is wrong with this thing. All you can do is saying “I don’t know” and start helping him push some more buttons.
I get this feeling every times my colleague ask me if I know how to fix a software system issues without any interest in gathering some information beforehand. The system is our software running in our projects. Isn’t it a lot faster and more reliable to investigate logs than asking me?
It’s true that investigating logs is not a fun thing to do comparing to writing code. But we definitely can’t avoid it. Reading logs and interpreting metrics is heart and soul of maintaining live services and production. I have some suggestions on developing skills and reading logs so we could knows what is going on in the system and solve issues.
1. know how to get the logs
This seems to be very obvious but there are many times that people don’t read logs just because they don’t know the location or a way to turn on logging. Here are some suggestion to find it out. Note that it is assumed that you have access to the target machine and able to restart the services.
- Read that document. Yes, the boring thing that we don’t wnat to spend time with that much. Open the document page, press CTRL + F then find ‘logging’ section. A lot of time it tell exactly what you want.
- Know your software stack. For example, if it is ELK, the chance is configurations are at /etc/something and log files are at /var/log/something
- If you know the installation directory then you may try finding all files with the name containing ‘log’ ( find . | grep -i log ) . The log4j.properties should be somewhere there (if it is Java system). Or grep all files containing word log in it ( grep -r /opt/something/config log ).
- Grep the target running process in the machine. You should see the command used to start the process. The process might be started with configuration or logging options that point to non-official location.
- List all files opening by the process ( lsof -p process-id ) and grep just the config or log files.
Once you know which file to change then chaning it should be easy enough to guess.
2. know how to interpret the logs and guess the system’s behaviour.
This is the hardest part. It require a lot of experience on the system you are working on. This knowledge will set you apart from others. Imagine the ability to monitor logs and able to tell that something is wrong with some machines in the cluster. Node-5 and node-2 is out of sync and must be restarted since the system could not recovery from this kind of error scenario. Here are some tips to build up this knowledge.
- Know the key words of important concept in that system. You much know what is the word to search for in logs for a particular activity you want to know. For example, in Kafka system, there are key words such as.
- ISR ( in-sync replicas) – It is (overly simplified here) a set of machines in cluster that nearly have the same data for a particular partition. If you see a lot of logs saying replica are leaving, joining ISR all the times then your cluster may not be in good health.
- Controller – The special Kafka broker coordinating partitions state changes in the cluster. If you start seeing logs running like crazy in controller.log without good reason (you are not restarting a node) then you may want to check the the health of the whole cluster.
- Consumer Rebalance – It is the mechanism that consumer group is re-assigning partitions owner. You should know that during this rebalance, consumer is not consuming messages. So if you see excessive logs of rebalancing then it means something wrong with consumer group and you consumers are rarely get chances to consume any messages.
- Know which situations should produce which logs. Some logs present in different situations might have different level of severity. One example is the ISR logs described above. Let’s say a broker node has been shutdown for a 1 day maintenance. The node will start mass messages replication for 1 day worth of data once it is started and join cluster back again. There will be a lot of ISR joining, leaving logs at this time which is alright (It could cause problem if auto leader rebalance is enable). Just wait for the replication to catch up.
- Search through source-code. This doesn’t mean reading through the whole code. Just simple search for the log message to build an informed guess.
- If the log is from low level module handling transportation layer then it may be more about network outage or corrupted data.
- Browsing through call hierarchy may reveal that this log is from a callback registered for handling Zookeeper session lost. Then we may looks for more logs suggesting session timeout in the connectivity. This guessing on class name, method and code comment seems to be hopeless thing to do but it has saved my life many times.
3. Skill to prove correlation
Let’s say the system start having problem since 11.00. There are 5 lines of low-level connection error at 10.50. Does it means these errors are the root cause?
I have seen it many times that people jump to a conclusion just because they found first thing that could support their first assumption. It takes some reasoning skills to find evidences supporting an assumption to the point that you are confident enough to reach conclusion.
In the example case, we might need to ask for more questions.
- What is the problematic behaviour of the machine right now? If it is about server-side under performance then how come 5 bad connection error from client-side could cause this. Are we saying that 5 errors in handling client requests could cause sever problem on server side?
- Are there any logs in other server nodes having this same error? If yes, are the other node showing the same problem with this node?
- Has this kind of error appeared in the system before? If the error has been happening from time to time and we don’t have any problem before then this time may be just coincident.
My observation is that those engineers who are very good at troubleshooting and system maintenance are very good at contradict themself. They have some assumptions then they ask “wait, if this is true then how come that doesn’t also happen”. They are ready to drop any assumptions if they found any reasonable counter argument.
4. Know how to search bug database.
Sometimes a message in logs is from a known issue that doesn’t cause sever problem and could be ignored. Sometimes it is a known bug that is already fixed in the newer version. Just search the message in bug database could give us a lot of information.
The excellent additional benefit from this is the discussion in issue thread. You could learn a lot from the pre-conditions, reproducing steps and system behaviour discussed there. A lot of times, you get information from developers who actually the owner of that module which is very valuable.