Triaging Firefox Nightly Bugs - It's fun!

[linkstandalone]

You may notice that typing about:crashes in Firefox brings up all the crashes that have occurred in your browser. Do you ever wonder the process of how Firefox engineers look into these crash reports and analyze them? In this post, I'll be talking about the crash triage process that I've been involved in for the past few months and hopefully, you can find it interesting.

This is a group of people whose responsibilities are filing bugs for Nightly crashes if they are valid. It is a small group of engineers so we do a weekly rotation to respect everyone's time which also ensures that we have coverage on each day.

As you probably aware already, there are two Nightly builds get released every day, and both of them need to be triaged. However, We don't triage crashes right in the next day after the releases, because usually, it takes time for users to update their Nightly build and also takes time for crashes to occur.

Crash reports come with a lot of information, pretty much everything relates to the crash are included, such as crash reasons, crash addresses and crash stacks. Note that no personal identified information is included in the crash reports, so it's impossible to correlate crash reports to individual users. Most of the time, we don't need to check all the fields to tell what's going on.

Bugs We File

There are some approaches that we try to follow to avoid filing bugs that are not actionable, but also make sure we don't miss any potential bugs.

Common Crashes

OOM

Out-of-Memory crashes are common and not actionable usually. However, we should still valid the validness of them.

Generally, there are two things we want to look at:

The Above image shows a bizarre crash report I noticed one day as the allocation size was huge. So I filed bug 167475 for it.

Bad Memory Bits

Some crashes are just caused by bad hardware such as bad memory bits. This is actually quite common crash reports.

The way to identify them is by expanding the crash reports to show other existing crashes that have the same signature. And if every crash has a different crash reason and crash addresses, then it is very likely that this is a bad memory bit crash.

Here's an example of bad memory bits crash. If we expand the report to view other crashes, we can see they all have different crash addresses. For this particular example, some addresses are very close to 0 which seem like a null pointer crash, on the other hand, some addresses are far away from 0.

Here's the screenshot of the crash addresses.

In addition to the above reason, the call stack of this crash shows that it is in the garbage collector code. Garbage collector would sweep lots of memory so users have bad memory bits are very likely going to crash.

Shutdown Hang

Shutdown hangs are also quite common in crash reports, and it drove me nuts in the beginning. The signature usually looks similar to [@ IPCError-browser | ShutDownKill. However, the fact is they are not real crashes. They are reports for slow shutdowns. The call stacks reported doesn't indicate where it crashes, in fact they are snapshots of the content process that was told to shutdown but didn't manage to do so within a 20 seconds limit.

There are two possible causes here.

  1. Content process was slow to respond
    • We don't want to file bugs for this reason because there isn't much we can do unless improving the speed of shutdown
  2. Content process was deadlocked while trying to shutdown
    • This is something we care about and actionable

There are a couple of steps that I follow to investigate these bugs.

  1. Expand the date range to include more reports by clicking the More Reports button to increase the date range to at least a month.
  2. Check the ipc shutdown state table under the aggregation tab. The value could be either RecvShutdown or SendFinishShutdown by a content process while it's shutting down.
    • If the table has multiple values, then it's likely that it's slow shutdown,
  3. Check the call stack to see if there's mozilla::dom::ContentChild::RecvShutdown() method. This occurrence of this method indicates that content process had received the shutdown IPC message and had started shutting down. It's likely that the process was just slow to shutdown if you see this method, however it is possible that the process might still get stuck after it had received the message.
  4. If the content process didn't receive the shutdown message. Then it's a clear sign of a potential deadlock. You should look at the call stack and see if you can find an issue there.
bug 1658429 is an example of potential deadlocks as it appears to be that the content process was waiting for the parent process, however the parent process had started to shutdown.

Nightly Triage rotation made me gained a lot of knowledge regards bug triaging and crash report analyzing, and I found they were extremely useful. The knowledge I gained not only helped me to do my work better, but they also expanded my knowledge about system programming, which is something that I always want to know more about.

Hope you find this article helpful!