Triaging Firefox Nightly Bugs - It's fun!

You may notice that typing about:crashes in Firefox brings up all the crashes that have occurred in your browser. Do you ever wonder the process of how Firefox engineers look into these crash reports and analyze them? In this post, I’ll be talking about the crash triage process that I’ve been involved in for the past few months and hopefully, you can find it interesting.

This is a group of people whose responsibilities are filing bugs for Nightly crashes if they are valid. It is a small group of engineers so we do a weekly rotation to respect everyone’s time which also ensures that we have coverage on each day.

As you probably aware already, there are two Nightly builds get released every day, and both of them need to be triaged. However, We don’t triage crashes right in the next day after the releases, because usually, it takes time for users to update their Nightly build and also takes time for crashes to occur.

Crash reports come with a lot of information, pretty much everything relates to the crash are included, such as crash reasons, crash addresses and crash stacks. Note that no personal identified information is included in the crash reports, so it’s impossible to correlate crash reports to individual users. Most of the time, we don’t need to check all the fields to tell what’s going on.

Bugs We File

There are some approaches that we try to follow to avoid filing bugs that are not actionable, but also make sure we don’t miss any potential bugs.

  • File bugs when more than one installations hit the same crash
    • Everyone has different machines and setups. We try to file bugs that at least can be reproducible on two different machines, otherwise, the crash is very likely only reproducible on that particular machine and not actionable for engineers. However, we still file bugs if we think it is valid even if it only comes from one installation
  • Don’t file IPC shutdown crashes if you can’t figure what’s going on.
    • I’ll talk about IPC shutdown crashes in detail later. TLDR: IPC shutdown crashes are not real crashes. They are hard to be actionable, so we only file them if we have reasons.
  • Don’t file bugs that don’t have symbols
    • Bugs usually are not actionable without symbols.
  • For crashes in third party library, file if there are same pattern crashes from multiple installations
    • There aren’t a lot of things that we can do if the crash occurs in third part code. However, it would still be useful to file bugs for them, so that we could get the attention from the third party libraries. For instance, keychain-pkcs11’s developers jumped into bug 1668593 to help the investigation.
  • Valid crash reasons such as MOZ_CRASH or MOZ_DIAGNOSTIC_ASSERT
    • These usually are real crashes, we should file them despite the number of crashes.

Common Crashes

OOM

Out-of-Memory crashes are common and not actionable usually. However, we should still valid the validness of them.

Generally, there are two things we want to look at:

  • Available Memory
    • This means checking the Available Page File on Windows. It’s a lot harder on Linux and MacOS due the existence of OOM killer as OOM Killer could possibly kill the Firefox process caused by memory allocations from other programs.
  • OOM allocation Size
    • If it’s very large then something might be wrong.

[[./resources/oom.png|OOM Image]]

The Above image shows a bizarre crash report I noticed one day as the allocation size was huge. So I filed bug 167475 for it.

Bad Memory Bits

Some crashes are just caused by bad hardware such as bad memory bits. This is actually quite common crash reports.

The way to identify them is by expanding the crash reports to show other existing crashes that have the same signature. And if every crash has a different crash reason and crash addresses, then it is very likely that this is a bad memory bit crash.

Here’s an example of bad memory bits crash. If we expand the report to view other crashes, we can see they all have different crash addresses. For this particular example, some addresses are very close to 0 which seem like a null pointer crash, on the other hand, some addresses are far away from 0.

[[./resources/bad_memory_crash.png|bad_memory_crash.png]]

In addition to the above reason, the call stack of this crash shows that it is in the garbage collector code. Garbage collector would sweep lots of memory so users have bad memory bits are very likely going to crash.

Shutdown Hang

Shutdown hangs are also quite common in crash reports, and it drove me nuts in the beginning. The signature usually looks similar to [@ IPCError-browser | ShutDownKill. However, the fact is they are not real crashes. They are reports for slow shutdowns. The call stacks reported doesn’t indicate where it crashes, in fact they are snapshots of the content process that was told to shutdown but didn’t manage to do so within a 20 seconds limit.

There are two possible causes here.

  1. Content process was slow to respond - We don’t want to file bugs for this reason because there isn’t much we can do unless improving the speed of shutdown
  2. Content process was deadlocked while trying to shutdown - This is something we care about and actionable

There are a couple of steps that I follow to investigate these bugs.

  1. Expand the date range to include more reports by clicking the More Reports button to increase the date range to at least a month.
  2. Check the ipc shutdown state table under the aggregation tab. The value could be either RecvShutdown or SendFinishShutdown by a content process while it’s shutting down. - If the table has multiple values, then it’s likely that it’s slow shutdown,
  3. Check the call stack to see if there’s mozilla::dom::ContentChild::RecvShutdown() method. This occurrence of this method indicates that content process had received the shutdown IPC message and had started shutting down. It’s likely that the process was just slow to shutdown if you see this method, however it is possible that the process might still get stuck after it had received the message.
  4. If the content process didn’t receive the shutdown message. Then it’s a clear sign of a potential deadlock. You should look at the call stack and see if you can find an issue there.

Joining the Nightly Triage rotation made me gained a lot of knowledge regards bug triaging and crash report analyzing, and I found they were extremely useful. The knowledge I gained not only helped me to do my work better, but they also expanded my knowledge about system programming, which is something that I always want to know more about.

Hope you find this article helpful!