You may notice that typing
about:crashes in Firefox brings up all the crashes that
have occurred in your browser. Do you ever wonder the process of how Firefox engineers
look into these crash reports and analyze them? In this post, I’ll be talking
about the crash triage process that I’ve been involved in for the past few months
and hopefully, you can find it interesting.
This is a group of people whose responsibilities are filing bugs for
crashes if they are valid. It is a small group of engineers so we do a
weekly rotation to respect everyone’s time which also ensures that we have
coverage on each day.
As you probably aware already, there are two
Nightly builds get released
every day, and both of them need to be triaged. However, We don’t triage crashes
right in the next day after the releases, because usually, it takes time for users to
Nightly build and also takes time for crashes to occur.
Crash reports come with a lot of information, pretty much everything relates to the crash are included, such as crash reasons, crash addresses and crash stacks. Note that no personal identified information is included in the crash reports, so it’s impossible to correlate crash reports to individual users. Most of the time, we don’t need to check all the fields to tell what’s going on.
Bugs We File
There are some approaches that we try to follow to avoid filing bugs that are not actionable, but also make sure we don’t miss any potential bugs.
- File bugs when more than one installations hit the same crash
- Everyone has different machines and setups. We try to file bugs that at least can be reproducible on two different machines, otherwise, the crash is very likely only reproducible on that particular machine and not actionable for engineers. However, we still file bugs if we think it is valid even if it only comes from one installation
- Don’t file IPC shutdown crashes if you can’t figure what’s going on.
- I’ll talk about IPC shutdown crashes in detail later. TLDR: IPC shutdown crashes are not real crashes. They are hard to be actionable, so we only file them if we have reasons.
- Don’t file bugs that don’t have symbols
- Bugs usually are not actionable without symbols.
- For crashes in third party library, file if there are same pattern crashes
from multiple installations
- There aren’t a lot of things that we can do if the crash occurs in third
part code. However, it would still be useful to file bugs for them, so that
we could get the attention from the third party libraries. For instance,
keychain-pkcs11’s developers jumped into bug 1668593 to help the investigation.
- There aren’t a lot of things that we can do if the crash occurs in third part code. However, it would still be useful to file bugs for them, so that we could get the attention from the third party libraries. For instance,
- Valid crash reasons such as
- These usually are real crashes, we should file them despite the number of crashes.
Out-of-Memory crashes are common and not actionable usually. However, we should still valid the validness of them.
Generally, there are two things we want to look at:
- Available Memory
- This means checking the
Available Page Fileon Windows. It’s a lot harder on Linux and MacOS due the existence of OOM killer as OOM Killer could possibly kill the Firefox process caused by memory allocations from other programs.
- This means checking the
- OOM allocation Size
- If it’s very large then something might be wrong.
The Above image shows a bizarre crash report I noticed one day as the allocation size was huge. So I filed bug 167475 for it.
Bad Memory Bits
Some crashes are just caused by bad hardware such as bad memory bits. This is actually quite common crash reports.
The way to identify them is by expanding the crash reports to show other existing crashes that have the same signature. And if every crash has a different crash reason and crash addresses, then it is very likely that this is a bad memory bit crash.
Here’s an example of bad memory bits crash. If we expand the report to view other crashes, we can see they all have different crash addresses. For this particular example, some addresses are very close to 0 which seem like a null pointer crash, on the other hand, some addresses are far away from 0.
In addition to the above reason, the call stack of this crash shows that it is in the garbage collector code. Garbage collector would sweep lots of memory so users have bad memory bits are very likely going to crash.
Shutdown hangs are also quite common in crash reports, and it drove me nuts
in the beginning. The signature usually looks similar to
[@ IPCError-browser | ShutDownKill. However, the fact is they are not real crashes.
They are reports for slow shutdowns. The call stacks reported doesn’t indicate
where it crashes, in fact they are snapshots of the content process that was told
to shutdown but didn’t manage to do so within a 20 seconds limit.
There are two possible causes here.
- Content process was slow to respond - We don’t want to file bugs for this reason because there isn’t much we can do unless improving the speed of shutdown
- Content process was deadlocked while trying to shutdown - This is something we care about and actionable
There are a couple of steps that I follow to investigate these bugs.
- Expand the date range to include more reports by clicking the
More Reportsbutton to increase the date range to at least a month.
- Check the
ipc shutdown statetable under the
aggregationtab. The value could be either
SendFinishShutdownby a content process while it’s shutting down. - If the table has multiple values, then it’s likely that it’s slow shutdown,
- Check the call stack to see if there’s
mozilla::dom::ContentChild::RecvShutdown()method. This occurrence of this method indicates that content process had received the shutdown IPC message and had started shutting down. It’s likely that the process was just slow to shutdown if you see this method, however it is possible that the process might still get stuck after it had received the message.
- If the content process didn’t receive the shutdown message. Then it’s a clear sign of a potential deadlock. You should look at the call stack and see if you can find an issue there.
Nightly Triage rotation made me gained a lot of knowledge
regards bug triaging and crash report analyzing, and I found they were extremely
useful. The knowledge I gained not only helped me to do my work better, but
they also expanded my knowledge about system programming, which is something
that I always want to know more about.
Hope you find this article helpful!