You may notice that typing
about:crashes in Firefox brings up all the crashes that
have occurred in your browser. Do you ever wonder the process of how Firefox engineers
look into these crash reports and analyze them? In this post, I'll be talking
about the crash triage process that I've been involved in for the past few months
and hopefully, you can find it interesting.
This is a group of people whose responsibilities are filing bugs for
crashes if they are valid. It is a small group of engineers so we do a
weekly rotation to respect everyone's time which also ensures that we have
coverage on each day.
As you probably aware already, there are two
Nightly builds get released
every day, and both of them need to be triaged. However, We don't triage crashes
right in the next day after the releases, because usually, it takes time for users to
Nightly build and also takes time for crashes to occur.
Crash reports come with a lot of information, pretty much everything relates to the crash are included, such as crash reasons, crash addresses and crash stacks. Note that no personal identified information is included in the crash reports, so it's impossible to correlate crash reports to individual users. Most of the time, we don't need to check all the fields to tell what's going on.
There are some approaches that we try to follow to avoid filing bugs that are not actionable, but also make sure we don't miss any potential bugs.
keychain-pkcs11's developers jumped into bug 1668593 to help the investigation.
Out-of-Memory crashes are common and not actionable usually. However, we should still valid the validness of them.
Generally, there are two things we want to look at:
Available Page Fileon Windows. It's a lot harder on Linux and MacOS due the existence of OOM killer as OOM Killer could possibly kill the Firefox process caused by memory allocations from other programs.
The Above image shows a bizarre crash report I noticed one day as the allocation size was huge. So I filed bug 167475 for it.
Some crashes are just caused by bad hardware such as bad memory bits. This is actually quite common crash reports.
The way to identify them is by expanding the crash reports to show other existing crashes that have the same signature. And if every crash has a different crash reason and crash addresses, then it is very likely that this is a bad memory bit crash.
Here's an example of bad memory bits crash. If we expand the report to view other crashes, we can see they all have different crash addresses. For this particular example, some addresses are very close to 0 which seem like a null pointer crash, on the other hand, some addresses are far away from 0.Here's the screenshot of the crash addresses.
In addition to the above reason, the call stack of this crash shows that it is in the garbage collector code. Garbage collector would sweep lots of memory so users have bad memory bits are very likely going to crash.
Shutdown hangs are also quite common in crash reports, and it drove me nuts
in the beginning. The signature usually looks similar to
[@ IPCError-browser | ShutDownKill. However, the fact is they are not real crashes.
They are reports for slow shutdowns. The call stacks reported doesn't indicate
where it crashes, in fact they are snapshots of the content process that was told
to shutdown but didn't manage to do so within a 20 seconds limit.
There are two possible causes here.
There are a couple of steps that I follow to investigate these bugs.
More Reportsbutton to increase the date range to at least a month.
ipc shutdown statetable under the
aggregationtab. The value could be either
SendFinishShutdownby a content process while it's shutting down.
mozilla::dom::ContentChild::RecvShutdown()method. This occurrence of this method indicates that content process had received the shutdown IPC message and had started shutting down. It's likely that the process was just slow to shutdown if you see this method, however it is possible that the process might still get stuck after it had received the message.
Nightly Triage rotation made me gained a lot of knowledge
regards bug triaging and crash report analyzing, and I found they were extremely
useful. The knowledge I gained not only helped me to do my work better, but
they also expanded my knowledge about system programming, which is something
that I always want to know more about.
Hope you find this article helpful!