Bug 128928 - Improve GDB backtrace generation for GTK/EFL
Summary: Improve GDB backtrace generation for GTK/EFL
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Linux
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 160927
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-17 12:17 PST by Carlos Alberto Lopez Perez
Modified: 2016-08-16 18:10 PDT (History)
8 users (show)

See Also:


Attachments
patch (21.07 KB, patch)
2016-07-14 08:36 PDT, Philippe Normand
no flags Details | Formatted Diff | Diff
patch (21.03 KB, patch)
2016-07-14 08:37 PDT, Philippe Normand
no flags Details | Formatted Diff | Diff
patch (17.49 KB, patch)
2016-07-22 00:36 PDT, Philippe Normand
no flags Details | Formatted Diff | Diff
patch (17.39 KB, patch)
2016-07-22 00:50 PDT, Philippe Normand
cgarcia: review+
cgarcia: commit-queue-
Details | Formatted Diff | Diff
patch for landing (18.11 KB, patch)
2016-07-22 03:14 PDT, Philippe Normand
no flags Details | Formatted Diff | Diff
patch for landing (18.08 KB, patch)
2016-07-22 03:38 PDT, Philippe Normand
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos Alberto Lopez Perez 2014-02-17 12:17:09 PST
To current GDB backtrace generation (see https://bugs.webkit.org/show_bug.cgi?id=119338 )
uses the Linux kernel sysctl core_pattern variable to generate the core dumps with a specific name.

  echo "/path/to/cores/core-pid_%p-_-process_%e" > /proc/sys/kernel/core_pattern

The problem is that the executable filename (variable "%e") used in core_pattern
has a limit of 16 characters. See comm variable defined on task_struct at include/linux/sched.h

This causes most of the times that cores generated have a filename on disk
different than the one expected:

Ex: 
core-pid_25409-_-process_WebKitTestRunne
core-pid_25411-_-process_WebKitWebProces
core-pid_25461-_-process_testwebinspecto

                              ^ notice the cut

This causes that the script that launches gdb when a test crashes is unable to find the core.
See for example our current GTK Linux 64-bit Release bot:
http://build.webkit.org/results/GTK%20Linux%2064-bit%20Release/r164108%20%2844874%29/fast/dom/Geolocation/not-enough-arguments-crash-log.txt

In order to fix this the variable "%E" could be used (introduced on https://git.kernel.org/linus/57cc083ad ),
but a more interesting approach is to set core_pattern to pipe to a program
(introduced on https://git.kernel.org/linus/d025c9d ).

This approach of piping the core to a program have some extra benefits:

This program can have much more information about the crashed process since
the /proc/$pid structures are still available when this helper program is executed.
So it can extract easily more information about the crashed program.

Also the tests can set a WEBKIT_CURRENT_TEST="name_test" environment variable
that this program can check (/proc/$pid/environ) in order to detect without
doubt which test was the one that crashed.
This avoid the problem that we currently have when the crashing process
is a child and we don't know its pid. What the current code does is to
just pick the most recent core matching the executable filename. This is
not deterministic in a context where the tests are ran in parallel.
Comment 1 Philippe Normand 2014-02-17 21:48:05 PST
Nice investigation Carlos!

So I suppose this broke since we moved the debug bot to WK2.
Comment 2 Carlos Alberto Lopez Perez 2014-02-18 03:51:37 PST
> So I suppose this broke since we moved the debug bot to WK2.

Yes, this affects mainly WK2, because with WK1 the process that crashes is most of the times "DumpRenderTree" which fits the 15 char limit.

I'm going to start working on a patch to implement this.
Comment 3 Philippe Normand 2014-11-10 05:14:01 PST
I plan to upload a non-intrusive patch in Bug 138564 but this bug should be also fixed in the mid-term :)
Comment 4 Michael Catanzaro 2015-12-22 11:15:21 PST
Nice investigation indeed.

The right thing to do is to get cores from coredumpctl so we don't have to mess around with changing the core_pattern. I'd be interested in working on this eventually (for some large value of eventually).
Comment 5 Carlos Alberto Lopez Perez 2015-12-23 05:07:02 PST
(In reply to comment #4)
> Nice investigation indeed.
> 
> The right thing to do is to get cores from coredumpctl so we don't have to
> mess around with changing the core_pattern. I'd be interested in working on
> this eventually (for some large value of eventually).

I guess that would require the users to run systemd. This can be problematic on the bots (they run on unprivileged containers, and systemd don't plays nice there)


Also I'm not sure if it will solve the main problem, that is:

 * Due to WebKit multiprocess model, our tooling for the tests can only know the pid of WebKitTestRunner, but not the pid of any of the childs (WebKitWebProcess, WebKitNetworkProcess, etc).

So, when you get a coredump for WebKitWebProcess (for example) with pid NNNN you can't tell which test caused the crash.

The only reasonable way of solving this that I see is setting an environment variable (that the childs will inherit) with the test name.

When it crashes, the program invoked via the pipe in core_pattern will have access to /proc/$pid/environ and can read this.

With coredumpctl, such program will be systemd-coredump.

So...  the main question.... is coredumpctl able to get and print the environ variables that a program had defined when it crashed for a given coredump? If is not, I don't think it will be of any help here.
Comment 6 Michael Catanzaro 2015-12-23 06:08:20 PST
coredumpctl is not useful for bots -- anything that makes coredumps display in the web interface would be a great solution there -- but for humans running tests manually, we should try to figure out a solution that doesn't require editing core_pattern to disable it.

(In reply to comment #5)
> I guess that would require the users to run systemd. This can be problematic
> on the bots (they run on unprivileged containers, and systemd don't plays
> nice there)

Huh, I didn't realize that. I thought all Linux except Gentoo was using systemd nowadays (certainly it's installed on the bots I used yesterday), and that it was designed to handle unprivileged containers nicely -- that's the whole point of machinectl, right? but I've hardly used containers myself.

> So...  the main question.... is coredumpctl able to get and print the
> environ variables that a program had defined when it crashed for a given
> coredump? If is not, I don't think it will be of any help here.

Hm, I don't think so. You can match on an awful lot, anything listed in systemd.journal-fields(7), but not environment variables. It'd be easy to add though, since it has access to /proc/$pid/environ as you say, we'd just need to make it save the environment as a journal field. This is how the abrt coredumpctl integration was written -- abrt needed more info on the coredumps, so they modified coredumpctl to save it in journal fields. But that is a lot more work....
Comment 7 Carlos Alberto Lopez Perez 2015-12-23 06:42:51 PST
(In reply to comment #6)
> coredumpctl is not useful for bots -- anything that makes coredumps display
> in the web interface would be a great solution there -- but for humans
> running tests manually, we should try to figure out a solution that doesn't
> require editing core_pattern to disable it.
> 

Humans running tests can already match a coredump with a test name. Is a matter of running only one test at a time and picking the last coredump. 

The main use case here is for the bots. The bots runs thousands of tests in parallel, and they need a way to match the coredump with the test name. Otherwise is not possible to show the backtrace/crash log on the results page that the bot published.

For example: https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20%28Tests%29/r189464%20%2810899%29/results.html

Notice the empty crash logs on that html page published by the bot? That's what we are trying to fix here.

> (In reply to comment #5)
> > I guess that would require the users to run systemd. This can be problematic
> > on the bots (they run on unprivileged containers, and systemd don't plays
> > nice there)
> 
> Huh, I didn't realize that. I thought all Linux except Gentoo was using
> systemd nowadays (certainly it's installed on the bots I used yesterday),
> and that it was designed to handle unprivileged containers nicely -- that's
> the whole point of machinectl, right? but I've hardly used containers myself.
> 

Well, Ubuntu is still using upstart (the switch to systemd still didn't happened.. I guess that by 16.04 maybe).

And with unprivileged containers I mean LXC namespace based containers, and systemd is not running correctly inside there.

I think i read that in the latest releases of systemd this was somehow fixed, but I didn't tested it. So I remain cautious about the possibility of using something like coredumpctl inside a LXC namespace based container until I test it and see it working.

> > So...  the main question.... is coredumpctl able to get and print the
> > environ variables that a program had defined when it crashed for a given
> > coredump? If is not, I don't think it will be of any help here.
> 
> Hm, I don't think so. You can match on an awful lot, anything listed in
> systemd.journal-fields(7), but not environment variables. It'd be easy to
> add though, since it has access to /proc/$pid/environ as you say, we'd just
> need to make it save the environment as a journal field. This is how the
> abrt coredumpctl integration was written -- abrt needed more info on the
> coredumps, so they modified coredumpctl to save it in journal fields. But
> that is a lot more work....

If you are interested on adding this to systemd please go ahead.. 

But take into account that because of the release cycles of the distributions, the time it will take to reach the systemd used by all our developers (debian/ubuntu/fedora/etc) will be measured in years.
Comment 8 Carlos Alberto Lopez Perez 2015-12-23 06:45:50 PST
(In reply to comment #7)
> Well, Ubuntu is still using upstart (the switch to systemd still didn't
> happened.. I guess that by 16.04 maybe).

Correction: they switched to systemd on 15.04 according to this wiki https://wiki.ubuntu.com/SystemdForUpstartUsers#System_Init_Daemon
Comment 9 Philippe Normand 2016-07-07 10:51:11 PDT
(In reply to comment #5)
> (In reply to comment #4)
> > Nice investigation indeed.
> > 
> > The right thing to do is to get cores from coredumpctl so we don't have to
> > mess around with changing the core_pattern. I'd be interested in working on
> > this eventually (for some large value of eventually).
> 
> I guess that would require the users to run systemd. This can be problematic
> on the bots (they run on unprivileged containers, and systemd don't plays
> nice there)
> 
> 
> Also I'm not sure if it will solve the main problem, that is:
> 
>  * Due to WebKit multiprocess model, our tooling for the tests can only know
> the pid of WebKitTestRunner, but not the pid of any of the childs
> (WebKitWebProcess, WebKitNetworkProcess, etc).

This can be solved by implementing private C APIs for WKContext and WKPage, like on Mac. I started a patch...

Also changing the crash log analisys as advised in the initial description of the bug.
Comment 10 Philippe Normand 2016-07-14 08:36:11 PDT
Created attachment 283646 [details]
patch
Comment 11 Philippe Normand 2016-07-14 08:37:43 PDT
Created attachment 283647 [details]
patch
Comment 12 Carlos Garcia Campos 2016-07-16 04:11:23 PDT
Comment on attachment 283647 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=283647&action=review

> Source/WebKit2/UIProcess/API/C/gtk/WKContextPrivateGtk.cpp:41
> +pid_t WKContextGetNetworkProcessIdentifier(WKContextRef contextRef)
> +{
> +    return toImpl(contextRef)->networkProcessIdentifier();
> +}
> +
> +pid_t WKContextGetDatabaseProcessIdentifier(WKContextRef contextRef)
> +{
> +    return toImpl(contextRef)->databaseProcessIdentifier();

Why is this specific to GTK+? Why don't we add this to WKContext.cpp and WKContextPrivate.h?

> Source/WebKit2/UIProcess/API/C/gtk/WKPagePrivateGtk.cpp:37
> +pid_t WKPageGetProcessIdentifier(WKPageRef pageRef)
> +{
> +    return toImpl(pageRef)->processIdentifier();
> +}

Ditto, this could be added to WKPage.cpp and WKPagePrivate.h, since this is exposing cross-platform API

> Tools/Scripts/process-linux-coredump:26
> +    full_path = os.path.join(destination_directory, 'core-pid_%d.dump'  % int(pid))

Why do you need to convert the pid to int? Why not just use %s instead? Or the script could receive the path already built no?

> Tools/Scripts/webkitpy/port/linux_get_crash_log_unittest.py:49
>          core_pattern = os.path.join(core_directory, "core-pid_%p-_-process_%E")
> +        core_pattern = "|%s %%p /path/to/coredumps" % process_coredump_script_path

Shouldn't you remove the previous core_pattern assignation?

> Tools/WebKitTestRunner/TestController.cpp:1487
>      pid_t pid = WKContextGetNetworkProcessIdentifier(m_context.get());

Ah, mac already has an impl of this. In that case it should be moved to the cross-platform files, I think, instead of duplicating it in GTK specific files.
Comment 13 Philippe Normand 2016-07-22 00:04:27 PDT
Comment on attachment 283647 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=283647&action=review

>> Tools/Scripts/process-linux-coredump:26
>> +    full_path = os.path.join(destination_directory, 'core-pid_%d.dump'  % int(pid))
> 
> Why do you need to convert the pid to int? Why not just use %s instead? Or the script could receive the path already built no?

No strong reason for the conversion. The script cannot receive the full path already built because the script invocation is hardcoded in the core_pattern.

>> Tools/Scripts/webkitpy/port/linux_get_crash_log_unittest.py:49
>> +        core_pattern = "|%s %%p /path/to/coredumps" % process_coredump_script_path
> 
> Shouldn't you remove the previous core_pattern assignation?

oops

>> Tools/WebKitTestRunner/TestController.cpp:1487
>>      pid_t pid = WKContextGetNetworkProcessIdentifier(m_context.get());
> 
> Ah, mac already has an impl of this. In that case it should be moved to the cross-platform files, I think, instead of duplicating it in GTK specific files.

Ok I can try this.
Comment 14 Philippe Normand 2016-07-22 00:36:06 PDT
Created attachment 284313 [details]
patch
Comment 15 Philippe Normand 2016-07-22 00:41:39 PDT
(In reply to comment #13)
> Comment on attachment 283647 [details]
> patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=283647&action=review
> 
> >> Tools/Scripts/process-linux-coredump:26
> >> +    full_path = os.path.join(destination_directory, 'core-pid_%d.dump'  % int(pid))
> > 
> > Why do you need to convert the pid to int? Why not just use %s instead? Or the script could receive the path already built no?
> 
> No strong reason for the conversion. The script cannot receive the full path
> already built because the script invocation is hardcoded in the core_pattern.
> 

Ah sorry this can be improved indeed. I'll update the patch.
Comment 16 Philippe Normand 2016-07-22 00:50:27 PDT
Created attachment 284314 [details]
patch
Comment 17 Carlos Garcia Campos 2016-07-22 01:14:54 PDT
Comment on attachment 284314 [details]
patch

Patch looks good to me, you just need to fix mac builds (update the header includes), so please submit a new patch and wait for EWS before landing
Comment 18 Philippe Normand 2016-07-22 03:14:16 PDT
Created attachment 284321 [details]
patch for landing
Comment 19 Philippe Normand 2016-07-22 03:38:27 PDT
Created attachment 284322 [details]
patch for landing
Comment 20 Philippe Normand 2016-07-24 23:09:02 PDT
Committed r203674: <http://trac.webkit.org/changeset/203674>
Comment 21 Michael Catanzaro 2016-07-25 07:25:30 PDT
It broke a thing:
                                                                                                  [692/1454] webkitpy.port.linux_get_crash_log_unittest.GDBCrashLogGeneratorTest.test_generate_crash_log failed:
  Traceback (most recent call last):
    File "/home/slave/webkitgtk/gtk-linux-64-release-tests/build/Tools/Scripts/webkitpy/port/linux_get_crash_log_unittest.py", line 67, in test_generate_crash_log
      self.assertMultiLineEqual(log, mock_empty_crash_log)
  AssertionError: 'crash log for DumpRenderTree (pid 28529):\n\nCoredump core-pid_28529.dump not f [truncated]... != 'crash log for DumpRenderTree (pid 28529):\n\nCoredump core-pid_28529.dump not f [truncated]...
  Diff is 822 characters long. Set self.maxDiff to None to see it.

Could you investigate please?
Comment 22 Philippe Normand 2016-07-25 09:32:47 PDT
Thanks, follow-up in r203685