Bug 128928

Summary: Improve GDB backtrace generation for GTK/EFL
Product: WebKit Reporter: Carlos Alberto Lopez Perez <clopez>
Component: Tools / TestsAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: abrhm, cgarcia, commit-queue, glenn, mcatanzaro, ossy, pnormand, sergio
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Linux   
Bug Depends on: 160927    
Bug Blocks:    
Attachments:
Description Flags
patch
none
patch
none
patch
none
patch
cgarcia: review+, cgarcia: commit-queue-
patch for landing
none
patch for landing none

Carlos Alberto Lopez Perez
Reported 2014-02-17 12:17:09 PST
To current GDB backtrace generation (see https://bugs.webkit.org/show_bug.cgi?id=119338 ) uses the Linux kernel sysctl core_pattern variable to generate the core dumps with a specific name. echo "/path/to/cores/core-pid_%p-_-process_%e" > /proc/sys/kernel/core_pattern The problem is that the executable filename (variable "%e") used in core_pattern has a limit of 16 characters. See comm variable defined on task_struct at include/linux/sched.h This causes most of the times that cores generated have a filename on disk different than the one expected: Ex: core-pid_25409-_-process_WebKitTestRunne core-pid_25411-_-process_WebKitWebProces core-pid_25461-_-process_testwebinspecto ^ notice the cut This causes that the script that launches gdb when a test crashes is unable to find the core. See for example our current GTK Linux 64-bit Release bot: http://build.webkit.org/results/GTK%20Linux%2064-bit%20Release/r164108%20%2844874%29/fast/dom/Geolocation/not-enough-arguments-crash-log.txt In order to fix this the variable "%E" could be used (introduced on https://git.kernel.org/linus/57cc083ad ), but a more interesting approach is to set core_pattern to pipe to a program (introduced on https://git.kernel.org/linus/d025c9d ). This approach of piping the core to a program have some extra benefits: This program can have much more information about the crashed process since the /proc/$pid structures are still available when this helper program is executed. So it can extract easily more information about the crashed program. Also the tests can set a WEBKIT_CURRENT_TEST="name_test" environment variable that this program can check (/proc/$pid/environ) in order to detect without doubt which test was the one that crashed. This avoid the problem that we currently have when the crashing process is a child and we don't know its pid. What the current code does is to just pick the most recent core matching the executable filename. This is not deterministic in a context where the tests are ran in parallel.
Attachments
patch (21.07 KB, patch)
2016-07-14 08:36 PDT, Philippe Normand
no flags
patch (21.03 KB, patch)
2016-07-14 08:37 PDT, Philippe Normand
no flags
patch (17.49 KB, patch)
2016-07-22 00:36 PDT, Philippe Normand
no flags
patch (17.39 KB, patch)
2016-07-22 00:50 PDT, Philippe Normand
cgarcia: review+
cgarcia: commit-queue-
patch for landing (18.11 KB, patch)
2016-07-22 03:14 PDT, Philippe Normand
no flags
patch for landing (18.08 KB, patch)
2016-07-22 03:38 PDT, Philippe Normand
no flags
Philippe Normand
Comment 1 2014-02-17 21:48:05 PST
Nice investigation Carlos! So I suppose this broke since we moved the debug bot to WK2.
Carlos Alberto Lopez Perez
Comment 2 2014-02-18 03:51:37 PST
> So I suppose this broke since we moved the debug bot to WK2. Yes, this affects mainly WK2, because with WK1 the process that crashes is most of the times "DumpRenderTree" which fits the 15 char limit. I'm going to start working on a patch to implement this.
Philippe Normand
Comment 3 2014-11-10 05:14:01 PST
I plan to upload a non-intrusive patch in Bug 138564 but this bug should be also fixed in the mid-term :)
Michael Catanzaro
Comment 4 2015-12-22 11:15:21 PST
Nice investigation indeed. The right thing to do is to get cores from coredumpctl so we don't have to mess around with changing the core_pattern. I'd be interested in working on this eventually (for some large value of eventually).
Carlos Alberto Lopez Perez
Comment 5 2015-12-23 05:07:02 PST
(In reply to comment #4) > Nice investigation indeed. > > The right thing to do is to get cores from coredumpctl so we don't have to > mess around with changing the core_pattern. I'd be interested in working on > this eventually (for some large value of eventually). I guess that would require the users to run systemd. This can be problematic on the bots (they run on unprivileged containers, and systemd don't plays nice there) Also I'm not sure if it will solve the main problem, that is: * Due to WebKit multiprocess model, our tooling for the tests can only know the pid of WebKitTestRunner, but not the pid of any of the childs (WebKitWebProcess, WebKitNetworkProcess, etc). So, when you get a coredump for WebKitWebProcess (for example) with pid NNNN you can't tell which test caused the crash. The only reasonable way of solving this that I see is setting an environment variable (that the childs will inherit) with the test name. When it crashes, the program invoked via the pipe in core_pattern will have access to /proc/$pid/environ and can read this. With coredumpctl, such program will be systemd-coredump. So... the main question.... is coredumpctl able to get and print the environ variables that a program had defined when it crashed for a given coredump? If is not, I don't think it will be of any help here.
Michael Catanzaro
Comment 6 2015-12-23 06:08:20 PST
coredumpctl is not useful for bots -- anything that makes coredumps display in the web interface would be a great solution there -- but for humans running tests manually, we should try to figure out a solution that doesn't require editing core_pattern to disable it. (In reply to comment #5) > I guess that would require the users to run systemd. This can be problematic > on the bots (they run on unprivileged containers, and systemd don't plays > nice there) Huh, I didn't realize that. I thought all Linux except Gentoo was using systemd nowadays (certainly it's installed on the bots I used yesterday), and that it was designed to handle unprivileged containers nicely -- that's the whole point of machinectl, right? but I've hardly used containers myself. > So... the main question.... is coredumpctl able to get and print the > environ variables that a program had defined when it crashed for a given > coredump? If is not, I don't think it will be of any help here. Hm, I don't think so. You can match on an awful lot, anything listed in systemd.journal-fields(7), but not environment variables. It'd be easy to add though, since it has access to /proc/$pid/environ as you say, we'd just need to make it save the environment as a journal field. This is how the abrt coredumpctl integration was written -- abrt needed more info on the coredumps, so they modified coredumpctl to save it in journal fields. But that is a lot more work....
Carlos Alberto Lopez Perez
Comment 7 2015-12-23 06:42:51 PST
(In reply to comment #6) > coredumpctl is not useful for bots -- anything that makes coredumps display > in the web interface would be a great solution there -- but for humans > running tests manually, we should try to figure out a solution that doesn't > require editing core_pattern to disable it. > Humans running tests can already match a coredump with a test name. Is a matter of running only one test at a time and picking the last coredump. The main use case here is for the bots. The bots runs thousands of tests in parallel, and they need a way to match the coredump with the test name. Otherwise is not possible to show the backtrace/crash log on the results page that the bot published. For example: https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20%28Tests%29/r189464%20%2810899%29/results.html Notice the empty crash logs on that html page published by the bot? That's what we are trying to fix here. > (In reply to comment #5) > > I guess that would require the users to run systemd. This can be problematic > > on the bots (they run on unprivileged containers, and systemd don't plays > > nice there) > > Huh, I didn't realize that. I thought all Linux except Gentoo was using > systemd nowadays (certainly it's installed on the bots I used yesterday), > and that it was designed to handle unprivileged containers nicely -- that's > the whole point of machinectl, right? but I've hardly used containers myself. > Well, Ubuntu is still using upstart (the switch to systemd still didn't happened.. I guess that by 16.04 maybe). And with unprivileged containers I mean LXC namespace based containers, and systemd is not running correctly inside there. I think i read that in the latest releases of systemd this was somehow fixed, but I didn't tested it. So I remain cautious about the possibility of using something like coredumpctl inside a LXC namespace based container until I test it and see it working. > > So... the main question.... is coredumpctl able to get and print the > > environ variables that a program had defined when it crashed for a given > > coredump? If is not, I don't think it will be of any help here. > > Hm, I don't think so. You can match on an awful lot, anything listed in > systemd.journal-fields(7), but not environment variables. It'd be easy to > add though, since it has access to /proc/$pid/environ as you say, we'd just > need to make it save the environment as a journal field. This is how the > abrt coredumpctl integration was written -- abrt needed more info on the > coredumps, so they modified coredumpctl to save it in journal fields. But > that is a lot more work.... If you are interested on adding this to systemd please go ahead.. But take into account that because of the release cycles of the distributions, the time it will take to reach the systemd used by all our developers (debian/ubuntu/fedora/etc) will be measured in years.
Carlos Alberto Lopez Perez
Comment 8 2015-12-23 06:45:50 PST
(In reply to comment #7) > Well, Ubuntu is still using upstart (the switch to systemd still didn't > happened.. I guess that by 16.04 maybe). Correction: they switched to systemd on 15.04 according to this wiki https://wiki.ubuntu.com/SystemdForUpstartUsers#System_Init_Daemon
Philippe Normand
Comment 9 2016-07-07 10:51:11 PDT
(In reply to comment #5) > (In reply to comment #4) > > Nice investigation indeed. > > > > The right thing to do is to get cores from coredumpctl so we don't have to > > mess around with changing the core_pattern. I'd be interested in working on > > this eventually (for some large value of eventually). > > I guess that would require the users to run systemd. This can be problematic > on the bots (they run on unprivileged containers, and systemd don't plays > nice there) > > > Also I'm not sure if it will solve the main problem, that is: > > * Due to WebKit multiprocess model, our tooling for the tests can only know > the pid of WebKitTestRunner, but not the pid of any of the childs > (WebKitWebProcess, WebKitNetworkProcess, etc). This can be solved by implementing private C APIs for WKContext and WKPage, like on Mac. I started a patch... Also changing the crash log analisys as advised in the initial description of the bug.
Philippe Normand
Comment 10 2016-07-14 08:36:11 PDT
Philippe Normand
Comment 11 2016-07-14 08:37:43 PDT
Carlos Garcia Campos
Comment 12 2016-07-16 04:11:23 PDT
Comment on attachment 283647 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=283647&action=review > Source/WebKit2/UIProcess/API/C/gtk/WKContextPrivateGtk.cpp:41 > +pid_t WKContextGetNetworkProcessIdentifier(WKContextRef contextRef) > +{ > + return toImpl(contextRef)->networkProcessIdentifier(); > +} > + > +pid_t WKContextGetDatabaseProcessIdentifier(WKContextRef contextRef) > +{ > + return toImpl(contextRef)->databaseProcessIdentifier(); Why is this specific to GTK+? Why don't we add this to WKContext.cpp and WKContextPrivate.h? > Source/WebKit2/UIProcess/API/C/gtk/WKPagePrivateGtk.cpp:37 > +pid_t WKPageGetProcessIdentifier(WKPageRef pageRef) > +{ > + return toImpl(pageRef)->processIdentifier(); > +} Ditto, this could be added to WKPage.cpp and WKPagePrivate.h, since this is exposing cross-platform API > Tools/Scripts/process-linux-coredump:26 > + full_path = os.path.join(destination_directory, 'core-pid_%d.dump' % int(pid)) Why do you need to convert the pid to int? Why not just use %s instead? Or the script could receive the path already built no? > Tools/Scripts/webkitpy/port/linux_get_crash_log_unittest.py:49 > core_pattern = os.path.join(core_directory, "core-pid_%p-_-process_%E") > + core_pattern = "|%s %%p /path/to/coredumps" % process_coredump_script_path Shouldn't you remove the previous core_pattern assignation? > Tools/WebKitTestRunner/TestController.cpp:1487 > pid_t pid = WKContextGetNetworkProcessIdentifier(m_context.get()); Ah, mac already has an impl of this. In that case it should be moved to the cross-platform files, I think, instead of duplicating it in GTK specific files.
Philippe Normand
Comment 13 2016-07-22 00:04:27 PDT
Comment on attachment 283647 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=283647&action=review >> Tools/Scripts/process-linux-coredump:26 >> + full_path = os.path.join(destination_directory, 'core-pid_%d.dump' % int(pid)) > > Why do you need to convert the pid to int? Why not just use %s instead? Or the script could receive the path already built no? No strong reason for the conversion. The script cannot receive the full path already built because the script invocation is hardcoded in the core_pattern. >> Tools/Scripts/webkitpy/port/linux_get_crash_log_unittest.py:49 >> + core_pattern = "|%s %%p /path/to/coredumps" % process_coredump_script_path > > Shouldn't you remove the previous core_pattern assignation? oops >> Tools/WebKitTestRunner/TestController.cpp:1487 >> pid_t pid = WKContextGetNetworkProcessIdentifier(m_context.get()); > > Ah, mac already has an impl of this. In that case it should be moved to the cross-platform files, I think, instead of duplicating it in GTK specific files. Ok I can try this.
Philippe Normand
Comment 14 2016-07-22 00:36:06 PDT
Philippe Normand
Comment 15 2016-07-22 00:41:39 PDT
(In reply to comment #13) > Comment on attachment 283647 [details] > patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=283647&action=review > > >> Tools/Scripts/process-linux-coredump:26 > >> + full_path = os.path.join(destination_directory, 'core-pid_%d.dump' % int(pid)) > > > > Why do you need to convert the pid to int? Why not just use %s instead? Or the script could receive the path already built no? > > No strong reason for the conversion. The script cannot receive the full path > already built because the script invocation is hardcoded in the core_pattern. > Ah sorry this can be improved indeed. I'll update the patch.
Philippe Normand
Comment 16 2016-07-22 00:50:27 PDT
Carlos Garcia Campos
Comment 17 2016-07-22 01:14:54 PDT
Comment on attachment 284314 [details] patch Patch looks good to me, you just need to fix mac builds (update the header includes), so please submit a new patch and wait for EWS before landing
Philippe Normand
Comment 18 2016-07-22 03:14:16 PDT
Created attachment 284321 [details] patch for landing
Philippe Normand
Comment 19 2016-07-22 03:38:27 PDT
Created attachment 284322 [details] patch for landing
Philippe Normand
Comment 20 2016-07-24 23:09:02 PDT
Michael Catanzaro
Comment 21 2016-07-25 07:25:30 PDT
It broke a thing: [692/1454] webkitpy.port.linux_get_crash_log_unittest.GDBCrashLogGeneratorTest.test_generate_crash_log failed: Traceback (most recent call last): File "/home/slave/webkitgtk/gtk-linux-64-release-tests/build/Tools/Scripts/webkitpy/port/linux_get_crash_log_unittest.py", line 67, in test_generate_crash_log self.assertMultiLineEqual(log, mock_empty_crash_log) AssertionError: 'crash log for DumpRenderTree (pid 28529):\n\nCoredump core-pid_28529.dump not f [truncated]... != 'crash log for DumpRenderTree (pid 28529):\n\nCoredump core-pid_28529.dump not f [truncated]... Diff is 822 characters long. Set self.maxDiff to None to see it. Could you investigate please?
Philippe Normand
Comment 22 2016-07-25 09:32:47 PDT
Thanks, follow-up in r203685
Note You need to log in before you can comment on or make changes to this bug.