RESOLVED FIXED Bug 215829
[ iOS Release ] imported/w3c/web-platform-tests/* is flaky failing 200+ tests
https://bugs.webkit.org/show_bug.cgi?id=215829
Summary [ iOS Release ] imported/w3c/web-platform-tests/* is flaky failing 200+ tests
Hector Lopez
Reported 2020-08-25 16:45:05 PDT
imported/w3c/web-platform-tests/* A large group of these tests are flaky failing as a group on Apple iOS 13 Simulator Release WK2(Tests) and Apple iPadOS 13 Simulator Release WK2(Tests) bots. The group number is different every time but is consistently around 200 tests. No failures to APIs accompany these failures. Occurrence is happening on average 1 in 20 builds. Revisions may build successfully on one set of testers but fail on the other group. Almost in every instance the group failures will not happen again on immediate build with no indication that revision was used to clear failures. One commonality that all builds on each tester is this warning: (view as text) warning: (x86_64) /Volumes/Data/slave/ios-simulator-13-release/build/WebKitBuild/ImageDiff.build/Release/ImageDiff.build/Objects-normal/x86_64/ImageDiff.o unable to open object file: No such file or directory warning: (x86_64) /Volumes/Data/slave/ios-simulator-13-release/build/WebKitBuild/ImageDiff.build/Release/ImageDiff.build/Objects-normal/x86_64/PlatformImage.o unable to open object file: No such file or directory warning: (x86_64) /Volumes/Data/slave/ios-simulator-13-release/build/WebKitBuild/ImageDiff.build/Release/ImageDiff.build/Objects-normal/x86_64/PlatformImageCG.o unable to open object file: No such file or directory warning: no debug symbols in executable (-arch x86_64) Access to Apple iOS 13 Simulator Release WK2(Tests): https://build.webkit.org/builders/Apple%20iOS%2013%20Simulator%20Release%20WK2%20(Tests)?numbuilds=50 Access to Apple iPadOS 13 Simulator Release WK2(Tests) https://build.webkit.org/builders/Apple%20iPadOS%2013%20Simulator%20Release%20WK2%20(Tests)?numbuilds=50 Log for latest occurrence on iPadOS 13 Simulator Release WK2(Tests) please see attached. For build 4417 at r266107 with failures gone on next build at r266112 but no revision seems to be cause of fix and happens again on different revision.
Attachments
Stdio for iPadOS 13 Simulator Release WK2(Tests) build 4427 at r266107 (7.00 MB, text/html)
2020-08-25 16:57 PDT, Hector Lopez
no flags
Warnings for iPadOS 13 Simulator Release WK2(Tests) build 4427 at r266107 (1.07 KB, text/html)
2020-08-25 16:59 PDT, Hector Lopez
no flags
Patch (3.23 KB, patch)
2020-09-09 15:05 PDT, Sam Sneddon [:gsnedders]
no flags
Patch for landing (5.41 KB, patch)
2020-09-10 00:39 PDT, youenn fablet
no flags
Patch (1.73 KB, patch)
2020-09-16 11:00 PDT, Sam Sneddon [:gsnedders]
no flags
Patch (2.08 KB, patch)
2020-09-18 07:35 PDT, Sam Sneddon [:gsnedders]
youennf: review+
youennf: commit-queue-
Patch (2.60 KB, patch)
2020-09-21 10:56 PDT, Sam Sneddon [:gsnedders]
no flags
Radar WebKit Bug Importer
Comment 1 2020-08-25 16:45:43 PDT
Hector Lopez
Comment 2 2020-08-25 16:57:04 PDT
Created attachment 407250 [details] Stdio for iPadOS 13 Simulator Release WK2(Tests) build 4427 at r266107
Hector Lopez
Comment 3 2020-08-25 16:59:46 PDT
Created attachment 407251 [details] Warnings for iPadOS 13 Simulator Release WK2(Tests) build 4427 at r266107
Ryan Haddad
Comment 5 2020-08-26 17:25:15 PDT
Ryan Haddad
Comment 6 2020-09-09 13:09:15 PDT
Sam Sneddon [:gsnedders]
Comment 7 2020-09-09 14:34:00 PDT
(In reply to Ryan Haddad from comment #4) > Here is a link to the wptwk process log from the most recent failure: > https://build.webkit.org/results/ > Apple%20iPadOS%2013%20Simulator%20Release%20WK2%20(Tests)/r266190%20(4445)/ > wptwk_process_log.out.txt > > https://build.webkit.org/builders/ > Apple%20iPadOS%2013%20Simulator%20Release%20WK2%20%28Tests%29/builds/4445 This doesn't seem to give any obvious clue, it just starts returning 404 for files that should definitely exist, and then relatively quickly goes back to operating normally. Let's try adding some extra debugging info there. :shrug:
Sam Sneddon [:gsnedders]
Comment 8 2020-09-09 15:05:26 PDT
EWS Watchlist
Comment 9 2020-09-09 15:06:36 PDT
This patch modifies the imported WPT tests. Please ensure that any changes on the tests (not coming from a WPT import) are exported to WPT. Please see https://trac.webkit.org/wiki/WPTExportProcess
Sam Sneddon [:gsnedders]
Comment 10 2020-09-09 16:00:02 PDT
Comment on attachment 408370 [details] Patch oh no this broke things
Jonathan Bedard
Comment 11 2020-09-09 16:15:02 PDT
(In reply to Sam Sneddon [:gsnedders] from comment #10) > Comment on attachment 408370 [details] > Patch > > oh no this broke things That's pretty surprising, but the test it apparently broke does seem pretty green on trunk infrastructure.
youenn fablet
Comment 12 2020-09-10 00:32:14 PDT
Breakage seems fine, you might just have to rebase the corresponding expected.txt file for now.
youenn fablet
Comment 13 2020-09-10 00:39:08 PDT
Created attachment 408418 [details] Patch for landing
EWS
Comment 14 2020-09-10 01:09:23 PDT
Committed r266816: <https://trac.webkit.org/changeset/266816> All reviewed patches have been landed. Closing bug and clearing flags on attachment 408418 [details].
Sam Sneddon [:gsnedders]
Comment 15 2020-09-10 06:42:48 PDT
This hopefully gives us enough to debug this, but doesn't fix it.
youenn fablet
Comment 16 2020-09-11 13:19:54 PDT
Error message is: {\"error\": {\"message\": \"Too many open files\", \"code\": 404}}"
Jonathan Bedard
Comment 17 2020-09-11 13:29:19 PDT
(In reply to youenn fablet from comment #16) > Error message is: > {\"error\": {\"message\": \"Too many open files\", \"code\": 404}}" Maybe we haven't bumped thee file limit, or maybe we need to run less simulators.
Alexey Proskuryakov
Comment 18 2020-09-11 13:49:27 PDT
Ideally, we'd collect ps and lsof output in state.
Aakash Jain
Comment 19 2020-09-15 11:09:20 PDT
(In reply to youenn fablet from comment #16) > Error message is: > {\"error\": {\"message\": \"Too many open files\", \"code\": 404}}" Where exactly does this error message appear? Any link?
Sam Sneddon [:gsnedders]
Comment 20 2020-09-15 11:24:28 PDT
(In reply to Aakash Jain from comment #19) > (In reply to youenn fablet from comment #16) > > Error message is: > > {\"error\": {\"message\": \"Too many open files\", \"code\": 404}}" > Where exactly does this error message appear? Any link? Look at many of the actual results in https://build.webkit.org/results/Apple%20iOS%2013%20Simulator%20Release%20WK2%20(Tests)/r267004%20(6734)/results.html for example
Aakash Jain
Comment 21 2020-09-15 13:49:51 PDT
(In reply to Jonathan Bedard from comment #17) > Maybe we haven't bumped thee file limit, or maybe we need to run less simulators. Yeah, we haven't bumped the limits on these bots. Process limit seems to be good enough by default (11136 limit on ews122 - ews126). However, file limit is default: ews122:LaunchAgents buildbot$ launchctl limit maxfiles maxfiles 256 unlimited Maybe we should simply increase the file limit to 100,000 by using the method in https://trac.webkit.org/wiki/IncreasingKernelLimits#no1
Sam Sneddon [:gsnedders]
Comment 22 2020-09-16 11:00:16 PDT
youenn fablet
Comment 23 2020-09-16 11:09:18 PDT
Typically, in parallel to uploading a patch doing changes in LayoutTests/imported/w3c/web-platform-tests, we generate a PR (for instance using export-w3c-tests). Then we get a review, merge the PR in WPT and commit the change here. Do you plan to do so? I guess we could also first land this change and verify it is a full fix.
Sam Sneddon [:gsnedders]
Comment 24 2020-09-17 05:27:00 PDT
Was planning on verifying this works first before submitting upstream.
Aakash Jain
Comment 25 2020-09-17 12:47:23 PDT
Comment on attachment 408936 [details] Patch Breaks layout-tests e.g.: https://ews-build.webkit.org/#/builders/30/builds/18734 From /Volumes/Data/worker/macOS-Mojave-Release-WK1-Tests-EWS/build/layout-test-results/wptwk_process_log.out.txt on ews101: WARNING:web-platform-tests:Status of subprocess "http on port 8801": failed. Exit with non-zero status: 1 INFO:web-platform-tests:Status of subprocess "https on port 9443": running self.run() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/Volumes/Data/worker/macOS-Mojave-Release-WK1-Tests-EWS/build/LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py", line 422, in create_daemon resource.setrlimit(resource.RLIMIT_NOFILE, (nofile_hard, nofile_hard)) ValueError: current limit exceeds maximum limit Process https on port 9443: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/Volumes/Data/worker/macOS-Mojave-Release-WK1-Tests-EWS/build/LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py", line 422, in create_daemon resource.setrlimit(resource.RLIMIT_NOFILE, (nofile_hard, nofile_hard)) ValueError: current limit exceeds maximum limit
Sam Sneddon [:gsnedders]
Comment 26 2020-09-18 07:35:04 PDT
Created attachment 409127 [details] Patch This seems to work in a VM without erroring everywhere, at least
youenn fablet
Comment 27 2020-09-18 07:46:27 PDT
Comment on attachment 409127 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=409127&action=review > LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:429 > + resource.setrlimit(resource.RLIMIT_NOFILE, (new_soft, hard)) Should we do it here or in Tools/Scripts/webkitpy/layout_tests/servers/web_platform_test_server.py?
Aakash Jain
Comment 28 2020-09-18 08:05:23 PDT
Comment on attachment 409127 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=409127&action=review > LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:427 > + new_soft = min(soft + 1024, maxfilesperproc, hard) Is increasing soft limit by 1024 enough? It seems pretty less. on ews101 these are the numbers: ews101:~ buildbot$ sysctl -n kern.maxfilesperproc 98304 >>> resource.getrlimit(resource.RLIMIT_NOFILE) (256, 9223372036854775807) so increasing soft limit from 256 to 1280 would be enough? Previous patch was setting this limit to 9223372036854775807, this one is setting to 1280.
Sam Sneddon [:gsnedders]
Comment 29 2020-09-18 09:08:25 PDT
(In reply to youenn fablet from comment #27) > Comment on attachment 409127 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=409127&action=review > > > LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:429 > > + resource.setrlimit(resource.RLIMIT_NOFILE, (new_soft, hard)) > > Should we do it here or in > Tools/Scripts/webkitpy/layout_tests/servers/web_platform_test_server.py? People have occasionally run into this upstream too, so it makes sense to fix it in the upstream code. (In reply to Aakash Jain from comment #28) > so increasing soft limit from 256 to 1280 would be enough? Previous patch > was setting this limit to 9223372036854775807, this one is setting to 1280. I think so. A quick look on Linux box I had to hand showed an initial soft limit of 1024.
Darin Adler
Comment 30 2020-09-19 14:59:28 PDT
Comment on attachment 409127 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=409127&action=review >> LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:427 >> + new_soft = min(soft + 1024, maxfilesperproc, hard) > > Is increasing soft limit by 1024 enough? It seems pretty less. > > on ews101 these are the numbers: > > ews101:~ buildbot$ sysctl -n kern.maxfilesperproc > 98304 After a tiny bit of investigation I am wondering the same thing. On my desktop machine I typed the command and got this: % sysctl -n kern.maxfilesperproc 49152 I don’t think the intent here is to bump from 49152 to 50176; unlikely to be valuable. So there must be something else going on, on the servers where this is helpful. Sam, can you clarify?
youenn fablet
Comment 31 2020-09-21 06:34:13 PDT
Comment on attachment 409127 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=409127&action=review > LayoutTests/imported/w3c/ChangeLog:9 > + (ServerProc.create_daemon): As per our direct discussion, please add more comments explaining the default value on various platforms, and the value you are selecting here. > LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:421 > + if sys.platform == "darwin": Maybe we should verify that for any given system, we have a large enough value.
youenn fablet
Comment 32 2020-09-21 06:35:58 PDT
Really great to see that fixed. I am surprised Chrome or Firefox did not have that issue. Maybe they fixed in a different way though.
youenn fablet
Comment 33 2020-09-21 06:41:30 PDT
Also, even 256+1024 might be a bit small as we more and more run WTR in parallel on the same server. Can we increase it a bit further than that?
Sam Sneddon [:gsnedders]
Comment 34 2020-09-21 06:44:03 PDT
(In reply to Darin Adler from comment #30) > Comment on attachment 409127 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=409127&action=review > > >> LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:427 > >> + new_soft = min(soft + 1024, maxfilesperproc, hard) > > > > Is increasing soft limit by 1024 enough? It seems pretty less. > > > > on ews101 these are the numbers: > > > > ews101:~ buildbot$ sysctl -n kern.maxfilesperproc > > 98304 > > After a tiny bit of investigation I am wondering the same thing. On my > desktop machine I typed the command and got this: > > % sysctl -n kern.maxfilesperproc > 49152 > > I don’t think the intent here is to bump from 49152 to 50176; unlikely to be > valuable. So there must be something else going on, on the servers where > this is helpful. > > Sam, can you clarify? There's three limits in play here: * the processes' current limit (this starts at 256 on macOS and 1024 on the first Linux shell I had to hand), * the kernel limit for how many one process can have (this appears to be 49152 locally, this is kern.maxfilesperproc), and * the kernel limit for how many the system can have (which appears to be 98304 locally, this is kern.maxfiles). What we're doing is bumping the first of those, not the other two. 256 to 1280 is a pretty big jump; we probably don't want to jump all the way to kern.maxfilesperproc (as that then means the one process can get pretty close to kern.maxfiles, and it shouldn't be anywhere near it). Youenn pointed out directly that as we increase parallelism we might hit issues, though Servo was running pretty wide and didn't have any issues, so we'll see. (In reply to youenn fablet from comment #31) > Comment on attachment 409127 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=409127&action=review > > > LayoutTests/imported/w3c/ChangeLog:9 > > + (ServerProc.create_daemon): > > As per our direct discussion, please add more comments explaining the > default value on various platforms, and the value you are selecting here. > > > LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:421 > > + if sys.platform == "darwin": > > Maybe we should verify that for any given system, we have a large enough > value. Getting the system limits is a per-platform endeavour, however, and when it hasn't been observed elsewhere I don't think it's worthwhile trying that immediately? (In reply to youenn fablet from comment #32) > Really great to see that fixed. I am surprised Chrome or Firefox did not > have that issue. Maybe they fixed in a different way though. I don't think they run wptserve with more than one client hitting it at a time on macOS, rather just sharding across multiple machines each with their own wptserve instance. (In reply to youenn fablet from comment #33) > Also, even 256+1024 might be a bit small as we more and more run WTR in > parallel on the same server. Can we increase it a bit further than that? Certainly. I was mostly trying to avoid making a gigantic jump in one go and seeing whether this suffices.
Sam Sneddon [:gsnedders]
Comment 35 2020-09-21 10:56:09 PDT
youenn fablet
Comment 36 2020-09-21 12:05:59 PDT
Comment on attachment 409283 [details] Patch We cq it before landing the patch in WPT upstream, to validate this will be working as expected. @Sam, can you make the upstream PR right way so that we can land it in a couple of days? If these changes are your last commit, you should be able to do something like: Tools/Scripts/export-w3c-test-changes -g HEAD~1..HEAD -c
EWS
Comment 37 2020-09-21 12:06:36 PDT
ChangeLog entry in LayoutTests/imported/w3c/ChangeLog contains OOPS!.
youenn fablet
Comment 38 2020-09-21 12:07:55 PDT
Comment on attachment 409283 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=409283&action=review > LayoutTests/imported/w3c/ChangeLog:6 > + Reviewed by NOBODY (OOPS!). Ah right, you need to change it to something like: Reviewed by Youenn Fablet.
Aakash Jain
Comment 39 2020-09-21 12:11:43 PDT
Comment on attachment 409283 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=409283&action=review > LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:428 > + # 2048 is somewhat arbitrary, but gives us some headroom for wptrunner --parallel Why such a small number (2048)? Why not increase it to something like 100000? Note that in some parts of our infrastructure we do set this file limit to 100000 on various bots (using method in https://trac.webkit.org/wiki/IncreasingKernelLimits#no1) and it has been working fine for years. Also, were you able to reproduce the issue, and verify that by increasing the limit, the problem went away?
EWS
Comment 40 2020-09-21 12:31:40 PDT
Committed r267354: <https://trac.webkit.org/changeset/267354> All reviewed patches have been landed. Closing bug and clearing flags on attachment 409283 [details].
Sam Sneddon [:gsnedders]
Comment 41 2020-09-24 07:59:50 PDT
Note You need to log in before you can comment on or make changes to this bug.