Bug 215829

Summary: [ iOS Release ] imported/w3c/web-platform-tests/* is flaky failing 200+ tests
Product: WebKit Reporter: Hector Lopez <hector_i_lopez>
Component: New BugsAssignee: Sam Sneddon [:gsnedders] <gsnedders>
Status: RESOLVED FIXED    
Severity: Normal CC: aakash_jain, ap, clopez, darin, ews-watchlist, gsnedders, jbedard, webkit-bot-watchers-bugzilla, webkit-bug-importer, youennf
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=216536
https://bugs.webkit.org/show_bug.cgi?id=216823
Attachments:
Description Flags
Stdio for iPadOS 13 Simulator Release WK2(Tests) build 4427 at r266107
none
Warnings for iPadOS 13 Simulator Release WK2(Tests) build 4427 at r266107
none
Patch
none
Patch for landing
none
Patch
none
Patch
youennf: review+, youennf: commit-queue-
Patch none

Description Hector Lopez 2020-08-25 16:45:05 PDT
imported/w3c/web-platform-tests/*

A large group of these tests are flaky failing as a group on Apple iOS 13 Simulator Release WK2(Tests) and Apple iPadOS 13 Simulator Release WK2(Tests) bots. The group number is different every time but is consistently around 200 tests. No failures to APIs accompany these failures. Occurrence is happening on average 1 in 20 builds. Revisions may build successfully on one set of testers but fail on the other group. Almost in every instance the group failures will not happen again on immediate build with no indication that revision was used to clear failures. 

One commonality that all builds on each tester is this warning:
(view as text)
warning: (x86_64) /Volumes/Data/slave/ios-simulator-13-release/build/WebKitBuild/ImageDiff.build/Release/ImageDiff.build/Objects-normal/x86_64/ImageDiff.o unable to open object file: No such file or directory
warning: (x86_64) /Volumes/Data/slave/ios-simulator-13-release/build/WebKitBuild/ImageDiff.build/Release/ImageDiff.build/Objects-normal/x86_64/PlatformImage.o unable to open object file: No such file or directory
warning: (x86_64) /Volumes/Data/slave/ios-simulator-13-release/build/WebKitBuild/ImageDiff.build/Release/ImageDiff.build/Objects-normal/x86_64/PlatformImageCG.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)

Access to Apple iOS 13 Simulator Release WK2(Tests):
https://build.webkit.org/builders/Apple%20iOS%2013%20Simulator%20Release%20WK2%20(Tests)?numbuilds=50

Access to Apple iPadOS 13 Simulator Release WK2(Tests)
https://build.webkit.org/builders/Apple%20iPadOS%2013%20Simulator%20Release%20WK2%20(Tests)?numbuilds=50

Log for latest occurrence on iPadOS 13 Simulator Release WK2(Tests) please see attached. For build 4417 at r266107 with failures gone on next build at r266112 but no revision seems to be cause of fix and happens again on different revision.
Comment 1 Radar WebKit Bug Importer 2020-08-25 16:45:43 PDT
<rdar://problem/67765880>
Comment 2 Hector Lopez 2020-08-25 16:57:04 PDT
Created attachment 407250 [details]
Stdio for iPadOS 13 Simulator Release WK2(Tests) build 4427 at r266107
Comment 3 Hector Lopez 2020-08-25 16:59:46 PDT
Created attachment 407251 [details]
Warnings for iPadOS 13 Simulator Release WK2(Tests) build 4427 at r266107
Comment 5 Ryan Haddad 2020-08-26 17:25:15 PDT
Here is one that for a test run that had no failures for comparison:
https://build.webkit.org/results/Apple%20iOS%2013%20Simulator%20Release%20WK2%20(Tests)/r266190%20(6397)/wptwk_process_log.out.txt
Comment 6 Ryan Haddad 2020-09-09 13:09:15 PDT
Looks like this has popped up on EWS, too: https://ews-build.s3-us-west-2.amazonaws.com/iOS-13-Simulator-WK2-Tests-EWS/r408353-25299/results.html
Comment 7 Sam Sneddon [:gsnedders] 2020-09-09 14:34:00 PDT
(In reply to Ryan Haddad from comment #4)
> Here is a link to the wptwk process log from the most recent failure:
> https://build.webkit.org/results/
> Apple%20iPadOS%2013%20Simulator%20Release%20WK2%20(Tests)/r266190%20(4445)/
> wptwk_process_log.out.txt
> 
> https://build.webkit.org/builders/
> Apple%20iPadOS%2013%20Simulator%20Release%20WK2%20%28Tests%29/builds/4445

This doesn't seem to give any obvious clue, it just starts returning 404 for files that should definitely exist, and then relatively quickly goes back to operating normally.

Let's try adding some extra debugging info there. :shrug:
Comment 8 Sam Sneddon [:gsnedders] 2020-09-09 15:05:26 PDT
Created attachment 408370 [details]
Patch
Comment 9 EWS Watchlist 2020-09-09 15:06:36 PDT
This patch modifies the imported WPT tests. Please ensure that any changes on the tests (not coming from a WPT import) are exported to WPT. Please see https://trac.webkit.org/wiki/WPTExportProcess
Comment 10 Sam Sneddon [:gsnedders] 2020-09-09 16:00:02 PDT
Comment on attachment 408370 [details]
Patch

oh no this broke things
Comment 11 Jonathan Bedard 2020-09-09 16:15:02 PDT
(In reply to Sam Sneddon [:gsnedders] from comment #10)
> Comment on attachment 408370 [details]
> Patch
> 
> oh no this broke things

That's pretty surprising, but the test it apparently broke does seem pretty green on trunk infrastructure.
Comment 12 youenn fablet 2020-09-10 00:32:14 PDT
Breakage seems fine, you might just have to rebase the corresponding expected.txt file for now.
Comment 13 youenn fablet 2020-09-10 00:39:08 PDT
Created attachment 408418 [details]
Patch for landing
Comment 14 EWS 2020-09-10 01:09:23 PDT
Committed r266816: <https://trac.webkit.org/changeset/266816>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 408418 [details].
Comment 15 Sam Sneddon [:gsnedders] 2020-09-10 06:42:48 PDT
This hopefully gives us enough to debug this, but doesn't fix it.
Comment 16 youenn fablet 2020-09-11 13:19:54 PDT
Error message is:
{\"error\": {\"message\": \"Too many open files\", \"code\": 404}}"
Comment 17 Jonathan Bedard 2020-09-11 13:29:19 PDT
(In reply to youenn fablet from comment #16)
> Error message is:
> {\"error\": {\"message\": \"Too many open files\", \"code\": 404}}"

Maybe we haven't bumped thee file limit, or maybe we need to run less simulators.
Comment 18 Alexey Proskuryakov 2020-09-11 13:49:27 PDT
Ideally, we'd collect ps and lsof output in state.
Comment 19 Aakash Jain 2020-09-15 11:09:20 PDT
(In reply to youenn fablet from comment #16)
> Error message is:
> {\"error\": {\"message\": \"Too many open files\", \"code\": 404}}"
Where exactly does this error message appear? Any link?
Comment 20 Sam Sneddon [:gsnedders] 2020-09-15 11:24:28 PDT
(In reply to Aakash Jain from comment #19)
> (In reply to youenn fablet from comment #16)
> > Error message is:
> > {\"error\": {\"message\": \"Too many open files\", \"code\": 404}}"
> Where exactly does this error message appear? Any link?

Look at many of the actual results in https://build.webkit.org/results/Apple%20iOS%2013%20Simulator%20Release%20WK2%20(Tests)/r267004%20(6734)/results.html for example
Comment 21 Aakash Jain 2020-09-15 13:49:51 PDT
(In reply to Jonathan Bedard from comment #17)
> Maybe we haven't bumped thee file limit, or maybe we need to run less simulators.
Yeah, we haven't bumped the limits on these bots. Process limit seems to be good enough by default (11136 limit on ews122 - ews126).

However, file limit is default:

ews122:LaunchAgents buildbot$ launchctl limit maxfiles
	maxfiles    256            unlimited      

Maybe we should simply increase the file limit to 100,000 by using the method in https://trac.webkit.org/wiki/IncreasingKernelLimits#no1
Comment 22 Sam Sneddon [:gsnedders] 2020-09-16 11:00:16 PDT
Created attachment 408936 [details]
Patch
Comment 23 youenn fablet 2020-09-16 11:09:18 PDT
Typically, in parallel to uploading a patch doing changes in LayoutTests/imported/w3c/web-platform-tests, we generate a PR (for instance using export-w3c-tests).
Then we get a review, merge the PR in WPT and commit the change here.

Do you plan to do so?
I guess we could also first land this change and verify it is a full fix.
Comment 24 Sam Sneddon [:gsnedders] 2020-09-17 05:27:00 PDT
Was planning on verifying this works first before submitting upstream.
Comment 25 Aakash Jain 2020-09-17 12:47:23 PDT
Comment on attachment 408936 [details]
Patch

Breaks layout-tests

e.g.: https://ews-build.webkit.org/#/builders/30/builds/18734

From /Volumes/Data/worker/macOS-Mojave-Release-WK1-Tests-EWS/build/layout-test-results/wptwk_process_log.out.txt on ews101:

WARNING:web-platform-tests:Status of subprocess "http on port 8801": failed. Exit with non-zero status: 1
INFO:web-platform-tests:Status of subprocess "https on port 9443": running
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run 
    self._target(*self._args, **self._kwargs)
  File "/Volumes/Data/worker/macOS-Mojave-Release-WK1-Tests-EWS/build/LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py", line 422, in create_daemon
    resource.setrlimit(resource.RLIMIT_NOFILE, (nofile_hard, nofile_hard))
ValueError: current limit exceeds maximum limit
Process https on port 9443:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run 
    self._target(*self._args, **self._kwargs)
  File "/Volumes/Data/worker/macOS-Mojave-Release-WK1-Tests-EWS/build/LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py", line 422, in create_daemon
    resource.setrlimit(resource.RLIMIT_NOFILE, (nofile_hard, nofile_hard))
ValueError: current limit exceeds maximum limit
Comment 26 Sam Sneddon [:gsnedders] 2020-09-18 07:35:04 PDT
Created attachment 409127 [details]
Patch

This seems to work in a VM without erroring everywhere, at least
Comment 27 youenn fablet 2020-09-18 07:46:27 PDT
Comment on attachment 409127 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=409127&action=review

> LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:429
> +                resource.setrlimit(resource.RLIMIT_NOFILE, (new_soft, hard))

Should we do it here or in Tools/Scripts/webkitpy/layout_tests/servers/web_platform_test_server.py?
Comment 28 Aakash Jain 2020-09-18 08:05:23 PDT
Comment on attachment 409127 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=409127&action=review

> LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:427
> +            new_soft = min(soft + 1024, maxfilesperproc, hard)

Is increasing soft limit by 1024 enough? It seems pretty less.

on ews101 these are the numbers:

ews101:~ buildbot$ sysctl -n kern.maxfilesperproc
98304
>>> resource.getrlimit(resource.RLIMIT_NOFILE)
(256, 9223372036854775807)

so increasing soft limit from 256 to 1280 would be enough? Previous patch was setting this limit to 9223372036854775807, this one is setting to 1280.
Comment 29 Sam Sneddon [:gsnedders] 2020-09-18 09:08:25 PDT
(In reply to youenn fablet from comment #27)
> Comment on attachment 409127 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=409127&action=review
> 
> > LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:429
> > +                resource.setrlimit(resource.RLIMIT_NOFILE, (new_soft, hard))
> 
> Should we do it here or in
> Tools/Scripts/webkitpy/layout_tests/servers/web_platform_test_server.py?

People have occasionally run into this upstream too, so it makes sense to fix it in the upstream code.

(In reply to Aakash Jain from comment #28)
> so increasing soft limit from 256 to 1280 would be enough? Previous patch
> was setting this limit to 9223372036854775807, this one is setting to 1280.

I think so. A quick look on Linux box I had to hand showed an initial soft limit of 1024.
Comment 30 Darin Adler 2020-09-19 14:59:28 PDT
Comment on attachment 409127 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=409127&action=review

>> LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:427
>> +            new_soft = min(soft + 1024, maxfilesperproc, hard)
> 
> Is increasing soft limit by 1024 enough? It seems pretty less.
> 
> on ews101 these are the numbers:
> 
> ews101:~ buildbot$ sysctl -n kern.maxfilesperproc
> 98304

After a tiny bit of investigation I am wondering the same thing. On my desktop machine I typed the command and got this:

% sysctl -n kern.maxfilesperproc
49152

I don’t think the intent here is to bump from 49152 to 50176; unlikely to be valuable. So there must be something else going on, on the servers where this is helpful.

Sam, can you clarify?
Comment 31 youenn fablet 2020-09-21 06:34:13 PDT
Comment on attachment 409127 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=409127&action=review

> LayoutTests/imported/w3c/ChangeLog:9
> +        (ServerProc.create_daemon):

As per our direct discussion, please add more comments explaining the default value on various platforms, and the value you are selecting here.

> LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:421
> +        if sys.platform == "darwin":

Maybe we should verify that for any given system, we have a large enough value.
Comment 32 youenn fablet 2020-09-21 06:35:58 PDT
Really great to see that fixed. I am surprised Chrome or Firefox did not have that issue. Maybe they fixed in a different way though.
Comment 33 youenn fablet 2020-09-21 06:41:30 PDT
Also, even 256+1024 might be a bit small as we more and more run WTR in parallel on the same server. Can we increase it a bit further than that?
Comment 34 Sam Sneddon [:gsnedders] 2020-09-21 06:44:03 PDT
(In reply to Darin Adler from comment #30)
> Comment on attachment 409127 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=409127&action=review
> 
> >> LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:427
> >> +            new_soft = min(soft + 1024, maxfilesperproc, hard)
> > 
> > Is increasing soft limit by 1024 enough? It seems pretty less.
> > 
> > on ews101 these are the numbers:
> > 
> > ews101:~ buildbot$ sysctl -n kern.maxfilesperproc
> > 98304
> 
> After a tiny bit of investigation I am wondering the same thing. On my
> desktop machine I typed the command and got this:
> 
> % sysctl -n kern.maxfilesperproc
> 49152
> 
> I don’t think the intent here is to bump from 49152 to 50176; unlikely to be
> valuable. So there must be something else going on, on the servers where
> this is helpful.
> 
> Sam, can you clarify?

There's three limits in play here:

 * the processes' current limit (this starts at 256 on macOS and 1024 on the first Linux shell I had to hand),
 * the kernel limit for how many one process can have (this appears to be 49152 locally, this is kern.maxfilesperproc), and
 * the kernel limit for how many the system can have (which appears to be 98304 locally, this is kern.maxfiles).

What we're doing is bumping the first of those, not the other two. 256 to 1280 is a pretty big jump; we probably don't want to jump all the way to kern.maxfilesperproc (as that then means the one process can get pretty close to kern.maxfiles, and it shouldn't be anywhere near it).

Youenn pointed out directly that as we increase parallelism we might hit issues, though Servo was running pretty wide and didn't have any issues, so we'll see.

(In reply to youenn fablet from comment #31)
> Comment on attachment 409127 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=409127&action=review
> 
> > LayoutTests/imported/w3c/ChangeLog:9
> > +        (ServerProc.create_daemon):
> 
> As per our direct discussion, please add more comments explaining the
> default value on various platforms, and the value you are selecting here.
> 
> > LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:421
> > +        if sys.platform == "darwin":
> 
> Maybe we should verify that for any given system, we have a large enough
> value.

Getting the system limits is a per-platform endeavour, however, and when it hasn't been observed elsewhere I don't think it's worthwhile trying that immediately?

(In reply to youenn fablet from comment #32)
> Really great to see that fixed. I am surprised Chrome or Firefox did not
> have that issue. Maybe they fixed in a different way though.

I don't think they run wptserve with more than one client hitting it at a time on macOS, rather just sharding across multiple machines each with their own wptserve instance.

(In reply to youenn fablet from comment #33)
> Also, even 256+1024 might be a bit small as we more and more run WTR in
> parallel on the same server. Can we increase it a bit further than that?

Certainly. I was mostly trying to avoid making a gigantic jump in one go and seeing whether this suffices.
Comment 35 Sam Sneddon [:gsnedders] 2020-09-21 10:56:09 PDT
Created attachment 409283 [details]
Patch
Comment 36 youenn fablet 2020-09-21 12:05:59 PDT
Comment on attachment 409283 [details]
Patch

We cq it before landing the patch in WPT upstream, to validate this will be working as expected.
@Sam, can you make the upstream PR right way so that we can land it in a couple of days?
If these changes are your last commit, you should be able to do something like:
Tools/Scripts/export-w3c-test-changes -g HEAD~1..HEAD -c
Comment 37 EWS 2020-09-21 12:06:36 PDT
ChangeLog entry in LayoutTests/imported/w3c/ChangeLog contains OOPS!.
Comment 38 youenn fablet 2020-09-21 12:07:55 PDT
Comment on attachment 409283 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=409283&action=review

> LayoutTests/imported/w3c/ChangeLog:6
> +        Reviewed by NOBODY (OOPS!).

Ah right, you need to change it to something like:
Reviewed by Youenn Fablet.
Comment 39 Aakash Jain 2020-09-21 12:11:43 PDT
Comment on attachment 409283 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=409283&action=review

> LayoutTests/imported/w3c/web-platform-tests/tools/serve/serve.py:428
> +            # 2048 is somewhat arbitrary, but gives us some headroom for wptrunner --parallel

Why such a small number (2048)? Why not increase it to something like 100000?
Note that in some parts of our infrastructure we do set this file limit to 100000 on various bots (using method in https://trac.webkit.org/wiki/IncreasingKernelLimits#no1) and it has been working fine for years.

Also, were you able to reproduce the issue, and verify that by increasing the limit, the problem went away?
Comment 40 EWS 2020-09-21 12:31:40 PDT
Committed r267354: <https://trac.webkit.org/changeset/267354>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 409283 [details].
Comment 41 Sam Sneddon [:gsnedders] 2020-09-24 07:59:50 PDT
Upstreamed in https://github.com/web-platform-tests/wpt/pull/25773