Bug 218994

Summary: [ews] Add timeout to network requests
Product: WebKit Reporter: Aakash Jain <aakash_jain>
Component: Tools / TestsAssignee: Aakash Jain <aakash_jain>
Severity: Normal CC: aakash_jain, ap, dewei_zhu, jbedard, lingho, ryanhaddad, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=214355
Description Flags
Patch none

Description Aakash Jain 2020-11-16 11:35:22 PST
Sometimes network requests can get stuck or take very long time. It can introduce delays on EWS. Few days back (on nov 13), we noticed slowness on EWS when bugs.webkit.org requests were taking extremely long or getting stuck. 

We should add an explicit timeout to network requests, so that the network request is timely terminated.

Also, as per https://requests.readthedocs.io/en/master/user/quickstart/#timeouts:
"Nearly all production code should use this parameter in nearly all requests. Failure to do so can cause your program to hang indefinitely"
Comment 1 Aakash Jain 2020-11-16 11:38:08 PST
Created attachment 414261 [details]
Comment 2 Aakash Jain 2020-11-16 11:44:15 PST
Note that the network requests failures in all these calls are gracefully handled.

For e.g.: fetch_data_from_url_with_authentication() is used in get_patch_json/get_bug_json which are used in ValidatePatch step. In case of network request failure, validate-patch step will print a log about network request failure, mark the validate-patch step as warning, and continue with the subsequent build-steps.

For load_contributors_from_trac() network request failure, it falls back to reading the contributors.json file from disk.

For get_patch_status() network request failure in CheckPatchStatusOnEWSQueues step, commit-queue will continue processing the patch, although it might be inefficient, as it might run the layout-tests even if not strictly necessary.
Comment 3 lingho@apple.com 2020-11-16 11:57:23 PST
I think the timeout value to bugs.webkit.org should probably be set to a much lower value, like 2 seconds. Similarly for trac and ews-build connection, if you have't seen previous timeout, that should probably set to 2 seconds to make it useful.
Comment 4 Jonathan Bedard 2020-11-16 12:36:05 PST
Comment on attachment 414261 [details]

View in context: https://bugs.webkit.org/attachment.cgi?id=414261&action=review

> Tools/CISupport/ews-build/steps.py:402
> +            response = requests.get(url, timeout=15, params={'Bugzilla_api_key': self.get_bugzilla_api_key()})

I would second Ling's opinion that this timeout should be lower (I would say 5 seconds)

This is the time to wait for any bytes, not the entire response: https://requests.readthedocs.io/en/master/user/quickstart/
Comment 5 Alexey Proskuryakov 2020-11-16 12:42:53 PST
I'm not necessarily disagreeing, but can you explain why the timeout would be short? My instinct was to suggest making it at least 1 minute.

Failing these requests could put the system in a bad state.
Comment 6 Jonathan Bedard 2020-11-16 12:54:41 PST
(In reply to Alexey Proskuryakov from comment #5)
> I'm not necessarily disagreeing, but can you explain why the timeout would
> be short? My instinct was to suggest making it at least 1 minute.
> Failing these requests could put the system in a bad state.

My thinking is that if we can't receive any response (as in, not a single byte) from the server in 5 seconds, I wouldn't expect things to be different after 1 minute.

That being said, I'm not sure that this change would even fix the original problem, was our issue that requests where not connecting to the server? The timeout option for requests doesn't apply to how long it takes to download the response, only to the time it takes for the server to start the response.
Comment 7 lingho@apple.com 2020-11-16 14:06:36 PST
I might have misunderstood the reason the timeouts are to be addded. I thought these timeouts are acceptable and will be handled accordingly.

If the end goal is to not have a failed request, I guess there shouldn't even be a timeout specified. If a request eventually gets a timeout (not sure what's the default set up Python but I think the default Linux socket timeout is 60s), retry again and again until it eventually succeeded.

As of why ews-build was affected as a whole by unavailable connection on bugs.webkit.org last week, the cause might need to be figured out and fixed differently.
Comment 8 Aakash Jain 2020-11-18 13:44:30 PST
Created attachment 414481 [details]
Comment 9 Aakash Jain 2020-11-18 14:02:30 PST
Increased the timeout to 60s. It's not very clear why we saw performance issue on EWS on last Friday, but there were clearly many network requests which were stuck, taking extremely long etc. e.g.: in https://ews-build.webkit.org/#/builders/46/builds/2265 step #17 (validate-patch) took 2m:42s. Having 60s timeout is still much better than infinite timeout.

Also, for most of these network requests, failure only make EWS build slightly inefficient (e.g.: validate-patch step being skipped causing EWS to process obsolete/r- patches). For load_contributors_from_trac(), it will read local copy from disk (which might be slightly old)
Comment 10 Jonathan Bedard 2020-11-18 14:20:03 PST
Comment on attachment 414481 [details]

I think nit-picking the value this should be is not a good use of our time, so let's land!
Comment 11 EWS 2020-11-18 14:50:58 PST
Committed r269992: <https://trac.webkit.org/changeset/269992>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 414481 [details].
Comment 12 Radar WebKit Bug Importer 2020-11-18 14:51:20 PST