Bug 37104 - WebKit Apache configs only listen on IPv4 addresses, causing random timeouts
Summary: WebKit Apache configs only listen on IPv4 addresses, causing random timeouts
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC OS X 10.5
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks: 33357 33742 36774 36941 37045
  Show dependency treegraph
 
Reported: 2010-04-05 12:09 PDT by Eric Seidel (no email)
Modified: 2010-04-08 00:55 PDT (History)
5 users (show)

See Also:


Attachments
Reduced http/tests/xmlhttprequest/cross-origin-authorization.html which shows hang when run repeatedly (1.87 KB, text/html)
2010-04-07 17:06 PDT, Eric Seidel (no email)
no flags Details
An even better reduction (689 bytes, text/plain)
2010-04-07 17:38 PDT, Eric Seidel (no email)
no flags Details
Patch (6.80 KB, patch)
2010-04-07 21:53 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch (1.52 KB, patch)
2010-04-07 22:59 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch (1.41 KB, patch)
2010-04-07 23:58 PDT, Eric Seidel (no email)
abarth: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Seidel (no email) 2010-04-05 12:09:58 PDT
XMLHttpRequest Exception 101 causing flakiness on numerous XHR tests

This appears to mostly be a problem for Snow Leopard and may be a problem with the underlying network layer.

This is an attempt to unify a bunch of likely duplicate bugs into one bug.

http/tests/xmlhttprequest/cross-origin-authorization.html:
https://bugs.webkit.org/show_bug.cgi?id=37045
http/tests/xmlhttprequest/access-control-basic-allow.html:
https://bugs.webkit.org/show_bug.cgi?id=36941
http/tests/xmlhttprequest/failed-auth.html:
https://bugs.webkit.org/show_bug.cgi?id=36774
http/tests/xmlhttprequest/cross-origin-cookie-storage.html:
https://bugs.webkit.org/show_bug.cgi?id=33742
http/tests/xmlhttprequest/cross-origin-no-authorization.html:
https://bugs.webkit.org/show_bug.cgi?id=33357

We can probably just dupe all of those against this bug.

Alexey:  Is there some sort of extra information we could collect from the bots to help us move forward in understanding this bug?  Given that this looks like an intermittent failure, I doubt we'll get a reproducible case, but it might be possible to catch this in the debugger running one of these tests enough times locally.
Comment 1 Eric Seidel (no email) 2010-04-05 12:12:28 PDT
Two more failures likely caused by the same root bug:
http/tests/xmlhttprequest/logout.html
https://bugs.webkit.org/show_bug.cgi?id=33301
http/tests/xmlhttprequest/frame-load-cancelled-abort.html
https://bugs.webkit.org/show_bug.cgi?id=33462
Comment 2 Alexey Proskuryakov 2010-04-05 12:19:37 PDT
This sounds like something that needs to be reproducible locally to investigate.
Comment 3 Adam Barth 2010-04-06 17:27:11 PDT
Looks like Exception 101 is just XMLHttpRequestException::NETWORK_ERR.
Comment 4 Eric Seidel (no email) 2010-04-07 15:28:14 PDT
In bug 37235 I've fixed it so that we can actually run:
run-webkit-tests --iterations 100 http/tests/xmlhttprequest
without expected failures.
Comment 5 Eric Seidel (no email) 2010-04-07 15:40:56 PDT
Running:
run-webkit-tests --iterations 1000 http/tests/xmlhttprequest/cross-origin-authorization.html
on SL seems to hang every 10 tests or so for a couple seconds.  Strange.
Comment 6 Eric Seidel (no email) 2010-04-07 15:49:53 PDT
http/tests/xmlhttprequest/cross-origin-no-authorization.html exhibits the same strange "pause every 10 or so" behavior.  Watching the apache access_log, the hang point seems random.  The last request entry in the log before the hang is not consistent.

Watching "top" I only see DumpRenderTree ever taking time, never httpd.
Comment 7 Eric Seidel (no email) 2010-04-07 17:06:59 PDT
Created attachment 52808 [details]
Reduced http/tests/xmlhttprequest/cross-origin-authorization.html which shows hang when run repeatedly

I used:
yes LayoutTests/http/tests/xmlhttprequest/cross-origin-authorization.html | head -n 1000 | xargs WebKitBuild/Debug/DumpRenderTree | grep Time
to run this test repeatedly, and noticed that every 20 or so runs it would hang for 2 seconds.

I'm still investigating to find out what part is hanging.
Comment 8 Eric Seidel (no email) 2010-04-07 17:07:37 PDT
I'm investigating the hanging because I suspect it's related to the XHR exceptions that this bug is actually about.
Comment 9 Eric Seidel (no email) 2010-04-07 17:17:09 PDT
(In reply to comment #7)
> I used:
> yes LayoutTests/http/tests/xmlhttprequest/cross-origin-authorization.html |
> head -n 1000 | xargs WebKitBuild/Debug/DumpRenderTree | grep Time
> to run this test repeatedly, and noticed that every 20 or so runs it would hang
> for 2 seconds.

This original testing was on a Debug build on a 4-core Mac Pro running Snow Leopard.  I was also able to reproduce similar symptoms on a Release build on a 2-core Mac Book Pro running Leopard.
Comment 10 Eric Seidel (no email) 2010-04-07 17:23:28 PDT
I updated the test to call GCController.collect() at the end of each test (after printing the time taken), and that did not make the hang go away.  Hopefully that rules out the idea of GC causing this hang.
Comment 11 Eric Seidel (no email) 2010-04-07 17:33:00 PDT
Adam and I have now reduced the test case to:

<pre id="console"></pre>
<script>
var startTime = new Date;
layoutTestController.dumpAsText();
function log(message)
{
    document.getElementById('console').appendChild(document.createTextNode(message + '\n'));
}
var req = new XMLHttpRequest;
req.open("GET", "http://localhost:8000/xxxxx", false);
req.send();
log("Time: " + (new Date - startTime));
</script>

See the instructions for running it under DRT above.

If you change this to use same-origin (127.0.0.1), we don't hang.  But the test as is will hang for about 1 second every 50 requests or so.
Comment 12 Eric Seidel (no email) 2010-04-07 17:38:19 PDT
Created attachment 52811 [details]
An even better reduction

The 124th XHR seems to most commonly fail!
Comment 13 Eric Seidel (no email) 2010-04-07 18:10:14 PDT
Firefox reproduced the failure as well.  Suggesting I may be confused.

Chrome throws an error:
    (111) Connection refused
Comment 14 Eric Seidel (no email) 2010-04-07 18:32:03 PDT
while curl "http://localhost:8000/xxxxxx" -s -w "Time: %{time_total}\n" | grep Time ; do; done;
reproduces the hang.
Suggesting that this is not WebKit, but rather some lower-level system service (like lookupd?) or apache itself.
Comment 15 Eric Seidel (no email) 2010-04-07 18:37:36 PDT
More data:

while [ true ]; do; curl "http://localhost:8000/xxxxxx" -s -w "Time: %{time_total}\n" | grep Time | grep -v "0.0"; done; 
reproduces.  However, using "127.0.0.1" instead of "localhost":

while [ true ]; do; curl "http://127.0.0.1:8000/xxxxxx" -s -w "Time: %{time_total}\n" | grep Time | grep -v "0.0"; done;

does not reproduce.

This still faults Apache or DNS resolution (or reverse lookup) on the machine (which used to be done by lookupd, but I don't know how things are done in Leopard/SnowLeopard).
Comment 16 Adam Barth 2010-04-07 19:19:10 PDT
Reduced test case:
run this command in a loop (with no server listening on port 9999):

echo | nc 127.0.0.1 9999

Sometimes it will hang for 1+ seconds.  Other examples:

dig localhost  <-- no hang
ping -c 1 127.0.0.1  <-- does hang (often for 11 seconds) 
curl -s http://localhost:9999/xxx  <-- does hang

If you try talking to a remote server (like webblaze.org), you don't get any hangs.  My read is that this is a bug in Mac OS X listening sockets.
Comment 17 Alexey Proskuryakov 2010-04-07 20:18:15 PDT
Are you always testing on 10.5.8?
Comment 18 Eric Seidel (no email) 2010-04-07 21:38:29 PDT
The issue (thanks to Rob Braun and Maciej) is that our Apache is only listening on IPv4 loopback (127.0.0.1) and not IPv6 loopback (::1).  When CFNetwork (or any other network layer) resolves "localhost" to an address it gets back both 127.0.0.1 and ::1 in random order. When the connection fails, it resolves again, and tries again.

If the network layer tries again, it may or may not get the IPv4 address first on the second resolve.

If we're failing often enough on enough requests, the kernel will start to throttle back connection reset responses (which was what was causing the 1s or 2s pauses in several of our reduced test cases).

We suspect this can lead to both the random "network error" exceptions as well as the timeouts.

The fix is to make our various apache configs listen on both the IPv4 loopback addresses as well as the IPv6 addresses.

Unfortunately during this process I also discovered that we have some (half-broken) support for changing HTTP port numbers in our run-webkit-tests and run-webkit-httpd scripts.  I did not make any attempt to fix those, I just added a FIXME since the fix for this bug makes that half-broken support more broken.
Comment 19 Eric Seidel (no email) 2010-04-07 21:53:57 PDT
Created attachment 52824 [details]
Patch
Comment 20 Adam Barth 2010-04-07 21:58:48 PDT
Comment on attachment 52824 [details]
Patch

I bet websockets have the same bug.
Comment 21 Adam Barth 2010-04-07 22:00:40 PDT
Comment on attachment 52824 [details]
Patch

Clearing flags on attachment: 52824

Committed r57249: <http://trac.webkit.org/changeset/57249>
Comment 22 Adam Barth 2010-04-07 22:00:54 PDT
All reviewed patches have been landed.  Closing bug.
Comment 23 Alexey Proskuryakov 2010-04-07 22:01:55 PDT
This is great news!

1. What will this do to Apache 1 on Tiger? Or to Apache on machines without IPv6 support?

2. Does this work with "run-webkit-httpd -a"?

3. Does run-webkit-websocketserver need to be fixed?

4. Did you verify that this is the actual cause of problems by disabling IPv6 on your machine?
Comment 24 Alexey Proskuryakov 2010-04-07 22:02:16 PDT
Guys, you are too fast...
Comment 25 Adam Barth 2010-04-07 22:09:19 PDT
> Guys, you are too fast...

Sorry.  We've been working on this bug for hours and talked about the fix over dinner.

> 1. What will this do to Apache 1 on Tiger? Or to Apache on machines without
> IPv6 support?

I think apache will fail to bind to the interface but otherwise be ok.  We'll find out for sure by what the bot does.

> 2. Does this work with "run-webkit-httpd -a"?

I don't know what that option does, but we can look into it.

> 3. Does run-webkit-websocketserver need to be fixed?

Yes.

> 4. Did you verify that this is the actual cause of problems by disabling IPv6
> on your machine?

We ran lots of experiments, but we didn't run that particular experiment.  The configuration is definitely wrong.  We're unsure of how much flakiness this will fix (hopefully a lot).  We'll find out.
Comment 26 Adam Barth 2010-04-07 22:14:35 PDT
> > 2. Does this work with "run-webkit-httpd -a"?
> 
> I don't know what that option does, but we can look into it.

That's part of the runtime configuration of Apache listening.  Eric says that all broken.
Comment 27 Adam Barth 2010-04-07 22:17:47 PDT
> 2. Does this work with "run-webkit-httpd -a"?

The short answer is yes.  These directives are additive.
Comment 28 Alexey Proskuryakov 2010-04-07 22:20:57 PDT
> Sorry.

Fast isn't bad :)

> > 2. Does this work with "run-webkit-httpd -a"?
> I don't know what that option does, but we can look into it.

It binds to all interfaces - useful when you want to run tests over network from another machine for some reason.

> The configuration is definitely wrong.

Agreed.
Comment 29 Alexey Proskuryakov 2010-04-07 22:22:10 PDT
> Unfortunately during this process I also discovered that we have some
> (half-broken) support for changing HTTP port numbers

It's completely broken, because tests rely on hardcoded port numbers (I initially hoped that test authors would be careful to not depend on those). This support should be just removed.
Comment 30 Eric Seidel (no email) 2010-04-07 22:57:43 PDT
Tiger needs more fixeration.  Reopening.
Comment 31 Eric Seidel (no email) 2010-04-07 22:59:43 PDT
Created attachment 52826 [details]
Patch
Comment 32 WebKit Review Bot 2010-04-07 23:03:36 PDT
http://trac.webkit.org/changeset/57249 might have broken Tiger Intel Release
Comment 33 Eric Seidel (no email) 2010-04-07 23:03:55 PDT
Committed r57250: <http://trac.webkit.org/changeset/57250>
Comment 34 Eric Seidel (no email) 2010-04-07 23:58:53 PDT
Created attachment 52842 [details]
Patch
Comment 35 Eric Seidel (no email) 2010-04-08 00:55:20 PDT
Committed r57260: <http://trac.webkit.org/changeset/57260>