Bug 63531

Summary: Turn PreloadScanner implementation to a thread based solution
Product: WebKit Reporter: Zoltan Horvath <zoltan>
Component: WebCore Misc.Assignee: Nobody <webkit-unassigned>
Status: RESOLVED INVALID    
Severity: Normal CC: abarth, aestes, ap, dglazkov, eric, fpizlo, gustavo.noronha, gustavo, kbalazs, koivisto, loki, psolanki, simonjam, tonyg, webkit.review.bot, xan.lopez, zherczeg
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 65263    
Bug Blocks: 106127    
Attachments:
Description Flags
patch for comments
webkit.review.bot: commit-queue-
disable preload scanner
webkit.review.bot: commit-queue-
Archive of layout-test-results from ec2-cr-linux-03
none
proposed patch
none
new patch kbalazs: review-

Description Zoltan Horvath 2011-06-28 06:54:51 PDT
Created attachment 98909 [details]
patch for comments

I implemented PreloadScanner to use a worker thread.

My observations...

Some "performance" results:

# Measurement with html-parser performance test (trunk/PerformanceTests/Parser/html-parser.html) runs:

with PreloadScanner: 1774 ms (+/-5.8%)
without PreloadScanner: 23.7 ms (+/-0.55%)
with my ThreadedPreloadScanner: 77.15 ms (+/-1.5%)

So I think html-parser test is not suitable for htmlparser benchmarking.

# Measurement with our Methanol benchmark (23 locally mirrored sites) + 1 site what contains 3 iframes

with PreloadScanner: 19208 ms (+/-2.61%)
without PreloadScanner: 18335 ms (+/-3.74%)
with my ThreadedPreloadScanner: 17423 ms (+/-2.65%)

I had issues with AtomicString and QualifiedName comparison on worker thread, so I turned some variables into simple WTF::String.

I haven't modified the layouttests of the preloadscanner yet since my preloadscanner is working in a bit different way.

I didn't find anything about the simple preloadscanner's performance results... Has anybody measured anything?

I'm waiting your comments about the patch!

Thanks in advance,

Z
Comment 1 Adam Barth 2011-06-28 10:38:35 PDT
Very cool that you're working on this problem.  I've added some folks to the CC who have some experience benchmarking the preload scanner (and parser experience in general).
Comment 2 Adam Barth 2011-06-28 10:42:28 PDT
Comment on attachment 98909 [details]
patch for comments

View in context: https://bugs.webkit.org/attachment.cgi?id=98909&action=review

I need to look at this in more detail, but it looks very interesting.

> Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:59
> +static String bodyTagString;
> +static String charsetAttrString;
> +static String hrefAttrString;
> +static String imgTagString;
> +static String inputTagString;
> +static String linkTagString;
> +
> +static String mediaAttrString;
> +static String relAttrString;
> +static String scriptTagString;
> +static String srcAttrString;
> +static String styleTagString;
> +static String typeAttrString;

This looks like static initializers, we aren't allowed.
Comment 3 Tony Gentilcore 2011-06-28 10:47:31 PDT
Really excited to see you working on this.

I've used web-page-replay for benchmarking preload scanner changes in the past. Methanol seems similar, but wpr allows for network simulation so that bandwidth contention and RTT can be modeled. Instructions here:
http://code.google.com/p/web-page-replay/wiki/ChromeBenchmark

I haven't looked at the patch yet, but can you describe briefly what you hope the benefit will be. Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier?
Comment 4 Tony Gentilcore 2011-06-28 11:22:18 PDT
> Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier?

BTW, I ask because there are different ways this could evolve. If the concern is the CPU we spend running the preload scanner, an alternative to consider is reusing the tokens the scanner generates instead of discarding them (https://bugs.webkit.org/show_bug.cgi?id=57376).

The theory behind the preload scanner is that the parsing CPU time is trivial compared to the time spent waiting for subresources to download. So we spend extra CPU in order to start subresource downloads as early as possible so they can download while we are blocked on other things.

So the real optimization target is running the scanner as early as possible. One way we can get gaps in the download waterfall today is that the UI thread is busy for a while while the scanner is running. For instance:

<script>
  workForOneSecond();
</script>
<img src=foo.gif>

In that case we'd parse sequentially and execute the script without ever running the preload scanner. So foo.gif was not discovered as early as possible. A potential solution would be to always run the preload scanner first thing in insert()/append(), then pass the tokens to the real parser for it to actually run the scripts and build the tree. Obviously we'd only want to do that if we could reuse tokens and there'd probably be some limit in how far we'd let it run before doing real parsing.

Anyway, just throwing this out for your consideration.
Comment 5 Adam Barth 2011-06-28 14:17:24 PDT
Another way this can improve performance is the following situation:

<script src="external-script-that-comes-back-quickly.js"></script>
... lots of stuff to scan ...

In this case, we'll be blocked inside the preload scanner when the script comes back from the network.  The time between that point in time and when the preload scanner yields will be wasted and add to overall latency.
Comment 6 Alexey Proskuryakov 2011-06-28 16:04:49 PDT
Using a dedicated thread just for this seems extremely wasteful. On Mac, tasks should probably just go to libdispatch, and other platforms would likely need some kind of thread pool.
Comment 7 Antti Koivisto 2011-06-29 03:23:13 PDT
> # Measurement with html-parser performance test (trunk/PerformanceTests/Parser/html-parser.html) runs:
> 
> with PreloadScanner: 1774 ms (+/-5.8%)
> without PreloadScanner: 23.7 ms (+/-0.55%)
> with my ThreadedPreloadScanner: 77.15 ms (+/-1.5%)

In the worst case preload scanner will end up tokenizing the document text twice. Considering the triviality of the parsing compared the real parser (and  the lack of tree building), the cost should always be fraction of the time spend in real parsing.

Either these results indicate serious bug in preload scanning or you are mismeasuring.
Comment 8 Antti Koivisto 2011-06-29 03:24:43 PDT
s/twice/second time
Comment 9 Zoltan Horvath 2011-06-29 06:02:05 PDT
(In reply to comment #2) Adam
> ...
> > +static String typeAttrString;
> 
> This looks like static initializers, we aren't allowed.

Yeap, this is just a hack to make these strings comparable on the thread, although we should find a better way to do this. I didn't want to initialize the entire QualifiedName 'array' on the thread.(In reply to comment #3)


(In reply to comment #3) Tony
> Really excited to see you working on this.

I like this topic! ;)

> I've used web-page-replay for benchmarking preload scanner changes in the past. Methanol seems similar, but wpr allows for network simulation so that bandwidth contention and RTT can be modeled. Instructions here:
> http://code.google.com/p/web-page-replay/wiki/ChromeBenchmark

I'm going to set up wpr and make measurements soon!

> I haven't looked at the patch yet, but can you describe briefly what you hope the benefit will be. Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier?

One thing is decrease CPU load on the main thread. On the other side we can save the time that preloadscanner spends on the main thread. Ideally only the resourcerequest will coast on main thread. Does it make sense and is it possible to modify resource downloading to support download from worker threads?

(In reply to comment #4) Tony
> > Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier?
> 
> BTW, I ask because there are different ways this could evolve. If the concern is the CPU we spend running the preload scanner, an alternative to consider is reusing the tokens the scanner generates instead of discarding them (https://bugs.webkit.org/show_bug.cgi?id=57376).

Unfortunately, with the threaded approach the reusing of the tokens doesn't seems to be trivial. 

> A potential solution would be to always run the preload scanner first thing in insert()/append(), then pass the tokens to the real parser for it to actually run the scripts and build the tree. 

I have just rived this sentence off... My solution does the same, but it doesn't reuse tokens.


(In reply to comment #5) Adam
> Another way this can improve performance is the following situation:
> 
> <script src="external-script-that-comes-back-quickly.js"></script>
> ... lots of stuff to scan ...
> 
> In this case, we'll be blocked inside the preload scanner when the script comes back from the network.  The time between that point in time and when the preload scanner yields will be wasted and add to overall latency.

In the case of html-parser test html5.html looks like the same, so this can be the reason for long runtime, so as I see html-parser test is testing preloadscanner instead of the html5-parser. 

(In reply to comment #7) Antti
> In the worst case preload scanner will end up tokenizing the document text twice. Considering the triviality of the parsing compared the real parser (and  the lack of tree building), the cost should always be fraction of the time spend in real parsing.
> 
> Either these results indicate serious bug in preload scanning or you are mismeasuring.

I will upload a patch here to disable preloadscanner, could you make a quick measurement for me to confirm or deny me?

(In reply to comment #6) Alexey 
> Using a dedicated thread just for this seems extremely wasteful. On Mac, tasks should probably just go to libdispatch, and other platforms would likely need some kind of thread pool.

We would be happy with using of a generic thread pool as well. We pushed Parallel Jobs Framework in the trunk that supports libdispatch also, what do you think about the modification of it to a more generic implementation that supports long/short live threads and use that?

Thanks for the comments guys!
Comment 10 Tony Gentilcore 2011-06-29 06:48:56 PDT
> One thing is decrease CPU load on the main thread. On the other side we can save the time that preloadscanner spends on the main thread. Ideally only the resourcerequest will coast on main thread. Does it make sense and is it possible to modify resource downloading to support download from worker threads?

So if tokenization time on the cpu is the main target here, it seems like reusing tokens would be strictly superior (as it would work for single core as well as multicore devices). It doesn't seem like too much work to hack up for comparison. Or perhaps I'm missing another benefit of threading here?
Comment 11 Balazs Kelemen 2011-06-29 06:57:42 PDT
(In reply to comment #10)
> > One thing is decrease CPU load on the main thread. On the other side we can save the time that preloadscanner spends on the main thread. Ideally only the resourcerequest will coast on main thread. Does it make sense and is it possible to modify resource downloading to support download from worker threads?
> 
> So if tokenization time on the cpu is the main target here, it seems like reusing tokens would be strictly superior (as it would work for single core as well as multicore devices). It doesn't seem like too much work to hack up for comparison. Or perhaps I'm missing another benefit of threading here?

Let's look at the html5-parser benchmark results again:
with PreloadScanner: 1774 ms (+/-5.8%)
without PreloadScanner: 23.7 ms (+/-0.55%)
with my ThreadedPreloadScanner: 77.15 ms (+/-1.5%)

From these number it seems like preloading is even slower than the real parsing. According to this result I guess that the most time consuming part of the preload is not the tokenization but starting the requests. Do you think it is possible? Do you think it can be true on all platforms (with different networking stacks)?  If starting the request is really that costly it would be great to support starting them from a worker thread and make the preload scanner run on a worker thread.
Comment 12 Tony Gentilcore 2011-06-29 07:30:24 PDT
> According to this result I guess that the most time consuming part of the preload is not the tokenization but starting the requests. Do you think it is possible?

I didn't consider that starting a request would be expensive. It would be very interesting to demonstrate that is the case. If that does turn out to be the case, then maybe it would be better for the embedder to handle any threading when initiating requests? Chromium, for instance, does IO in a separate thread and the network requests in a separate process.

However, like Antti, I'm somewhat skeptical of the measurement. Why would it be the case that initiating the requests are expensive when done from the preload scanner but not when done from the parser? I would really expect that tokenization is the only work being duplicated here.

Again, I'm not opposed to this change, just want to make sure I understand it completely.
Comment 13 Balazs Kelemen 2011-06-29 08:04:15 PDT
> However, like Antti, I'm somewhat skeptical of the measurement. Why would it be the case that initiating the requests are expensive when done from the preload scanner but not when done from the parser? I would really expect that tokenization is the only work being duplicated here.

I asked the same from Zoltan :) The answer is: the parser does not initiate requests. Those resources will be requested on demand when somebody from WebCore try to access them.
Comment 14 WebKit Review Bot 2011-06-29 08:07:58 PDT
Attachment 98909 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/WebCore.pro', u'Source/WebC..." exit_code: 1

Source/WebCore/html/parser/HTMLDocumentParser.h:35:  Alphabetical sorting problem.  [build/include_order] [4]
Total errors found: 1 in 13 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 15 WebKit Review Bot 2011-06-29 08:16:36 PDT
Comment on attachment 98909 [details]
patch for comments

Attachment 98909 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/8966034
Comment 16 WebKit Review Bot 2011-06-29 08:22:59 PDT
Comment on attachment 98909 [details]
patch for comments

Attachment 98909 [details] did not pass cr-mac-ews (chromium):
Output: http://queues.webkit.org/results/8960196
Comment 17 Collabora GTK+ EWS bot 2011-06-29 08:30:59 PDT
Comment on attachment 98909 [details]
patch for comments

Attachment 98909 [details] did not pass gtk-ews (gtk):
Output: http://queues.webkit.org/results/8965064
Comment 18 WebKit Review Bot 2011-06-29 08:55:47 PDT
Comment on attachment 98909 [details]
patch for comments

Attachment 98909 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/8906018
Comment 19 Antti Koivisto 2011-06-29 11:20:52 PDT
This would need much more convincing testing. Sampler profiles showing preload scanner taking substantial CPU time would be a good start.

Anyway, I think the approach taken here takes us to a wrong direction. Instead we should

- Make both main parser and preload scanning use the same token stream.
- Move the token generation to a thread (if it is a win). This will help much more than just moving the preload scanning.
Comment 20 Zoltan Horvath 2011-06-29 23:54:09 PDT
Created attachment 99245 [details]
disable preload scanner 

This patch disables preloadscanner. 
Can somebody verify my results please?
Comment 21 WebKit Review Bot 2011-06-30 00:24:44 PDT
Comment on attachment 99245 [details]
disable preload scanner 

Attachment 99245 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/8963390

New failing tests:
fast/preloader/input.html
fast/preloader/noscript.html
fast/preloader/image.html
fast/preloader/script.html
fast/preloader/scan-body-from-head-import.html
fast/preloader/style.html
http/tests/loading/preload-append-scan.php
fast/preloader/link.html
fast/preloader/scan-body-from-head-script.html
Comment 22 WebKit Review Bot 2011-06-30 00:24:50 PDT
Created attachment 99249 [details]
Archive of layout-test-results from ec2-cr-linux-03

The attached test failures were seen while running run-webkit-tests on the chromium-ews.
Bot: ec2-cr-linux-03  Port: Chromium  Platform: Linux-2.6.35-28-virtual-x86_64-with-Ubuntu-10.10-maverick
Comment 23 Gabor Loki 2011-06-30 02:50:53 PDT
> Can somebody verify my results please?

I tried your patches on Gtk, and I can confirm the difference between trunk and disabled preloadScanner on PerformanceTests/Parser/html-parser.html .
Trunk:
  avg 2801.2
  median 2801
  stdev 9.57914401186244
  min 2791
  max 2828

PreloadScanner is disabled
  avg 37.6
  median 37.5
  stdev 0.8602325267042628
  min 36
  max 40

I can also confirm the same on the threaded preloadScanner, but we should take care about the jobs which are dedicated to the worker thread. The current approach does not care about the lifetime of dedicated tasks. We should cancel the pending actions on worker thread if there is no need for them (for example the parser is finished its work).

PreloadScanner is in a dedicated thread
  avg 227.05
  median 174
  stdev 143.21503936388805
  min 158
  max 692
Comment 24 Gyuyoung Kim 2011-06-30 05:19:46 PDT
Comment on attachment 98909 [details]
patch for comments

Attachment 98909 [details] did not pass efl-ews (efl):
Output: http://queues.webkit.org/results/8959630
Comment 25 Balazs Kelemen 2011-07-04 07:42:17 PDT
> - Move the token generation to a thread (if it is a win). This will help much more than just moving the preload scanning.

I don expect that it would be a win. What would the main thread do while waiting for the tokens? The reason why the preload scanner is a friendly topic for multithreading is that preloading is independent from the rest of the parsing process and it can run in parallel while the main thread is waiting for resources.
Comment 26 Balazs Kelemen 2011-07-12 11:14:03 PDT
I started to investigate in reusing the tokens in https://bugs.webkit.org/show_bug.cgi?id=64369. Actually I'm think that running the preload scanner in parallel is more valuable than reusing the tokens. However we should know how much performance gain we can win by reusing the tokens.
Comment 27 Balazs Kelemen 2011-07-26 07:24:39 PDT
Created attachment 101999 [details]
proposed patch

Similar design than the last patch with the following improvements:
 * lots of refactoring
 * add a way to cancel pending tasks
 * schedule preloading on the main thread asyncronically through the event system (via WTF::callOnMainThread)
   in order to be able to initiate the load as soon as possible and minimalize the extra work that the parser
   need to do with preload scanning

It's still not fully ready since it's missing rebaselines for preload scanning tests.
Benchmark results are on their way, be patient. Feedback is greatly welcomed.
Comment 28 WebKit Review Bot 2011-07-26 07:27:47 PDT
Attachment 101999 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/html/parser/CSSPreloadScanner.cpp:202:  One line control clauses should not use braces.  [whitespace/braces] [4]
Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:6:  Alphabetical sorting problem.  [build/include_order] [4]
Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:9:  Alphabetical sorting problem.  [build/include_order] [4]
Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:12:  "HTMLPreloadScanner.h" already included at Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:5  [build/include] [4]
Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:13:  Alphabetical sorting problem.  [build/include_order] [4]
Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:26:  More than one command on the same line  [whitespace/newline] [4]
Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:91:  Missing space before ( in while(  [whitespace/parens] [5]
Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:104:  One line control clauses should not use braces.  [whitespace/braces] [4]
Source/WebCore/html/parser/HTMLPreloadScannerThread.h:8:  Alphabetical sorting problem.  [build/include_order] [4]
Source/WebCore/html/parser/HTMLPreloadScannerThread.h:11:  Alphabetical sorting problem.  [build/include_order] [4]
Source/WebCore/html/parser/HTMLPreloadScannerThread.h:33:  The parameter name "task" adds no information, so it should be removed.  [readability/parameter_name] [5]
Total errors found: 11 in 16 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 29 WebKit Review Bot 2011-07-26 07:29:43 PDT
Comment on attachment 101999 [details]
proposed patch

Attachment 101999 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/9250377
Comment 30 Gyuyoung Kim 2011-07-26 07:30:23 PDT
Comment on attachment 101999 [details]
proposed patch

Attachment 101999 [details] did not pass efl-ews (efl):
Output: http://queues.webkit.org/results/9249450
Comment 31 WebKit Review Bot 2011-07-26 07:44:20 PDT
Comment on attachment 101999 [details]
proposed patch

Attachment 101999 [details] did not pass cr-mac-ews (chromium):
Output: http://queues.webkit.org/results/9245463
Comment 32 Gustavo Noronha (kov) 2011-07-26 08:54:28 PDT
Comment on attachment 101999 [details]
proposed patch

Attachment 101999 [details] did not pass gtk-ews (gtk):
Output: http://queues.webkit.org/results/9244477
Comment 33 WebKit Review Bot 2011-07-26 12:25:08 PDT
Comment on attachment 101999 [details]
proposed patch

Attachment 101999 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/9245524
Comment 34 Balazs Kelemen 2011-07-27 09:30:12 PDT
The patch needs a solution for https://bugs.webkit.org/show_bug.cgi?id=65263 to not trigger an assertion.
Comment 35 Balazs Kelemen 2011-08-01 08:46:50 PDT
Sorry for the slow progress. Unfortunately benchmark results are not that good as expected so I'm still tweaking it. The html-parser benchmark (in PerformaceTests/Parser) is encouraging but page load time tests like methanol are currently showing a small regression. I hope I can make it faster.
Comment 36 Antti Koivisto 2011-08-05 14:19:09 PDT
It is unlikely that threading something that takes very little CPU time is going to make page loading faster, especially since the actually triggering of the loads still has to happen in the webkit thread. We definitely are not going to accept a patch that introduces more threading (and all the associated complexity) without showing useful performance gains. Your efforts would probably be better spent elsewhere.
Comment 37 Balazs Kelemen 2011-08-08 02:34:47 PDT
(In reply to comment #36)
> It is unlikely that threading something that takes very little CPU time is going to make page loading faster, especially since the actually triggering of the loads still has to happen in the webkit thread. We definitely are not going to accept a patch that introduces more threading (and all the associated complexity) without showing useful performance gains. Your efforts would probably be better spent elsewhere.

Your arguments are reasonable but I still did not give it up :)
Yes, it seems like load time of typical web pages won't be really faster.
However, doing the scan in parallel has the following advantages:
 * It eliminates the worst case which is blocking in the preload scanner when the resource that we are waiting for has already arrived. This is the case with the html-parser benchmark and that's why the patch speeds it up. Actually this is the case with every page that refers to some script or css at the beginning and has a lot of text.
 * It helps in kicking off the load a bit earlier because we can give each chunk to the scanner as soon as it has arrived. Unfortunately this benefit cannot be shown by page load time tests.

I can support the first point with the current numbers (with a newer local version of the patch).

html-parser
Ref
    avg 1786.85
    median 1781
    stdev 25.16997218909866
    min 1758
    max 1844
Parallel
    avg 1015.85
    median 1013
    stdev 6.174746958378132
    min 1010
    max 1026

Methanol with some popular site
ref      1157.39
parallel 1145.5

Methanol with some popular site + PerformanceTests/Parser/resources/html5.html
ref      40070
parallel 36824

The sites I used with Methanol were the local copies of http://www.blogger.com/, http://www.conduit.com/, http://www.facebook.com/, http://www.qq.com/, http://twitter.com/, http://www.google.com/, http://www.microsoft.com/, http://www.ebay.com/ and http://www.wikipedia.org/.

My further plan is to do a web-page-replay based benchmark on Mac that would show the behaviour with (simulated) network latency.
Comment 38 Balazs Kelemen 2011-08-09 04:39:26 PDT
Created attachment 103346 [details]
new patch

Unfortunately I could not setup a better benchmark (web-page-replay did not work for me neither on Linux nor on Mac).
If anybody could help in evaluating the performance I would really appreciate it.
Comment 39 WebKit Review Bot 2011-08-09 04:44:06 PDT
Comment on attachment 103346 [details]
new patch

Attachment 103346 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/9333366
Comment 40 WebKit Review Bot 2011-08-09 04:58:21 PDT
Comment on attachment 103346 [details]
new patch

Attachment 103346 [details] did not pass cr-mac-ews (chromium):
Output: http://queues.webkit.org/results/9336225
Comment 41 Balazs Kelemen 2011-10-10 01:58:04 PDT
Comment on attachment 103346 [details]
new patch

Outdated patch.
Comment 42 Balazs Kelemen 2011-10-10 02:00:49 PDT
The benefit doesn't seems to be valuable enough.
Comment 43 Filip Pizlo 2013-01-10 00:53:14 PST
(In reply to comment #42)
> The benefit doesn't seems to be valuable enough.

It would be great if you could summarize your findings.  It's likely that we will want to explore these sorts of optimizations throughout the system.  Probably a lot of the exploration we will do will dead-end, like yours did, simply because this is a tough problem.

So I'm curious about the following:

- To your best estimate, what was the expected speed-up?  The results you posted are for example 1157.39 (ref) versus 1145.5 (your concurrent version), which seems like a speed-up.

- As far as you know, what were the bottlenecks?

I know it's been a long time since you worked on this, but I think that this is a pretty neat case study of taking some part of the code and attempting to make it concurrent.  It would be cool to understand more precisely which parts of your optimization worked, and which parts didn't.
Comment 44 Filip Pizlo 2013-01-10 00:57:50 PST
Comment on attachment 103346 [details]
new patch

View in context: https://bugs.webkit.org/attachment.cgi?id=103346&action=review

> Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:58
> +    while (OwnPtr<HTMLPreloadScannerThread::ScanTask> task = m_tasks.waitForMessage()) {
> +        Mutex& scanTasksMutex = task->preloadScanner->scanTasksMutex();
> +        if (!scanTasksMutex.tryLock())
> +            continue;
> +        scan(task.get());
> +        scanTasksMutex.unlock();
> +    }

This seems like a really bad idea.  This looks like it will remove a task from the queue, but then if it cannot acquire the scanTasksMutex, it will just drop the task on the floor.

> Source/WebCore/html/parser/HTMLPreloadScannerThread.h:36
> +    MessageQueue<ScanTask> m_tasks;

I took a peek into MessageQueue, and it appears not to be particularly optimal.  It pretty much requires everything happening in lockstep: to add a task, you lock a lock, add things, signal, and unlock.  When you do this and the worker thread was asleep, you pay the price of it being awoken.  That's expensive.  The locking itself is probably expensive, but that's not nearly as bad.
Comment 45 Balazs Kelemen 2013-01-10 14:31:49 PST
(In reply to comment #43)
> (In reply to comment #42)
> > The benefit doesn't seems to be valuable enough.
> 
> It would be great if you could summarize your findings.  It's likely that we will want to explore these sorts of optimizations throughout the system.  Probably a lot of the exploration we will do will dead-end, like yours did, simply because this is a tough problem.

I'm glad that you are interested in this work.

> 
> So I'm curious about the following:
> 
> - To your best estimate, what was the expected speed-up?  The results you posted are for example 1157.39 (ref) versus 1145.5 (your concurrent version), which seems like a speed-up.

It's a very tiny win, and as I remember I could not reproduce it every time, so we can say it is under the margin of measuring error. Generally page load time did not improved by this work. The case were it was very beneficial is such a page like the parser benchmark (PerformanceTests/Parser/html-parser.html) where the preload scanner is a time waster. I believe this worst case scenario is really a bug of the preload scanner and it should be fixed anyway - without threading I can think about yielding to the event loop after some amount of text it has processed.
The other advantage could be that we can kick off loading resources earlier, but it's hard to measure.

> 
> - As far as you know, what were the bottlenecks?
> 
> I know it's been a long time since you worked on this, but I think that this is a pretty neat case study of taking some part of the code and attempting to make it concurrent.  It would be cool to understand more precisely which parts of your optimization worked, and which parts didn't.

A serious problem is that we can only start the load on the main thread. If the main parser has been stopped because of JS execution we could not load the resources until it finishes. If we could have a loader on the thread it could help but I guess it needs a lot of work.
Comment 46 Balazs Kelemen 2013-01-10 14:51:07 PST
(In reply to comment #44)
> (From update of attachment 103346 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=103346&action=review
> 
> > Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:58
> > +    while (OwnPtr<HTMLPreloadScannerThread::ScanTask> task = m_tasks.waitForMessage()) {
> > +        Mutex& scanTasksMutex = task->preloadScanner->scanTasksMutex();
> > +        if (!scanTasksMutex.tryLock())
> > +            continue;
> > +        scan(task.get());
> > +        scanTasksMutex.unlock();
> > +    }
> 
> This seems like a really bad idea.  This looks like it will remove a task from the queue, but then if it cannot acquire the scanTasksMutex, it will just drop the task on the floor.

This condition can only be true if the parser is stopped or it is under destruction, so we don't want to process this task anyway.

> 
> > Source/WebCore/html/parser/HTMLPreloadScannerThread.h:36
> > +    MessageQueue<ScanTask> m_tasks;
> 
> I took a peek into MessageQueue, and it appears not to be particularly optimal.  It pretty much requires everything happening in lockstep: to add a task, you lock a lock, add things, signal, and unlock.  When you do this and the worker thread was asleep, you pay the price of it being awoken.  That's expensive.  The locking itself is probably expensive, but that's not nearly as bad.

I'm not sure how could we avoid that. Do you mean we should make sure that the thread will never fall asleep, i.e. it always have something to do? This seems tricky, I don't have any idea how to achieve that.
Comment 47 Filip Pizlo 2013-01-10 15:19:58 PST
(In reply to comment #46)
> (In reply to comment #44)
> > (From update of attachment 103346 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=103346&action=review
> > 
> > > Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:58
> > > +    while (OwnPtr<HTMLPreloadScannerThread::ScanTask> task = m_tasks.waitForMessage()) {
> > > +        Mutex& scanTasksMutex = task->preloadScanner->scanTasksMutex();
> > > +        if (!scanTasksMutex.tryLock())
> > > +            continue;
> > > +        scan(task.get());
> > > +        scanTasksMutex.unlock();
> > > +    }
> > 
> > This seems like a really bad idea.  This looks like it will remove a task from the queue, but then if it cannot acquire the scanTasksMutex, it will just drop the task on the floor.
> 
> This condition can only be true if the parser is stopped or it is under destruction, so we don't want to process this task anyway.

Ah.  I buy it.

> 
> > 
> > > Source/WebCore/html/parser/HTMLPreloadScannerThread.h:36
> > > +    MessageQueue<ScanTask> m_tasks;
> > 
> > I took a peek into MessageQueue, and it appears not to be particularly optimal.  It pretty much requires everything happening in lockstep: to add a task, you lock a lock, add things, signal, and unlock.  When you do this and the worker thread was asleep, you pay the price of it being awoken.  That's expensive.  The locking itself is probably expensive, but that's not nearly as bad.
> 
> I'm not sure how could we avoid that. Do you mean we should make sure that the thread will never fall asleep, i.e. it always have something to do? This seems tricky, I don't have any idea how to achieve that.

It is tricky, and I'm also not sure how to achieve it.  This is basically always the problem in trying to get a speed-up from making something concurrent: you need to get enough concurrency to keep your threads always fed with work to do.

If we are sure that we cannot achieve this, then we should think twice about using concurrency as a performance optimization.  (Though it may still be a valid optimization for reducing worst-case latencies if we believe that the task we have made concurrent risks stalling, say because of I/O or because of an algorithmic pathology.  That doesn't appear to be the case here since you're just moving non-I/O linear-time work off the main thread.)