63531 – Turn PreloadScanner implementation to a thread based solution

Zoltan Horvath

Reported 2011-06-28 06:54:51 PDT

Created attachment 98909 [details] patch for comments I implemented PreloadScanner to use a worker thread. My observations... Some "performance" results: # Measurement with html-parser performance test (trunk/PerformanceTests/Parser/html-parser.html) runs: with PreloadScanner: 1774 ms (+/-5.8%) without PreloadScanner: 23.7 ms (+/-0.55%) with my ThreadedPreloadScanner: 77.15 ms (+/-1.5%) So I think html-parser test is not suitable for htmlparser benchmarking. # Measurement with our Methanol benchmark (23 locally mirrored sites) + 1 site what contains 3 iframes with PreloadScanner: 19208 ms (+/-2.61%) without PreloadScanner: 18335 ms (+/-3.74%) with my ThreadedPreloadScanner: 17423 ms (+/-2.65%) I had issues with AtomicString and QualifiedName comparison on worker thread, so I turned some variables into simple WTF::String. I haven't modified the layouttests of the preloadscanner yet since my preloadscanner is working in a bit different way. I didn't find anything about the simple preloadscanner's performance results... Has anybody measured anything? I'm waiting your comments about the patch! Thanks in advance, Z

Adam Barth

Comment 1 2011-06-28 10:38:35 PDT

Very cool that you're working on this problem. I've added some folks to the CC who have some experience benchmarking the preload scanner (and parser experience in general).

Adam Barth

Comment 2 2011-06-28 10:42:28 PDT

Comment on attachment 98909 [details] patch for comments View in context: https://bugs.webkit.org/attachment.cgi?id=98909&action=review I need to look at this in more detail, but it looks very interesting. > Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:59 > +static String bodyTagString; > +static String charsetAttrString; > +static String hrefAttrString; > +static String imgTagString; > +static String inputTagString; > +static String linkTagString; > + > +static String mediaAttrString; > +static String relAttrString; > +static String scriptTagString; > +static String srcAttrString; > +static String styleTagString; > +static String typeAttrString; This looks like static initializers, we aren't allowed.

Tony Gentilcore

Comment 3 2011-06-28 10:47:31 PDT

Really excited to see you working on this. I've used web-page-replay for benchmarking preload scanner changes in the past. Methanol seems similar, but wpr allows for network simulation so that bandwidth contention and RTT can be modeled. Instructions here: http://code.google.com/p/web-page-replay/wiki/ChromeBenchmark I haven't looked at the patch yet, but can you describe briefly what you hope the benefit will be. Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier?

Tony Gentilcore

Comment 4 2011-06-28 11:22:18 PDT

> Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier? BTW, I ask because there are different ways this could evolve. If the concern is the CPU we spend running the preload scanner, an alternative to consider is reusing the tokens the scanner generates instead of discarding them (https://bugs.webkit.org/show_bug.cgi?id=57376). The theory behind the preload scanner is that the parsing CPU time is trivial compared to the time spent waiting for subresources to download. So we spend extra CPU in order to start subresource downloads as early as possible so they can download while we are blocked on other things. So the real optimization target is running the scanner as early as possible. One way we can get gaps in the download waterfall today is that the UI thread is busy for a while while the scanner is running. For instance: <script> workForOneSecond(); </script> <img src=foo.gif> In that case we'd parse sequentially and execute the script without ever running the preload scanner. So foo.gif was not discovered as early as possible. A potential solution would be to always run the preload scanner first thing in insert()/append(), then pass the tokens to the real parser for it to actually run the scripts and build the tree. Obviously we'd only want to do that if we could reuse tokens and there'd probably be some limit in how far we'd let it run before doing real parsing. Anyway, just throwing this out for your consideration.

Adam Barth

Comment 5 2011-06-28 14:17:24 PDT

Another way this can improve performance is the following situation: <script src="external-script-that-comes-back-quickly.js"></script> ... lots of stuff to scan ... In this case, we'll be blocked inside the preload scanner when the script comes back from the network. The time between that point in time and when the preload scanner yields will be wasted and add to overall latency.

Alexey Proskuryakov

Comment 6 2011-06-28 16:04:49 PDT

Using a dedicated thread just for this seems extremely wasteful. On Mac, tasks should probably just go to libdispatch, and other platforms would likely need some kind of thread pool.

Antti Koivisto

Comment 7 2011-06-29 03:23:13 PDT

> # Measurement with html-parser performance test (trunk/PerformanceTests/Parser/html-parser.html) runs: > > with PreloadScanner: 1774 ms (+/-5.8%) > without PreloadScanner: 23.7 ms (+/-0.55%) > with my ThreadedPreloadScanner: 77.15 ms (+/-1.5%) In the worst case preload scanner will end up tokenizing the document text twice. Considering the triviality of the parsing compared the real parser (and the lack of tree building), the cost should always be fraction of the time spend in real parsing. Either these results indicate serious bug in preload scanning or you are mismeasuring.

Antti Koivisto

Comment 8 2011-06-29 03:24:43 PDT

s/twice/second time

Zoltan Horvath

Comment 9 2011-06-29 06:02:05 PDT

(In reply to comment #2) Adam > ... > > +static String typeAttrString; > > This looks like static initializers, we aren't allowed. Yeap, this is just a hack to make these strings comparable on the thread, although we should find a better way to do this. I didn't want to initialize the entire QualifiedName 'array' on the thread.(In reply to comment #3) (In reply to comment #3) Tony > Really excited to see you working on this. I like this topic! ;) > I've used web-page-replay for benchmarking preload scanner changes in the past. Methanol seems similar, but wpr allows for network simulation so that bandwidth contention and RTT can be modeled. Instructions here: > http://code.google.com/p/web-page-replay/wiki/ChromeBenchmark I'm going to set up wpr and make measurements soon! > I haven't looked at the patch yet, but can you describe briefly what you hope the benefit will be. Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier? One thing is decrease CPU load on the main thread. On the other side we can save the time that preloadscanner spends on the main thread. Ideally only the resourcerequest will coast on main thread. Does it make sense and is it possible to modify resource downloading to support download from worker threads? (In reply to comment #4) Tony > > Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier? > > BTW, I ask because there are different ways this could evolve. If the concern is the CPU we spend running the preload scanner, an alternative to consider is reusing the tokens the scanner generates instead of discarding them (https://bugs.webkit.org/show_bug.cgi?id=57376). Unfortunately, with the threaded approach the reusing of the tokens doesn't seems to be trivial. > A potential solution would be to always run the preload scanner first thing in insert()/append(), then pass the tokens to the real parser for it to actually run the scripts and build the tree. I have just rived this sentence off... My solution does the same, but it doesn't reuse tokens. (In reply to comment #5) Adam > Another way this can improve performance is the following situation: > > <script src="external-script-that-comes-back-quickly.js"></script> > ... lots of stuff to scan ... > > In this case, we'll be blocked inside the preload scanner when the script comes back from the network. The time between that point in time and when the preload scanner yields will be wasted and add to overall latency. In the case of html-parser test html5.html looks like the same, so this can be the reason for long runtime, so as I see html-parser test is testing preloadscanner instead of the html5-parser. (In reply to comment #7) Antti > In the worst case preload scanner will end up tokenizing the document text twice. Considering the triviality of the parsing compared the real parser (and the lack of tree building), the cost should always be fraction of the time spend in real parsing. > > Either these results indicate serious bug in preload scanning or you are mismeasuring. I will upload a patch here to disable preloadscanner, could you make a quick measurement for me to confirm or deny me? (In reply to comment #6) Alexey > Using a dedicated thread just for this seems extremely wasteful. On Mac, tasks should probably just go to libdispatch, and other platforms would likely need some kind of thread pool. We would be happy with using of a generic thread pool as well. We pushed Parallel Jobs Framework in the trunk that supports libdispatch also, what do you think about the modification of it to a more generic implementation that supports long/short live threads and use that? Thanks for the comments guys!

Tony Gentilcore

Comment 10 2011-06-29 06:48:56 PDT

> One thing is decrease CPU load on the main thread. On the other side we can save the time that preloadscanner spends on the main thread. Ideally only the resourcerequest will coast on main thread. Does it make sense and is it possible to modify resource downloading to support download from worker threads? So if tokenization time on the cpu is the main target here, it seems like reusing tokens would be strictly superior (as it would work for single core as well as multicore devices). It doesn't seem like too much work to hack up for comparison. Or perhaps I'm missing another benefit of threading here?

Balazs Kelemen

Comment 11 2011-06-29 06:57:42 PDT

(In reply to comment #10) > > One thing is decrease CPU load on the main thread. On the other side we can save the time that preloadscanner spends on the main thread. Ideally only the resourcerequest will coast on main thread. Does it make sense and is it possible to modify resource downloading to support download from worker threads? > > So if tokenization time on the cpu is the main target here, it seems like reusing tokens would be strictly superior (as it would work for single core as well as multicore devices). It doesn't seem like too much work to hack up for comparison. Or perhaps I'm missing another benefit of threading here? Let's look at the html5-parser benchmark results again: with PreloadScanner: 1774 ms (+/-5.8%) without PreloadScanner: 23.7 ms (+/-0.55%) with my ThreadedPreloadScanner: 77.15 ms (+/-1.5%) From these number it seems like preloading is even slower than the real parsing. According to this result I guess that the most time consuming part of the preload is not the tokenization but starting the requests. Do you think it is possible? Do you think it can be true on all platforms (with different networking stacks)? If starting the request is really that costly it would be great to support starting them from a worker thread and make the preload scanner run on a worker thread.

Tony Gentilcore

Comment 12 2011-06-29 07:30:24 PDT

> According to this result I guess that the most time consuming part of the preload is not the tokenization but starting the requests. Do you think it is possible? I didn't consider that starting a request would be expensive. It would be very interesting to demonstrate that is the case. If that does turn out to be the case, then maybe it would be better for the embedder to handle any threading when initiating requests? Chromium, for instance, does IO in a separate thread and the network requests in a separate process. However, like Antti, I'm somewhat skeptical of the measurement. Why would it be the case that initiating the requests are expensive when done from the preload scanner but not when done from the parser? I would really expect that tokenization is the only work being duplicated here. Again, I'm not opposed to this change, just want to make sure I understand it completely.

Balazs Kelemen

Comment 13 2011-06-29 08:04:15 PDT

> However, like Antti, I'm somewhat skeptical of the measurement. Why would it be the case that initiating the requests are expensive when done from the preload scanner but not when done from the parser? I would really expect that tokenization is the only work being duplicated here. I asked the same from Zoltan :) The answer is: the parser does not initiate requests. Those resources will be requested on demand when somebody from WebCore try to access them.

WebKit Review Bot

Comment 14 2011-06-29 08:07:58 PDT

Attachment 98909 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/WebCore.pro', u'Source/WebC..." exit_code: 1 Source/WebCore/html/parser/HTMLDocumentParser.h:35: Alphabetical sorting problem. [build/include_order] [4] Total errors found: 1 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.

WebKit Review Bot

Comment 15 2011-06-29 08:16:36 PDT

Comment on attachment 98909 [details] patch for comments Attachment 98909 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/8966034

WebKit Review Bot

Comment 16 2011-06-29 08:22:59 PDT

Comment on attachment 98909 [details] patch for comments Attachment 98909 [details] did not pass cr-mac-ews (chromium): Output: http://queues.webkit.org/results/8960196

Collabora GTK+ EWS bot

Comment 17 2011-06-29 08:30:59 PDT

Comment on attachment 98909 [details] patch for comments Attachment 98909 [details] did not pass gtk-ews (gtk): Output: http://queues.webkit.org/results/8965064

WebKit Review Bot

Comment 18 2011-06-29 08:55:47 PDT

Comment on attachment 98909 [details] patch for comments Attachment 98909 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/8906018

Antti Koivisto

Comment 19 2011-06-29 11:20:52 PDT

This would need much more convincing testing. Sampler profiles showing preload scanner taking substantial CPU time would be a good start. Anyway, I think the approach taken here takes us to a wrong direction. Instead we should - Make both main parser and preload scanning use the same token stream. - Move the token generation to a thread (if it is a win). This will help much more than just moving the preload scanning.

Zoltan Horvath

Comment 20 2011-06-29 23:54:09 PDT

Created attachment 99245 [details] disable preload scanner This patch disables preloadscanner. Can somebody verify my results please?

WebKit Review Bot

Comment 21 2011-06-30 00:24:44 PDT

Comment on attachment 99245 [details] disable preload scanner Attachment 99245 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/8963390 New failing tests: fast/preloader/input.html fast/preloader/noscript.html fast/preloader/image.html fast/preloader/script.html fast/preloader/scan-body-from-head-import.html fast/preloader/style.html http/tests/loading/preload-append-scan.php fast/preloader/link.html fast/preloader/scan-body-from-head-script.html

WebKit Review Bot

Comment 22 2011-06-30 00:24:50 PDT

Created attachment 99249 [details] Archive of layout-test-results from ec2-cr-linux-03 The attached test failures were seen while running run-webkit-tests on the chromium-ews. Bot: ec2-cr-linux-03 Port: Chromium Platform: Linux-2.6.35-28-virtual-x86_64-with-Ubuntu-10.10-maverick

Gabor Loki

Comment 23 2011-06-30 02:50:53 PDT

> Can somebody verify my results please? I tried your patches on Gtk, and I can confirm the difference between trunk and disabled preloadScanner on PerformanceTests/Parser/html-parser.html . Trunk: avg 2801.2 median 2801 stdev 9.57914401186244 min 2791 max 2828 PreloadScanner is disabled avg 37.6 median 37.5 stdev 0.8602325267042628 min 36 max 40 I can also confirm the same on the threaded preloadScanner, but we should take care about the jobs which are dedicated to the worker thread. The current approach does not care about the lifetime of dedicated tasks. We should cancel the pending actions on worker thread if there is no need for them (for example the parser is finished its work). PreloadScanner is in a dedicated thread avg 227.05 median 174 stdev 143.21503936388805 min 158 max 692

Gyuyoung Kim

Comment 24 2011-06-30 05:19:46 PDT

Comment on attachment 98909 [details] patch for comments Attachment 98909 [details] did not pass efl-ews (efl): Output: http://queues.webkit.org/results/8959630

Balazs Kelemen

Comment 25 2011-07-04 07:42:17 PDT

> - Move the token generation to a thread (if it is a win). This will help much more than just moving the preload scanning. I don expect that it would be a win. What would the main thread do while waiting for the tokens? The reason why the preload scanner is a friendly topic for multithreading is that preloading is independent from the rest of the parsing process and it can run in parallel while the main thread is waiting for resources.

Balazs Kelemen

Comment 26 2011-07-12 11:14:03 PDT

I started to investigate in reusing the tokens in https://bugs.webkit.org/show_bug.cgi?id=64369. Actually I'm think that running the preload scanner in parallel is more valuable than reusing the tokens. However we should know how much performance gain we can win by reusing the tokens.

Balazs Kelemen

Comment 27 2011-07-26 07:24:39 PDT

Created attachment 101999 [details] proposed patch Similar design than the last patch with the following improvements: * lots of refactoring * add a way to cancel pending tasks * schedule preloading on the main thread asyncronically through the event system (via WTF::callOnMainThread) in order to be able to initiate the load as soon as possible and minimalize the extra work that the parser need to do with preload scanning It's still not fully ready since it's missing rebaselines for preload scanning tests. Benchmark results are on their way, be patient. Feedback is greatly welcomed.

WebKit Review Bot

Comment 28 2011-07-26 07:27:47 PDT

Attachment 101999 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/CSSPreloadScanner.cpp:202: One line control clauses should not use braces. [whitespace/braces] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:6: Alphabetical sorting problem. [build/include_order] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:9: Alphabetical sorting problem. [build/include_order] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:12: "HTMLPreloadScanner.h" already included at Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:5 [build/include] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:13: Alphabetical sorting problem. [build/include_order] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:26: More than one command on the same line [whitespace/newline] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:91: Missing space before ( in while( [whitespace/parens] [5] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:104: One line control clauses should not use braces. [whitespace/braces] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.h:8: Alphabetical sorting problem. [build/include_order] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.h:11: Alphabetical sorting problem. [build/include_order] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.h:33: The parameter name "task" adds no information, so it should be removed. [readability/parameter_name] [5] Total errors found: 11 in 16 files If any of these errors are false positives, please file a bug against check-webkit-style.

WebKit Review Bot

Comment 29 2011-07-26 07:29:43 PDT

Comment on attachment 101999 [details] proposed patch Attachment 101999 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/9250377

Gyuyoung Kim

Comment 30 2011-07-26 07:30:23 PDT

Comment on attachment 101999 [details] proposed patch Attachment 101999 [details] did not pass efl-ews (efl): Output: http://queues.webkit.org/results/9249450

WebKit Review Bot

Comment 31 2011-07-26 07:44:20 PDT

Comment on attachment 101999 [details] proposed patch Attachment 101999 [details] did not pass cr-mac-ews (chromium): Output: http://queues.webkit.org/results/9245463

Gustavo Noronha (kov)

Comment 32 2011-07-26 08:54:28 PDT

Comment on attachment 101999 [details] proposed patch Attachment 101999 [details] did not pass gtk-ews (gtk): Output: http://queues.webkit.org/results/9244477

WebKit Review Bot

Comment 33 2011-07-26 12:25:08 PDT

Comment on attachment 101999 [details] proposed patch Attachment 101999 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/9245524

Balazs Kelemen

Comment 34 2011-07-27 09:30:12 PDT

The patch needs a solution for https://bugs.webkit.org/show_bug.cgi?id=65263 to not trigger an assertion.

Balazs Kelemen

Comment 35 2011-08-01 08:46:50 PDT

Sorry for the slow progress. Unfortunately benchmark results are not that good as expected so I'm still tweaking it. The html-parser benchmark (in PerformaceTests/Parser) is encouraging but page load time tests like methanol are currently showing a small regression. I hope I can make it faster.

Antti Koivisto

Comment 36 2011-08-05 14:19:09 PDT

It is unlikely that threading something that takes very little CPU time is going to make page loading faster, especially since the actually triggering of the loads still has to happen in the webkit thread. We definitely are not going to accept a patch that introduces more threading (and all the associated complexity) without showing useful performance gains. Your efforts would probably be better spent elsewhere.

Balazs Kelemen

Comment 37 2011-08-08 02:34:47 PDT

(In reply to comment #36) > It is unlikely that threading something that takes very little CPU time is going to make page loading faster, especially since the actually triggering of the loads still has to happen in the webkit thread. We definitely are not going to accept a patch that introduces more threading (and all the associated complexity) without showing useful performance gains. Your efforts would probably be better spent elsewhere. Your arguments are reasonable but I still did not give it up :) Yes, it seems like load time of typical web pages won't be really faster. However, doing the scan in parallel has the following advantages: * It eliminates the worst case which is blocking in the preload scanner when the resource that we are waiting for has already arrived. This is the case with the html-parser benchmark and that's why the patch speeds it up. Actually this is the case with every page that refers to some script or css at the beginning and has a lot of text. * It helps in kicking off the load a bit earlier because we can give each chunk to the scanner as soon as it has arrived. Unfortunately this benefit cannot be shown by page load time tests. I can support the first point with the current numbers (with a newer local version of the patch). html-parser Ref avg 1786.85 median 1781 stdev 25.16997218909866 min 1758 max 1844 Parallel avg 1015.85 median 1013 stdev 6.174746958378132 min 1010 max 1026 Methanol with some popular site ref 1157.39 parallel 1145.5 Methanol with some popular site + PerformanceTests/Parser/resources/html5.html ref 40070 parallel 36824 The sites I used with Methanol were the local copies of http://www.blogger.com/, http://www.conduit.com/, http://www.facebook.com/, http://www.qq.com/, http://twitter.com/, http://www.google.com/, http://www.microsoft.com/, http://www.ebay.com/ and http://www.wikipedia.org/. My further plan is to do a web-page-replay based benchmark on Mac that would show the behaviour with (simulated) network latency.

Balazs Kelemen

Comment 38 2011-08-09 04:39:26 PDT

Created attachment 103346 [details] new patch Unfortunately I could not setup a better benchmark (web-page-replay did not work for me neither on Linux nor on Mac). If anybody could help in evaluating the performance I would really appreciate it.

WebKit Review Bot

Comment 39 2011-08-09 04:44:06 PDT

Comment on attachment 103346 [details] new patch Attachment 103346 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/9333366

WebKit Review Bot

Comment 40 2011-08-09 04:58:21 PDT

Comment on attachment 103346 [details] new patch Attachment 103346 [details] did not pass cr-mac-ews (chromium): Output: http://queues.webkit.org/results/9336225

Balazs Kelemen

Comment 41 2011-10-10 01:58:04 PDT

Comment on attachment 103346 [details] new patch Outdated patch.

Balazs Kelemen

Comment 42 2011-10-10 02:00:49 PDT

The benefit doesn't seems to be valuable enough.

Filip Pizlo

Comment 43 2013-01-10 00:53:14 PST

(In reply to comment #42) > The benefit doesn't seems to be valuable enough. It would be great if you could summarize your findings. It's likely that we will want to explore these sorts of optimizations throughout the system. Probably a lot of the exploration we will do will dead-end, like yours did, simply because this is a tough problem. So I'm curious about the following: - To your best estimate, what was the expected speed-up? The results you posted are for example 1157.39 (ref) versus 1145.5 (your concurrent version), which seems like a speed-up. - As far as you know, what were the bottlenecks? I know it's been a long time since you worked on this, but I think that this is a pretty neat case study of taking some part of the code and attempting to make it concurrent. It would be cool to understand more precisely which parts of your optimization worked, and which parts didn't.

Filip Pizlo

Comment 44 2013-01-10 00:57:50 PST

Comment on attachment 103346 [details] new patch View in context: https://bugs.webkit.org/attachment.cgi?id=103346&action=review > Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:58 > + while (OwnPtr<HTMLPreloadScannerThread::ScanTask> task = m_tasks.waitForMessage()) { > + Mutex& scanTasksMutex = task->preloadScanner->scanTasksMutex(); > + if (!scanTasksMutex.tryLock()) > + continue; > + scan(task.get()); > + scanTasksMutex.unlock(); > + } This seems like a really bad idea. This looks like it will remove a task from the queue, but then if it cannot acquire the scanTasksMutex, it will just drop the task on the floor. > Source/WebCore/html/parser/HTMLPreloadScannerThread.h:36 > + MessageQueue<ScanTask> m_tasks; I took a peek into MessageQueue, and it appears not to be particularly optimal. It pretty much requires everything happening in lockstep: to add a task, you lock a lock, add things, signal, and unlock. When you do this and the worker thread was asleep, you pay the price of it being awoken. That's expensive. The locking itself is probably expensive, but that's not nearly as bad.

Balazs Kelemen

Comment 45 2013-01-10 14:31:49 PST

(In reply to comment #43) > (In reply to comment #42) > > The benefit doesn't seems to be valuable enough. > > It would be great if you could summarize your findings. It's likely that we will want to explore these sorts of optimizations throughout the system. Probably a lot of the exploration we will do will dead-end, like yours did, simply because this is a tough problem. I'm glad that you are interested in this work. > > So I'm curious about the following: > > - To your best estimate, what was the expected speed-up? The results you posted are for example 1157.39 (ref) versus 1145.5 (your concurrent version), which seems like a speed-up. It's a very tiny win, and as I remember I could not reproduce it every time, so we can say it is under the margin of measuring error. Generally page load time did not improved by this work. The case were it was very beneficial is such a page like the parser benchmark (PerformanceTests/Parser/html-parser.html) where the preload scanner is a time waster. I believe this worst case scenario is really a bug of the preload scanner and it should be fixed anyway - without threading I can think about yielding to the event loop after some amount of text it has processed. The other advantage could be that we can kick off loading resources earlier, but it's hard to measure. > > - As far as you know, what were the bottlenecks? > > I know it's been a long time since you worked on this, but I think that this is a pretty neat case study of taking some part of the code and attempting to make it concurrent. It would be cool to understand more precisely which parts of your optimization worked, and which parts didn't. A serious problem is that we can only start the load on the main thread. If the main parser has been stopped because of JS execution we could not load the resources until it finishes. If we could have a loader on the thread it could help but I guess it needs a lot of work.

Balazs Kelemen

Comment 46 2013-01-10 14:51:07 PST

(In reply to comment #44) > (From update of attachment 103346 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=103346&action=review > > > Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:58 > > + while (OwnPtr<HTMLPreloadScannerThread::ScanTask> task = m_tasks.waitForMessage()) { > > + Mutex& scanTasksMutex = task->preloadScanner->scanTasksMutex(); > > + if (!scanTasksMutex.tryLock()) > > + continue; > > + scan(task.get()); > > + scanTasksMutex.unlock(); > > + } > > This seems like a really bad idea. This looks like it will remove a task from the queue, but then if it cannot acquire the scanTasksMutex, it will just drop the task on the floor. This condition can only be true if the parser is stopped or it is under destruction, so we don't want to process this task anyway. > > > Source/WebCore/html/parser/HTMLPreloadScannerThread.h:36 > > + MessageQueue<ScanTask> m_tasks; > > I took a peek into MessageQueue, and it appears not to be particularly optimal. It pretty much requires everything happening in lockstep: to add a task, you lock a lock, add things, signal, and unlock. When you do this and the worker thread was asleep, you pay the price of it being awoken. That's expensive. The locking itself is probably expensive, but that's not nearly as bad. I'm not sure how could we avoid that. Do you mean we should make sure that the thread will never fall asleep, i.e. it always have something to do? This seems tricky, I don't have any idea how to achieve that.

Filip Pizlo

Comment 47 2013-01-10 15:19:58 PST

(In reply to comment #46) > (In reply to comment #44) > > (From update of attachment 103346 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=103346&action=review > > > > > Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:58 > > > + while (OwnPtr<HTMLPreloadScannerThread::ScanTask> task = m_tasks.waitForMessage()) { > > > + Mutex& scanTasksMutex = task->preloadScanner->scanTasksMutex(); > > > + if (!scanTasksMutex.tryLock()) > > > + continue; > > > + scan(task.get()); > > > + scanTasksMutex.unlock(); > > > + } > > > > This seems like a really bad idea. This looks like it will remove a task from the queue, but then if it cannot acquire the scanTasksMutex, it will just drop the task on the floor. > > This condition can only be true if the parser is stopped or it is under destruction, so we don't want to process this task anyway. Ah. I buy it. > > > > > > Source/WebCore/html/parser/HTMLPreloadScannerThread.h:36 > > > + MessageQueue<ScanTask> m_tasks; > > > > I took a peek into MessageQueue, and it appears not to be particularly optimal. It pretty much requires everything happening in lockstep: to add a task, you lock a lock, add things, signal, and unlock. When you do this and the worker thread was asleep, you pay the price of it being awoken. That's expensive. The locking itself is probably expensive, but that's not nearly as bad. > > I'm not sure how could we avoid that. Do you mean we should make sure that the thread will never fall asleep, i.e. it always have something to do? This seems tricky, I don't have any idea how to achieve that. It is tricky, and I'm also not sure how to achieve it. This is basically always the problem in trying to get a speed-up from making something concurrent: you need to get enough concurrency to keep your threads always fed with work to do. If we are sure that we cannot achieve this, then we should think twice about using concurrency as a performance optimization. (Though it may still be a valid optimization for reducing worst-case latencies if we believe that the task we have made concurrent risks stalling, say because of I/O or because of an algorithmic pathology. That doesn't appear to be the case here since you're just moving non-I/O linear-time work off the main thread.)

Attachments
patch for comments (35.60 KB, patch) 2011-06-28 06:54 PDT, Zoltan Horvath	webkit.review.bot: commit-queue-	Details Formatted Diff Diff
disable preload scanner (4.55 KB, patch) 2011-06-29 23:54 PDT, Zoltan Horvath	webkit.review.bot: commit-queue-	Details Formatted Diff Diff
Archive of layout-test-results from ec2-cr-linux-03 (1.39 MB, application/zip) 2011-06-30 00:24 PDT, WebKit Review Bot	no flags	Details
proposed patch (52.18 KB, patch) 2011-07-26 07:24 PDT, Balazs Kelemen	no flags	Details Formatted Diff Diff
new patch (61.96 KB, patch) 2011-08-09 04:39 PDT, Balazs Kelemen	kbalazs: review-	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.