RESOLVED INVALID 63531
Turn PreloadScanner implementation to a thread based solution
https://bugs.webkit.org/show_bug.cgi?id=63531
Summary Turn PreloadScanner implementation to a thread based solution
Zoltan Horvath
Reported 2011-06-28 06:54:51 PDT
Created attachment 98909 [details] patch for comments I implemented PreloadScanner to use a worker thread. My observations... Some "performance" results: # Measurement with html-parser performance test (trunk/PerformanceTests/Parser/html-parser.html) runs: with PreloadScanner: 1774 ms (+/-5.8%) without PreloadScanner: 23.7 ms (+/-0.55%) with my ThreadedPreloadScanner: 77.15 ms (+/-1.5%) So I think html-parser test is not suitable for htmlparser benchmarking. # Measurement with our Methanol benchmark (23 locally mirrored sites) + 1 site what contains 3 iframes with PreloadScanner: 19208 ms (+/-2.61%) without PreloadScanner: 18335 ms (+/-3.74%) with my ThreadedPreloadScanner: 17423 ms (+/-2.65%) I had issues with AtomicString and QualifiedName comparison on worker thread, so I turned some variables into simple WTF::String. I haven't modified the layouttests of the preloadscanner yet since my preloadscanner is working in a bit different way. I didn't find anything about the simple preloadscanner's performance results... Has anybody measured anything? I'm waiting your comments about the patch! Thanks in advance, Z
Attachments
patch for comments (35.60 KB, patch)
2011-06-28 06:54 PDT, Zoltan Horvath
webkit.review.bot: commit-queue-
disable preload scanner (4.55 KB, patch)
2011-06-29 23:54 PDT, Zoltan Horvath
webkit.review.bot: commit-queue-
Archive of layout-test-results from ec2-cr-linux-03 (1.39 MB, application/zip)
2011-06-30 00:24 PDT, WebKit Review Bot
no flags
proposed patch (52.18 KB, patch)
2011-07-26 07:24 PDT, Balazs Kelemen
no flags
new patch (61.96 KB, patch)
2011-08-09 04:39 PDT, Balazs Kelemen
kbalazs: review-
Adam Barth
Comment 1 2011-06-28 10:38:35 PDT
Very cool that you're working on this problem. I've added some folks to the CC who have some experience benchmarking the preload scanner (and parser experience in general).
Adam Barth
Comment 2 2011-06-28 10:42:28 PDT
Comment on attachment 98909 [details] patch for comments View in context: https://bugs.webkit.org/attachment.cgi?id=98909&action=review I need to look at this in more detail, but it looks very interesting. > Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:59 > +static String bodyTagString; > +static String charsetAttrString; > +static String hrefAttrString; > +static String imgTagString; > +static String inputTagString; > +static String linkTagString; > + > +static String mediaAttrString; > +static String relAttrString; > +static String scriptTagString; > +static String srcAttrString; > +static String styleTagString; > +static String typeAttrString; This looks like static initializers, we aren't allowed.
Tony Gentilcore
Comment 3 2011-06-28 10:47:31 PDT
Really excited to see you working on this. I've used web-page-replay for benchmarking preload scanner changes in the past. Methanol seems similar, but wpr allows for network simulation so that bandwidth contention and RTT can be modeled. Instructions here: http://code.google.com/p/web-page-replay/wiki/ChromeBenchmark I haven't looked at the patch yet, but can you describe briefly what you hope the benefit will be. Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier?
Tony Gentilcore
Comment 4 2011-06-28 11:22:18 PDT
> Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier? BTW, I ask because there are different ways this could evolve. If the concern is the CPU we spend running the preload scanner, an alternative to consider is reusing the tokens the scanner generates instead of discarding them (https://bugs.webkit.org/show_bug.cgi?id=57376). The theory behind the preload scanner is that the parsing CPU time is trivial compared to the time spent waiting for subresources to download. So we spend extra CPU in order to start subresource downloads as early as possible so they can download while we are blocked on other things. So the real optimization target is running the scanner as early as possible. One way we can get gaps in the download waterfall today is that the UI thread is busy for a while while the scanner is running. For instance: <script> workForOneSecond(); </script> <img src=foo.gif> In that case we'd parse sequentially and execute the script without ever running the preload scanner. So foo.gif was not discovered as early as possible. A potential solution would be to always run the preload scanner first thing in insert()/append(), then pass the tokens to the real parser for it to actually run the scripts and build the tree. Obviously we'd only want to do that if we could reuse tokens and there'd probably be some limit in how far we'd let it run before doing real parsing. Anyway, just throwing this out for your consideration.
Adam Barth
Comment 5 2011-06-28 14:17:24 PDT
Another way this can improve performance is the following situation: <script src="external-script-that-comes-back-quickly.js"></script> ... lots of stuff to scan ... In this case, we'll be blocked inside the preload scanner when the script comes back from the network. The time between that point in time and when the preload scanner yields will be wasted and add to overall latency.
Alexey Proskuryakov
Comment 6 2011-06-28 16:04:49 PDT
Using a dedicated thread just for this seems extremely wasteful. On Mac, tasks should probably just go to libdispatch, and other platforms would likely need some kind of thread pool.
Antti Koivisto
Comment 7 2011-06-29 03:23:13 PDT
> # Measurement with html-parser performance test (trunk/PerformanceTests/Parser/html-parser.html) runs: > > with PreloadScanner: 1774 ms (+/-5.8%) > without PreloadScanner: 23.7 ms (+/-0.55%) > with my ThreadedPreloadScanner: 77.15 ms (+/-1.5%) In the worst case preload scanner will end up tokenizing the document text twice. Considering the triviality of the parsing compared the real parser (and the lack of tree building), the cost should always be fraction of the time spend in real parsing. Either these results indicate serious bug in preload scanning or you are mismeasuring.
Antti Koivisto
Comment 8 2011-06-29 03:24:43 PDT
s/twice/second time
Zoltan Horvath
Comment 9 2011-06-29 06:02:05 PDT
(In reply to comment #2) Adam > ... > > +static String typeAttrString; > > This looks like static initializers, we aren't allowed. Yeap, this is just a hack to make these strings comparable on the thread, although we should find a better way to do this. I didn't want to initialize the entire QualifiedName 'array' on the thread.(In reply to comment #3) (In reply to comment #3) Tony > Really excited to see you working on this. I like this topic! ;) > I've used web-page-replay for benchmarking preload scanner changes in the past. Methanol seems similar, but wpr allows for network simulation so that bandwidth contention and RTT can be modeled. Instructions here: > http://code.google.com/p/web-page-replay/wiki/ChromeBenchmark I'm going to set up wpr and make measurements soon! > I haven't looked at the patch yet, but can you describe briefly what you hope the benefit will be. Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier? One thing is decrease CPU load on the main thread. On the other side we can save the time that preloadscanner spends on the main thread. Ideally only the resourcerequest will coast on main thread. Does it make sense and is it possible to modify resource downloading to support download from worker threads? (In reply to comment #4) Tony > > Is this just offloading the CPU time spent parsing or does it actually allow us to discover subresources earlier? > > BTW, I ask because there are different ways this could evolve. If the concern is the CPU we spend running the preload scanner, an alternative to consider is reusing the tokens the scanner generates instead of discarding them (https://bugs.webkit.org/show_bug.cgi?id=57376). Unfortunately, with the threaded approach the reusing of the tokens doesn't seems to be trivial. > A potential solution would be to always run the preload scanner first thing in insert()/append(), then pass the tokens to the real parser for it to actually run the scripts and build the tree. I have just rived this sentence off... My solution does the same, but it doesn't reuse tokens. (In reply to comment #5) Adam > Another way this can improve performance is the following situation: > > <script src="external-script-that-comes-back-quickly.js"></script> > ... lots of stuff to scan ... > > In this case, we'll be blocked inside the preload scanner when the script comes back from the network. The time between that point in time and when the preload scanner yields will be wasted and add to overall latency. In the case of html-parser test html5.html looks like the same, so this can be the reason for long runtime, so as I see html-parser test is testing preloadscanner instead of the html5-parser. (In reply to comment #7) Antti > In the worst case preload scanner will end up tokenizing the document text twice. Considering the triviality of the parsing compared the real parser (and the lack of tree building), the cost should always be fraction of the time spend in real parsing. > > Either these results indicate serious bug in preload scanning or you are mismeasuring. I will upload a patch here to disable preloadscanner, could you make a quick measurement for me to confirm or deny me? (In reply to comment #6) Alexey > Using a dedicated thread just for this seems extremely wasteful. On Mac, tasks should probably just go to libdispatch, and other platforms would likely need some kind of thread pool. We would be happy with using of a generic thread pool as well. We pushed Parallel Jobs Framework in the trunk that supports libdispatch also, what do you think about the modification of it to a more generic implementation that supports long/short live threads and use that? Thanks for the comments guys!
Tony Gentilcore
Comment 10 2011-06-29 06:48:56 PDT
> One thing is decrease CPU load on the main thread. On the other side we can save the time that preloadscanner spends on the main thread. Ideally only the resourcerequest will coast on main thread. Does it make sense and is it possible to modify resource downloading to support download from worker threads? So if tokenization time on the cpu is the main target here, it seems like reusing tokens would be strictly superior (as it would work for single core as well as multicore devices). It doesn't seem like too much work to hack up for comparison. Or perhaps I'm missing another benefit of threading here?
Balazs Kelemen
Comment 11 2011-06-29 06:57:42 PDT
(In reply to comment #10) > > One thing is decrease CPU load on the main thread. On the other side we can save the time that preloadscanner spends on the main thread. Ideally only the resourcerequest will coast on main thread. Does it make sense and is it possible to modify resource downloading to support download from worker threads? > > So if tokenization time on the cpu is the main target here, it seems like reusing tokens would be strictly superior (as it would work for single core as well as multicore devices). It doesn't seem like too much work to hack up for comparison. Or perhaps I'm missing another benefit of threading here? Let's look at the html5-parser benchmark results again: with PreloadScanner: 1774 ms (+/-5.8%) without PreloadScanner: 23.7 ms (+/-0.55%) with my ThreadedPreloadScanner: 77.15 ms (+/-1.5%) From these number it seems like preloading is even slower than the real parsing. According to this result I guess that the most time consuming part of the preload is not the tokenization but starting the requests. Do you think it is possible? Do you think it can be true on all platforms (with different networking stacks)? If starting the request is really that costly it would be great to support starting them from a worker thread and make the preload scanner run on a worker thread.
Tony Gentilcore
Comment 12 2011-06-29 07:30:24 PDT
> According to this result I guess that the most time consuming part of the preload is not the tokenization but starting the requests. Do you think it is possible? I didn't consider that starting a request would be expensive. It would be very interesting to demonstrate that is the case. If that does turn out to be the case, then maybe it would be better for the embedder to handle any threading when initiating requests? Chromium, for instance, does IO in a separate thread and the network requests in a separate process. However, like Antti, I'm somewhat skeptical of the measurement. Why would it be the case that initiating the requests are expensive when done from the preload scanner but not when done from the parser? I would really expect that tokenization is the only work being duplicated here. Again, I'm not opposed to this change, just want to make sure I understand it completely.
Balazs Kelemen
Comment 13 2011-06-29 08:04:15 PDT
> However, like Antti, I'm somewhat skeptical of the measurement. Why would it be the case that initiating the requests are expensive when done from the preload scanner but not when done from the parser? I would really expect that tokenization is the only work being duplicated here. I asked the same from Zoltan :) The answer is: the parser does not initiate requests. Those resources will be requested on demand when somebody from WebCore try to access them.
WebKit Review Bot
Comment 14 2011-06-29 08:07:58 PDT
Attachment 98909 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/WebCore.pro', u'Source/WebC..." exit_code: 1 Source/WebCore/html/parser/HTMLDocumentParser.h:35: Alphabetical sorting problem. [build/include_order] [4] Total errors found: 1 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.
WebKit Review Bot
Comment 15 2011-06-29 08:16:36 PDT
Comment on attachment 98909 [details] patch for comments Attachment 98909 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/8966034
WebKit Review Bot
Comment 16 2011-06-29 08:22:59 PDT
Comment on attachment 98909 [details] patch for comments Attachment 98909 [details] did not pass cr-mac-ews (chromium): Output: http://queues.webkit.org/results/8960196
Collabora GTK+ EWS bot
Comment 17 2011-06-29 08:30:59 PDT
Comment on attachment 98909 [details] patch for comments Attachment 98909 [details] did not pass gtk-ews (gtk): Output: http://queues.webkit.org/results/8965064
WebKit Review Bot
Comment 18 2011-06-29 08:55:47 PDT
Comment on attachment 98909 [details] patch for comments Attachment 98909 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/8906018
Antti Koivisto
Comment 19 2011-06-29 11:20:52 PDT
This would need much more convincing testing. Sampler profiles showing preload scanner taking substantial CPU time would be a good start. Anyway, I think the approach taken here takes us to a wrong direction. Instead we should - Make both main parser and preload scanning use the same token stream. - Move the token generation to a thread (if it is a win). This will help much more than just moving the preload scanning.
Zoltan Horvath
Comment 20 2011-06-29 23:54:09 PDT
Created attachment 99245 [details] disable preload scanner This patch disables preloadscanner. Can somebody verify my results please?
WebKit Review Bot
Comment 21 2011-06-30 00:24:44 PDT
Comment on attachment 99245 [details] disable preload scanner Attachment 99245 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/8963390 New failing tests: fast/preloader/input.html fast/preloader/noscript.html fast/preloader/image.html fast/preloader/script.html fast/preloader/scan-body-from-head-import.html fast/preloader/style.html http/tests/loading/preload-append-scan.php fast/preloader/link.html fast/preloader/scan-body-from-head-script.html
WebKit Review Bot
Comment 22 2011-06-30 00:24:50 PDT
Created attachment 99249 [details] Archive of layout-test-results from ec2-cr-linux-03 The attached test failures were seen while running run-webkit-tests on the chromium-ews. Bot: ec2-cr-linux-03 Port: Chromium Platform: Linux-2.6.35-28-virtual-x86_64-with-Ubuntu-10.10-maverick
Gabor Loki
Comment 23 2011-06-30 02:50:53 PDT
> Can somebody verify my results please? I tried your patches on Gtk, and I can confirm the difference between trunk and disabled preloadScanner on PerformanceTests/Parser/html-parser.html . Trunk: avg 2801.2 median 2801 stdev 9.57914401186244 min 2791 max 2828 PreloadScanner is disabled avg 37.6 median 37.5 stdev 0.8602325267042628 min 36 max 40 I can also confirm the same on the threaded preloadScanner, but we should take care about the jobs which are dedicated to the worker thread. The current approach does not care about the lifetime of dedicated tasks. We should cancel the pending actions on worker thread if there is no need for them (for example the parser is finished its work). PreloadScanner is in a dedicated thread avg 227.05 median 174 stdev 143.21503936388805 min 158 max 692
Gyuyoung Kim
Comment 24 2011-06-30 05:19:46 PDT
Comment on attachment 98909 [details] patch for comments Attachment 98909 [details] did not pass efl-ews (efl): Output: http://queues.webkit.org/results/8959630
Balazs Kelemen
Comment 25 2011-07-04 07:42:17 PDT
> - Move the token generation to a thread (if it is a win). This will help much more than just moving the preload scanning. I don expect that it would be a win. What would the main thread do while waiting for the tokens? The reason why the preload scanner is a friendly topic for multithreading is that preloading is independent from the rest of the parsing process and it can run in parallel while the main thread is waiting for resources.
Balazs Kelemen
Comment 26 2011-07-12 11:14:03 PDT
I started to investigate in reusing the tokens in https://bugs.webkit.org/show_bug.cgi?id=64369. Actually I'm think that running the preload scanner in parallel is more valuable than reusing the tokens. However we should know how much performance gain we can win by reusing the tokens.
Balazs Kelemen
Comment 27 2011-07-26 07:24:39 PDT
Created attachment 101999 [details] proposed patch Similar design than the last patch with the following improvements: * lots of refactoring * add a way to cancel pending tasks * schedule preloading on the main thread asyncronically through the event system (via WTF::callOnMainThread) in order to be able to initiate the load as soon as possible and minimalize the extra work that the parser need to do with preload scanning It's still not fully ready since it's missing rebaselines for preload scanning tests. Benchmark results are on their way, be patient. Feedback is greatly welcomed.
WebKit Review Bot
Comment 28 2011-07-26 07:27:47 PDT
Attachment 101999 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/CSSPreloadScanner.cpp:202: One line control clauses should not use braces. [whitespace/braces] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:6: Alphabetical sorting problem. [build/include_order] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:9: Alphabetical sorting problem. [build/include_order] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:12: "HTMLPreloadScanner.h" already included at Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:5 [build/include] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:13: Alphabetical sorting problem. [build/include_order] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:26: More than one command on the same line [whitespace/newline] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:91: Missing space before ( in while( [whitespace/parens] [5] Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:104: One line control clauses should not use braces. [whitespace/braces] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.h:8: Alphabetical sorting problem. [build/include_order] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.h:11: Alphabetical sorting problem. [build/include_order] [4] Source/WebCore/html/parser/HTMLPreloadScannerThread.h:33: The parameter name "task" adds no information, so it should be removed. [readability/parameter_name] [5] Total errors found: 11 in 16 files If any of these errors are false positives, please file a bug against check-webkit-style.
WebKit Review Bot
Comment 29 2011-07-26 07:29:43 PDT
Comment on attachment 101999 [details] proposed patch Attachment 101999 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/9250377
Gyuyoung Kim
Comment 30 2011-07-26 07:30:23 PDT
Comment on attachment 101999 [details] proposed patch Attachment 101999 [details] did not pass efl-ews (efl): Output: http://queues.webkit.org/results/9249450
WebKit Review Bot
Comment 31 2011-07-26 07:44:20 PDT
Comment on attachment 101999 [details] proposed patch Attachment 101999 [details] did not pass cr-mac-ews (chromium): Output: http://queues.webkit.org/results/9245463
Gustavo Noronha (kov)
Comment 32 2011-07-26 08:54:28 PDT
Comment on attachment 101999 [details] proposed patch Attachment 101999 [details] did not pass gtk-ews (gtk): Output: http://queues.webkit.org/results/9244477
WebKit Review Bot
Comment 33 2011-07-26 12:25:08 PDT
Comment on attachment 101999 [details] proposed patch Attachment 101999 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/9245524
Balazs Kelemen
Comment 34 2011-07-27 09:30:12 PDT
The patch needs a solution for https://bugs.webkit.org/show_bug.cgi?id=65263 to not trigger an assertion.
Balazs Kelemen
Comment 35 2011-08-01 08:46:50 PDT
Sorry for the slow progress. Unfortunately benchmark results are not that good as expected so I'm still tweaking it. The html-parser benchmark (in PerformaceTests/Parser) is encouraging but page load time tests like methanol are currently showing a small regression. I hope I can make it faster.
Antti Koivisto
Comment 36 2011-08-05 14:19:09 PDT
It is unlikely that threading something that takes very little CPU time is going to make page loading faster, especially since the actually triggering of the loads still has to happen in the webkit thread. We definitely are not going to accept a patch that introduces more threading (and all the associated complexity) without showing useful performance gains. Your efforts would probably be better spent elsewhere.
Balazs Kelemen
Comment 37 2011-08-08 02:34:47 PDT
(In reply to comment #36) > It is unlikely that threading something that takes very little CPU time is going to make page loading faster, especially since the actually triggering of the loads still has to happen in the webkit thread. We definitely are not going to accept a patch that introduces more threading (and all the associated complexity) without showing useful performance gains. Your efforts would probably be better spent elsewhere. Your arguments are reasonable but I still did not give it up :) Yes, it seems like load time of typical web pages won't be really faster. However, doing the scan in parallel has the following advantages: * It eliminates the worst case which is blocking in the preload scanner when the resource that we are waiting for has already arrived. This is the case with the html-parser benchmark and that's why the patch speeds it up. Actually this is the case with every page that refers to some script or css at the beginning and has a lot of text. * It helps in kicking off the load a bit earlier because we can give each chunk to the scanner as soon as it has arrived. Unfortunately this benefit cannot be shown by page load time tests. I can support the first point with the current numbers (with a newer local version of the patch). html-parser Ref avg 1786.85 median 1781 stdev 25.16997218909866 min 1758 max 1844 Parallel avg 1015.85 median 1013 stdev 6.174746958378132 min 1010 max 1026 Methanol with some popular site ref 1157.39 parallel 1145.5 Methanol with some popular site + PerformanceTests/Parser/resources/html5.html ref 40070 parallel 36824 The sites I used with Methanol were the local copies of http://www.blogger.com/, http://www.conduit.com/, http://www.facebook.com/, http://www.qq.com/, http://twitter.com/, http://www.google.com/, http://www.microsoft.com/, http://www.ebay.com/ and http://www.wikipedia.org/. My further plan is to do a web-page-replay based benchmark on Mac that would show the behaviour with (simulated) network latency.
Balazs Kelemen
Comment 38 2011-08-09 04:39:26 PDT
Created attachment 103346 [details] new patch Unfortunately I could not setup a better benchmark (web-page-replay did not work for me neither on Linux nor on Mac). If anybody could help in evaluating the performance I would really appreciate it.
WebKit Review Bot
Comment 39 2011-08-09 04:44:06 PDT
Comment on attachment 103346 [details] new patch Attachment 103346 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/9333366
WebKit Review Bot
Comment 40 2011-08-09 04:58:21 PDT
Comment on attachment 103346 [details] new patch Attachment 103346 [details] did not pass cr-mac-ews (chromium): Output: http://queues.webkit.org/results/9336225
Balazs Kelemen
Comment 41 2011-10-10 01:58:04 PDT
Comment on attachment 103346 [details] new patch Outdated patch.
Balazs Kelemen
Comment 42 2011-10-10 02:00:49 PDT
The benefit doesn't seems to be valuable enough.
Filip Pizlo
Comment 43 2013-01-10 00:53:14 PST
(In reply to comment #42) > The benefit doesn't seems to be valuable enough. It would be great if you could summarize your findings. It's likely that we will want to explore these sorts of optimizations throughout the system. Probably a lot of the exploration we will do will dead-end, like yours did, simply because this is a tough problem. So I'm curious about the following: - To your best estimate, what was the expected speed-up? The results you posted are for example 1157.39 (ref) versus 1145.5 (your concurrent version), which seems like a speed-up. - As far as you know, what were the bottlenecks? I know it's been a long time since you worked on this, but I think that this is a pretty neat case study of taking some part of the code and attempting to make it concurrent. It would be cool to understand more precisely which parts of your optimization worked, and which parts didn't.
Filip Pizlo
Comment 44 2013-01-10 00:57:50 PST
Comment on attachment 103346 [details] new patch View in context: https://bugs.webkit.org/attachment.cgi?id=103346&action=review > Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:58 > + while (OwnPtr<HTMLPreloadScannerThread::ScanTask> task = m_tasks.waitForMessage()) { > + Mutex& scanTasksMutex = task->preloadScanner->scanTasksMutex(); > + if (!scanTasksMutex.tryLock()) > + continue; > + scan(task.get()); > + scanTasksMutex.unlock(); > + } This seems like a really bad idea. This looks like it will remove a task from the queue, but then if it cannot acquire the scanTasksMutex, it will just drop the task on the floor. > Source/WebCore/html/parser/HTMLPreloadScannerThread.h:36 > + MessageQueue<ScanTask> m_tasks; I took a peek into MessageQueue, and it appears not to be particularly optimal. It pretty much requires everything happening in lockstep: to add a task, you lock a lock, add things, signal, and unlock. When you do this and the worker thread was asleep, you pay the price of it being awoken. That's expensive. The locking itself is probably expensive, but that's not nearly as bad.
Balazs Kelemen
Comment 45 2013-01-10 14:31:49 PST
(In reply to comment #43) > (In reply to comment #42) > > The benefit doesn't seems to be valuable enough. > > It would be great if you could summarize your findings. It's likely that we will want to explore these sorts of optimizations throughout the system. Probably a lot of the exploration we will do will dead-end, like yours did, simply because this is a tough problem. I'm glad that you are interested in this work. > > So I'm curious about the following: > > - To your best estimate, what was the expected speed-up? The results you posted are for example 1157.39 (ref) versus 1145.5 (your concurrent version), which seems like a speed-up. It's a very tiny win, and as I remember I could not reproduce it every time, so we can say it is under the margin of measuring error. Generally page load time did not improved by this work. The case were it was very beneficial is such a page like the parser benchmark (PerformanceTests/Parser/html-parser.html) where the preload scanner is a time waster. I believe this worst case scenario is really a bug of the preload scanner and it should be fixed anyway - without threading I can think about yielding to the event loop after some amount of text it has processed. The other advantage could be that we can kick off loading resources earlier, but it's hard to measure. > > - As far as you know, what were the bottlenecks? > > I know it's been a long time since you worked on this, but I think that this is a pretty neat case study of taking some part of the code and attempting to make it concurrent. It would be cool to understand more precisely which parts of your optimization worked, and which parts didn't. A serious problem is that we can only start the load on the main thread. If the main parser has been stopped because of JS execution we could not load the resources until it finishes. If we could have a loader on the thread it could help but I guess it needs a lot of work.
Balazs Kelemen
Comment 46 2013-01-10 14:51:07 PST
(In reply to comment #44) > (From update of attachment 103346 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=103346&action=review > > > Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:58 > > + while (OwnPtr<HTMLPreloadScannerThread::ScanTask> task = m_tasks.waitForMessage()) { > > + Mutex& scanTasksMutex = task->preloadScanner->scanTasksMutex(); > > + if (!scanTasksMutex.tryLock()) > > + continue; > > + scan(task.get()); > > + scanTasksMutex.unlock(); > > + } > > This seems like a really bad idea. This looks like it will remove a task from the queue, but then if it cannot acquire the scanTasksMutex, it will just drop the task on the floor. This condition can only be true if the parser is stopped or it is under destruction, so we don't want to process this task anyway. > > > Source/WebCore/html/parser/HTMLPreloadScannerThread.h:36 > > + MessageQueue<ScanTask> m_tasks; > > I took a peek into MessageQueue, and it appears not to be particularly optimal. It pretty much requires everything happening in lockstep: to add a task, you lock a lock, add things, signal, and unlock. When you do this and the worker thread was asleep, you pay the price of it being awoken. That's expensive. The locking itself is probably expensive, but that's not nearly as bad. I'm not sure how could we avoid that. Do you mean we should make sure that the thread will never fall asleep, i.e. it always have something to do? This seems tricky, I don't have any idea how to achieve that.
Filip Pizlo
Comment 47 2013-01-10 15:19:58 PST
(In reply to comment #46) > (In reply to comment #44) > > (From update of attachment 103346 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=103346&action=review > > > > > Source/WebCore/html/parser/HTMLPreloadScannerThread.cpp:58 > > > + while (OwnPtr<HTMLPreloadScannerThread::ScanTask> task = m_tasks.waitForMessage()) { > > > + Mutex& scanTasksMutex = task->preloadScanner->scanTasksMutex(); > > > + if (!scanTasksMutex.tryLock()) > > > + continue; > > > + scan(task.get()); > > > + scanTasksMutex.unlock(); > > > + } > > > > This seems like a really bad idea. This looks like it will remove a task from the queue, but then if it cannot acquire the scanTasksMutex, it will just drop the task on the floor. > > This condition can only be true if the parser is stopped or it is under destruction, so we don't want to process this task anyway. Ah. I buy it. > > > > > > Source/WebCore/html/parser/HTMLPreloadScannerThread.h:36 > > > + MessageQueue<ScanTask> m_tasks; > > > > I took a peek into MessageQueue, and it appears not to be particularly optimal. It pretty much requires everything happening in lockstep: to add a task, you lock a lock, add things, signal, and unlock. When you do this and the worker thread was asleep, you pay the price of it being awoken. That's expensive. The locking itself is probably expensive, but that's not nearly as bad. > > I'm not sure how could we avoid that. Do you mean we should make sure that the thread will never fall asleep, i.e. it always have something to do? This seems tricky, I don't have any idea how to achieve that. It is tricky, and I'm also not sure how to achieve it. This is basically always the problem in trying to get a speed-up from making something concurrent: you need to get enough concurrency to keep your threads always fed with work to do. If we are sure that we cannot achieve this, then we should think twice about using concurrency as a performance optimization. (Though it may still be a valid optimization for reducing worst-case latencies if we believe that the task we have made concurrent risks stalling, say because of I/O or because of an algorithmic pathology. That doesn't appear to be the case here since you're just moving non-I/O linear-time work off the main thread.)
Note You need to log in before you can comment on or make changes to this bug.