RESOLVED WONTFIX 79880
[WebSocket]Optimize the masking/unmasking operation by 64-bit data type
https://bugs.webkit.org/show_bug.cgi?id=79880
Summary [WebSocket]Optimize the masking/unmasking operation by 64-bit data type
Li Yin
Reported 2012-02-28 23:36:22 PST
[WebSocket]Optimize the masking/umasking operation by SSE2
Attachments
Patch (3.79 KB, patch)
2012-02-28 23:41 PST, Li Yin
no flags
Patch (3.47 KB, patch)
2012-03-05 20:24 PST, Li Yin
no flags
Patch (5.16 KB, patch)
2012-03-06 00:37 PST, Li Yin
no flags
Patch (3.58 KB, patch)
2012-03-06 19:16 PST, Li Yin
no flags
Li Yin
Comment 1 2012-02-28 23:41:31 PST
Li Yin
Comment 2 2012-02-28 23:46:41 PST
Using the SSE2 instruction to do the xor operation, the performance has a great promotion, when the data size reaches 1KB or 1MB.
Li Yin
Comment 3 2012-02-28 23:51:17 PST
If the masking operation takes a long time, it will block the UI, so the optimization can decrease the waiting time of web UI.
Li Yin
Comment 4 2012-03-01 18:29:26 PST
Any comments about this patch?
Li Yin
Comment 5 2012-03-04 22:19:34 PST
Any comments about this patch?
Eric Seidel (no email)
Comment 6 2012-03-04 22:37:41 PST
Really? WebSockets, needs SSE optimization? I can't believe that this is that performance sensitive. Please include a benchmark which demonstrates that this code is hot enough that we need SSE here.
Eric Seidel (no email)
Comment 7 2012-03-04 22:38:00 PST
Comment on attachment 129401 [details] Patch r- awaiting benchmark to justify this need.
Li Yin
Comment 8 2012-03-04 23:16:59 PST
JavaScript code from benchmark. var NUM = 10*1024*1024; var data = ""; var i=0; var wsocket = new WebSocket("ws://127.0.0.1:8880/websocket/tests/hixie76/echo"); for(i=0;i<NUM;i++){ data += "A"; } wsocket.onopen = function(){ var start = (new Date()).getTime(); wsocket.send(data); var end = (new Date()).getTime(); console.log("Consumed time:"+ (end - start) + "ms"); };
Li Yin
Comment 9 2012-03-04 23:21:10 PST
The total efficiency of send function is 1.5 times faster when the data size reaches 10MB, and about 1.3 times when the size is 1MB.
Li Yin
Comment 10 2012-03-04 23:27:13 PST
Although the total performance can't be improved so much as the single masking/unmasking operation, the patch indeed decreases the waiting time of web ui to some extent.
Yuta Kitamura
Comment 11 2012-03-05 10:20:37 PST
I'm also surprised to see this patch, but it's somewhat understandable that the frame masking code is a hot path. I'm not an expert of SSE2 operations so I'm not qualified to review the patch, though. A few thoughts: * Current code masks frames for each byte (char). Have you tried to mask frames for each machine word (32-bit or 64-bit)? I believe that would improve the performance, even without SSE. * It may be a good idea to add some WebSocket tests under PerformanceTests/.
Li Yin
Comment 12 2012-03-05 20:24:30 PST
Li Yin
Comment 13 2012-03-05 20:30:45 PST
(In reply to comment #11) > I'm also surprised to see this patch, but it's somewhat understandable that the frame masking code is a hot path. I'm not an expert of SSE2 operations so I'm not qualified to review the patch, though. > > A few thoughts: > * Current code masks frames for each byte (char). Have you tried to mask frames for each machine word (32-bit or 64-bit)? I believe that would improve the performance, even without SSE. Yeah, the performance can be improved using 64-bit data, almost 7~8 times faster than the original method.
Li Yin
Comment 14 2012-03-06 00:37:47 PST
Li Yin
Comment 15 2012-03-06 00:44:16 PST
Add the benchmark under PerformanceTests/, please check it, thanks.
Zoltan Herczeg
Comment 16 2012-03-06 02:24:55 PST
Comment on attachment 130317 [details] Patch In general this looks well, and probably works well no other machines like ARM with NEON. What I dont like is &base[offset] forms, please use base + offset instead of them. View in context: https://bugs.webkit.org/attachment.cgi?id=130317&action=review > Source/WebCore/ChangeLog:8 > + No new tests. (OOPS!) Remove this line. And please describe why is a win here as well. And add write some perf results. > Source/WebCore/Modules/websockets/WebSocketChannel.cpp:113 > +static void maskFrameDataByUint64(char *src, size_t dataLength, const char *mask) Please don't abbreaviate. Use source instead of src. Or buffer since it is also a destination... And add an "inline" keyword. Otherwise this is probably not a win. > Source/WebCore/Modules/websockets/WebSocketChannel.cpp:119 > + if (dataLength > 0) { WebKit prefers early return. So use: if (datalength == 0) return; > Source/WebCore/Modules/websockets/WebSocketChannel.cpp:123 > + maskAlignedArray[i] = mask[index % maskingKeyWidthInBytes]; Is maskingKeyWidthInBytes is <= 8 and power of 2 all the time? We should add a COMPILE_ASSERT for it! > Source/WebCore/Modules/websockets/WebSocketChannel.cpp:133 > + for (size_t i = 0; i < dataLength % 8; i++) > + pSrc[i] ^= maskAlignedArray[i]; Just thinking about these fors... dataLength &= ~0x7; while (dataLength-- > 0) *pSrc++ ^= *maskAlignedArray++; I think compiler like this better.
Yuta Kitamura
Comment 17 2012-03-06 09:59:49 PST
Comment on attachment 130317 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=130317&action=review > PerformanceTests/websocket/sendDataByWebSocket.html:4 > +</head> Performance test should use PerformanceTests/resources/runners.js so performance metrics can be trackable. Also, you probably need to modify run-perf-tests to start pywebsocket server before running WebSocket tests, so there are a lot more to do before adding any test. I recommend splitting performance tests from this patch and opening a new bug about WebSocket performance tests.
Zoltan Herczeg
Comment 18 2012-03-06 12:23:34 PST
You still need to convice reviewers that this patch is useful and your algorithm gives us perf boost. Otherwise the increased code complexity (maintenance cost) does not worth it. We worked a lot of such algorithm before and unfortunately many of them are turned out to be not good enough. Many times the memory reads and writes are the bottleneck for such simple algorithms so an optimized version version gives only a minor speedup, but makes the code difficult to maintain. So please tell us your measurements, and also mention them in the changelog.
Li Yin
Comment 19 2012-03-06 19:16:47 PST
Li Yin
Comment 20 2012-03-06 19:20:30 PST
(In reply to comment #18) > You still need to convice reviewers that this patch is useful and your algorithm gives us perf boost. Otherwise the increased code complexity (maintenance cost) does not worth it. We worked a lot of such algorithm before and unfortunately many of them are turned out to be not good enough. Many times the memory reads and writes are the bottleneck for such simple algorithms so an optimized version version gives only a minor speedup, but makes the code difficult to maintain. So please tell us your measurements, and also mention them in the changelog. Thanks for your review and good advice, I have updated the patch already, and add the test result in the ChangeLog, please have a check again.
Li Yin
Comment 21 2012-03-06 20:12:53 PST
(In reply to comment #17) > (From update of attachment 130317 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=130317&action=review > > > PerformanceTests/websocket/sendDataByWebSocket.html:4 > > +</head> > > Performance test should use PerformanceTests/resources/runners.js so performance metrics can be trackable. Also, you probably need to modify run-perf-tests to start pywebsocket server before running WebSocket tests, so there are a lot more to do before adding any test. > > I recommend splitting performance tests from this patch and opening a new bug about WebSocket performance tests. File a new bug to add the WebSocket performance tests. Bug 80476 - [WebSocket]Add the performance test case to track the efficiency of sending data.
Li Yin
Comment 22 2012-09-01 21:34:49 PDT
It is not very very hot spot compared with bulk data in JS parsing. So mark it to be WONTFIX.
Note You need to log in before you can comment on or make changes to this bug.