Bug 79880

Summary: [WebSocket]Optimize the masking/unmasking operation by 64-bit data type
Product: WebKit Reporter: Li Yin <li.yin>
Component: WebCore Misc.Assignee: Nobody <webkit-unassigned>
Status: RESOLVED WONTFIX    
Severity: Normal CC: abarth, ap, barraclough, eric, ggaren, li.yin, tkent, webkit.review.bot, yutak, zherczeg
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch none

Description Li Yin 2012-02-28 23:36:22 PST
[WebSocket]Optimize the masking/umasking operation by SSE2
Comment 1 Li Yin 2012-02-28 23:41:31 PST
Created attachment 129401 [details]
Patch
Comment 2 Li Yin 2012-02-28 23:46:41 PST
Using the SSE2 instruction to do the xor operation, the performance has a great promotion, when the data size reaches 1KB or 1MB.
Comment 3 Li Yin 2012-02-28 23:51:17 PST
If the masking operation takes a long time, it will block the UI, so the optimization can decrease the waiting time of web UI.
Comment 4 Li Yin 2012-03-01 18:29:26 PST
Any comments about this patch?
Comment 5 Li Yin 2012-03-04 22:19:34 PST
Any comments about this patch?
Comment 6 Eric Seidel (no email) 2012-03-04 22:37:41 PST
Really?  WebSockets, needs SSE optimization?  I can't believe that this is that performance sensitive.

Please include a benchmark which demonstrates that this code is hot enough that we need SSE here.
Comment 7 Eric Seidel (no email) 2012-03-04 22:38:00 PST
Comment on attachment 129401 [details]
Patch

r- awaiting benchmark to justify this need.
Comment 8 Li Yin 2012-03-04 23:16:59 PST
JavaScript code from benchmark.

var NUM = 10*1024*1024;
var data = "";
var i=0;
var wsocket = new WebSocket("ws://127.0.0.1:8880/websocket/tests/hixie76/echo");
for(i=0;i<NUM;i++){
   data += "A";
}
wsocket.onopen = function(){
   var start = (new Date()).getTime();
   wsocket.send(data);
   var end = (new Date()).getTime();
   console.log("Consumed time:"+ (end - start) + "ms");
};
Comment 9 Li Yin 2012-03-04 23:21:10 PST
The total efficiency of send function is 1.5 times faster when the data size reaches 10MB, and about 1.3 times when the size is 1MB.
Comment 10 Li Yin 2012-03-04 23:27:13 PST
Although the total performance can't be improved so much as the single masking/unmasking operation, the patch indeed decreases the waiting time of web ui to some extent.
Comment 11 Yuta Kitamura 2012-03-05 10:20:37 PST
I'm also surprised to see this patch, but it's somewhat understandable that the frame masking code is a hot path. I'm not an expert of SSE2 operations so I'm not qualified to review the patch, though.

A few thoughts:
* Current code masks frames for each byte (char). Have you tried to mask frames for each machine word (32-bit or 64-bit)? I believe that would improve the performance, even without SSE.

* It may be a good idea to add some WebSocket tests under PerformanceTests/.
Comment 12 Li Yin 2012-03-05 20:24:30 PST
Created attachment 130276 [details]
Patch
Comment 13 Li Yin 2012-03-05 20:30:45 PST
(In reply to comment #11)
> I'm also surprised to see this patch, but it's somewhat understandable that the frame masking code is a hot path. I'm not an expert of SSE2 operations so I'm not qualified to review the patch, though.
> 
> A few thoughts:
> * Current code masks frames for each byte (char). Have you tried to mask frames for each machine word (32-bit or 64-bit)? I believe that would improve the performance, even without SSE.

Yeah, the performance can be improved using 64-bit data, almost 7~8 times faster than the original method.
Comment 14 Li Yin 2012-03-06 00:37:47 PST
Created attachment 130317 [details]
Patch
Comment 15 Li Yin 2012-03-06 00:44:16 PST
Add the benchmark under PerformanceTests/, please check it, thanks.
Comment 16 Zoltan Herczeg 2012-03-06 02:24:55 PST
Comment on attachment 130317 [details]
Patch

In general this looks well, and probably works well no other machines like ARM with NEON.

What I dont like is &base[offset] forms, please use base + offset instead of them.

View in context: https://bugs.webkit.org/attachment.cgi?id=130317&action=review

> Source/WebCore/ChangeLog:8
> +        No new tests. (OOPS!)

Remove this line. And please describe why is a win here as well. And add write some perf results.

> Source/WebCore/Modules/websockets/WebSocketChannel.cpp:113
> +static void maskFrameDataByUint64(char *src, size_t dataLength, const char *mask)

Please don't abbreaviate. Use source instead of src. Or buffer since it is also a destination...
And add an "inline" keyword. Otherwise this is probably not a win.

> Source/WebCore/Modules/websockets/WebSocketChannel.cpp:119
> +    if (dataLength > 0) {

WebKit prefers early return. So use:

if (datalength == 0)
    return;

> Source/WebCore/Modules/websockets/WebSocketChannel.cpp:123
> +            maskAlignedArray[i] = mask[index % maskingKeyWidthInBytes];

Is maskingKeyWidthInBytes is <= 8 and power of 2 all the time? We should add a COMPILE_ASSERT for it!

> Source/WebCore/Modules/websockets/WebSocketChannel.cpp:133
> +        for (size_t i = 0; i < dataLength % 8; i++)
> +             pSrc[i] ^= maskAlignedArray[i];

Just thinking about these fors...

dataLength &= ~0x7;
while (dataLength-- > 0)
    *pSrc++ ^= *maskAlignedArray++;

I think compiler like this better.
Comment 17 Yuta Kitamura 2012-03-06 09:59:49 PST
Comment on attachment 130317 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=130317&action=review

> PerformanceTests/websocket/sendDataByWebSocket.html:4
> +</head>

Performance test should use PerformanceTests/resources/runners.js so performance metrics can be trackable. Also, you probably need to modify run-perf-tests to start pywebsocket server before running WebSocket tests, so there are a lot more to do before adding any test.

I recommend splitting performance tests from this patch and opening a new bug about WebSocket performance tests.
Comment 18 Zoltan Herczeg 2012-03-06 12:23:34 PST
You still need to convice reviewers that this patch is useful and your algorithm gives us perf boost. Otherwise the increased code complexity (maintenance cost) does not worth it. We worked a lot of such algorithm before and unfortunately many of them are turned out to be not good enough. Many times the memory reads and writes are the bottleneck for such simple algorithms so an optimized version version gives only a minor speedup, but makes the code difficult to maintain. So please tell us your measurements, and also mention them in the changelog.
Comment 19 Li Yin 2012-03-06 19:16:47 PST
Created attachment 130519 [details]
Patch
Comment 20 Li Yin 2012-03-06 19:20:30 PST
(In reply to comment #18)
> You still need to convice reviewers that this patch is useful and your algorithm gives us perf boost. Otherwise the increased code complexity (maintenance cost) does not worth it. We worked a lot of such algorithm before and unfortunately many of them are turned out to be not good enough. Many times the memory reads and writes are the bottleneck for such simple algorithms so an optimized version version gives only a minor speedup, but makes the code difficult to maintain. So please tell us your measurements, and also mention them in the changelog.

Thanks for your review and good advice, I have updated the patch already, and add the test result in the ChangeLog, please have a check again.
Comment 21 Li Yin 2012-03-06 20:12:53 PST
(In reply to comment #17)
> (From update of attachment 130317 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=130317&action=review
> 
> > PerformanceTests/websocket/sendDataByWebSocket.html:4
> > +</head>
> 
> Performance test should use PerformanceTests/resources/runners.js so performance metrics can be trackable. Also, you probably need to modify run-perf-tests to start pywebsocket server before running WebSocket tests, so there are a lot more to do before adding any test.
> 
> I recommend splitting performance tests from this patch and opening a new bug about WebSocket performance tests.

File a new bug to add the WebSocket performance tests.
Bug 80476 - [WebSocket]Add the performance test case to track the efficiency of sending data.
Comment 22 Li Yin 2012-09-01 21:34:49 PDT
It is not very very hot spot compared with bulk data in JS parsing.
So mark it to be WONTFIX.