[WebSocket]Optimize the masking/umasking operation by SSE2
Created attachment 129401 [details] Patch
Using the SSE2 instruction to do the xor operation, the performance has a great promotion, when the data size reaches 1KB or 1MB.
If the masking operation takes a long time, it will block the UI, so the optimization can decrease the waiting time of web UI.
Any comments about this patch?
Really? WebSockets, needs SSE optimization? I can't believe that this is that performance sensitive. Please include a benchmark which demonstrates that this code is hot enough that we need SSE here.
Comment on attachment 129401 [details] Patch r- awaiting benchmark to justify this need.
JavaScript code from benchmark. var NUM = 10*1024*1024; var data = ""; var i=0; var wsocket = new WebSocket("ws://127.0.0.1:8880/websocket/tests/hixie76/echo"); for(i=0;i<NUM;i++){ data += "A"; } wsocket.onopen = function(){ var start = (new Date()).getTime(); wsocket.send(data); var end = (new Date()).getTime(); console.log("Consumed time:"+ (end - start) + "ms"); };
The total efficiency of send function is 1.5 times faster when the data size reaches 10MB, and about 1.3 times when the size is 1MB.
Although the total performance can't be improved so much as the single masking/unmasking operation, the patch indeed decreases the waiting time of web ui to some extent.
I'm also surprised to see this patch, but it's somewhat understandable that the frame masking code is a hot path. I'm not an expert of SSE2 operations so I'm not qualified to review the patch, though. A few thoughts: * Current code masks frames for each byte (char). Have you tried to mask frames for each machine word (32-bit or 64-bit)? I believe that would improve the performance, even without SSE. * It may be a good idea to add some WebSocket tests under PerformanceTests/.
Created attachment 130276 [details] Patch
(In reply to comment #11) > I'm also surprised to see this patch, but it's somewhat understandable that the frame masking code is a hot path. I'm not an expert of SSE2 operations so I'm not qualified to review the patch, though. > > A few thoughts: > * Current code masks frames for each byte (char). Have you tried to mask frames for each machine word (32-bit or 64-bit)? I believe that would improve the performance, even without SSE. Yeah, the performance can be improved using 64-bit data, almost 7~8 times faster than the original method.
Created attachment 130317 [details] Patch
Add the benchmark under PerformanceTests/, please check it, thanks.
Comment on attachment 130317 [details] Patch In general this looks well, and probably works well no other machines like ARM with NEON. What I dont like is &base[offset] forms, please use base + offset instead of them. View in context: https://bugs.webkit.org/attachment.cgi?id=130317&action=review > Source/WebCore/ChangeLog:8 > + No new tests. (OOPS!) Remove this line. And please describe why is a win here as well. And add write some perf results. > Source/WebCore/Modules/websockets/WebSocketChannel.cpp:113 > +static void maskFrameDataByUint64(char *src, size_t dataLength, const char *mask) Please don't abbreaviate. Use source instead of src. Or buffer since it is also a destination... And add an "inline" keyword. Otherwise this is probably not a win. > Source/WebCore/Modules/websockets/WebSocketChannel.cpp:119 > + if (dataLength > 0) { WebKit prefers early return. So use: if (datalength == 0) return; > Source/WebCore/Modules/websockets/WebSocketChannel.cpp:123 > + maskAlignedArray[i] = mask[index % maskingKeyWidthInBytes]; Is maskingKeyWidthInBytes is <= 8 and power of 2 all the time? We should add a COMPILE_ASSERT for it! > Source/WebCore/Modules/websockets/WebSocketChannel.cpp:133 > + for (size_t i = 0; i < dataLength % 8; i++) > + pSrc[i] ^= maskAlignedArray[i]; Just thinking about these fors... dataLength &= ~0x7; while (dataLength-- > 0) *pSrc++ ^= *maskAlignedArray++; I think compiler like this better.
Comment on attachment 130317 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=130317&action=review > PerformanceTests/websocket/sendDataByWebSocket.html:4 > +</head> Performance test should use PerformanceTests/resources/runners.js so performance metrics can be trackable. Also, you probably need to modify run-perf-tests to start pywebsocket server before running WebSocket tests, so there are a lot more to do before adding any test. I recommend splitting performance tests from this patch and opening a new bug about WebSocket performance tests.
You still need to convice reviewers that this patch is useful and your algorithm gives us perf boost. Otherwise the increased code complexity (maintenance cost) does not worth it. We worked a lot of such algorithm before and unfortunately many of them are turned out to be not good enough. Many times the memory reads and writes are the bottleneck for such simple algorithms so an optimized version version gives only a minor speedup, but makes the code difficult to maintain. So please tell us your measurements, and also mention them in the changelog.
Created attachment 130519 [details] Patch
(In reply to comment #18) > You still need to convice reviewers that this patch is useful and your algorithm gives us perf boost. Otherwise the increased code complexity (maintenance cost) does not worth it. We worked a lot of such algorithm before and unfortunately many of them are turned out to be not good enough. Many times the memory reads and writes are the bottleneck for such simple algorithms so an optimized version version gives only a minor speedup, but makes the code difficult to maintain. So please tell us your measurements, and also mention them in the changelog. Thanks for your review and good advice, I have updated the patch already, and add the test result in the ChangeLog, please have a check again.
(In reply to comment #17) > (From update of attachment 130317 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=130317&action=review > > > PerformanceTests/websocket/sendDataByWebSocket.html:4 > > +</head> > > Performance test should use PerformanceTests/resources/runners.js so performance metrics can be trackable. Also, you probably need to modify run-perf-tests to start pywebsocket server before running WebSocket tests, so there are a lot more to do before adding any test. > > I recommend splitting performance tests from this patch and opening a new bug about WebSocket performance tests. File a new bug to add the WebSocket performance tests. Bug 80476 - [WebSocket]Add the performance test case to track the efficiency of sending data.
It is not very very hot spot compared with bulk data in JS parsing. So mark it to be WONTFIX.