I haven't tested this with WebKit trunk so far, only with https://github.com/annulen/webkit and WebKit2GTK 2.16.3 - but since the code in trunk looks the same, I'm guessing this applies too. After updating my packages on Archlinux, I noticed a lot of segfaults when navigating to pages, or e.g. when submitting a comment on Reddit. I was able to minimize it to this example which crashes when run in jsc: s = 'xxxxxxxxxxxxxxAxxxxxxxxxxxxxxxxxxxxA–'; s.replace(/A/g, 'b') (note the non-ASCII – at the end) Stack (with the original failure on reddit, with QtWebKit): #0 0x00007fffed413b65 in WTF::StringImpl::copyChars<char16_t>(char16_t*, char16_t const*, unsigned int) (destination=0x7ffefa9e0250 u"<p ", source=0x7fff1baf0256 u"<p class=\"parent\"><a name=\"dircqkb\"></a></p><div class=\"midcol likes\" ><div class=\"arrow upmod login-required access-required\" data-event-action=\"upvote\" role=\"button\""..., numCharacters=20) at /tmp/makepkg/qt5-webkit-ng-debug/src/webkit-qtwebkit-5.212.0-alpha/Source/WTF/wtf/text/StringImpl.h:634 #1 0x00007fffed420926 in JSC::jsSpliceSubstringsWithSeparators (separatorCount=251, separators=<optimized out>, rangeCount=<optimized out>, substringRanges=<optimized out>, source=..., sourceVal=<optimized out>, exec=0x7fffffffc6f0, this=<optimized out>) at /tmp/makepkg/qt5-webkit-ng-debug/src/webkit-qtwebkit-5.212.0-alpha/Source/JavaScriptCore/runtime/StringPrototype.cpp:431 #2 0x00007fffed420926 in JSC::replaceUsingRegExpSearch (replaceValue=..., replacementString=..., callType=<optimized out>, callData=..., searchValue=..., string=<optimized out>, exec=<optimized out>) at /tmp/makepkg/qt5-webkit-ng-debug/src/webkit-qtwebkit-5.212.0-alpha/Source/JavaScriptCore/runtime/StringPrototype.cpp:666 #3 0x00007fffed420926 in JSC::replaceUsingRegExpSearch (replaceValue=..., searchValue=..., string=<optimized out>, exec=<optimized out>) at /tmp/makepkg/qt5-webkit-ng-debug/src/webkit-qtwebkit-5.212.0-alpha/Source/JavaScriptCore/runtime/StringPrototype.cpp:708 #4 0x00007fffed420926 in JSC::replace(JSC::ExecState*, JSC::JSString*, JSC::JSValue, JSC::JSValue) (replaceValue=..., searchValue=..., string=<optimized out>, exec=<optimized out>) at /tmp/makepkg/qt5-webkit-ng-debug/src/webkit-qtwebkit-5.212.0-alpha/Source/JavaScriptCore/runtime/StringPrototype.cpp:829 #5 0x00007fffed420926 in JSC::replace(JSC::ExecState*, JSC::JSValue, JSC::JSValue, JSC::JSValue) (replaceValue=..., searchValue=..., thisValue=..., exec=<optimized out>) at /tmp/makepkg/qt5-webkit-ng-debug/src/webkit-qtwebkit-5.212.0-alpha/Source/JavaScriptCore/runtime/StringPrototype.cpp:841 #6 0x00007fffed420926 in JSC::stringProtoFuncReplace(JSC::ExecState*) (exec=<optimized out>) at /tmp/makepkg/qt5-webkit-ng-debug/src/webkit-qtwebkit-5.212.0-alpha/Source/JavaScriptCore/runtime/StringPrototype.cpp:846 #7 0x00007fff1bfff0c8 in () #8 0x00007fffffffc7b0 in () #9 0x00007fffed5c930d in llint_entry () at /usr/lib/libQt5WebKit.so.5 Stack (with jsc and WebKit2GTK, release build): #0 0x00007ffff7ac2c95 in void WTF::StringImpl::copyChars<char16_t>(char16_t*, char16_t const*, unsigned int) () at /usr/lib/libjavascriptcoregtk-4.0.so.18 #1 0x00007ffff7ac1d5e in JSC::stringProtoFuncReplaceUsingRegExp(JSC::ExecState*) () at /usr/lib/libjavascriptcoregtk-4.0.so.18 #2 0x00007fffb21ff028 in () #3 0x00007fffffffcfc0 in () #4 0x00007ffff77a4e42 in () at /usr/lib/libjavascriptcoregtk-4.0.so.18 #5 0x0000000000000000 in () So I looked into the implementation of WTF::StringImpl::copyChars: template <typename T> static void copyChars(T* destination, const T* source, unsigned numCharacters) { if (numCharacters == 1) { *destination = *source; return; } if (numCharacters <= s_copyCharsInlineCutOff) { unsigned i = 0; #if (CPU(X86) || CPU(X86_64)) const unsigned charsPerInt = sizeof(uint32_t) / sizeof(T); if (numCharacters > charsPerInt) { unsigned stopCount = numCharacters & ~(charsPerInt - 1); const uint32_t* srcCharacters = reinterpret_cast<const uint32_t*>(source); // aliasing uint32_t* destCharacters = reinterpret_cast<uint32_t*>(destination); // aliasing for (unsigned j = 0; i < stopCount; i += charsPerInt, ++j) destCharacters[j] = srcCharacters[j]; // CRASH } #endif for (; i < numCharacters; ++i) destination[i] = source[i]; } else memcpy(destination, source, numCharacters * sizeof(T)); } Removing the optimized #if block made things run fine. With some help from people in the #gcc IRC channel, I figured out that WebKit violates strict aliasing rules with the casts marked above (as T is an uint16_t, incompatible with uint32_t): https://www.cocoawithlove.com/2008/04/using-pointers-to-recast-in-c-is-bad.html http://dbp-consulting.com/tutorials/StrictAliasing.html WebKit should be built with -fno-strict-aliasing though, and at least with QtWebKit I verified this was the case. From what I understand, that should make it possible to do this kind of thing, yet with GCC 7.1.1 I get a crash there. I haven't looked at the generated assembler yet (and not sure I'll get to it anytime soon), but this might be a GCC 7 bug? But even if it's fixed there, I suppose WebKit should work around it in some way. I wonder whether the optimized inline copyChars is really needed nowadays (it was added in 2011, https://github.com/annulen/webkit/commit/0cb990be04a30870abb2a01e7f9fe5eea08b25a9 ) - I'd expect an inline memcpy() to generate equal if not better code nowadays, no? Downstream QtWebKit bug where I investigated this: https://github.com/annulen/webkit/issues/562
I've run JSC benchmarks on Linux with GCC 7.1, performance results were neutral after change of copyChars to do just memcpy
So then this essentially boils down to whether it should be replaced by a simple memcpy() only for GCC >= 7, or for other compilers as well? (Unless there's another solution to this I'm missing - right now the only option I see is to get rid of the "optimized" part)
(In reply to Konstantin Tokarev from comment #1) > I've run JSC benchmarks on Linux with GCC 7.1, performance results were > neutral after change of copyChars to do just memcpy I like to use memcpy if it does not hurt performance. Just out of curiosity, I wonder if we can use bitwise_cast here instead of reinterpret_cast. Maybe, JF knows much about strict-aliasing, type punning and memcpy.
(In reply to Yusuke Suzuki from comment #3) > (In reply to Konstantin Tokarev from comment #1) > > I've run JSC benchmarks on Linux with GCC 7.1, performance results were > > neutral after change of copyChars to do just memcpy > > I like to use memcpy if it does not hurt performance. I've checked just one platform/compiler combination so far. I guess that initially this code path might have been added because of deficiency in old version of memcpy in macOS' libc, or gcc 4.2. Could you recommend any specific benchmark that is signficantly affected by performance of copyChars? > > Just out of curiosity, I wonder if we can use bitwise_cast here instead of > reinterpret_cast. > > Maybe, JF knows much about strict-aliasing, type punning and memcpy.
(In reply to Yusuke Suzuki from comment #3) > (In reply to Konstantin Tokarev from comment #1) > > I've run JSC benchmarks on Linux with GCC 7.1, performance results were > > neutral after change of copyChars to do just memcpy > > I like to use memcpy if it does not hurt performance. > > Just out of curiosity, I wonder if we can use bitwise_cast here instead of > reinterpret_cast. > > Maybe, JF knows much about strict-aliasing, type punning and memcpy. bitwise_cast wouldn't help here, the pointers would still alias. Making the pointer type char* would fix the problem, because char* is magical in C++ and ca alias everything. memcpy sounds good if per isn't hurt. Can source ever be the same as destination though? I want to make sure we don't need memmove instead.
If it's a char*, this would be equal to the "naive" bytewise copying below the #ifdef though, right? If it's more than 20 chars (> s_copyCharsInlineCutOff), memcpy is already used, so if anything used copyChars with the same source/destination, that'd have to be only with strings guaranteed to be shorter, which seems quite a stretch I think?
(In reply to Florian Bruhin from comment #6) > If it's a char*, this would be equal to the "naive" bytewise copying below > the #ifdef though, right? Correct. > If it's more than 20 chars (> s_copyCharsInlineCutOff), memcpy is already > used, so if anything used copyChars with the same source/destination, that'd > have to be only with strings guaranteed to be shorter, which seems quite a > stretch I think? I'm not familiar with this part of the code, so it doesn't seem like a stretch to me :)
For the record, optimization that avoids memcpy was initially added by Darin in https://bugs.webkit.org/show_bug.cgi?id=20671 (2008)
(In reply to Konstantin Tokarev from comment #8) > For the record, optimization that avoids memcpy was initially added by Darin > in https://bugs.webkit.org/show_bug.cgi?id=20671 (2008) OK, that note is very helpful because it clarifies that this optimization is driven by SunSpider.
(In reply to Yusuke Suzuki from comment #9) > (In reply to Konstantin Tokarev from comment #8) > > For the record, optimization that avoids memcpy was initially added by Darin > > in https://bugs.webkit.org/show_bug.cgi?id=20671 (2008) > > OK, that note is very helpful because it clarifies that this optimization is > driven by SunSpider. I've just tested it on my Linux box. VMs tested: "baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/string-copy-tot/Release/bin/jsc "patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/string-copy/Release/bin/jsc Collected 100 samples per benchmark/VM, with 100 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. baseline patched string-base64 7.7544+-0.1761 7.6138+-0.2071 might be 1.0185x faster string-fasta 10.5429+-0.2746 ? 10.7500+-0.2669 ? might be 1.0196x slower string-tagcloud 14.8588+-0.2828 14.8039+-0.3039 string-unpack-code 36.1769+-0.4251 35.3397+-0.5398 might be 1.0237x faster string-validate-input 8.5182+-0.2206 8.3514+-0.2179 might be 1.0200x faster <arithmetic> 15.5702+-0.1239 15.3718+-0.1452 might be 1.0129x faster It seems there is no explicit difference right now. The compiler is GCC 5.4.0 I would like to see the difference in macOS's clang.
It’s great news if memcpy is now fast enough. I’d love to see my old awkward optimization removed if it no longer does any good. In fact, I’m not sure we need a copyChars function at all. We could use memcpy directly as we once did, or even better ICU has u_memcpy if we want something that prevents the need to do "* sizeof(UChar)" at each call site. As long as u_memcpy has great performance too; presumably it expands into memcpy as an inline.
Ping?
FWIW Archlinux now added a patch which just uses memcpy to its qt5-webkit package: https://git.archlinux.org/svntogit/packages.git/tree/trunk/qt5-webkit-gcc7.patch?h=packages/qt5-webkit Meanwhile, a Fedora packager told me they didn't notice any issues with GCC 7.1.1... no idea why. :-/
Created attachment 314599 [details] Patch
Comment on attachment 314599 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=314599&action=review > Source/WTF/wtf/text/StringImpl.h:631 > } We still keep StringImpl::copyChars() because it is very useful to copy characters based on the number of chars. If we use memcpy directly, we need to consider sizeof(UChar) and sizeof(LChar) carefully. And StringImpl::copyChars(UChar*, const LChar*, unsigned) override is also useful.
Comment on attachment 314599 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=314599&action=review > Source/WTF/wtf/text/StringImpl.h:-631 > - if (numCharacters <= s_copyCharsInlineCutOff) { I think we should also remove s_copyCharsInlineCutOff definition
Comment on attachment 314599 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=314599&action=review >> Source/WTF/wtf/text/StringImpl.h:-631 >> - if (numCharacters <= s_copyCharsInlineCutOff) { > > I think we should also remove s_copyCharsInlineCutOff definition Nice catch! I'll update the patch.
Created attachment 314600 [details] Patch
Did anyone run numbers on MacOS?
(In reply to JF Bastien from comment #19) > Did anyone run numbers on MacOS? I don't measure it yet. If nobody runs it until tomorrow's evening (JST), possibly I can run it ;) (my macOS box is placed at my office).
Ok I can give it a try later today.
On my iMac: before after 3d-cube 3.8573+-0.0502 3.7993+-0.0415 might be 1.0153x faster 3d-morph 4.2964+-0.0214 4.2933+-0.0245 3d-raytrace 3.9897+-0.0476 3.9794+-0.0589 access-binary-trees 1.8148+-0.0242 1.8113+-0.0155 access-fannkuch 4.5038+-0.0384 4.5000+-0.0476 access-nbody 2.2008+-0.0196 ? 2.2017+-0.0167 ? access-nsieve 1.9470+-0.0433 ? 1.9663+-0.0509 ? bitops-3bit-bits-in-byte 1.0725+-0.0117 1.0721+-0.0118 bitops-bits-in-byte 2.1698+-0.0342 2.1535+-0.0250 bitops-bitwise-and 1.8784+-0.0138 ? 1.8907+-0.0219 ? bitops-nsieve-bits 2.7890+-0.0119 ? 2.8228+-0.0349 ? might be 1.0121x slower controlflow-recursive 2.0770+-0.0207 2.0688+-0.0231 crypto-aes 3.5230+-0.0352 ? 3.5641+-0.0508 ? might be 1.0117x slower crypto-md5 2.1610+-0.0176 2.1448+-0.0139 crypto-sha1 2.2058+-0.0256 ? 2.2185+-0.0236 ? date-format-tofte 6.0776+-0.0427 ? 6.1917+-0.0917 ? might be 1.0188x slower date-format-xparb 4.8437+-0.0344 ^ 4.7681+-0.0408 ^ definitely 1.0159x faster math-cordic 2.4417+-0.0326 2.4224+-0.0170 math-partial-sums 3.3532+-0.0203 ? 3.3763+-0.0432 ? math-spectral-norm 1.6996+-0.0198 1.6885+-0.0164 regexp-dna 5.9056+-0.0450 5.8670+-0.0320 string-base64 3.6116+-0.0421 3.5795+-0.0302 string-fasta 5.0965+-0.0400 5.0413+-0.0540 might be 1.0110x faster string-tagcloud 7.0740+-0.0728 ? 7.0958+-0.0883 ? string-unpack-code 16.5066+-0.1147 ? 16.6307+-0.1750 ? string-validate-input 4.1960+-0.0615 4.1216+-0.0161 might be 1.0181x faster <arithmetic> 3.8959+-0.0079 3.8950+-0.0104 might be 1.0002x faster
Comment on attachment 314600 [details] Patch r=me
Comment on attachment 314600 [details] Patch Thanks! OK, let's go.
Comment on attachment 314600 [details] Patch Clearing flags on attachment: 314600 Committed r219182: <http://trac.webkit.org/changeset/219182>
All reviewed patches have been landed. Closing bug.
(In reply to Yusuke Suzuki from comment #15) > We still keep StringImpl::copyChars() because it is very useful to copy > characters based on the number of chars. > If we use memcpy directly, we need to consider sizeof(UChar) and > sizeof(LChar) carefully. But why not use u_memcpy directly? > And StringImpl::copyChars(UChar*, const LChar*, unsigned) override is also > useful. That makes sense.
(In reply to Darin Adler from comment #27) > (In reply to Yusuke Suzuki from comment #15) > > We still keep StringImpl::copyChars() because it is very useful to copy > > characters based on the number of chars. > > If we use memcpy directly, we need to consider sizeof(UChar) and > > sizeof(LChar) carefully. > > But why not use u_memcpy directly? The problem is that memcpy is not designed for LChar. It is designed for arbitrary bytes. Thus it's signature is `void* memcpy(void*, const void*, size_t)`. We cannot catch the case using memcpy for UChar as a compile error. I think offering copyChars is safer, its overloading and templates can select appropriate memcpy based implementation automatically.