Bug 93742 - HTML Parser should produce 8bit substrings for inline style and script elements
Summary: HTML Parser should produce 8bit substrings for inline style and script elements
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Michael Saboff
URL:
Keywords: InRadar
Depends on: 94295
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-10 14:12 PDT by Michael Saboff
Modified: 2012-08-17 13:33 PDT (History)
16 users (show)

See Also:


Attachments
Patch for review (13.87 KB, patch)
2012-08-10 16:03 PDT, Michael Saboff
gyuyoung.kim: commit-queue-
Details | Formatted Diff | Diff
Updated patch without default template parameter (14.30 KB, patch)
2012-08-10 17:12 PDT, Michael Saboff
abarth: review-
Details | Formatted Diff | Diff
Performance Comparison of Patch including char loop vs Asm loop (22.53 KB, application/pdf)
2012-08-13 13:58 PDT, Michael Saboff
no flags Details
Updated Patch using OS(DARWIN) and intrinsics (14.56 KB, patch)
2012-08-15 11:13 PDT, Michael Saboff
benjamin: review-
buildbot: commit-queue-
Details | Formatted Diff | Diff
Trivial bench (4.87 KB, text/plain)
2012-08-15 23:49 PDT, Benjamin Poulain
no flags Details
Patch updated with suggested changes including using PACKUSWB (14.29 KB, patch)
2012-08-16 11:50 PDT, Michael Saboff
benjamin: review+
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
Patch with Reviewer's Changes and Speculative Chromium Linux Build Fix (14.53 KB, patch)
2012-08-16 14:24 PDT, Michael Saboff
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Saboff 2012-08-10 14:12:54 PDT
The HTML parser produces 16 bit strings for all elements.  The JavaScript parser can accept 8-bit strings and work on the CSS parser is under way.

This change should not create 8-bit strings for actual text as it would slow down the render code which expects 16-bit strings.
Comment 1 Michael Saboff 2012-08-10 14:13:24 PDT
<rdar://problem/12067032>
Comment 2 Michael Saboff 2012-08-10 16:03:20 PDT
Created attachment 157827 [details]
Patch for review
Comment 3 Gyuyoung Kim 2012-08-10 16:39:00 PDT
Comment on attachment 157827 [details]
Patch for review

Attachment 157827 [details] did not pass efl-ews (efl):
Output: http://queues.webkit.org/results/13477347
Comment 4 Build Bot 2012-08-10 17:01:13 PDT
Comment on attachment 157827 [details]
Patch for review

Attachment 157827 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/13475383
Comment 5 Michael Saboff 2012-08-10 17:12:50 PDT
Created attachment 157840 [details]
Updated patch without default template parameter
Comment 6 Eric Seidel (no email) 2012-08-11 02:38:11 PDT
Do we have any perf numbers for this work?
Comment 7 Adam Barth 2012-08-11 14:48:26 PDT
Comment on attachment 157840 [details]
Updated patch without default template parameter

View in context: https://bugs.webkit.org/attachment.cgi?id=157840&action=review

> Source/WTF/wtf/text/ASCIIFastPath.h:106
> +#if CPU(X86_64) && COMPILER(GCC) && PLATFORM(MAC)

Do you mean OS(DARWIN)?
Comment 8 Michael Saboff 2012-08-12 11:17:03 PDT
(In reply to comment #6)
> Do we have any perf numbers for this work?

Here they are as measured across three runs.
Basically the CSS tests are slightly slower (we make 8 bit strings and then up convert to 16 bit before parsing). Most of the HTML tests improve, except tiny-inner.  The work is a prerequisite for the CSS parsing work I'm doing and I expect the combination of this change and the 8 bit CSS parser work will be a progression.

                        Baseline   With 8 bit 
                         r124894   inline elems
css-parser-yui            367.40    353.73  runs/s
css-parser-yui            365.62    364.73  runs/s
css-parser-yui            366.07    366.65  runs/s
          Mean            366.36    361.70  -1.27%
html-parser              1862.65   1830.80  ms
html-parser              1847.40   1816.15  ms
html-parser              1848.10   1828.20  ms
          Mean           1852.72   1825.05   1.49%
html5-full-render       21604.50  21405.50  ms
html5-full-render       21702.50  21459.50  ms
html5-full-render       21532.50  21286.50  ms
          Mean          21613.17  21383.83   1.06%
innerHTML-setter          300.55    296.74  runs/s
innerHTML-setter          289.15    303.01  runs/s
innerHTML-setter          301.40    297.82  runs/s
          Mean            297.03    299.19   0.73%
query-selector-deep       389.34    385.73  runs/s
query-selector-deep       397.66    395.53  runs/s
query-selector-deep       397.74    396.02  runs/s
          Mean            394.92    392.42  -0.63%
query-selector-first     2084.10   2015.10  runs/s
query-selector-first     2092.48   2036.31  runs/s
query-selector-first     2045.50   2006.66  runs/s
          Mean           2074.03   2019.35  -2.64%
query-selector-last       385.78    385.75  runs/s
query-selector-last       386.03    379.85  runs/s
query-selector-last       385.45    385.52  runs/s
          Mean            385.75    383.71  -0.53%
simple-url                 49.85     48.64  runs/s
simple-url                 47.66     48.65  runs/s
simple-url                 47.40     48.42  runs/s
          Mean             48.30     48.57   0.55%
textarea-parsing           56.77     55.45  runs/s
textarea-parsing           56.77     56.05  runs/s
textarea-parsing           55.68     56.06  runs/s
          Mean             56.40     55.85  -0.98%
tiny-innerHTML              5.53      5.29  runs/s
tiny-innerHTML              5.53      5.17  runs/s
tiny-innerHTML              5.40      4.99  runs/s
          Mean              5.48      5.15  -6.11%
url-parser                111.89    112.25  runs/s
url-parser                112.15    113.04  runs/s
url-parser                110.36    111.82  runs/s
          Mean            111.47    112.37   0.81%
xml-parser                  7.55      7.59  runs/s
xml-parser                  7.62      7.60  runs/s
xml-parser                  7.48      7.46  runs/s
          Mean              7.55      7.55   0.04%
Comment 9 Michael Saboff 2012-08-12 11:21:39 PDT
(In reply to comment #7)
> (From update of attachment 157840 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=157840&action=review
> 
> > Source/WTF/wtf/text/ASCIIFastPath.h:106
> > +#if CPU(X86_64) && COMPILER(GCC) && PLATFORM(MAC)
> 
> Do you mean OS(DARWIN)?

Actually, if someone on the chrome side could test this on Linux, the last term could be changed to (PLATFORM(MAC) || OS(LINUX)).
Comment 10 Eric Seidel (no email) 2012-08-12 11:32:44 PDT
Do we understand what the perf regressions come from?  Presumably from converting 8 bit strings back to 16 bit.  Just curious if you've looked at the sample for the tests which got slower (in case avoiding such is an easy fix).
Comment 11 Eric Seidel (no email) 2012-08-12 11:35:30 PDT
(In reply to comment #9)
> (In reply to comment #7)
> > (From update of attachment 157840 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=157840&action=review
> > 
> > > Source/WTF/wtf/text/ASCIIFastPath.h:106
> > > +#if CPU(X86_64) && COMPILER(GCC) && PLATFORM(MAC)
> > 
> > Do you mean OS(DARWIN)?
> 
> Actually, if someone on the chrome side could test this on Linux, the last term could be changed to (PLATFORM(MAC) || OS(LINUX)).

My understanding is that PLATFORM(MAC) means only the Apple Mac port.  If you wan to include Chromium Mac, and Qt on Mac, Gtk on Mac, etc. you'll need to use OS(DARWIN).  But perhaps you intend only the Apple mac port?
Comment 12 Benjamin Poulain 2012-08-12 15:45:21 PDT
I can look at this tomorrow if you want.

> Actually, if someone on the chrome side could test this on Linux, the last term could be changed to (PLATFORM(MAC) || OS(LINUX)).

I think it shouldn't be changed.

SSSE3 is not a mandatory extension of x86_64 (only SSE and SSE2 are). I think this patch is safe for Mac because the supported hardware start with the Intel Core Architecture.

On Linux, WebKit runs on plenty of CPUs without SSSE3.
To support this in a cross platform way, one should use the CPU extensions detection at runtime.
Comment 13 Adam Barth 2012-08-13 10:18:05 PDT
> I think it shouldn't be changed.

It sounds like it should be OS(DARWIN) rather than PLAFORM(MAC).  PLATFORM(MAC) is just the apple-mac port, and excludes Qt, Chromium, and other ports that run on Mac OS X.

Can you also post benchmark numbers for how the non-ASM code path performs?  If the ASM is only a small win, it might not be worth the maintenance burden.  If the non-ASM path is a regression for Windows and Linux ports of WebKit, that's also a concern.
Comment 14 Adam Barth 2012-08-13 10:18:29 PDT
Comment on attachment 157840 [details]
Updated patch without default template parameter

r- for incorrect ifdef.
Comment 15 Michael Saboff 2012-08-13 11:58:29 PDT
(In reply to comment #10)
> Do we understand what the perf regressions come from?  Presumably from converting 8 bit strings back to 16 bit.  Just curious if you've looked at the sample for the tests which got slower (in case avoiding such is an easy fix).

From using instruments, 1.6% of css-parser-yui is spent up converting the string back to 16 bits.  The rest of the tests with regressions don't not show such an obvious source of the slowdown.
The most curious slowdown is tiny-innerHTML.
Comment 16 Benjamin Poulain 2012-08-13 12:33:57 PDT
(In reply to comment #13)
> It sounds like it should be OS(DARWIN) rather than PLAFORM(MAC).  PLATFORM(MAC) is just the apple-mac port, and excludes Qt, Chromium, and other ports that run on Mac OS X.

I completely disagree. Read my previous comment.

You cannot blindly assume SSSE3 is available on x86_64.
Comment 17 Adam Barth 2012-08-13 12:42:14 PDT
I think there's some miscommunication.  Let's be concrete.

1) Can you describe a system that runs chromium-mac or qt-mac port that doesn't have SSSE3?
2) Can you describe a system that uses WebKit with OS(DARWIN) that doesn't have SSSE3?
Comment 18 Benjamin Poulain 2012-08-13 12:58:54 PDT
> 1) Can you describe a system that runs chromium-mac or qt-mac port that doesn't have SSSE3?

All the AMD (and most of the Atom) hackintosh.

I think it is ok for OS(MAC) to not support that, but pushing this on all ports seems unfair.
Comment 19 Adam Barth 2012-08-13 13:25:04 PDT
(In reply to comment #18)
> > 1) Can you describe a system that runs chromium-mac or qt-mac port that doesn't have SSSE3?
> 
> All the AMD (and most of the Atom) hackintosh.

I don't think chromium-mac has any desire to run on non-Apple Mac OS X hardware.  We can ask the qt-mac folks for there opinion on that topic.

> I think it is ok for OS(MAC) to not support that, but pushing this on all ports seems unfair.

I wasn't aware that there was an OS(MAC).  Do you mean PLATFORM(MAC)?
Comment 20 Michael Saboff 2012-08-13 13:58:49 PDT
Created attachment 158097 [details]
Performance Comparison of Patch including char loop vs Asm loop

The attached data shows that the character at a time loop version is a progression for the HTML tests except tiny-innrHTML.
The inline asm version improves on most tests. For example html-parser's performance goes from +.75% to +1.5% with the inline asm.
Comment 21 Adam Barth 2012-08-13 14:01:49 PDT
Cool.  Thanks.
Comment 22 Michael Saboff 2012-08-14 14:42:33 PDT
(In reply to comment #19)
> (In reply to comment #18)
> > > 1) Can you describe a system that runs chromium-mac or qt-mac port that doesn't have SSSE3?
> > 
> > All the AMD (and most of the Atom) hackintosh.
> 
> I don't think chromium-mac has any desire to run on non-Apple Mac OS X hardware.  We can ask the qt-mac folks for there opinion on that topic.
> 
> > I think it is ok for OS(MAC) to not support that, but pushing this on all ports seems unfair.
> 
> I wasn't aware that there was an OS(MAC).  Do you mean PLATFORM(MAC)?

I think the choice is between OS(DARWIN) and PLATFORM(MAC) (with the CPU(X86_64) && COMPILER(GCC) qualifiers).  Who do we need to contact in qt land to rule out OS(DARWIN)?

What other holdups are there for this patch?
Comment 23 Benjamin Poulain 2012-08-14 14:47:22 PDT
> I think the choice is between OS(DARWIN) and PLATFORM(MAC) (with the CPU(X86_64) && COMPILER(GCC) qualifiers).  Who do we need to contact in qt land to rule out OS(DARWIN)?
> 
> What other holdups are there for this patch?

Actually...
I would much prefer intrinsics over inline assembly. I would also remove CPU(X86_64) && COMPILER(GCC) if you use intrinsics.
Comment 24 Eric Seidel (no email) 2012-08-14 14:49:22 PDT
I would not hold up this patch based on theoretical concerns for QtMac. :)  Hackintosh is not a supported platform of the WebKit project.  We can safely assume that OS(DARWIN) means "Apple hardware" IMO.
Comment 25 Csaba Osztrogonác 2012-08-15 01:38:39 PDT
I cc-ed Qt Mac guys. The question is simple: Do we have any x86_64 hardware
without SSE3 support? ( I don't think so, but everything is possible. :) )
Comment 26 Michael Saboff 2012-08-15 11:13:23 PDT
Created attachment 158598 [details]
Updated Patch using OS(DARWIN) and intrinsics

The performance of this version appears to be identical to the prior inline asm version.

Also made a minor alignment fix, relaxing the alignment that can use the SSE loop to 16 byte, instead of 16 word.
Comment 27 Build Bot 2012-08-15 11:49:04 PDT
Comment on attachment 158598 [details]
Updated Patch using OS(DARWIN) and intrinsics

Attachment 158598 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/13501729
Comment 28 Benjamin Poulain 2012-08-15 13:55:05 PDT
Comment on attachment 158598 [details]
Updated Patch using OS(DARWIN) and intrinsics

View in context: https://bugs.webkit.org/attachment.cgi?id=158598&action=review

Quick comments, I'll review later.

> Source/WTF/wtf/text/ASCIIFastPath.h:25
> +#if OS(DARWIN)

&& CPU(X86_64) || CPU(X86)

Or defined(__SSSE3__)

> Source/WTF/wtf/text/ASCIIFastPath.h:106
> +template <uintptr_t mask>
> +inline bool isAlignedTo(const void* pointer)
> +{
> +    return !(reinterpret_cast<uintptr_t>(pointer) & mask);
> +}

Maybe move this to wtf/Alignment.h?

> Source/WTF/wtf/text/ASCIIFastPath.h:110
> +#if OS(DARWIN)

Ditto for the #define.

> Source/WTF/wtf/text/ASCIIFastPath.h:121
> +    const UChar* alignedEnd = reinterpret_cast<const UChar*>((reinterpret_cast<const uintptr_t>(source + length))&~sourceLoadMask);

Any way to get rid of the reinterpret_cast?
Coding style: spaces around the "&" operator.

> Source/WTF/wtf/text/ASCIIFastPath.h:126
> +    for (unsigned i = 0; i < length;) {
> +        if (isAlignedTo<memoryAccessMask>(&source[i])) {

This is a little unusually construction.

Usually, we have a prefix aligning the input, the simd code, the a suffix finishing the input.

Nothing wrong with this code, just for info.

> Source/WTF/wtf/text/ASCIIFastPath.h:130
> +                    ASSERT(!(source[i+checkIndex]&0xff00));

Coding style: space for the "&" operator.

> Source/WTF/wtf/text/ASCIIFastPath.h:146
> +        ASSERT(!(source[i]&0xff00));

!(source[i] & 0xff00)

> Source/WTF/wtf/text/ASCIIFastPath.h:152
> +        ASSERT(!(*source&0xff00));

Space.
Comment 29 Benjamin Poulain 2012-08-15 15:34:35 PDT
Comment on attachment 158598 [details]
Updated Patch using OS(DARWIN) and intrinsics

View in context: https://bugs.webkit.org/attachment.cgi?id=158598&action=review

> Source/WTF/wtf/text/ASCIIFastPath.h:114
> +    // See Intel documentation for how the shuffle control mask works.

I don't think that comment adds much :)

> Source/WTF/wtf/text/ASCIIFastPath.h:120
> +    const __m128i xmmShuf1 = {0x0e0c0a0806040200ll, 0x8080808080808080ll};
> +    const __m128i xmmShuf2 = {0x8080808080808080ll, 0x0e0c0a0806040200ll};

I think mm_setr_pi16 would be clearer for the masks.

> Source/WTF/wtf/text/ASCIIFastPath.h:137
> +                __m128i first8LChars = _mm_shuffle_epi8(first8UChars, xmmShuf1);
> +                __m128i second8LChars = _mm_shuffle_epi8(second8UChars, xmmShuf2);
> +                __m128i all16LChars = _mm_or_si128(first8LChars, second8LChars);

Could mm_packus_epi16 do the trick?

I clear the flag while you try regular packing.
Comment 30 Benjamin Poulain 2012-08-15 15:58:54 PDT
Comment on attachment 158598 [details]
Updated Patch using OS(DARWIN) and intrinsics

View in context: https://bugs.webkit.org/attachment.cgi?id=158598&action=review

> Source/WTF/wtf/text/WTFString.cpp:775
> +    if (!source || !length)

!length should be enough.

> Source/WTF/wtf/text/WTFString.cpp:779
> +    if (length > numeric_limits<unsigned>::max())
> +        CRASH();

StringImpl does that for you, you should not have to do it.
Comment 31 Benjamin Poulain 2012-08-15 23:45:50 PDT
Comment on attachment 158598 [details]
Updated Patch using OS(DARWIN) and intrinsics

View in context: https://bugs.webkit.org/attachment.cgi?id=158598&action=review

>> Source/WTF/wtf/text/ASCIIFastPath.h:121
>> +    const UChar* alignedEnd = reinterpret_cast<const UChar*>((reinterpret_cast<const uintptr_t>(source + length))&~sourceLoadMask);
> 
> Any way to get rid of the reinterpret_cast?
> Coding style: spaces around the "&" operator.

If length so small that you do not pass the next 16bytes alignment, alignedEnd is invalid (< source).
Comment 32 Benjamin Poulain 2012-08-15 23:49:34 PDT
Created attachment 158723 [details]
Trivial bench

Here is a trivial benchmark of the 3 implementations.

On my Mac Pro, the PACKUSWB gives me better results than PSHUFB.
Both are a lot faster than the trivial implementation.
Comment 33 Michael Saboff 2012-08-16 11:50:01 PDT
Created attachment 158867 [details]
Patch updated with suggested changes including using PACKUSWB
Comment 34 Benjamin Poulain 2012-08-16 12:40:20 PDT
Comment on attachment 158867 [details]
Patch updated with suggested changes including using PACKUSWB

View in context: https://bugs.webkit.org/attachment.cgi?id=158867&action=review

I have a few comments but everything seems good to me.

I wonder if this patch will also help with the cascading StringImp::getData16SlowCase() we get later in the execution.

> Source/WTF/wtf/Alignment.h:60
> -
> +    

Whilespaces.

> Source/WTF/wtf/text/ASCIIFastPath.h:25
> +#if OS(DARWIN) 

Because of iOS, I think you also need "&& (CPU(X86) || CPU(X86_64))".

> Source/WTF/wtf/text/ASCIIFastPath.h:104
> +#if OS(DARWIN)

Ditto for iOS.

> Source/WTF/wtf/text/ASCIIFastPath.h:108
> +    unsigned i = 0;

This should probably be size_t since length is also size_t.

> Source/WTF/wtf/text/ASCIIFastPath.h:131
> +    for (; i < length; ++i) {
> +        ASSERT(!(source[i] & 0xff00));
> +        destination[i] = source[i];
> +    }

If you want this assertion, you should also have it in the prefix code.

Maybe you should just add an ALWAYS_INLINE function doing the loop, and use it in the prefix, postfix and !OS(darwin) version.

> Source/WTF/wtf/text/WTFString.h:104
> +template<bool isSpecialCharacter(UChar), typename CharType>
> +bool isAllSpecialCharacters(const CharType*, size_t);

More and more we tend to use CharacterType instead of CharType.

> Source/WTF/wtf/text/WTFString.h:573
> -template<bool isSpecialCharacter(UChar)> inline bool isAllSpecialCharacters(const UChar* characters, size_t length)
> +template<bool isSpecialCharacter(UChar), typename CharType>
> +inline bool isAllSpecialCharacters(const CharType* characters, size_t length)

CharType -> CharacterType would be nice.

> Source/WebCore/ChangeLog:13
> +        Currently all data associated with a token is stored and processed as UChars.
> +        Added code to determine that the contents of token data is all 8 bit by keeping
> +        the logical OR value of all prior characters.  Also added a flag that the parser
> +        can set to indicate when the token data is converted to a String that we want
> +        to make an 8 bit string if possible.  Enabled this handling for script, style,
> +        iframe, noembed, noframes, noscript and xmp tags.

Double space after each period.

> Source/WebCore/xml/parser/MarkupTokenBase.h:307
> +    void setConvertTo8Bit()

setConvertTo8Bit() -> setConvertTo8BitIfPossible()?

The method name setConvertTo8Bit() seems to imply the token is always gonna be converted to 8bits.
Comment 35 WebKit Review Bot 2012-08-16 12:54:42 PDT
Comment on attachment 158867 [details]
Patch updated with suggested changes including using PACKUSWB

Attachment 158867 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/13508724
Comment 36 Michael Saboff 2012-08-16 14:24:10 PDT
Created attachment 158903 [details]
Patch with Reviewer's Changes and Speculative Chromium Linux Build Fix
Comment 37 Michael Saboff 2012-08-16 19:54:02 PDT
Committed r125846: <http://trac.webkit.org/changeset/125846>
Comment 38 Eric Seidel (no email) 2012-08-16 19:59:20 PDT
Yay!
Comment 39 Kent Tamura 2012-08-16 22:26:12 PDT
(In reply to comment #37)
> Committed r125846: <http://trac.webkit.org/changeset/125846>

It broke Chromium-mac build, and I landed a fix: http://trac.webkit.org/changeset/125855
Comment 40 WebKit Review Bot 2012-08-16 22:27:41 PDT
Re-opened since this is blocked by 94295
Comment 41 Michael Saboff 2012-08-17 08:52:24 PDT
(In reply to comment #39)
> (In reply to comment #37)
> > Committed r125846: <http://trac.webkit.org/changeset/125846>
> 
> It broke Chromium-mac build, and I landed a fix: http://trac.webkit.org/changeset/125855

Thanks for fixing this.  Looks like I left two includes in WTFString.cpp from when I was debugging.
Comment 42 Tobias Netzel 2012-08-17 13:13:35 PDT
Mozilla also has SIMD code for string conversions, for example this one:
https://hg.mozilla.org/mozilla-central/file/aecd4db48e8e/xpcom/string/src/nsUTF8UtilsSSE2.cpp

They do the same thing using the same intrinsics but seem to have found out that unrolling the loop to process 32 UChars in one loop iteration improves performance. I know that it does speed up things on PowerPC (TenFourFox has an Altivec version of that code).

Other parts where mozilla uses SIMD (SSE in this case) code for string processing:
https://hg.mozilla.org/mozilla-central/file/aecd4db48e8e/content/base/src/nsTextFragmentSSE2.cpp : Find out if there are any 16 bit characters in a UChar buffer.
https://hg.mozilla.org/mozilla-central/file/aecd4db48e8e/intl/uconv/src/nsUTF8ToUnicodeSSE2.cpp : Copy from an LChar buffer to a UChar buffer (somewhat different to the one in nsUTF8UtilsSSE2)

Maybe that can inspire some more SIMD code in WebKit?
Comment 43 Michael Saboff 2012-08-17 13:33:32 PDT
(In reply to comment #42)
> Mozilla also has SIMD code for string conversions, for example this one:
> https://hg.mozilla.org/mozilla-central/file/aecd4db48e8e/xpcom/string/src/nsUTF8UtilsSSE2.cpp
> 
> They do the same thing using the same intrinsics but seem to have found out that unrolling the loop to process 32 UChars in one loop iteration improves performance. I know that it does speed up things on PowerPC (TenFourFox has an Altivec version of that code).
> 
> Other parts where mozilla uses SIMD (SSE in this case) code for string processing:
> https://hg.mozilla.org/mozilla-central/file/aecd4db48e8e/content/base/src/nsTextFragmentSSE2.cpp : Find out if there are any 16 bit characters in a UChar buffer.
> https://hg.mozilla.org/mozilla-central/file/aecd4db48e8e/intl/uconv/src/nsUTF8ToUnicodeSSE2.cpp : Copy from an LChar buffer to a UChar buffer (somewhat different to the one in nsUTF8UtilsSSE2)
> 
> Maybe that can inspire some more SIMD code in WebKit?

Tobias, thanks for the info.  After working on this patch, I'm already thinking of how we can use SIMD more.