Bug 45594 - Add AtomicString::fromUTF8
Summary: Add AtomicString::fromUTF8
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Patrick R. Gansterer
URL:
Keywords:
Depends on: 49581
Blocks: 43085
  Show dependency treegraph
 
Reported: 2010-09-11 10:35 PDT by Patrick R. Gansterer
Modified: 2010-12-02 14:51 PST (History)
12 users (show)

See Also:


Attachments
Patch (2.93 KB, patch)
2010-09-11 10:40 PDT, Patrick R. Gansterer
no flags Details | Formatted Diff | Diff
Patch (2.96 KB, patch)
2010-09-12 05:06 PDT, Patrick R. Gansterer
no flags Details | Formatted Diff | Diff
Patch (14.10 KB, patch)
2010-10-09 11:36 PDT, Patrick R. Gansterer
darin: review-
Details | Formatted Diff | Diff
Patch (18.73 KB, patch)
2010-11-13 05:44 PST, Patrick R. Gansterer
darin: review+
darin: commit-queue-
Details | Formatted Diff | Diff
Patch (14.41 KB, patch)
2010-12-01 06:35 PST, Patrick R. Gansterer
no flags Details | Formatted Diff | Diff
Patch (14.41 KB, patch)
2010-12-01 07:22 PST, Patrick R. Gansterer
no flags Details | Formatted Diff | Diff
Patch (15.63 KB, patch)
2010-12-01 08:07 PST, Patrick R. Gansterer
no flags Details | Formatted Diff | Diff
Patch (15.63 KB, patch)
2010-12-01 13:15 PST, Patrick R. Gansterer
darin: review+
commit-queue: commit-queue-
Details | Formatted Diff | Diff
Patch (15.66 KB, patch)
2010-12-02 13:56 PST, Patrick R. Gansterer
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Patrick R. Gansterer 2010-09-11 10:35:43 PDT
see patch
Comment 1 Patrick R. Gansterer 2010-09-11 10:40:29 PDT
Created attachment 67302 [details]
Patch

I'll use this function for performance improvements of XMLParser (bug 43085).
Comment 2 Early Warning System Bot 2010-09-11 10:58:38 PDT
Attachment 67302 [details] did not build on qt:
Build output: http://queues.webkit.org/results/3897416
Comment 3 WebKit Review Bot 2010-09-11 11:37:24 PDT
Attachment 67302 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/3914415
Comment 4 WebKit Review Bot 2010-09-11 13:04:16 PDT
Attachment 67302 [details] did not build on gtk:
Build output: http://queues.webkit.org/results/3946401
Comment 5 Patrick R. Gansterer 2010-09-12 05:06:43 PDT
Created attachment 67334 [details]
Patch
Comment 6 Adam Barth 2010-09-12 12:17:55 PDT
Comment on attachment 67334 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=67334&action=prettypatch

I don't get it.  Is this more efficient than making a string and then converting that to an atomic string?  Maybe it saves you a memcpy in the case where you have multibyte characters?
Comment 7 Patrick R. Gansterer 2010-09-12 12:30:37 PDT
(In reply to comment #6)
> (From update of attachment 67334 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=67334&action=prettypatch
> 
> I don't get it.  Is this more efficient than making a string and then converting that to an atomic string?  Maybe it saves you a memcpy in the case where you have multibyte characters?
XMLParser will uses the version witout a explicit length most of the time. Ther we can calcualte the hash, utf8length and utf16 length in one step:
+AtomicString AtomicString::fromUTF8(const char* characters)
+{
+    if (!characters)
+        return AtomicString();
+
+    HashAndUTF8Characters buffer;
+    buffer.characters = characters;
+    buffer.string = 0;
+
+    if (!calculateHashAndLength(characters, buffer.hash, buffer.lengthUTF8, buffer.lengthUTF16))
+        return AtomicString(); // characters contains invalid UTF-8 characters.
+
+    pair<HashSet<StringImpl*>::iterator, bool> addResult = stringTable().add<HashAndUTF8Characters, HashAndUTF8CharactersTranslator>(buffer);
+
+    // If the string is newly-translated, then we need to adopt it.
+    // The boolean in the pair tells us if that is so.
+    AtomicString atomicString;
+    atomicString.m_string = addResult.second ? adoptRef(*addResult.first) : *addResult.first;
+    return atomicString;
+}

This will avoid unnecessary operations, but requires additonal changes to the stringHash and UTF8 functions.
I'd like to commit this in a first step to start using the "new API" in XMLParser.
Comment 8 Adam Barth 2010-09-12 12:36:00 PDT
> XMLParser will uses the version witout a explicit length most of the time. Ther we can calcualte the hash, utf8length and utf16 length in one step:

The code you quote below isn't in the patch...  If that's what we're aiming for, why not include that now?

> This will avoid unnecessary operations, but requires additonal changes to the stringHash and UTF8 functions.
> I'd like to commit this in a first step to start using the "new API" in XMLParser.

I guess I still don't understand why any of this is specific to AtomicString.  What we saving by putting these methods on AtomicString and not just on String?
Comment 9 Patrick R. Gansterer 2010-09-12 12:43:54 PDT
(In reply to comment #8)
> > XMLParser will uses the version witout a explicit length most of the time. Ther we can calcualte the hash, utf8length and utf16 length in one step:
> 
> The code you quote below isn't in the patch...  If that's what we're aiming for, why not include that now?
It requires many other changes and I want to be able to create a patch for the XMLParser.

> > This will avoid unnecessary operations, but requires additonal changes to the stringHash and UTF8 functions.
> > I'd like to commit this in a first step to start using the "new API" in XMLParser.
> 
> I guess I still don't understand why any of this is specific to AtomicString.  What we saving by putting these methods on AtomicString and not just on String?
HashAndUTF8CharactersTranslator is a new class which can operate directly with the utf8 characters. We don't need to convert it to utf16 for lookup in the HashTable (this will be the real performance win).
Comment 10 Adam Barth 2010-09-12 12:49:24 PDT
Ah, I see.  That sounds cool.  I don't see why we should take this partial patch.  It doesn't seem to provide value on its own.  Instead, we should do the full patch.  Of course, you're encouraged to break it up into small pieces, but this patch seems to start in the middle of the dependency chain instead of at the bottom.
Comment 11 Patrick R. Gansterer 2010-09-12 12:56:50 PDT
Comment on attachment 67334 [details]
Patch

(In reply to comment #10)
> Ah, I see.  That sounds cool.  I don't see why we should take this partial patch.
Ok, I'll start with an other patch.
Comment 12 Adam Barth 2010-09-12 12:59:38 PDT
> Ok, I'll start with an other patch.

Thanks Patrick.
Comment 13 Patrick R. Gansterer 2010-10-09 11:36:32 PDT
Created attachment 70357 [details]
Patch
Comment 14 WebKit Review Bot 2010-10-09 11:42:28 PDT
Attachment 70357 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style']" exit_code: 1
JavaScriptCore/wtf/unicode/UTF8.cpp:326:  A case label should not be indented, but line up with its switch statement.  [whitespace/indent] [4]
JavaScriptCore/wtf/unicode/UTF8.cpp:326:  More than one command on the same line  [whitespace/newline] [4]
JavaScriptCore/wtf/unicode/UTF8.cpp:326:  More than one command on the same line in if  [whitespace/parens] [4]
JavaScriptCore/wtf/unicode/UTF8.cpp:327:  More than one command on the same line  [whitespace/newline] [4]
JavaScriptCore/wtf/unicode/UTF8.cpp:327:  More than one command on the same line in if  [whitespace/parens] [4]
JavaScriptCore/wtf/unicode/UTF8.cpp:328:  More than one command on the same line  [whitespace/newline] [4]
JavaScriptCore/wtf/unicode/UTF8.cpp:328:  More than one command on the same line in if  [whitespace/parens] [4]
JavaScriptCore/wtf/unicode/UTF8.cpp:329:  More than one command on the same line  [whitespace/newline] [4]
JavaScriptCore/wtf/unicode/UTF8.cpp:329:  More than one command on the same line in if  [whitespace/parens] [4]
JavaScriptCore/wtf/unicode/UTF8.cpp:330:  More than one command on the same line  [whitespace/newline] [4]
JavaScriptCore/wtf/unicode/UTF8.cpp:330:  More than one command on the same line in if  [whitespace/parens] [4]
Total errors found: 11 in 9 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 15 WebKit Review Bot 2010-10-16 03:18:01 PDT
Attachment 70357 [details] did not build on win:
Build output: http://queues.webkit.org/results/4456055
Comment 16 Patrick R. Gansterer 2010-10-28 13:25:58 PDT
ping?
Comment 17 Adam Barth 2010-10-28 13:59:57 PDT
Comment on attachment 70357 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=70357&action=review

We'll need some unicode experts to look at this patch.

> JavaScriptCore/wtf/text/AtomicString.cpp:232
> +static inline bool equal(StringImpl* string, const char* characters, unsigned length)
> +{
> +    if (string->length() != length)
> +        return false;
> +
> +    const UChar* stringCharacters = string->characters();
> +    for (unsigned i = 0; i != length; ++i) {
> +        if (*stringCharacters++ != *characters++)
> +            return false;
> +    }
> +
> +    return true;
> +}

I don't understand how you're comparing UChars with chars.  They're in different encodings...

> JavaScriptCore/wtf/text/AtomicString.cpp:255
> +        if (buffer.rawLength == buffer.utf16Length) // buffer contains only ASCII characters.
> +            return WTF::equal(string, buffer.characters, buffer.rawLength);

I see what you're trying to do, but WTF::equal has way too general of a name.  Also, this check isn't correct.  Recall that some UTF16 characters are multibyte characters.

> JavaScriptCore/wtf/text/AtomicString.cpp:391
> +    // Try converting into the buffer.
> +    if (convertUTF8ToUTF16(&characters, characters + length, &buffer, bufferEnd) != conversionOK)
> +        return AtomicString();
> +
> +    // stringBuffer is full (the input must have been all ascii), use the string.
> +    if (buffer == bufferEnd)
> +        return AtomicString(stringBuffer.get());

This seems really sketchy.  Why not just figure out what the correct length is?
Comment 18 Patrick R. Gansterer 2010-10-28 14:27:55 PDT
(In reply to comment #17)
> (From update of attachment 70357 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=70357&action=review
> I don't understand how you're comparing UChars with chars.  They're in different encodings...
Is ok, if we give it a better name? (any good ideas?)

> I see what you're trying to do, but WTF::equal has way too general of a name.  Also, this check isn't correct.  Recall that some UTF16 characters are multibyte characters.
This check is only done if there are only ASCII characters. AFAIK the first byte of a UTF16 multibyte sequence will never match a ASCII character, so the true/false test should be ok IMHO.

> 
> > JavaScriptCore/wtf/text/AtomicString.cpp:391
> > +    // Try converting into the buffer.
> > +    if (convertUTF8ToUTF16(&characters, characters + length, &buffer, bufferEnd) != conversionOK)
> > +        return AtomicString();
> > +
> > +    // stringBuffer is full (the input must have been all ascii), use the string.
> > +    if (buffer == bufferEnd)
> > +        return AtomicString(stringBuffer.get());
> 
> This seems really sketchy.  Why not just figure out what the correct length is?
What do you mean exactly? Which length? This is the nearly the same alogrithm as for String::fromUtf8.
Comment 19 Patrick R. Gansterer 2010-10-30 13:27:15 PDT
Comment on attachment 70357 [details]
Patch

Setting back to r? as discussed on IRC with abarth.

@gbarra Can you have look at the this patch? svn blame shows you as author for most of the code in AtomicString.
Comment 20 Darin Adler 2010-10-30 15:23:28 PDT
Comment on attachment 70357 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=70357&action=review

Basic idea here is quite good, but we should improve the execution at least a bit before landing.

One thing we could do to cut down the size of the patch a bit is to add a non-optimized AtomicString::fromUTF8 first along with changes to adopt it, and then add the second patch which adds the optimized code path.

> JavaScriptCore/wtf/text/AtomicString.cpp:217
> +    mutable RefPtr<StringImpl> string;

It’s pretty strange to have a mutable member in a struct with no member functions. It might be better to use a class. I think this idiom deserves at least a comment. Using a mutable structure so we can allocate the string during the lookup process and then reuse it when putting the string into the hash table is cleverness. Cleverness is a bad thing without comments to explain to those who follow.

Quite a bit of additional complexity is here with the intention of speeding up cases that I don’t think are common enough to deserve it. Storing a StringImpl in the structure optimizes the case where we compare and discover the string is a new string, by removing one pass of UTF-8-to-16 conversion, but at the cost of extra string allocations where we compare and discover the string is not new, so I think it’s not a good tradeoff.

With no comments about rationale, this all is a bit mysterious.

> JavaScriptCore/wtf/text/AtomicString.cpp:220
> +static inline bool equal(StringImpl* string, const char* characters, unsigned length)

Since this function is only legal for ASCII characters, then the name needs to express that limitation, and I’d like to see an assertion as well. I’m not sure a fast path for all-ASCII is needed, though.

> JavaScriptCore/wtf/text/AtomicString.cpp:234
> +static inline void addStringImplIntoBuffer(const HashAndUTF8Characters& buffer)

The function name here is pretty confusing. This seems like a perfect candidate for a member function.

The more-typical idiom for this would be to use a private member for the string, and then have a public accessor function that does this work of creating the string. Then we avoid having this awkwardly-named function, and we also have an interface that's harder to use incorrectly.

> JavaScriptCore/wtf/text/AtomicString.cpp:258
> +        addStringImplIntoBuffer(buffer);
> +        return WTF::equal(string, buffer.string->characters(), buffer.string->length());

I think we easily can write code to compare a UTF-8 string with a UTF-16 string without allocating memory. This would come up in the relatively common case of looking for a non-ASCII string that is already in the table. It would be faster not to allocate memory in that case.

> JavaScriptCore/wtf/text/AtomicString.cpp:377
> +        return AtomicString(StringImpl::empty());

I believe it’s faster to return emptyAtom here. Other call sites in the same file should do the same if they don’t already. Unless it’s not faster!

> JavaScriptCore/wtf/text/AtomicString.cpp:383
> +    // We'll use a StringImpl as a buffer; if the source string only contains ascii this should be
> +    // the right length, if there are any multi-byte sequences this buffer will be too large.
> +    UChar* buffer;
> +    RefPtr<StringImpl> stringBuffer = StringImpl::createUninitialized(length, buffer);
> +    UChar* bufferEnd = buffer + length;

This is not the right approach. We should write code that can look up the string in the hash table without allocating memory at all. There’s no reason we need a buffer to convert UTF-8 to UTF-16 until we have proven the string is not already in the table. I don’t think we need a special case for ASCII either; it will come out fast automatically.

In this patch, we reserve the optimization of not allocating the string for null-terminated strings. That’s not the tradeoff we usually make in WebKit.

> JavaScriptCore/wtf/text/AtomicString.cpp:399
> +AtomicString AtomicString::fromUTF8(const char* characters)

This should call the other fromUTF8 function after first calling strlen rather than having separate logic.

> JavaScriptCore/wtf/text/AtomicString.h:112
> +    static AtomicString fromUTF8(const char*, size_t);
> +    static AtomicString fromUTF8(const char*);

I think these functions and the ones already in the String class deserve a comment indicating what that the function returns a null string if the characters are not valid UTF-8.

> JavaScriptCore/wtf/unicode/UTF8.h:73
> +    bool calculateHashAndLength(const char* rawData, unsigned& hash, unsigned& rawLength, unsigned& utf16Length);

This function is unclear. It has an argument name “rawData” and nothing in the function name or argument name make it clear that it’s expected to be UTF-8 data. I think you are using “raw” here to mean “UTF-8”, which is not a usual term of art. I also find the boolean result unclear. Without a comment I have no idea what its significance is. Maybe true if the function fails, or false if it fails?

It’s not right to have this function work on null-terminated strings. Our normal approach is that all functions work on strings with an explicit separate length argument, except for a few convenience functions. Those can use strlen and then call through to the general case. Since this function works only on null-terminated strings, we can’t use it at all in functions that have strings with an explicit length: the function would run off the end of the buffer looking for a null character and also would misinterpret null characters as end-of-string indicators.

A single function that computes a hash for a UTF-8 string and also returns the UTF-16 length as a side effect is OK. But we should add special cases only if we can prove with profiling they are beneficial. More branches mean the code is harder to test. Knowing the length in advance can be useful so we can allocate a perfectly-sized buffer without scanning the string again. It also serves the dual purpose of being an all-ASCII flag, but it’s not clear we need such a flag. The specifici motivation and value in having oblique functions with multiple purposes like this one means we need at least a brief mention in a comment of the benefit of doing these two things at once.

The location of the function in this header isn’t perfect either, and creates a bit of a modularity problem within WTF. The function computes a hash, but the UTF8.h file is a lower level Unicode building block and this level knows nothing about the hashing algorithm. The code that knows about both UTF-8 and the hash functions probably belongs in StringHashFunctions.h rather than UTF8.h.

I’d like us to add a function that compares a UTF-16 string to a UTF-8 one so we can find non-ASCII strings that are already present in the AtomicString table without allocating memory.
Comment 21 Patrick R. Gansterer 2010-10-30 16:29:20 PDT
Comment on attachment 70357 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=70357&action=review

Many thanks for so much comments!

>> JavaScriptCore/wtf/text/AtomicString.cpp:217
>> +    mutable RefPtr<StringImpl> string;
> 
> It’s pretty strange to have a mutable member in a struct with no member functions. It might be better to use a class. I think this idiom deserves at least a comment. Using a mutable structure so we can allocate the string during the lookup process and then reuse it when putting the string into the hash table is cleverness. Cleverness is a bad thing without comments to explain to those who follow.
> 
> Quite a bit of additional complexity is here with the intention of speeding up cases that I don’t think are common enough to deserve it. Storing a StringImpl in the structure optimizes the case where we compare and discover the string is a new string, by removing one pass of UTF-8-to-16 conversion, but at the cost of extra string allocations where we compare and discover the string is not new, so I think it’s not a good tradeoff.
> 
> With no comments about rationale, this all is a bit mysterious.

If we add a UTF16/UTF8 equal function we won't need it anymore. :-)

>> JavaScriptCore/wtf/text/AtomicString.cpp:220
>> +static inline bool equal(StringImpl* string, const char* characters, unsigned length)
> 
> Since this function is only legal for ASCII characters, then the name needs to express that limitation, and I’d like to see an assertion as well. I’m not sure a fast path for all-ASCII is needed, though.

It has the same name as http://trac.webkit.org/browser/trunk/JavaScriptCore/wtf/text/StringImpl.cpp?rev=69414#L915. Maybe we should change the name of other function too and add a ASSERT(containsOnlyASCII()).

>> JavaScriptCore/wtf/text/AtomicString.cpp:234
>> +static inline void addStringImplIntoBuffer(const HashAndUTF8Characters& buffer)
> 
> The function name here is pretty confusing. This seems like a perfect candidate for a member function.
> 
> The more-typical idiom for this would be to use a private member for the string, and then have a public accessor function that does this work of creating the string. Then we avoid having this awkwardly-named function, and we also have an interface that's harder to use incorrectly.

Will be removed too in favor of UTF16/UTF8 equal.

>> JavaScriptCore/wtf/text/AtomicString.cpp:377
>> +        return AtomicString(StringImpl::empty());
> 
> I believe it’s faster to return emptyAtom here. Other call sites in the same file should do the same if they don’t already. Unless it’s not faster!

I wanted to use it, but didn't found it. I was looking for a member function like StringImpl::empty().

>> JavaScriptCore/wtf/text/AtomicString.cpp:383
>> +    UChar* bufferEnd = buffer + length;
> 
> This is not the right approach. We should write code that can look up the string in the hash table without allocating memory at all. There’s no reason we need a buffer to convert UTF-8 to UTF-16 until we have proven the string is not already in the table. I don’t think we need a special case for ASCII either; it will come out fast automatically.
> 
> In this patch, we reserve the optimization of not allocating the string for null-terminated strings. That’s not the tradeoff we usually make in WebKit.

If we add a equal function which compares UTF16 with UTF8 this won't make sense anymore.

>> JavaScriptCore/wtf/text/AtomicString.cpp:399
>> +AtomicString AtomicString::fromUTF8(const char* characters)
> 
> This should call the other fromUTF8 function after first calling strlen rather than having separate logic.

I don't like the idea of calling strlen, when we iterate over the string too for calculating the hash afterwards. But it's ok to do it with strlen for now and get rid of it in a second patch.

>> JavaScriptCore/wtf/text/AtomicString.h:112
>> +    static AtomicString fromUTF8(const char*);
> 
> I think these functions and the ones already in the String class deserve a comment indicating what that the function returns a null string if the characters are not valid UTF-8.

I'll add comments in WTFString.h too.

>> JavaScriptCore/wtf/unicode/UTF8.h:73
>> +    bool calculateHashAndLength(const char* rawData, unsigned& hash, unsigned& rawLength, unsigned& utf16Length);
> 
> This function is unclear. It has an argument name “rawData” and nothing in the function name or argument name make it clear that it’s expected to be UTF-8 data. I think you are using “raw” here to mean “UTF-8”, which is not a usual term of art. I also find the boolean result unclear. Without a comment I have no idea what its significance is. Maybe true if the function fails, or false if it fails?
> 
> It’s not right to have this function work on null-terminated strings. Our normal approach is that all functions work on strings with an explicit separate length argument, except for a few convenience functions. Those can use strlen and then call through to the general case. Since this function works only on null-terminated strings, we can’t use it at all in functions that have strings with an explicit length: the function would run off the end of the buffer looking for a null character and also would misinterpret null characters as end-of-string indicators.
> 
> A single function that computes a hash for a UTF-8 string and also returns the UTF-16 length as a side effect is OK. But we should add special cases only if we can prove with profiling they are beneficial. More branches mean the code is harder to test. Knowing the length in advance can be useful so we can allocate a perfectly-sized buffer without scanning the string again. It also serves the dual purpose of being an all-ASCII flag, but it’s not clear we need such a flag. The specifici motivation and value in having oblique functions with multiple purposes like this one means we need at least a brief mention in a comment of the benefit of doing these two things at once.
> 
> The location of the function in this header isn’t perfect either, and creates a bit of a modularity problem within WTF. The function computes a hash, but the UTF8.h file is a lower level Unicode building block and this level knows nothing about the hashing algorithm. The code that knows about both UTF-8 and the hash functions probably belongs in StringHashFunctions.h rather than UTF8.h.
> 
> I’d like us to add a function that compares a UTF-16 string to a UTF-8 one so we can find non-ASCII strings that are already present in the AtomicString table without allocating memory.

In one of my first versions I had a utf8Length too, so a needed rawLength, but this doesn't apply any more. ;-)

If we do the inital size check via strlen in a first patch, it makes no sense to work with null-terminated strings.

IMHO the all-ASCII flag makes sense to avoid checking for UTF8-sequences later, but this might need some performance measurements to be sure.

I'm not happy with the location too! I think that StringHash(Functions).h would be a better place, but the function uses to much (inline) stuff of the current UTF8.cpp (e.g. offsetsFromUTF8). 

We have the same problem for the UTF16/UTF8 equal function too. Our equal function all take a StringImpl*, but UTF8.cpp does not know anything about it, and StringImpl does not know anything about UTF8 sequences. I don't know what the best/clean solution might be.
Comment 22 Patrick R. Gansterer 2010-11-13 05:44:55 PST
Created attachment 73819 [details]
Patch
Comment 23 Darin Adler 2010-11-15 12:11:33 PST
Comment on attachment 73819 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=73819&action=review

Looks good. Please consider some of my suggestions for improvement. For now, I set commit-queue- so you could think over if you want to make these changes before you land.

> JavaScriptCore/wtf/text/AtomicString.cpp:233
> +        UChar *target;

The * is positioned incorrectly here. It should be next to the UChar.

> JavaScriptCore/wtf/text/AtomicString.cpp:234
> +        location = StringImpl::createUninitialized(buffer.utf16Length, target).releaseRef();

Please use the new name, leakRef, not the old name, releaseRef, in new code.

> JavaScriptCore/wtf/text/AtomicString.h:112
> +    // the input data contains invalid UTF-8 characters.

Comment should say “invalid UTF-8 sequences” not “invalid UTF-8 characters”.

> JavaScriptCore/wtf/text/WTFString.h:313
> +    // the input data contains invalid UTF-8 characters.

Same comment.

> JavaScriptCore/wtf/unicode/UTF8.cpp:238
> +    UChar32 ch = 0;

I’d much prefer “character” for this local.

> JavaScriptCore/wtf/unicode/UTF8.cpp:243
> +        case 6: ch += static_cast<unsigned char>(*sequence++); ch <<= 6; // remember, illegal UTF-8
> +        case 5: ch += static_cast<unsigned char>(*sequence++); ch <<= 6; // remember, illegal UTF-8

Who are we saying “remember” to in these comments? I know you just moved them, but I think these comments are unhelpful and unclear. We should remove the comments.

> JavaScriptCore/wtf/unicode/UTF8.cpp:262
> +        if (source + utf8SequenceLength > sourceEnd) {

This should be written like this:

    if (sourceEnd - source < utf8SequenceLength)

It’s not great to compute an address past the end of the buffer, and doing it this way guarantees we don’t try to do that.

> JavaScriptCore/wtf/unicode/UTF8.cpp:313
> +unsigned calculateHashFromUTF8(const char* data, const char* dataEnd, unsigned& utf16Length)

This function needs a policy for handling lengths that do not fit in unsigned. Even if that policy is simply to call CRASH when encountering a giant length. Not doing so can lead to security vulnerabilities. Alternatively, the type of the out argument could be changed to a size_t, but then the callers would need to check if they narrowed it to an unsigned.

> JavaScriptCore/wtf/unicode/UTF8.cpp:322
> +        if (!(*data & 0x80)) {

It would be better to use isASCII from ASCIICType.h for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:337
> +        ASSERT(ch >= 0x80);

It would be better to use isASCII from ASCIICType.h for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:339
> +        if (ch <= 0xFFFF) {

It would be better to use U_IS_BMP for this. This is a macro from ICU that we may already implement for non-ICU platforms, but if not we can do so.

> JavaScriptCore/wtf/unicode/UTF8.cpp:341
> +            if (ch >= 0xD800 && ch <= 0xDFFF)

It would be better to use U_IS_SURROGATE for this. Another ICU macro.

> JavaScriptCore/wtf/unicode/UTF8.cpp:348
> +        } else if (ch <= 0x10FFFF) {
> +            ch -= 0x0010000UL;
> +            stringHasher.addCharacters(static_cast<UChar>((ch >> 10) + 0xD800),
> +                                       static_cast<UChar>((ch & 0x03FF) + 0xDC00));

It would be better to use U_IS_SUPPLEMENTARY, U16_LEAD, and U16_TRAIL for this. More ICU macros.

> JavaScriptCore/wtf/unicode/UTF8.cpp:360
> +        if (!(*b & 0x80)) {

It would be better to use isASCII from ASCIICType.h for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:361
> +            if (*a++ != static_cast<UChar>(*b++))

The static_cast is unfortunate. Can we use a local variable to avoid the typecast?

> JavaScriptCore/wtf/unicode/UTF8.cpp:375
> +        ASSERT(ch >= 0x80);

It would be better to use isASCII from ASCIICType.h for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:377
> +        if (ch <= 0xFFFF) {

It would be better to use U_IS_BMP for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:379
> +            if (ch >= 0xD800 && ch <= 0xDFFF)

It would be better to use U_IS_SURROGATE for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:381
> +            if (*a++ != static_cast<UChar>(ch))

I don’t think we need the static_cast here. Please remove it.

> JavaScriptCore/wtf/unicode/UTF8.cpp:389
> +        } else if (ch <= 0x10FFFF) {
> +            ch -= 0x0010000UL;
> +
> +            if (*a++ != static_cast<UChar>((ch >> 10) + 0xD800))
> +                return false;
> +            if (*a++ != static_cast<UChar>((ch & 0x03FF) + 0xDC00))
> +                return false;

It would be better to use U_IS_SUPPLEMENTARY, U16_LEAD, and U16_TRAIL for this.

> JavaScriptCore/wtf/unicode/UTF8.h:73
> +    unsigned calculateHashFromUTF8(const char* data, const char* dataEnd, unsigned& utf16Length);

Still unfortunate to have this here in this file. At the very least this needs a comment to explain what kind of hash we are talking about.

> JavaScriptCore/wtf/unicode/UTF8.h:75
> +    bool equalUTF8(const UChar* a, const UChar* aEnd, const char* b, const char* bEnd);

This name should mention UTF-16, not just UTF-8, since the function compares UTF-16 to UTF-8.
Comment 24 Build Bot 2010-11-15 22:04:36 PST
Attachment 73819 [details] did not build on win:
Build output: http://queues.webkit.org/results/5989100
Comment 25 Patrick R. Gansterer 2010-11-15 23:59:07 PST
I created bug 45981 for some cleanup in UTF8.cpp.
IMHO the UTF8 changes are independent and should go in before this one.
Comment 26 Patrick R. Gansterer 2010-11-16 00:02:34 PST
(In reply to comment #25)
> I created bug 45981 for some cleanup in UTF8.cpp.
bug 49581 instead of bug 45981
Comment 27 Patrick R. Gansterer 2010-12-01 06:35:27 PST
Created attachment 75269 [details]
Patch
Comment 28 WebKit Review Bot 2010-12-01 06:38:00 PST
Attachment 75269 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--diff-files', u'JavaScriptCore/ChangeLog', u'JavaScriptCore/JavaScriptCore.exp', u'JavaScriptCore/wtf/text/AtomicString.cpp', u'JavaScriptCore/wtf/text/AtomicString.h', u'JavaScriptCore/wtf/text/StringImpl.h', u'JavaScriptCore/wtf/text/WTFString.h', u'JavaScriptCore/wtf/unicode/UTF8.cpp', u'JavaScriptCore/wtf/unicode/UTF8.h', u'WebCore/ChangeLog', u'WebCore/dom/XMLDocumentParserLibxml2.cpp']" exit_code: 1
JavaScriptCore/wtf/text/AtomicString.cpp:233:  Declaration has space between type name and * in UChar *target  [whitespace/declaration] [3]
Total errors found: 1 in 10 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 29 Patrick R. Gansterer 2010-12-01 07:22:02 PST
Created attachment 75271 [details]
Patch
Comment 30 Build Bot 2010-12-01 07:51:08 PST
Attachment 75269 [details] did not build on win:
Build output: http://queues.webkit.org/results/6646008
Comment 31 Patrick R. Gansterer 2010-12-01 08:07:01 PST
Created attachment 75274 [details]
Patch

Try to make Win-EWS happy...
Comment 32 Build Bot 2010-12-01 09:04:19 PST
Attachment 75271 [details] did not build on win:
Build output: http://queues.webkit.org/results/6617008
Comment 33 Patrick R. Gansterer 2010-12-01 13:15:20 PST
Created attachment 75312 [details]
Patch

Fixed style issue
Comment 34 Patrick R. Gansterer 2010-12-02 13:15:23 PST
@darin: ping
Comment 35 WebKit Commit Bot 2010-12-02 13:21:41 PST
Comment on attachment 75312 [details]
Patch

Rejecting patch 75312 from commit-queue.

Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-sf', 'apply-attachment', '--force-clean', '--non-interactive', 75312]" exit_code: 2
Last 500 characters of output:
ing file JavaScriptCore/wtf/text/AtomicString.h
patching file JavaScriptCore/wtf/text/StringImpl.h
patching file JavaScriptCore/wtf/text/WTFString.h
patching file JavaScriptCore/wtf/unicode/UTF8.cpp
patching file JavaScriptCore/wtf/unicode/UTF8.h
patching file WebCore/ChangeLog
Hunk #1 succeeded at 1 with fuzz 3.
patching file WebCore/dom/XMLDocumentParserLibxml2.cpp

Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--reviewer', u'Darin Adler', u'--force']" exit_code: 1

Full output: http://queues.webkit.org/results/6807015
Comment 36 Patrick R. Gansterer 2010-12-02 13:56:16 PST
Created attachment 75414 [details]
Patch
Comment 37 WebKit Commit Bot 2010-12-02 14:45:54 PST
Comment on attachment 75414 [details]
Patch

Clearing flags on attachment: 75414

Committed r73201: <http://trac.webkit.org/changeset/73201>
Comment 38 WebKit Commit Bot 2010-12-02 14:46:04 PST
All reviewed patches have been landed.  Closing bug.
Comment 39 WebKit Review Bot 2010-12-02 14:51:43 PST
http://trac.webkit.org/changeset/73201 might have broken Qt Windows 32-bit Debug