WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
Bug 45594
Add AtomicString::fromUTF8
https://bugs.webkit.org/show_bug.cgi?id=45594
Summary
Add AtomicString::fromUTF8
Patrick R. Gansterer
Reported
2010-09-11 10:35:43 PDT
see patch
Attachments
Patch
(2.93 KB, patch)
2010-09-11 10:40 PDT
,
Patrick R. Gansterer
no flags
Details
Formatted Diff
Diff
Patch
(2.96 KB, patch)
2010-09-12 05:06 PDT
,
Patrick R. Gansterer
no flags
Details
Formatted Diff
Diff
Patch
(14.10 KB, patch)
2010-10-09 11:36 PDT
,
Patrick R. Gansterer
darin
: review-
Details
Formatted Diff
Diff
Patch
(18.73 KB, patch)
2010-11-13 05:44 PST
,
Patrick R. Gansterer
darin
: review+
darin
: commit-queue-
Details
Formatted Diff
Diff
Patch
(14.41 KB, patch)
2010-12-01 06:35 PST
,
Patrick R. Gansterer
no flags
Details
Formatted Diff
Diff
Patch
(14.41 KB, patch)
2010-12-01 07:22 PST
,
Patrick R. Gansterer
no flags
Details
Formatted Diff
Diff
Patch
(15.63 KB, patch)
2010-12-01 08:07 PST
,
Patrick R. Gansterer
no flags
Details
Formatted Diff
Diff
Patch
(15.63 KB, patch)
2010-12-01 13:15 PST
,
Patrick R. Gansterer
darin
: review+
commit-queue
: commit-queue-
Details
Formatted Diff
Diff
Patch
(15.66 KB, patch)
2010-12-02 13:56 PST
,
Patrick R. Gansterer
no flags
Details
Formatted Diff
Diff
Show Obsolete
(8)
View All
Add attachment
proposed patch, testcase, etc.
Patrick R. Gansterer
Comment 1
2010-09-11 10:40:29 PDT
Created
attachment 67302
[details]
Patch I'll use this function for performance improvements of XMLParser (
bug 43085
).
Early Warning System Bot
Comment 2
2010-09-11 10:58:38 PDT
Attachment 67302
[details]
did not build on qt: Build output:
http://queues.webkit.org/results/3897416
WebKit Review Bot
Comment 3
2010-09-11 11:37:24 PDT
Attachment 67302
[details]
did not build on chromium: Build output:
http://queues.webkit.org/results/3914415
WebKit Review Bot
Comment 4
2010-09-11 13:04:16 PDT
Attachment 67302
[details]
did not build on gtk: Build output:
http://queues.webkit.org/results/3946401
Patrick R. Gansterer
Comment 5
2010-09-12 05:06:43 PDT
Created
attachment 67334
[details]
Patch
Adam Barth
Comment 6
2010-09-12 12:17:55 PDT
Comment on
attachment 67334
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=67334&action=prettypatch
I don't get it. Is this more efficient than making a string and then converting that to an atomic string? Maybe it saves you a memcpy in the case where you have multibyte characters?
Patrick R. Gansterer
Comment 7
2010-09-12 12:30:37 PDT
(In reply to
comment #6
)
> (From update of
attachment 67334
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=67334&action=prettypatch
> > I don't get it. Is this more efficient than making a string and then converting that to an atomic string? Maybe it saves you a memcpy in the case where you have multibyte characters?
XMLParser will uses the version witout a explicit length most of the time. Ther we can calcualte the hash, utf8length and utf16 length in one step: +AtomicString AtomicString::fromUTF8(const char* characters) +{ + if (!characters) + return AtomicString(); + + HashAndUTF8Characters buffer; + buffer.characters = characters; + buffer.string = 0; + + if (!calculateHashAndLength(characters, buffer.hash, buffer.lengthUTF8, buffer.lengthUTF16)) + return AtomicString(); // characters contains invalid UTF-8 characters. + + pair<HashSet<StringImpl*>::iterator, bool> addResult = stringTable().add<HashAndUTF8Characters, HashAndUTF8CharactersTranslator>(buffer); + + // If the string is newly-translated, then we need to adopt it. + // The boolean in the pair tells us if that is so. + AtomicString atomicString; + atomicString.m_string = addResult.second ? adoptRef(*addResult.first) : *addResult.first; + return atomicString; +} This will avoid unnecessary operations, but requires additonal changes to the stringHash and UTF8 functions. I'd like to commit this in a first step to start using the "new API" in XMLParser.
Adam Barth
Comment 8
2010-09-12 12:36:00 PDT
> XMLParser will uses the version witout a explicit length most of the time. Ther we can calcualte the hash, utf8length and utf16 length in one step:
The code you quote below isn't in the patch... If that's what we're aiming for, why not include that now?
> This will avoid unnecessary operations, but requires additonal changes to the stringHash and UTF8 functions. > I'd like to commit this in a first step to start using the "new API" in XMLParser.
I guess I still don't understand why any of this is specific to AtomicString. What we saving by putting these methods on AtomicString and not just on String?
Patrick R. Gansterer
Comment 9
2010-09-12 12:43:54 PDT
(In reply to
comment #8
)
> > XMLParser will uses the version witout a explicit length most of the time. Ther we can calcualte the hash, utf8length and utf16 length in one step: > > The code you quote below isn't in the patch... If that's what we're aiming for, why not include that now?
It requires many other changes and I want to be able to create a patch for the XMLParser.
> > This will avoid unnecessary operations, but requires additonal changes to the stringHash and UTF8 functions. > > I'd like to commit this in a first step to start using the "new API" in XMLParser. > > I guess I still don't understand why any of this is specific to AtomicString. What we saving by putting these methods on AtomicString and not just on String?
HashAndUTF8CharactersTranslator is a new class which can operate directly with the utf8 characters. We don't need to convert it to utf16 for lookup in the HashTable (this will be the real performance win).
Adam Barth
Comment 10
2010-09-12 12:49:24 PDT
Ah, I see. That sounds cool. I don't see why we should take this partial patch. It doesn't seem to provide value on its own. Instead, we should do the full patch. Of course, you're encouraged to break it up into small pieces, but this patch seems to start in the middle of the dependency chain instead of at the bottom.
Patrick R. Gansterer
Comment 11
2010-09-12 12:56:50 PDT
Comment on
attachment 67334
[details]
Patch (In reply to
comment #10
)
> Ah, I see. That sounds cool. I don't see why we should take this partial patch.
Ok, I'll start with an other patch.
Adam Barth
Comment 12
2010-09-12 12:59:38 PDT
> Ok, I'll start with an other patch.
Thanks Patrick.
Patrick R. Gansterer
Comment 13
2010-10-09 11:36:32 PDT
Created
attachment 70357
[details]
Patch
WebKit Review Bot
Comment 14
2010-10-09 11:42:28 PDT
Attachment 70357
[details]
did not pass style-queue: Failed to run "['WebKitTools/Scripts/check-webkit-style']" exit_code: 1 JavaScriptCore/wtf/unicode/UTF8.cpp:326: A case label should not be indented, but line up with its switch statement. [whitespace/indent] [4] JavaScriptCore/wtf/unicode/UTF8.cpp:326: More than one command on the same line [whitespace/newline] [4] JavaScriptCore/wtf/unicode/UTF8.cpp:326: More than one command on the same line in if [whitespace/parens] [4] JavaScriptCore/wtf/unicode/UTF8.cpp:327: More than one command on the same line [whitespace/newline] [4] JavaScriptCore/wtf/unicode/UTF8.cpp:327: More than one command on the same line in if [whitespace/parens] [4] JavaScriptCore/wtf/unicode/UTF8.cpp:328: More than one command on the same line [whitespace/newline] [4] JavaScriptCore/wtf/unicode/UTF8.cpp:328: More than one command on the same line in if [whitespace/parens] [4] JavaScriptCore/wtf/unicode/UTF8.cpp:329: More than one command on the same line [whitespace/newline] [4] JavaScriptCore/wtf/unicode/UTF8.cpp:329: More than one command on the same line in if [whitespace/parens] [4] JavaScriptCore/wtf/unicode/UTF8.cpp:330: More than one command on the same line [whitespace/newline] [4] JavaScriptCore/wtf/unicode/UTF8.cpp:330: More than one command on the same line in if [whitespace/parens] [4] Total errors found: 11 in 9 files If any of these errors are false positives, please file a bug against check-webkit-style.
WebKit Review Bot
Comment 15
2010-10-16 03:18:01 PDT
Attachment 70357
[details]
did not build on win: Build output:
http://queues.webkit.org/results/4456055
Patrick R. Gansterer
Comment 16
2010-10-28 13:25:58 PDT
ping?
Adam Barth
Comment 17
2010-10-28 13:59:57 PDT
Comment on
attachment 70357
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=70357&action=review
We'll need some unicode experts to look at this patch.
> JavaScriptCore/wtf/text/AtomicString.cpp:232 > +static inline bool equal(StringImpl* string, const char* characters, unsigned length) > +{ > + if (string->length() != length) > + return false; > + > + const UChar* stringCharacters = string->characters(); > + for (unsigned i = 0; i != length; ++i) { > + if (*stringCharacters++ != *characters++) > + return false; > + } > + > + return true; > +}
I don't understand how you're comparing UChars with chars. They're in different encodings...
> JavaScriptCore/wtf/text/AtomicString.cpp:255 > + if (buffer.rawLength == buffer.utf16Length) // buffer contains only ASCII characters. > + return WTF::equal(string, buffer.characters, buffer.rawLength);
I see what you're trying to do, but WTF::equal has way too general of a name. Also, this check isn't correct. Recall that some UTF16 characters are multibyte characters.
> JavaScriptCore/wtf/text/AtomicString.cpp:391 > + // Try converting into the buffer. > + if (convertUTF8ToUTF16(&characters, characters + length, &buffer, bufferEnd) != conversionOK) > + return AtomicString(); > + > + // stringBuffer is full (the input must have been all ascii), use the string. > + if (buffer == bufferEnd) > + return AtomicString(stringBuffer.get());
This seems really sketchy. Why not just figure out what the correct length is?
Patrick R. Gansterer
Comment 18
2010-10-28 14:27:55 PDT
(In reply to
comment #17
)
> (From update of
attachment 70357
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=70357&action=review
> I don't understand how you're comparing UChars with chars. They're in different encodings...
Is ok, if we give it a better name? (any good ideas?)
> I see what you're trying to do, but WTF::equal has way too general of a name. Also, this check isn't correct. Recall that some UTF16 characters are multibyte characters.
This check is only done if there are only ASCII characters. AFAIK the first byte of a UTF16 multibyte sequence will never match a ASCII character, so the true/false test should be ok IMHO.
> > > JavaScriptCore/wtf/text/AtomicString.cpp:391 > > + // Try converting into the buffer. > > + if (convertUTF8ToUTF16(&characters, characters + length, &buffer, bufferEnd) != conversionOK) > > + return AtomicString(); > > + > > + // stringBuffer is full (the input must have been all ascii), use the string. > > + if (buffer == bufferEnd) > > + return AtomicString(stringBuffer.get()); > > This seems really sketchy. Why not just figure out what the correct length is?
What do you mean exactly? Which length? This is the nearly the same alogrithm as for String::fromUtf8.
Patrick R. Gansterer
Comment 19
2010-10-30 13:27:15 PDT
Comment on
attachment 70357
[details]
Patch Setting back to r? as discussed on IRC with abarth. @gbarra Can you have look at the this patch? svn blame shows you as author for most of the code in AtomicString.
Darin Adler
Comment 20
2010-10-30 15:23:28 PDT
Comment on
attachment 70357
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=70357&action=review
Basic idea here is quite good, but we should improve the execution at least a bit before landing. One thing we could do to cut down the size of the patch a bit is to add a non-optimized AtomicString::fromUTF8 first along with changes to adopt it, and then add the second patch which adds the optimized code path.
> JavaScriptCore/wtf/text/AtomicString.cpp:217 > + mutable RefPtr<StringImpl> string;
It’s pretty strange to have a mutable member in a struct with no member functions. It might be better to use a class. I think this idiom deserves at least a comment. Using a mutable structure so we can allocate the string during the lookup process and then reuse it when putting the string into the hash table is cleverness. Cleverness is a bad thing without comments to explain to those who follow. Quite a bit of additional complexity is here with the intention of speeding up cases that I don’t think are common enough to deserve it. Storing a StringImpl in the structure optimizes the case where we compare and discover the string is a new string, by removing one pass of UTF-8-to-16 conversion, but at the cost of extra string allocations where we compare and discover the string is not new, so I think it’s not a good tradeoff. With no comments about rationale, this all is a bit mysterious.
> JavaScriptCore/wtf/text/AtomicString.cpp:220 > +static inline bool equal(StringImpl* string, const char* characters, unsigned length)
Since this function is only legal for ASCII characters, then the name needs to express that limitation, and I’d like to see an assertion as well. I’m not sure a fast path for all-ASCII is needed, though.
> JavaScriptCore/wtf/text/AtomicString.cpp:234 > +static inline void addStringImplIntoBuffer(const HashAndUTF8Characters& buffer)
The function name here is pretty confusing. This seems like a perfect candidate for a member function. The more-typical idiom for this would be to use a private member for the string, and then have a public accessor function that does this work of creating the string. Then we avoid having this awkwardly-named function, and we also have an interface that's harder to use incorrectly.
> JavaScriptCore/wtf/text/AtomicString.cpp:258 > + addStringImplIntoBuffer(buffer); > + return WTF::equal(string, buffer.string->characters(), buffer.string->length());
I think we easily can write code to compare a UTF-8 string with a UTF-16 string without allocating memory. This would come up in the relatively common case of looking for a non-ASCII string that is already in the table. It would be faster not to allocate memory in that case.
> JavaScriptCore/wtf/text/AtomicString.cpp:377 > + return AtomicString(StringImpl::empty());
I believe it’s faster to return emptyAtom here. Other call sites in the same file should do the same if they don’t already. Unless it’s not faster!
> JavaScriptCore/wtf/text/AtomicString.cpp:383 > + // We'll use a StringImpl as a buffer; if the source string only contains ascii this should be > + // the right length, if there are any multi-byte sequences this buffer will be too large. > + UChar* buffer; > + RefPtr<StringImpl> stringBuffer = StringImpl::createUninitialized(length, buffer); > + UChar* bufferEnd = buffer + length;
This is not the right approach. We should write code that can look up the string in the hash table without allocating memory at all. There’s no reason we need a buffer to convert UTF-8 to UTF-16 until we have proven the string is not already in the table. I don’t think we need a special case for ASCII either; it will come out fast automatically. In this patch, we reserve the optimization of not allocating the string for null-terminated strings. That’s not the tradeoff we usually make in WebKit.
> JavaScriptCore/wtf/text/AtomicString.cpp:399 > +AtomicString AtomicString::fromUTF8(const char* characters)
This should call the other fromUTF8 function after first calling strlen rather than having separate logic.
> JavaScriptCore/wtf/text/AtomicString.h:112 > + static AtomicString fromUTF8(const char*, size_t); > + static AtomicString fromUTF8(const char*);
I think these functions and the ones already in the String class deserve a comment indicating what that the function returns a null string if the characters are not valid UTF-8.
> JavaScriptCore/wtf/unicode/UTF8.h:73 > + bool calculateHashAndLength(const char* rawData, unsigned& hash, unsigned& rawLength, unsigned& utf16Length);
This function is unclear. It has an argument name “rawData” and nothing in the function name or argument name make it clear that it’s expected to be UTF-8 data. I think you are using “raw” here to mean “UTF-8”, which is not a usual term of art. I also find the boolean result unclear. Without a comment I have no idea what its significance is. Maybe true if the function fails, or false if it fails? It’s not right to have this function work on null-terminated strings. Our normal approach is that all functions work on strings with an explicit separate length argument, except for a few convenience functions. Those can use strlen and then call through to the general case. Since this function works only on null-terminated strings, we can’t use it at all in functions that have strings with an explicit length: the function would run off the end of the buffer looking for a null character and also would misinterpret null characters as end-of-string indicators. A single function that computes a hash for a UTF-8 string and also returns the UTF-16 length as a side effect is OK. But we should add special cases only if we can prove with profiling they are beneficial. More branches mean the code is harder to test. Knowing the length in advance can be useful so we can allocate a perfectly-sized buffer without scanning the string again. It also serves the dual purpose of being an all-ASCII flag, but it’s not clear we need such a flag. The specifici motivation and value in having oblique functions with multiple purposes like this one means we need at least a brief mention in a comment of the benefit of doing these two things at once. The location of the function in this header isn’t perfect either, and creates a bit of a modularity problem within WTF. The function computes a hash, but the UTF8.h file is a lower level Unicode building block and this level knows nothing about the hashing algorithm. The code that knows about both UTF-8 and the hash functions probably belongs in StringHashFunctions.h rather than UTF8.h. I’d like us to add a function that compares a UTF-16 string to a UTF-8 one so we can find non-ASCII strings that are already present in the AtomicString table without allocating memory.
Patrick R. Gansterer
Comment 21
2010-10-30 16:29:20 PDT
Comment on
attachment 70357
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=70357&action=review
Many thanks for so much comments!
>> JavaScriptCore/wtf/text/AtomicString.cpp:217 >> + mutable RefPtr<StringImpl> string; > > It’s pretty strange to have a mutable member in a struct with no member functions. It might be better to use a class. I think this idiom deserves at least a comment. Using a mutable structure so we can allocate the string during the lookup process and then reuse it when putting the string into the hash table is cleverness. Cleverness is a bad thing without comments to explain to those who follow. > > Quite a bit of additional complexity is here with the intention of speeding up cases that I don’t think are common enough to deserve it. Storing a StringImpl in the structure optimizes the case where we compare and discover the string is a new string, by removing one pass of UTF-8-to-16 conversion, but at the cost of extra string allocations where we compare and discover the string is not new, so I think it’s not a good tradeoff. > > With no comments about rationale, this all is a bit mysterious.
If we add a UTF16/UTF8 equal function we won't need it anymore. :-)
>> JavaScriptCore/wtf/text/AtomicString.cpp:220 >> +static inline bool equal(StringImpl* string, const char* characters, unsigned length) > > Since this function is only legal for ASCII characters, then the name needs to express that limitation, and I’d like to see an assertion as well. I’m not sure a fast path for all-ASCII is needed, though.
It has the same name as
http://trac.webkit.org/browser/trunk/JavaScriptCore/wtf/text/StringImpl.cpp?rev=69414#L915
. Maybe we should change the name of other function too and add a ASSERT(containsOnlyASCII()).
>> JavaScriptCore/wtf/text/AtomicString.cpp:234 >> +static inline void addStringImplIntoBuffer(const HashAndUTF8Characters& buffer) > > The function name here is pretty confusing. This seems like a perfect candidate for a member function. > > The more-typical idiom for this would be to use a private member for the string, and then have a public accessor function that does this work of creating the string. Then we avoid having this awkwardly-named function, and we also have an interface that's harder to use incorrectly.
Will be removed too in favor of UTF16/UTF8 equal.
>> JavaScriptCore/wtf/text/AtomicString.cpp:377 >> + return AtomicString(StringImpl::empty()); > > I believe it’s faster to return emptyAtom here. Other call sites in the same file should do the same if they don’t already. Unless it’s not faster!
I wanted to use it, but didn't found it. I was looking for a member function like StringImpl::empty().
>> JavaScriptCore/wtf/text/AtomicString.cpp:383 >> + UChar* bufferEnd = buffer + length; > > This is not the right approach. We should write code that can look up the string in the hash table without allocating memory at all. There’s no reason we need a buffer to convert UTF-8 to UTF-16 until we have proven the string is not already in the table. I don’t think we need a special case for ASCII either; it will come out fast automatically. > > In this patch, we reserve the optimization of not allocating the string for null-terminated strings. That’s not the tradeoff we usually make in WebKit.
If we add a equal function which compares UTF16 with UTF8 this won't make sense anymore.
>> JavaScriptCore/wtf/text/AtomicString.cpp:399 >> +AtomicString AtomicString::fromUTF8(const char* characters) > > This should call the other fromUTF8 function after first calling strlen rather than having separate logic.
I don't like the idea of calling strlen, when we iterate over the string too for calculating the hash afterwards. But it's ok to do it with strlen for now and get rid of it in a second patch.
>> JavaScriptCore/wtf/text/AtomicString.h:112 >> + static AtomicString fromUTF8(const char*); > > I think these functions and the ones already in the String class deserve a comment indicating what that the function returns a null string if the characters are not valid UTF-8.
I'll add comments in WTFString.h too.
>> JavaScriptCore/wtf/unicode/UTF8.h:73 >> + bool calculateHashAndLength(const char* rawData, unsigned& hash, unsigned& rawLength, unsigned& utf16Length); > > This function is unclear. It has an argument name “rawData” and nothing in the function name or argument name make it clear that it’s expected to be UTF-8 data. I think you are using “raw” here to mean “UTF-8”, which is not a usual term of art. I also find the boolean result unclear. Without a comment I have no idea what its significance is. Maybe true if the function fails, or false if it fails? > > It’s not right to have this function work on null-terminated strings. Our normal approach is that all functions work on strings with an explicit separate length argument, except for a few convenience functions. Those can use strlen and then call through to the general case. Since this function works only on null-terminated strings, we can’t use it at all in functions that have strings with an explicit length: the function would run off the end of the buffer looking for a null character and also would misinterpret null characters as end-of-string indicators. > > A single function that computes a hash for a UTF-8 string and also returns the UTF-16 length as a side effect is OK. But we should add special cases only if we can prove with profiling they are beneficial. More branches mean the code is harder to test. Knowing the length in advance can be useful so we can allocate a perfectly-sized buffer without scanning the string again. It also serves the dual purpose of being an all-ASCII flag, but it’s not clear we need such a flag. The specifici motivation and value in having oblique functions with multiple purposes like this one means we need at least a brief mention in a comment of the benefit of doing these two things at once. > > The location of the function in this header isn’t perfect either, and creates a bit of a modularity problem within WTF. The function computes a hash, but the UTF8.h file is a lower level Unicode building block and this level knows nothing about the hashing algorithm. The code that knows about both UTF-8 and the hash functions probably belongs in StringHashFunctions.h rather than UTF8.h. > > I’d like us to add a function that compares a UTF-16 string to a UTF-8 one so we can find non-ASCII strings that are already present in the AtomicString table without allocating memory.
In one of my first versions I had a utf8Length too, so a needed rawLength, but this doesn't apply any more. ;-) If we do the inital size check via strlen in a first patch, it makes no sense to work with null-terminated strings. IMHO the all-ASCII flag makes sense to avoid checking for UTF8-sequences later, but this might need some performance measurements to be sure. I'm not happy with the location too! I think that StringHash(Functions).h would be a better place, but the function uses to much (inline) stuff of the current UTF8.cpp (e.g. offsetsFromUTF8). We have the same problem for the UTF16/UTF8 equal function too. Our equal function all take a StringImpl*, but UTF8.cpp does not know anything about it, and StringImpl does not know anything about UTF8 sequences. I don't know what the best/clean solution might be.
Patrick R. Gansterer
Comment 22
2010-11-13 05:44:55 PST
Created
attachment 73819
[details]
Patch
Darin Adler
Comment 23
2010-11-15 12:11:33 PST
Comment on
attachment 73819
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=73819&action=review
Looks good. Please consider some of my suggestions for improvement. For now, I set commit-queue- so you could think over if you want to make these changes before you land.
> JavaScriptCore/wtf/text/AtomicString.cpp:233 > + UChar *target;
The * is positioned incorrectly here. It should be next to the UChar.
> JavaScriptCore/wtf/text/AtomicString.cpp:234 > + location = StringImpl::createUninitialized(buffer.utf16Length, target).releaseRef();
Please use the new name, leakRef, not the old name, releaseRef, in new code.
> JavaScriptCore/wtf/text/AtomicString.h:112 > + // the input data contains invalid UTF-8 characters.
Comment should say “invalid UTF-8 sequences” not “invalid UTF-8 characters”.
> JavaScriptCore/wtf/text/WTFString.h:313 > + // the input data contains invalid UTF-8 characters.
Same comment.
> JavaScriptCore/wtf/unicode/UTF8.cpp:238 > + UChar32 ch = 0;
I’d much prefer “character” for this local.
> JavaScriptCore/wtf/unicode/UTF8.cpp:243 > + case 6: ch += static_cast<unsigned char>(*sequence++); ch <<= 6; // remember, illegal UTF-8 > + case 5: ch += static_cast<unsigned char>(*sequence++); ch <<= 6; // remember, illegal UTF-8
Who are we saying “remember” to in these comments? I know you just moved them, but I think these comments are unhelpful and unclear. We should remove the comments.
> JavaScriptCore/wtf/unicode/UTF8.cpp:262 > + if (source + utf8SequenceLength > sourceEnd) {
This should be written like this: if (sourceEnd - source < utf8SequenceLength) It’s not great to compute an address past the end of the buffer, and doing it this way guarantees we don’t try to do that.
> JavaScriptCore/wtf/unicode/UTF8.cpp:313 > +unsigned calculateHashFromUTF8(const char* data, const char* dataEnd, unsigned& utf16Length)
This function needs a policy for handling lengths that do not fit in unsigned. Even if that policy is simply to call CRASH when encountering a giant length. Not doing so can lead to security vulnerabilities. Alternatively, the type of the out argument could be changed to a size_t, but then the callers would need to check if they narrowed it to an unsigned.
> JavaScriptCore/wtf/unicode/UTF8.cpp:322 > + if (!(*data & 0x80)) {
It would be better to use isASCII from ASCIICType.h for this.
> JavaScriptCore/wtf/unicode/UTF8.cpp:337 > + ASSERT(ch >= 0x80);
It would be better to use isASCII from ASCIICType.h for this.
> JavaScriptCore/wtf/unicode/UTF8.cpp:339 > + if (ch <= 0xFFFF) {
It would be better to use U_IS_BMP for this. This is a macro from ICU that we may already implement for non-ICU platforms, but if not we can do so.
> JavaScriptCore/wtf/unicode/UTF8.cpp:341 > + if (ch >= 0xD800 && ch <= 0xDFFF)
It would be better to use U_IS_SURROGATE for this. Another ICU macro.
> JavaScriptCore/wtf/unicode/UTF8.cpp:348 > + } else if (ch <= 0x10FFFF) { > + ch -= 0x0010000UL; > + stringHasher.addCharacters(static_cast<UChar>((ch >> 10) + 0xD800), > + static_cast<UChar>((ch & 0x03FF) + 0xDC00));
It would be better to use U_IS_SUPPLEMENTARY, U16_LEAD, and U16_TRAIL for this. More ICU macros.
> JavaScriptCore/wtf/unicode/UTF8.cpp:360 > + if (!(*b & 0x80)) {
It would be better to use isASCII from ASCIICType.h for this.
> JavaScriptCore/wtf/unicode/UTF8.cpp:361 > + if (*a++ != static_cast<UChar>(*b++))
The static_cast is unfortunate. Can we use a local variable to avoid the typecast?
> JavaScriptCore/wtf/unicode/UTF8.cpp:375 > + ASSERT(ch >= 0x80);
It would be better to use isASCII from ASCIICType.h for this.
> JavaScriptCore/wtf/unicode/UTF8.cpp:377 > + if (ch <= 0xFFFF) {
It would be better to use U_IS_BMP for this.
> JavaScriptCore/wtf/unicode/UTF8.cpp:379 > + if (ch >= 0xD800 && ch <= 0xDFFF)
It would be better to use U_IS_SURROGATE for this.
> JavaScriptCore/wtf/unicode/UTF8.cpp:381 > + if (*a++ != static_cast<UChar>(ch))
I don’t think we need the static_cast here. Please remove it.
> JavaScriptCore/wtf/unicode/UTF8.cpp:389 > + } else if (ch <= 0x10FFFF) { > + ch -= 0x0010000UL; > + > + if (*a++ != static_cast<UChar>((ch >> 10) + 0xD800)) > + return false; > + if (*a++ != static_cast<UChar>((ch & 0x03FF) + 0xDC00)) > + return false;
It would be better to use U_IS_SUPPLEMENTARY, U16_LEAD, and U16_TRAIL for this.
> JavaScriptCore/wtf/unicode/UTF8.h:73 > + unsigned calculateHashFromUTF8(const char* data, const char* dataEnd, unsigned& utf16Length);
Still unfortunate to have this here in this file. At the very least this needs a comment to explain what kind of hash we are talking about.
> JavaScriptCore/wtf/unicode/UTF8.h:75 > + bool equalUTF8(const UChar* a, const UChar* aEnd, const char* b, const char* bEnd);
This name should mention UTF-16, not just UTF-8, since the function compares UTF-16 to UTF-8.
Build Bot
Comment 24
2010-11-15 22:04:36 PST
Attachment 73819
[details]
did not build on win: Build output:
http://queues.webkit.org/results/5989100
Patrick R. Gansterer
Comment 25
2010-11-15 23:59:07 PST
I created
bug 45981
for some cleanup in UTF8.cpp. IMHO the UTF8 changes are independent and should go in before this one.
Patrick R. Gansterer
Comment 26
2010-11-16 00:02:34 PST
(In reply to
comment #25
)
> I created
bug 45981
for some cleanup in UTF8.cpp.
bug 49581
instead of
bug 45981
Patrick R. Gansterer
Comment 27
2010-12-01 06:35:27 PST
Created
attachment 75269
[details]
Patch
WebKit Review Bot
Comment 28
2010-12-01 06:38:00 PST
Attachment 75269
[details]
did not pass style-queue: Failed to run "['WebKitTools/Scripts/check-webkit-style', '--diff-files', u'JavaScriptCore/ChangeLog', u'JavaScriptCore/JavaScriptCore.exp', u'JavaScriptCore/wtf/text/AtomicString.cpp', u'JavaScriptCore/wtf/text/AtomicString.h', u'JavaScriptCore/wtf/text/StringImpl.h', u'JavaScriptCore/wtf/text/WTFString.h', u'JavaScriptCore/wtf/unicode/UTF8.cpp', u'JavaScriptCore/wtf/unicode/UTF8.h', u'WebCore/ChangeLog', u'WebCore/dom/XMLDocumentParserLibxml2.cpp']" exit_code: 1 JavaScriptCore/wtf/text/AtomicString.cpp:233: Declaration has space between type name and * in UChar *target [whitespace/declaration] [3] Total errors found: 1 in 10 files If any of these errors are false positives, please file a bug against check-webkit-style.
Patrick R. Gansterer
Comment 29
2010-12-01 07:22:02 PST
Created
attachment 75271
[details]
Patch
Build Bot
Comment 30
2010-12-01 07:51:08 PST
Attachment 75269
[details]
did not build on win: Build output:
http://queues.webkit.org/results/6646008
Patrick R. Gansterer
Comment 31
2010-12-01 08:07:01 PST
Created
attachment 75274
[details]
Patch Try to make Win-EWS happy...
Build Bot
Comment 32
2010-12-01 09:04:19 PST
Attachment 75271
[details]
did not build on win: Build output:
http://queues.webkit.org/results/6617008
Patrick R. Gansterer
Comment 33
2010-12-01 13:15:20 PST
Created
attachment 75312
[details]
Patch Fixed style issue
Patrick R. Gansterer
Comment 34
2010-12-02 13:15:23 PST
@darin: ping
WebKit Commit Bot
Comment 35
2010-12-02 13:21:41 PST
Comment on
attachment 75312
[details]
Patch Rejecting patch 75312 from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-sf', 'apply-attachment', '--force-clean', '--non-interactive', 75312]" exit_code: 2 Last 500 characters of output: ing file JavaScriptCore/wtf/text/AtomicString.h patching file JavaScriptCore/wtf/text/StringImpl.h patching file JavaScriptCore/wtf/text/WTFString.h patching file JavaScriptCore/wtf/unicode/UTF8.cpp patching file JavaScriptCore/wtf/unicode/UTF8.h patching file WebCore/ChangeLog Hunk #1 succeeded at 1 with fuzz 3. patching file WebCore/dom/XMLDocumentParserLibxml2.cpp Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--reviewer', u'Darin Adler', u'--force']" exit_code: 1 Full output:
http://queues.webkit.org/results/6807015
Patrick R. Gansterer
Comment 36
2010-12-02 13:56:16 PST
Created
attachment 75414
[details]
Patch
WebKit Commit Bot
Comment 37
2010-12-02 14:45:54 PST
Comment on
attachment 75414
[details]
Patch Clearing flags on attachment: 75414 Committed
r73201
: <
http://trac.webkit.org/changeset/73201
>
WebKit Commit Bot
Comment 38
2010-12-02 14:46:04 PST
All reviewed patches have been landed. Closing bug.
WebKit Review Bot
Comment 39
2010-12-02 14:51:43 PST
http://trac.webkit.org/changeset/73201
might have broken Qt Windows 32-bit Debug
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug