Bug 144257 - REGRESSION (r183373): ASSERT failed in wtf/SHA1.h
Summary: REGRESSION (r183373): ASSERT failed in wtf/SHA1.h
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords:
Depends on:
Blocks: 144263 142691
  Show dependency treegraph
 
Reported: 2015-04-27 03:11 PDT by Yusuke Suzuki
Modified: 2015-04-29 10:46 PDT (History)
11 users (show)

See Also:


Attachments
Patch (3.08 KB, patch)
2015-04-27 03:47 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (3.08 KB, patch)
2015-04-27 04:06 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (3.89 KB, patch)
2015-04-27 10:17 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (11.88 KB, patch)
2015-04-28 06:09 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (13.19 KB, patch)
2015-04-28 06:28 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (14.69 KB, patch)
2015-04-28 06:31 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (5.71 KB, patch)
2015-04-29 08:33 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Comment 1 Yusuke Suzuki 2015-04-27 03:20:23 PDT
OK. I've found the issue.
SHA1 is used to calculate CodeBlockHash.
To calculate hash value, we pass the source code UTF-8 string to SHA1::addBytes.
However, in SHA1::addBytes, there's assertion `input.length() == strlen(string)`.

In the template-literal-syntax.js, we perform `eval` with the script contains "\0". As the result, `strlen(string)` accidentally shortened by the contained "\0", and assertion fails.
Comment 2 Yusuke Suzuki 2015-04-27 03:47:52 PDT
Created attachment 251733 [details]
Patch
Comment 3 Csaba Osztrogonác 2015-04-27 04:03:04 PDT
Comment on attachment 251733 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251733&action=review

The fix looks good to me, but I'm not sure if the source code can 
contain null characters, so I would let JSC experts to review this fix.

> Source/JavaScriptCore/ChangeLog:11
> +        So when performing `strlen` on the source code's CString, it returnes the incorrect length.

typo: returnes -> returns

> Source/JavaScriptCore/ChangeLog:19
> +        It allow the given bytes to contain a null character.

typo: allow -> allows
Comment 4 Yusuke Suzuki 2015-04-27 04:06:25 PDT
Created attachment 251734 [details]
Patch
Comment 5 Darin Adler 2015-04-27 09:33:28 PDT
Comment on attachment 251734 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251734&action=review

> Source/JavaScriptCore/bytecode/CodeBlockHash.cpp:47
> -    sha1.addBytes(sourceCode.toUTF8());
> +    CString sourceCodeBytes = sourceCode.toUTF8();
> +    // Since the source code string can contain a null character,
> +    // pass the bytes and length explicitly instead of passing CString as an argument.
> +    sha1.addBytes(reinterpret_cast<const uint8_t*>(sourceCodeBytes.data()), sourceCodeBytes.length());

This should be fixed by removing the incorrect assertion from SHA1::addBytes, not by changing the code here at all. That function already has the correct algorithm, it just has a misguided assertion.

But at some point, this code should be made be better in other ways. We don’t need a single copy of all the data in UTF-8 just so we can pass it to the SHA1 algorithm; we just need a view of it in chunks that we can pass in. The best API for something like that is not something that allocated a UTF8 string, but rather a streaming decoder that can give us ranges of bytes in UTF-8 form. This enables two optimizations: 1) if the string is stored in an 8-bit buffer and is all ASCII then we can just compute the SHA1 in place without allocating any memory at all; 2) we don’t need to heap allocate a single object big enough to translate the entire thing to UTF-8. Instead we can use a fixed size buffer and not bother with heap allocations at all. An idiom I like to use for this is a “chunky iterator” which returns points to subsequent runs of UTF-8 bytes. The caller decides how big the runs are.

I think SourceCode and String need something like this. It makes no sense to copy all the source code into a large memory buffer just to compute an SHA1 of it. One way to do it is to make a function that takes a std::function that gets a pointer and a length to UTF-8 bytes that gets called repeatedly until the entire string is iterated.
Comment 6 Yusuke Suzuki 2015-04-27 10:08:37 PDT
Comment on attachment 251734 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251734&action=review

>> Source/JavaScriptCore/bytecode/CodeBlockHash.cpp:47
>> +    sha1.addBytes(reinterpret_cast<const uint8_t*>(sourceCodeBytes.data()), sourceCodeBytes.length());
> 
> This should be fixed by removing the incorrect assertion from SHA1::addBytes, not by changing the code here at all. That function already has the correct algorithm, it just has a misguided assertion.
> 
> But at some point, this code should be made be better in other ways. We don’t need a single copy of all the data in UTF-8 just so we can pass it to the SHA1 algorithm; we just need a view of it in chunks that we can pass in. The best API for something like that is not something that allocated a UTF8 string, but rather a streaming decoder that can give us ranges of bytes in UTF-8 form. This enables two optimizations: 1) if the string is stored in an 8-bit buffer and is all ASCII then we can just compute the SHA1 in place without allocating any memory at all; 2) we don’t need to heap allocate a single object big enough to translate the entire thing to UTF-8. Instead we can use a fixed size buffer and not bother with heap allocations at all. An idiom I like to use for this is a “chunky iterator” which returns points to subsequent runs of UTF-8 bytes. The caller decides how big the runs are.
> 
> I think SourceCode and String need something like this. It makes no sense to copy all the source code into a large memory buffer just to compute an SHA1 of it. One way to do it is to make a function that takes a std::function that gets a pointer and a length to UTF-8 bytes that gets called repeatedly until the entire string is iterated.

You're completely right.
We should stream the decoded UTF-8 bytes into SHA1 instead of the current "alloc, convert and use" implementation.
I've opened the new issue to implement it[1].

So in the meantime, to fix the debug builds, I'll upload the simple patch dropping the assertion in SHA1.

[1]: https://bugs.webkit.org/show_bug.cgi?id=144263
Comment 7 Yusuke Suzuki 2015-04-27 10:17:11 PDT
Created attachment 251754 [details]
Patch
Comment 8 Alexey Proskuryakov 2015-04-27 12:54:08 PDT
CString is not a data buffer, so a CString object with a null byte is simply invalid. I don't think that the assertion is incorrect, the original patch looks a lot better to me.

We've had enough hard to diagnose bugs caused by null bytes in CStrings in the past to make me want to disallow this in every way possible.
Comment 9 Ryosuke Niwa 2015-04-27 14:09:46 PDT
Committed r183421: <http://trac.webkit.org/changeset/183421>
Comment 10 Ryosuke Niwa 2015-04-27 14:10:33 PDT
(In reply to comment #9)
> Committed r183421: <http://trac.webkit.org/changeset/183421>

I've skipped three tests failing on Windows.
Comment 11 Darin Adler 2015-04-27 14:58:08 PDT
(In reply to comment #8)
> CString is not a data buffer, so a CString object with a null byte is simply
> invalid.

I don’t agree. That’s not the intent of the CString class, although perhaps it should be in the future.

> I don't think that the assertion is incorrect, the original patch
> looks a lot better to me.

We should use Vector<char> for these purposes if a CString can’t have a null byte.

> We've had enough hard to diagnose bugs caused by null bytes in CStrings in
> the past to make me want to disallow this in every way possible.

OK, but that is a statement about the future, not about the present.
Comment 12 Alexey Proskuryakov 2015-04-27 16:33:52 PDT
> OK, but that is a statement about the future, not about the present.

CString is documented like this:

// A container for a null-terminated char array supporting copy-on-write
// assignment.  The contained char array may be null.

A null terminated string with null bytes in the middle does not really make sense. And in fact, we allow CStrings whose length is bigger than string length, because null termination is the truth.
Comment 13 Benjamin Poulain 2015-04-27 16:46:32 PDT
Don't forget to unskip the tests when landing.
Comment 14 Yusuke Suzuki 2015-04-28 06:09:55 PDT
Created attachment 251842 [details]
Patch
Comment 15 Yusuke Suzuki 2015-04-28 06:10:36 PDT
(In reply to comment #12)
> > OK, but that is a statement about the future, not about the present.
> 
> CString is documented like this:
> 
> // A container for a null-terminated char array supporting copy-on-write
> // assignment.  The contained char array may be null.
> 
> A null terminated string with null bytes in the middle does not really make
> sense. And in fact, we allow CStrings whose length is bigger than string
> length, because null termination is the truth.

Thank you for your comment.

> And in fact, we allow CStrings whose length is bigger than string length, because null termination is the truth.

So, it seems that the current assertion `input.length() == strlen(string)` is problematic anyway.

OK, so I propose the following fix,

1. Dropping SHA1::addBytes(CString) interface.

SHA1 should feed byte streams (bytes). So the interface accepting CString is not good I think. And if CString should not contain a NULL character, we should insert assertions into CString instead of SHA1::addBytes(CString). Seeing the code, CodeBlockHash is the only user of this.

2. Add StringImpl::utf8BufferForRange -> Vector<char> function.

It returns Vector instead of CString.
And we implement utf8ForRange based on this function.

3. Use returned buffer vector in CodeBlockHash.

Use a buffer vector instead of CString because CString should not contain a null character.

And I've upload the revised patch. darin@ & ap@, could you review this once more?
Comment 16 Yusuke Suzuki 2015-04-28 06:11:56 PDT
I've get an TestWebKitAPI build error. Just dropping this test (SHA1::addBytes(CString)).
Comment 17 Yusuke Suzuki 2015-04-28 06:28:13 PDT
Created attachment 251843 [details]
Patch
Comment 18 Yusuke Suzuki 2015-04-28 06:30:10 PDT
And unskip template literal tests...
Comment 19 Yusuke Suzuki 2015-04-28 06:31:46 PDT
Created attachment 251844 [details]
Patch
Comment 20 Yusuke Suzuki 2015-04-28 06:33:12 PDT
(In reply to comment #13)
> Don't forget to unskip the tests when landing.

Thank you! Now win32 issue is also fixed by peavo@
http://trac.webkit.org/changeset/183465

So unskipping the tests in the updated patch.
Comment 21 Darin Adler 2015-04-28 08:24:22 PDT
(In reply to comment #12)
> > OK, but that is a statement about the future, not about the present.
> 
> CString is documented like this:
> 
> // A container for a null-terminated char array supporting copy-on-write
> // assignment.  The contained char array may be null.
> 
> A null terminated string with null bytes in the middle does not really make
> sense. And in fact, we allow CStrings whose length is bigger than string
> length, because null termination is the truth.

If this was actually CString’s contract, then CString::length function would have the strlen assertion, not SHA1::addBytes.

We allow CString with null characters in them, and they can be used at call sites that don’t care about null character termination at all and are just using CString as a holder for a string of 8-bit characters. If we didn’t allow this use, the String::UTF8 would need to fail when decoding a null character rather than successfully a CString with an embedded null character.

I am happy for us to make this change and forbid this usage, but despite the comment you cite, this is not the full truth about what CString is today.
Comment 22 Darin Adler 2015-04-28 08:28:52 PDT
I do see, however, that there are some CString functions that do not support the strings with embedded null characters the way I thought they did.

CString functions that work properly with embedded null characters:

    data
    mutableData
    length
    isNull
    isSafeToSendToAnotherThread
    buffer
    isHashTableDeletedValue
    == (when both sides are CString)

Functions that do not:

    hash
    <
    == (when one side is a CString and the other is a const char*)

It’s clear that we should either forbid the embedded null characters entirely, or make the functions all support them.

Alexey, lets figure out what we want to do about this and file a new bug about it.
Comment 23 Alexey Proskuryakov 2015-04-28 10:52:00 PDT
Thank you for the helpful analysis! I filed bug 144339 to track this.
Comment 24 Yusuke Suzuki 2015-04-29 08:17:41 PDT
Changing CString not to contain null character needs large changes and it is now handled in bug 144339.
So in this issue, I suggest the simplest solution; just dropping the assertion.
If it's not good because CString should not contain a null-character, I think it will be fixed in bug 144339.
Comment 25 Yusuke Suzuki 2015-04-29 08:33:40 PDT
Created attachment 251951 [details]
Patch
Comment 26 WebKit Commit Bot 2015-04-29 10:46:09 PDT
Comment on attachment 251951 [details]
Patch

Clearing flags on attachment: 251951

Committed r183559: <http://trac.webkit.org/changeset/183559>
Comment 27 WebKit Commit Bot 2015-04-29 10:46:14 PDT
All reviewed patches have been landed.  Closing bug.