Bug 211142 - U_STRING_NOT_TERMINATED_WARNING ICU must be handled when using the output buffer as a C string
Summary: U_STRING_NOT_TERMINATED_WARNING ICU must be handled when using the output buf...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Saam Barati
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-04-28 12:48 PDT by Saam Barati
Modified: 2020-04-29 00:15 PDT (History)
24 users (show)

See Also:


Attachments
patch (28.49 KB, patch)
2020-04-28 16:45 PDT, Saam Barati
no flags Details | Formatted Diff | Diff
patch for landing (28.31 KB, patch)
2020-04-28 21:26 PDT, Saam Barati
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Saam Barati 2020-04-28 12:48:17 PDT
We have code that checks if we got the U_BUFFER_OVERFLOW_ERROR, and we grow the output buffer and try again when we see such a "warning". We also detect "failure" using U_FAILURE, which doesn't consdier U_STRING_NOT_TERMINATED_WARNING as failure, which seems crazy at first glance, and is leading to bugs in at least some parts of JSC.

There are 118 uses of U_BUFFER_OVERFLOW_ERROR in WebKit
There are 211 uses of U_FAILURE in WebKit

I think we need to audit all of them
Comment 1 Saam Barati 2020-04-28 12:52:17 PDT
Some examples that appear troublesome:
```
String LocaleICU::decimalSymbol(UNumberFormatSymbol symbol)
{
    UErrorCode status = U_ZERO_ERROR;
    int32_t bufferLength = unum_getSymbol(m_numberFormat, symbol, 0, 0, &status);
    ASSERT(U_SUCCESS(status) || status == U_BUFFER_OVERFLOW_ERROR);
    if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
        return String();
    Vector<UChar> buffer(bufferLength);
    status = U_ZERO_ERROR;
    unum_getSymbol(m_numberFormat, symbol, buffer.data(), bufferLength, &status);
    if (U_FAILURE(status))
        return String();
    return String::adopt(WTFMove(buffer));
}
```

```
    int32_t parsedLength;
    auto intermediateLength = uloc_forLanguageTag(input.data(), intermediate.data(), intermediate.size(), &parsedLength, &status);
    if (status == U_BUFFER_OVERFLOW_ERROR) {
        intermediate.resize(intermediateLength + 1);
        status = U_ZERO_ERROR;
        uloc_forLanguageTag(input.data(), intermediate.data(), intermediateLength + 1, &parsedLength, &status);
    }
    if (U_FAILURE(status) || parsedLength != static_cast<int32_t>(input.length()))
        return String();
```
Comment 2 Darin Adler 2020-04-28 12:53:26 PDT
I am puzzled by this. What specifically goes wrong in what case?
Comment 3 Saam Barati 2020-04-28 12:53:53 PDT
(In reply to Saam Barati from comment #1)
> Some examples that appear troublesome:
> ```
> String LocaleICU::decimalSymbol(UNumberFormatSymbol symbol)
> {
>     UErrorCode status = U_ZERO_ERROR;
>     int32_t bufferLength = unum_getSymbol(m_numberFormat, symbol, 0, 0,
> &status);
>     ASSERT(U_SUCCESS(status) || status == U_BUFFER_OVERFLOW_ERROR);
>     if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
>         return String();
>     Vector<UChar> buffer(bufferLength);
>     status = U_ZERO_ERROR;
>     unum_getSymbol(m_numberFormat, symbol, buffer.data(), bufferLength,
> &status);
>     if (U_FAILURE(status))
>         return String();
>     return String::adopt(WTFMove(buffer));
> }
> ```
Actually this one might be ok, since we don't rely on the buffer being a valid C string

> 
> ```
>     int32_t parsedLength;
>     auto intermediateLength = uloc_forLanguageTag(input.data(),
> intermediate.data(), intermediate.size(), &parsedLength, &status);
>     if (status == U_BUFFER_OVERFLOW_ERROR) {
>         intermediate.resize(intermediateLength + 1);
>         status = U_ZERO_ERROR;
>         uloc_forLanguageTag(input.data(), intermediate.data(),
> intermediateLength + 1, &parsedLength, &status);
>     }
>     if (U_FAILURE(status) || parsedLength !=
> static_cast<int32_t>(input.length()))
>         return String();
> ```

This isn't ok, since we do rely on it being a C string
Comment 4 Darin Adler 2020-04-28 12:54:01 PDT
U_STRING_NOT_TERMINATED_WARNING is only relevant if we rely on a 0 character terminator, and we do not rely on that in the code above.
Comment 5 Darin Adler 2020-04-28 12:54:16 PDT
I’m not sure there is any problem here at all.
Comment 6 Saam Barati 2020-04-28 12:54:59 PDT
(In reply to Darin Adler from comment #2)
> I am puzzled by this. What specifically goes wrong in what case?

We will take the output of an ICU call and trust that it's a C string.

See:
```
    UErrorCode status = U_ZERO_ERROR;
    Vector<char, 32> intermediate(31);
    int32_t parsedLength;
    auto intermediateLength = uloc_forLanguageTag(input.data(), intermediate.data(), intermediate.size(), &parsedLength, &status);
    if (status == U_BUFFER_OVERFLOW_ERROR) {
        intermediate.resize(intermediateLength + 1);
        status = U_ZERO_ERROR;
        uloc_forLanguageTag(input.data(), intermediate.data(), intermediateLength + 1, &parsedLength, &status);
    }
    if (U_FAILURE(status) || parsedLength != static_cast<int32_t>(input.length()))
        return String();

    Vector<char, 32> result(32);
    auto resultLength = uloc_toLanguageTag(intermediate.data(), result.data(), result.size(), false, &status);
    if (status == U_BUFFER_OVERFLOW_ERROR) {
        result.grow(resultLength);
        status = U_ZERO_ERROR;
        uloc_toLanguageTag(intermediate.data(), result.data(), resultLength, false, &status);
    }
    if (U_FAILURE(status))
        return String();
```
Comment 7 Darin Adler 2020-04-28 12:57:35 PDT
(In reply to Saam Barati from comment #3)
> >     if (U_FAILURE(status) || parsedLength !=
> > static_cast<int32_t>(input.length()))
> >         return String();
> > ```
> 
> This isn't ok, since we do rely on it being a C string

I see now. That’s right! That case is wrong. I’d be surprised if there are a lot of other examples like this, but there may be a few in the new Intl code.
Comment 8 Darin Adler 2020-04-28 12:58:38 PDT
(In reply to Saam Barati from comment #6)
> We will take the output of an ICU call and trust that it's a C string.

I think it’s great to audit carefully. But I think it’s unlikely that we’ve made this mistake a lot in the past.
Comment 9 Saam Barati 2020-04-28 12:59:00 PDT
(In reply to Darin Adler from comment #7)
> (In reply to Saam Barati from comment #3)
> > >     if (U_FAILURE(status) || parsedLength !=
> > > static_cast<int32_t>(input.length()))
> > >         return String();
> > > ```
> > 
> > This isn't ok, since we do rely on it being a C string
> 
> I see now. That’s right! That case is wrong. I’d be surprised if there are a
> lot of other examples like this, but there may be a few in the new Intl code.

Yeah my initial reaction of this being terribly wrong was off-base.

However, I've spotted a handful in JSC's Intl code already, so there exist a non-trivial amount.

It's probably worth doing at least a basic audit of all uses of this.
Comment 10 Saam Barati 2020-04-28 12:59:29 PDT
(In reply to Darin Adler from comment #8)
> (In reply to Saam Barati from comment #6)
> > We will take the output of an ICU call and trust that it's a C string.
> 
> I think it’s great to audit carefully. But I think it’s unlikely that we’ve
> made this mistake a lot in the past.

Yeah it seems like we do the right thing in most places, maybe except JSC's Intl code.
Comment 11 Ross Kirsling 2020-04-28 13:14:38 PDT
I'll note that that uloc_forLanguageTag -> uloc_toLanguageTag case was particularly tricky to get working safely (hence the comment I added) so any ideas on how to make it less fragile would be great.

The other case I'm aware of where we pass the C string from one ICU API directly into another is here:
https://github.com/WebKit/webkit/blob/master/Source/JavaScriptCore/runtime/IntlDateTimeFormat.cpp#L115-L144

But this case is still less horrible because the second call takes an input length.
Comment 12 Darin Adler 2020-04-28 13:35:39 PDT
I want to quibble with the language here. Let’s stay logical: there’s no need for fuzziness here.

If we need a null-terminated string, we *must* treat U_STRING_NOT_TERMINATED_WARNING as a “buffer too small” case; unless there is some other way we are adding a null terminator. If we do *not* need a null-terminated string, then we can and should ignore that warning.

(In reply to Ross Kirsling from comment #11)
> But this case is still less horrible because the second call takes an input
> length.

Here’s my quibble. That case is not just “less horrible”. It is *correct*.

The other case is *incorrect*.
Comment 13 Saam Barati 2020-04-28 16:37:06 PDT
<rdar://problem/62530860>
Comment 14 Saam Barati 2020-04-28 16:45:48 PDT
Created attachment 397904 [details]
patch
Comment 15 Darin Adler 2020-04-28 17:04:00 PDT
Comment on attachment 397904 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=397904&action=review

Honestly I think needsToGrowToProduceBuffer doesn’t make things easier to read. It’s mainly valuable as a record of the fact that you audited a given call site. On the other hand, needsToGrowToProduceCString is definitely valuable. But only in exactly one place. I think adding the header is overkill, but I won’t stand in your way if you still like it.

> Source/JavaScriptCore/runtime/IntlObject.cpp:334
> +    if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING || parsedLength != static_cast<int32_t>(input.length()))

How can U_STRING_NOT_TERMINATED_WARNING happen here? Seems like we can just leave this out or simply assert it doesn’t happen. We are using the length it returned, so there is no chance it will happen.

> Source/JavaScriptCore/runtime/IntlObject.cpp:346
> +#if ASSERT_ENABLED
> +    bool sawNullTerminator = false;
> +    for (char c : intermediate) {
> +        if (c == '\0') {
> +            sawNullTerminator = true;
> +            break;
> +        }
> +    }
> +    ASSERT(sawNullTerminator);
> +#endif

Alternate implementation that does the same thing:

    ASSERT(intermediate.contains('\0'));

> Source/WTF/wtf/text/LineBreakIteratorPoolICU.h:78
> +        if (needsToGrowToProduceCString(status)) {

This doesn’t seem right. I don’t see code depending on scratchBuffer being null terminated.

> Source/WTF/wtf/text/LineBreakIteratorPoolICU.h:79
>              scratchBuffer.grow(lengthNeeded + 1);

Seems like this should just be lengthNeeded, not + 1.
Comment 16 Mark Lam 2020-04-28 19:52:40 PDT
Comment on attachment 397904 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=397904&action=review

> Source/JavaScriptCore/runtime/IntlObject.cpp:154
> -        return String(StringImpl::createStaticStringImpl(buffer.data(), length));
> +        return String(buffer.data(), length);

This is wrong.  convertICULocaleToBCP47LanguageTag() is only called from inside std::call_once guarded blocks i.e. it's meant to produce immortal singleton strings.  Before I changed it to be StringImpl::createStaticStringImpl(), we had UAFs due to a user of this string freeing it.
Comment 17 Mark Lam 2020-04-28 20:02:53 PDT
Comment on attachment 397904 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=397904&action=review

>> Source/JavaScriptCore/runtime/IntlObject.cpp:154
>> +        return String(buffer.data(), length);
> 
> This is wrong.  convertICULocaleToBCP47LanguageTag() is only called from inside std::call_once guarded blocks i.e. it's meant to produce immortal singleton strings.  Before I changed it to be StringImpl::createStaticStringImpl(), we had UAFs due to a user of this string freeing it.

I talked to Saam offline, but for the record, there's another important property of StaticStringImpls i.e. that they are safe to be ref'ed and deref'ed from different threads without locks.  That's the most important reason this needs to be a StaticStringImpl because it is a singleton that can be used from different threads.
Comment 18 Mark Lam 2020-04-28 20:04:56 PDT
Comment on attachment 397904 [details]
patch

Restoring Darin's r+ since Saam said he will undo the StringImpl::createStaticStringImpl() change.
Comment 19 Saam Barati 2020-04-28 20:11:26 PDT
Comment on attachment 397904 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=397904&action=review

>> Source/JavaScriptCore/runtime/IntlObject.cpp:154
>> +        return String(buffer.data(), length);
> 
> This is wrong.  convertICULocaleToBCP47LanguageTag() is only called from inside std::call_once guarded blocks i.e. it's meant to produce immortal singleton strings.  Before I changed it to be StringImpl::createStaticStringImpl(), we had UAFs due to a user of this string freeing it.

As Mark and I discussed on Slack, the actual issue here isn't being called from call_once, but the fact that the resulting string may ref/deref from more than one thread at once.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:334
>> +    if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING || parsedLength != static_cast<int32_t>(input.length()))
> 
> How can U_STRING_NOT_TERMINATED_WARNING happen here? Seems like we can just leave this out or simply assert it doesn’t happen. We are using the length it returned, so there is no chance it will happen.

My thinking here was just ICU failing for some reason, but this would be a weird way for it to fail.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:346
>> +#endif
> 
> Alternate implementation that does the same thing:
> 
>     ASSERT(intermediate.contains('\0'));

That'd be too easy!

>> Source/WTF/wtf/text/LineBreakIteratorPoolICU.h:78
>> +        if (needsToGrowToProduceCString(status)) {
> 
> This doesn’t seem right. I don’t see code depending on scratchBuffer being null terminated.

Yeah it's not needed. I got confused with the AtomString::fromUTF8 version that doesn't take "length" that ends up calling strlen.
Comment 20 Saam Barati 2020-04-28 20:13:34 PDT
Comment on attachment 397904 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=397904&action=review

>>> Source/JavaScriptCore/runtime/IntlObject.cpp:334
>>> +    if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING || parsedLength != static_cast<int32_t>(input.length()))
>> 
>> How can U_STRING_NOT_TERMINATED_WARNING happen here? Seems like we can just leave this out or simply assert it doesn’t happen. We are using the length it returned, so there is no chance it will happen.
> 
> My thinking here was just ICU failing for some reason, but this would be a weird way for it to fail.

I guess part of my reasoning. is that if for some reason ICU again hit this warning, that "status == U_STRING_NOT_TERMINATED_WARNING" is just as bad as U_FAILURE
Comment 21 Saam Barati 2020-04-28 20:39:57 PDT
(In reply to Darin Adler from comment #15)
> Comment on attachment 397904 [details]
> patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=397904&action=review
> 
> Honestly I think needsToGrowToProduceBuffer doesn’t make things easier to
> read. It’s mainly valuable as a record of the fact that you audited a given
> call site. On the other hand, needsToGrowToProduceCString is definitely
> valuable. But only in exactly one place. I think adding the header is
> overkill, but I won’t stand in your way if you still like it.
I'm going to keep "needsToGrowToProduceBuffer". The reason I like it is that if you end writing code to use it, it's (hopefully) a forcing function to consider if you should instead be using needsToGrowToProduceCString.

> 
> > Source/JavaScriptCore/runtime/IntlObject.cpp:334
> > +    if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING || parsedLength != static_cast<int32_t>(input.length()))
> 
> How can U_STRING_NOT_TERMINATED_WARNING happen here? Seems like we can just
> leave this out or simply assert it doesn’t happen. We are using the length
> it returned, so there is no chance it will happen.
> 
> > Source/JavaScriptCore/runtime/IntlObject.cpp:346
> > +#if ASSERT_ENABLED
> > +    bool sawNullTerminator = false;
> > +    for (char c : intermediate) {
> > +        if (c == '\0') {
> > +            sawNullTerminator = true;
> > +            break;
> > +        }
> > +    }
> > +    ASSERT(sawNullTerminator);
> > +#endif
> 
> Alternate implementation that does the same thing:
> 
>     ASSERT(intermediate.contains('\0'));
> 
> > Source/WTF/wtf/text/LineBreakIteratorPoolICU.h:78
> > +        if (needsToGrowToProduceCString(status)) {
> 
> This doesn’t seem right. I don’t see code depending on scratchBuffer being
> null terminated.
> 
> > Source/WTF/wtf/text/LineBreakIteratorPoolICU.h:79
> >              scratchBuffer.grow(lengthNeeded + 1);
> 
> Seems like this should just be lengthNeeded, not + 1.
Comment 22 Saam Barati 2020-04-28 20:48:48 PDT
Comment on attachment 397904 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=397904&action=review

>> Source/WTF/wtf/text/LineBreakIteratorPoolICU.h:79
>>              scratchBuffer.grow(lengthNeeded + 1);
> 
> Seems like this should just be lengthNeeded, not + 1.

I believe this is needed. But this is my first time reading ICU documentation, and it's not crystal clear exactly what the invariants are. My reading, and slight guessing, given the ambiguity, at what ICU means by "input buffer containing well-formed locale ID to be modified" is that it's null-terminated.
There is also this warning: "NOTE: Unlike almost every other ICU function which takes a buffer, this function will NOT truncate the output text, and will not update the buffer with unterminated text setting a status of U_STRING_NOT_TERMINATED_WARNING.". So I think we need to give it the capacity to write the terminating byte otherwise it won't write any output

I'm reading the documentation here:
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uloc_8h.html#a4ba76d26bf66ef7629a649f6dfc3cc93
Comment 23 Saam Barati 2020-04-28 21:22:37 PDT
(In reply to Saam Barati from comment #22)
> Comment on attachment 397904 [details]
> patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=397904&action=review
> 
> >> Source/WTF/wtf/text/LineBreakIteratorPoolICU.h:79
> >>              scratchBuffer.grow(lengthNeeded + 1);
> > 
> > Seems like this should just be lengthNeeded, not + 1.
> 
> I believe this is needed. But this is my first time reading ICU
> documentation, and it's not crystal clear exactly what the invariants are.
> My reading, and slight guessing, given the ambiguity, at what ICU means by
> "input buffer containing well-formed locale ID to be modified" is that it's
> null-terminated.
> There is also this warning: "NOTE: Unlike almost every other ICU function
> which takes a buffer, this function will NOT truncate the output text, and
> will not update the buffer with unterminated text setting a status of
> U_STRING_NOT_TERMINATED_WARNING.". So I think we need to give it the
> capacity to write the terminating byte otherwise it won't write any output
> 
> I'm reading the documentation here:
> https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uloc_8h.
> html#a4ba76d26bf66ef7629a649f6dfc3cc93

I misread their "NOTE". As Ross pointed out on Slack, ICU is calling out that they won't set such a warning. I read this as having a comma between "text" and "setting".
Comment 24 Saam Barati 2020-04-28 21:26:07 PDT
Created attachment 397921 [details]
patch for landing
Comment 25 EWS 2020-04-29 00:15:23 PDT
Committed r260882: <https://trac.webkit.org/changeset/260882>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 397921 [details].