Bug 144160 - [ES6] Implement String.fromCodePoint
Summary: [ES6] Implement String.fromCodePoint
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-24 14:20 PDT by Yusuke Suzuki
Modified: 2015-04-25 15:02 PDT (History)
7 users (show)

See Also:


Attachments
Patch (6.72 KB, patch)
2015-04-24 15:29 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews103 for mac-mavericks (539.07 KB, application/zip)
2015-04-24 16:02 PDT, Build Bot
no flags Details
Archive of layout-test-results from ews107 for mac-mavericks-wk2 (647.44 KB, application/zip)
2015-04-24 16:03 PDT, Build Bot
no flags Details
Patch (6.54 KB, patch)
2015-04-25 06:36 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (6.48 KB, patch)
2015-04-25 06:49 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews102 for mac-mavericks (543.14 KB, application/zip)
2015-04-25 07:22 PDT, Build Bot
no flags Details
Archive of layout-test-results from ews104 for mac-mavericks-wk2 (589.31 KB, application/zip)
2015-04-25 07:45 PDT, Build Bot
no flags Details
Patch (10.66 KB, patch)
2015-04-25 07:49 PDT, Yusuke Suzuki
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2015-04-24 14:20:51 PDT
[ES6] Implement String.fromCodePoint
Comment 1 Yusuke Suzuki 2015-04-24 15:29:11 PDT
Created attachment 251579 [details]
Patch
Comment 2 Build Bot 2015-04-24 16:02:22 PDT
Comment on attachment 251579 [details]
Patch

Attachment 251579 [details] did not pass mac-ews (mac):
Output: http://webkit-queues.appspot.com/results/5169665965817856

New failing tests:
js/Object-getOwnPropertyNames.html
Comment 3 Build Bot 2015-04-24 16:02:25 PDT
Created attachment 251581 [details]
Archive of layout-test-results from ews103 for mac-mavericks

The attached test failures were seen while running run-webkit-tests on the mac-ews.
Bot: ews103  Port: mac-mavericks  Platform: Mac OS X 10.9.5
Comment 4 Build Bot 2015-04-24 16:03:13 PDT
Comment on attachment 251579 [details]
Patch

Attachment 251579 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-queues.appspot.com/results/5934236179628032

New failing tests:
js/Object-getOwnPropertyNames.html
Comment 5 Build Bot 2015-04-24 16:03:17 PDT
Created attachment 251582 [details]
Archive of layout-test-results from ews107 for mac-mavericks-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: ews107  Port: mac-mavericks-wk2  Platform: Mac OS X 10.9.5
Comment 6 Alexey Proskuryakov 2015-04-24 17:09:35 PDT
Comment on attachment 251579 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251579&action=review

> Source/JavaScriptCore/runtime/StringConstructor.cpp:100
> +    StringBuilder builder;
> +
> +    for (unsigned i = 0; i < length; ++i) {

Would it be more efficient to preallocate a 2*length size buffer than using StringBuilder?
Comment 7 Darin Adler 2015-04-24 18:08:59 PDT
Comment on attachment 251579 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251579&action=review

Need to update the Object.getOwnPropertyNames test.

>> Source/JavaScriptCore/runtime/StringConstructor.cpp:100
>> +    for (unsigned i = 0; i < length; ++i) {
> 
> Would it be more efficient to preallocate a 2*length size buffer than using StringBuilder?

Or since it’s highly likely it will be length, not 2*length, should just call:

    builder.reserveCapacity(length);

> Source/JavaScriptCore/runtime/StringConstructor.cpp:101
> +        double nextCP = exec->uncheckedArgument(i).toNumber(exec);

I suggest naming this codePointAsDouble rather than nextCP.

> Source/JavaScriptCore/runtime/StringConstructor.cpp:109
> +        if (!sameValue(exec, jsNumber(nextCP), jsNumber(jsNumber(nextCP).toInteger(exec))))
> +            return throwVMError(exec, createRangeError(exec, ASCIILiteral("Arguments contain a value that is out of range of code points")));
> +
> +        if (nextCP < 0 || nextCP > 0x10FFFF)
> +            return throwVMError(exec, createRangeError(exec, ASCIILiteral("Arguments contain a value that is out of range of code points")));

This conversion back to jsNumber is really strange. I think we can do this check without using the sameValue function and without all this conversion back and forth.

I think it’s probably simply this:

    UChar32 codePoint = codePointAsDouble;
    if (codePoint != codePointAsDouble || codePoint > 0x10FFFF)
        return throwVMError(exec, createRangeError(exec, ASCIILiteral("Arguments contain a value that is out of range of code points")));

> Source/JavaScriptCore/runtime/StringConstructor.cpp:111
> +        unsigned codePoint = static_cast<unsigned>(nextCP);

Should be UChar32 rather than unsigned.

> Source/JavaScriptCore/runtime/StringConstructor.cpp:119
> +        int encodeOffset = 0;
> +        UChar encodeBuffer[2] = { 0 };
> +        U16_APPEND_UNSAFE(encodeBuffer, encodeOffset, codePoint);
> +
> +        builder.append(encodeBuffer[0]);
> +        if (encodeOffset == 2)
> +            builder.append(encodeBuffer[1]);

Here is a much cleaner way to write it:

    if (U_IS_BMP(codePoint))
        builder.append(codePoint);
    else {
        builder.append(U16_LEAD(codePoint));
        builder.append(U16_TRAIL(codePoint));
    }
Comment 8 Yusuke Suzuki 2015-04-25 06:36:32 PDT
Created attachment 251632 [details]
Patch
Comment 9 Yusuke Suzuki 2015-04-25 06:49:49 PDT
Created attachment 251633 [details]
Patch
Comment 10 Yusuke Suzuki 2015-04-25 06:51:51 PDT
Comment on attachment 251579 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251579&action=review

>>> Source/JavaScriptCore/runtime/StringConstructor.cpp:100
>>> +    for (unsigned i = 0; i < length; ++i) {
>> 
>> Would it be more efficient to preallocate a 2*length size buffer than using StringBuilder?
> 
> Or since it’s highly likely it will be length, not 2*length, should just call:
> 
>     builder.reserveCapacity(length);

Looks nice!
If there's no code points represented by surrogate pairs, a length of a generated string becomes `length`.
OK, so `builder.reserveCapacity(length);` looks very nice.

>> Source/JavaScriptCore/runtime/StringConstructor.cpp:101
>> +        double nextCP = exec->uncheckedArgument(i).toNumber(exec);
> 
> I suggest naming this codePointAsDouble rather than nextCP.

Looks nice. Fixed.

>> Source/JavaScriptCore/runtime/StringConstructor.cpp:109
>> +            return throwVMError(exec, createRangeError(exec, ASCIILiteral("Arguments contain a value that is out of range of code points")));
> 
> This conversion back to jsNumber is really strange. I think we can do this check without using the sameValue function and without all this conversion back and forth.
> 
> I think it’s probably simply this:
> 
>     UChar32 codePoint = codePointAsDouble;
>     if (codePoint != codePointAsDouble || codePoint > 0x10FFFF)
>         return throwVMError(exec, createRangeError(exec, ASCIILiteral("Arguments contain a value that is out of range of code points")));

Looks nice. However, since `UChar32` is typedef-ed as `int32_t`, when `-1` comes, `codePoint != codePointAsDouble` becomes false.
So instead of UChar32, I've changed it to `uint32_t` in the revised patch.

>> Source/JavaScriptCore/runtime/StringConstructor.cpp:111
>> +        unsigned codePoint = static_cast<unsigned>(nextCP);
> 
> Should be UChar32 rather than unsigned.

Sounds very nice! UChar32 is nice for representing unicode code points.

>> Source/JavaScriptCore/runtime/StringConstructor.cpp:119
>> +            builder.append(encodeBuffer[1]);
> 
> Here is a much cleaner way to write it:
> 
>     if (U_IS_BMP(codePoint))
>         builder.append(codePoint);
>     else {
>         builder.append(U16_LEAD(codePoint));
>         builder.append(U16_TRAIL(codePoint));
>     }

Right. Your suggestion is cleaner :D
Comment 11 Build Bot 2015-04-25 07:22:20 PDT
Comment on attachment 251633 [details]
Patch

Attachment 251633 [details] did not pass mac-ews (mac):
Output: http://webkit-queues.appspot.com/results/5669308101296128

New failing tests:
js/Object-getOwnPropertyNames.html
Comment 12 Build Bot 2015-04-25 07:22:24 PDT
Created attachment 251635 [details]
Archive of layout-test-results from ews102 for mac-mavericks

The attached test failures were seen while running run-webkit-tests on the mac-ews.
Bot: ews102  Port: mac-mavericks  Platform: Mac OS X 10.9.5
Comment 13 Build Bot 2015-04-25 07:45:28 PDT
Comment on attachment 251633 [details]
Patch

Attachment 251633 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-queues.appspot.com/results/4974387224641536

New failing tests:
js/Object-getOwnPropertyNames.html
Comment 14 Build Bot 2015-04-25 07:45:32 PDT
Created attachment 251638 [details]
Archive of layout-test-results from ews104 for mac-mavericks-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: ews104  Port: mac-mavericks-wk2  Platform: Mac OS X 10.9.5
Comment 15 Yusuke Suzuki 2015-04-25 07:49:27 PDT
Created attachment 251639 [details]
Patch
Comment 16 Darin Adler 2015-04-25 09:54:55 PDT
Comment on attachment 251639 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251639&action=review

Looks great.

Correctness-wise this seems just right. Optimization-wise, maybe worth revisiting after landing.

- We might consider a fast path in the extremely likely case that an argument is already an Int32; converting to double and back is expensive and unnecessary. The JSValue::toInt32 and toUInt32 functions does something like this, and while they are doing a different operation, the same principle applies. This one is really easy, we’d just check isUInt32 before calling toNumber and refactor so we don’t do the floating point arithmetic at all in that case.

- We might consider a fast path in the extremely likely case that there is a single BMP argument that should return a single character string, like the special case we already have for fromCharCode.

- If fromCharCode or fromCodePoint ever gets really hot usage then we could also consider optimizing either in the JIT the way we do with some other built-in functions like the math functions.

It’s unfortunate that toUInt32 has the unwanted behavior of doing modulo rather than failing when the value cannot be converted into a number that is in the UInt32 range. Because aside from that, toUInt32 has desirable behavior such as avoiding the "to double and back" thing.

> Source/JavaScriptCore/tests/stress/string-from-code-points.js:51
> +    [ 0x10000,  "\uD800\uDC00" ],
> +    [ 0x10001,  "\uD800\uDC01" ],

Should test that when you pass in an unpaired surrogate such as 0xD800 or 0xDC00 it just gets passed through unmodified, since that seems to be the specified behavior. Or even a pair of surrogates as code points that then turn into a single code point in the resulting string!

> Source/JavaScriptCore/tests/stress/string-from-code-points.js:59
> +var invalidTests = [

Should specifically test a value that works with fromCharCode because of toUInt32 but not for this class. Such as 2^32 + 32, which produces a space if passed to fromCharCode, but should throw an exception if passed to fromCodePoint.

> Source/JavaScriptCore/tests/stress/string-from-code-points.js:103
> +    [ null, "\0" ],
> +    [ [], "\0" ],
> +    [ "0x41", "A" ],
> +    [ "", "\0" ],

Should probably add tests for boolean true and false also.
Comment 17 Darin Adler 2015-04-25 10:01:50 PDT
Comment on attachment 251639 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251639&action=review

>> Source/JavaScriptCore/tests/stress/string-from-code-points.js:51
>> +    [ 0x10001,  "\uD800\uDC01" ],
> 
> Should test that when you pass in an unpaired surrogate such as 0xD800 or 0xDC00 it just gets passed through unmodified, since that seems to be the specified behavior. Or even a pair of surrogates as code points that then turn into a single code point in the resulting string!

Besides unpaired surrogates, there are these values that are technically “non-characters” but that we intentionally handle rather than rejecting:

- the last two code points on each plane (U+__FFFE and U+__FFFF, 34 code points total in 17 planes); of these we test only 0x10FFFF, 0x10FFFE and 0xFFFF here but we could test some others
- U+FDD0-U+FDEF (32 code points)

We might want to test a few more of them, at least the true edge cases. If we had used, for example, the U_IS_UNICODE_CHAR function from the ICU library rather than comparing against 0x10FFFF explicitly we would have rejected those characters.
Comment 18 Yusuke Suzuki 2015-04-25 14:59:51 PDT
Comment on attachment 251639 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251639&action=review

Thank you for your review!
Your thought is correct. I think if we need more performance for this area, implementing a fast path in JIT layer (implementing intrinsic thunk code for JS function) is preferable.

>>> Source/JavaScriptCore/tests/stress/string-from-code-points.js:51
>>> +    [ 0x10001,  "\uD800\uDC01" ],
>> 
>> Should test that when you pass in an unpaired surrogate such as 0xD800 or 0xDC00 it just gets passed through unmodified, since that seems to be the specified behavior. Or even a pair of surrogates as code points that then turn into a single code point in the resulting string!
> 
> Besides unpaired surrogates, there are these values that are technically “non-characters” but that we intentionally handle rather than rejecting:
> 
> - the last two code points on each plane (U+__FFFE and U+__FFFF, 34 code points total in 17 planes); of these we test only 0x10FFFF, 0x10FFFE and 0xFFFF here but we could test some others
> - U+FDD0-U+FDEF (32 code points)
> 
> We might want to test a few more of them, at least the true edge cases. If we had used, for example, the U_IS_UNICODE_CHAR function from the ICU library rather than comparing against 0x10FFFF explicitly we would have rejected those characters.

Looks nice.
I've added
1. unpaired surrogates
2. all non-characters code points (17 * 2 + 32 = 66)

>> Source/JavaScriptCore/tests/stress/string-from-code-points.js:59
>> +var invalidTests = [
> 
> Should specifically test a value that works with fromCharCode because of toUInt32 but not for this class. Such as 2^32 + 32, which produces a space if passed to fromCharCode, but should throw an exception if passed to fromCodePoint.

Added it (0x100000000 + 32).

>> Source/JavaScriptCore/tests/stress/string-from-code-points.js:103
>> +    [ "", "\0" ],
> 
> Should probably add tests for boolean true and false also.

Sounds nice. Added.
Comment 19 Yusuke Suzuki 2015-04-25 15:02:07 PDT
Committed r183315: <http://trac.webkit.org/changeset/183315>