Bug 143934

Summary: Implement String.codePointAt()
Product: WebKit Reporter: Benjamin Poulain <benjamin>
Component: New BugsAssignee: Benjamin Poulain <benjamin>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, cmarcelo, commit-queue, darin, ggaren
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch darin: review+

Description Benjamin Poulain 2015-04-19 01:09:11 PDT
Implement String.codePointAt()
Comment 1 Benjamin Poulain 2015-04-19 01:13:21 PDT
Created attachment 251118 [details]
Patch
Comment 2 Anders Carlsson 2015-04-19 08:21:33 PDT
Comment on attachment 251118 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251118&action=review

> Source/WTF/wtf/text/StringImpl.h:657
> +    WTF_EXPORT_STRING_API UChar32 jsCodePointAt(unsigned);

If this is only used by JSC I don't think it needs to go into WTF.
Comment 3 Alexey Proskuryakov 2015-04-19 09:51:36 PDT
Comment on attachment 251118 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251118&action=review

> Source/WTF/wtf/text/StringImpl.cpp:371
> +    // -Let first be the code unit value of the element at index position in the String S.

Delete these comments now that the code is written?

> LayoutTests/js/script-tests/string-code-point-at.js:80
> +shouldThrow('"abc".codePointAt(Symbol("WebKit"))');

Nice test suite. Curious why Symbol is  so much of a special case.

> LayoutTests/js/script-tests/string-code-point-at.js:82
> +// The following are using special test functions because of limitations of WebKitTestRunner when handling strings with invalid codepoints.

Can you reword this comment to explain what exactly doesn't work?
Comment 4 Yusuke Suzuki 2015-04-19 10:36:56 PDT
Comment on attachment 251118 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251118&action=review

> LayoutTests/js/Object-getOwnPropertyNames-expected.txt:51
> +FAIL getSortedOwnPropertyNames(String.prototype) should be anchor,big,blink,bold,charAt,charCodeAt,concat,constructor,endsWith,fixed,fontcolor,fontsize,includes,indexOf,italics,lastIndexOf,length,link,localeCompare,match,repeat,replace,search,slice,small,split,startsWith,strike,sub,substr,substring,sup,toLocaleLowerCase,toLocaleUpperCase,toLowerCase,toString,toUpperCase,trim,trimLeft,trimRight,valueOf. Was anchor,big,blink,bold,charAt,charCodeAt,codePointAt,concat,constructor,endsWith,fixed,fontcolor,fontsize,includes,indexOf,italics,lastIndexOf,length,link,localeCompare,match,repeat,replace,search,slice,small,split,startsWith,strike,sub,substr,substring,sup,toLocaleLowerCase,toLocaleUpperCase,toLowerCase,toString,toUpperCase,trim,trimLeft,trimRight,valueOf.

test failed. adding "charCodeAt' to test solves it :D
Comment 5 Darin Adler 2015-04-19 11:23:46 PDT
Comment on attachment 251118 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251118&action=review

if

> Source/JavaScriptCore/runtime/StringPrototype.cpp:816
> +EncodedJSValue JSC_HOST_CALL stringProtoFuncCodePointAt(ExecState* exec)

Here’s the helper function I suggest implementing:

    static inline UChar32 codePointAt(const String& string, unsigned position)
    {
        ASSERT(position < string.length());
        if (string.is8Bit())
            return string[position];
        UChar32 character;
        U16_NEXT(string.characters16(), position, string.length(), character);
        return character;
    }

Or you could skip the is8Bit check and not bother with characters16 to make this cleaner:

    static inline UChar32 codePointAt(const String& string, unsigned position)
    {
        ASSERT(position < string.length());
        UChar32 character;
        U16_NEXT(string, position, string.length(), character);
        return character;
    }

I’m not sure how good the compiler will be at optimizing this second case, but it might do just fine.

This implements the same rule that the JavaScript specification calls for. It’s annoying how close this is to StringImpl::characterStartingAt but for some reason that function returns 0 when it’s not a valid surrogate pair.

>> Source/WTF/wtf/text/StringImpl.h:657
>> +    WTF_EXPORT_STRING_API UChar32 jsCodePointAt(unsigned);
> 
> If this is only used by JSC I don't think it needs to go into WTF.

I agree. We don’t need to put this in StringImpl.
Comment 6 Darin Adler 2015-04-19 11:28:52 PDT
Comment on attachment 251118 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251118&action=review

>> LayoutTests/js/script-tests/string-code-point-at.js:82
>> +// The following are using special test functions because of limitations of WebKitTestRunner when handling strings with invalid codepoints.
> 
> Can you reword this comment to explain what exactly doesn't work?

A good point. I’d like to know more.
Comment 7 Benjamin Poulain 2015-04-19 11:34:47 PDT
This reason for StringImpl + explicit implementation was the extra testing and having the spec text in code.

Looks like nobody appreciates it. I'll move back to JSC.
Comment 8 Benjamin Poulain 2015-04-19 12:03:14 PDT
Created attachment 251123 [details]
Patch
Comment 9 Darin Adler 2015-04-19 16:58:12 PDT
Comment on attachment 251123 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251123&action=review

> Source/JavaScriptCore/runtime/StringPrototype.cpp:813
>      return JSValue::encode(jsNaN());

Strange that charCodeAt returns NaN and codePointAt returns undefined in the analogous situation.

> Source/JavaScriptCore/runtime/StringPrototype.cpp:832
> +    String string = thisValue.toWTFString(exec);

I think the fact that we convert "this" to a string before converting the argument to an integer is detectable. We should write tests using valueOf to see if so. And if so, decide whether that’s the correct order or not.
Comment 10 Benjamin Poulain 2015-04-19 17:28:14 PDT
(In reply to comment #9)
> Comment on attachment 251123 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=251123&action=review
> 
> > Source/JavaScriptCore/runtime/StringPrototype.cpp:813
> >      return JSValue::encode(jsNaN());
> 
> Strange that charCodeAt returns NaN and codePointAt returns undefined in the
> analogous situation.

Agreed, that is just plain weird.

> > Source/JavaScriptCore/runtime/StringPrototype.cpp:832
> > +    String string = thisValue.toWTFString(exec);
> 
> I think the fact that we convert "this" to a string before converting the
> argument to an integer is detectable. We should write tests using valueOf to
> see if so. And if so, decide whether that’s the correct order or not.

Good point!

It is the correct order. It is defined by the spec.

I'll add coverage for that.
Comment 11 Darin Adler 2015-04-19 21:40:34 PDT
Oh, I thought of something else.

If converting “this” to a string throws an exception, then the argument should not be converted to an integer; the exception should stop the processing before the second conversion is done. To implement this correctly, we probably need to add one early return if we see an exception. I bet we have this wrong for charCodeAt and probably other functions as well. It’s one good argument for writing more of the runtime library in JavaScript instead of C++ long term because it’s easier to get exceptions right that way.
Comment 12 Benjamin Poulain 2015-04-19 21:55:04 PDT
(In reply to comment #11)
> Oh, I thought of something else.
> 
> If converting “this” to a string throws an exception, then the argument
> should not be converted to an integer; the exception should stop the
> processing before the second conversion is done. To implement this
> correctly, we probably need to add one early return if we see an exception.
> I bet we have this wrong for charCodeAt and probably other functions as
> well. It’s one good argument for writing more of the runtime library in
> JavaScript instead of C++ long term because it’s easier to get exceptions
> right that way.

I thought about that, but what would be the way of checking if "this" is coercible to object in JavaScript?
Comment 13 Benjamin Poulain 2015-04-19 22:14:51 PDT
Comment on attachment 251123 [details]
Patch

I'll post a new patch tomorrow. The last bug you uncovered shows this need more work.
Comment 14 Jordan Harband 2015-04-20 14:52:20 PDT
Comment on attachment 251123 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251123&action=review

Please also see es6-shim's tests here: https://github.com/paulmillr/es6-shim/blob/master/test/string.js#L429-L474

> LayoutTests/js/script-tests/string-code-point-at.js:8
> +shouldBe('String.prototype.codePointAt.length', '1');

It'd be good to also assert `shouldBeEqualToString('String.prototype.codePointAt.name', 'codePointAt')`

> LayoutTests/js/script-tests/string-code-point-at.js:69
> +shouldBe('"abc".codePointAt(4)', 'undefined');

Maybe also a `shouldBe('var str = "abc"; str.codePointAt(str.length)', 'undefined');`?
Comment 15 Benjamin Poulain 2015-04-20 16:00:54 PDT
Created attachment 251197 [details]
Patch
Comment 16 Benjamin Poulain 2015-04-20 16:02:07 PDT
Ok, the exception on toString() is not a problem since the generated string is null so the remaining code does nothing.

I added test coverage for the issues pointed out so far.
Comment 17 Benjamin Poulain 2015-04-20 23:52:56 PDT
Created attachment 251221 [details]
Patch
Comment 18 Darin Adler 2015-04-21 08:48:32 PDT
Comment on attachment 251221 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251221&action=review

> Source/JavaScriptCore/runtime/StringPrototype.cpp:818
> +    RELEASE_ASSERT(position < string.length());

This should not be RELEASE_ASSERT. The two call sites just below guarantee this invariant in a clear, obvious way. Maybe if this function was used in more places, but it seems like overkill given the way this is used.

> Source/JavaScriptCore/runtime/StringPrototype.cpp:834
> +    if (UNLIKELY(exec->hadException()))
> +        return JSValue::encode(jsUndefined());

You could put this after the isUInt32 check below, since we only need this check if we continue and call toInteger. Of course, you could also argue that putting it right here is best for clarity.
Comment 19 Benjamin Poulain 2015-04-22 14:04:18 PDT
(In reply to comment #18)
> Comment on attachment 251221 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=251221&action=review
> 
> > Source/JavaScriptCore/runtime/StringPrototype.cpp:818
> > +    RELEASE_ASSERT(position < string.length());
> 
> This should not be RELEASE_ASSERT. The two call sites just below guarantee
> this invariant in a clear, obvious way. Maybe if this function was used in
> more places, but it seems like overkill given the way this is used.

I think it is a good safety net.

Today the code will be eliminated by the compiler. If anyone ever modify a caller, this branch could protect the "string.characters8()[position]".
Comment 20 Benjamin Poulain 2015-04-22 15:26:33 PDT
Committed r183141: <http://trac.webkit.org/changeset/183141>