Bug 210119 - querySelector("#\u0000") should match an element with ID U+FFFD
Summary: querySelector("#\u0000") should match an element with ID U+FFFD
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: CSS (show other bugs)
Version: Safari 13
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Chris Dumez
URL: https://drafts.csswg.org/css-syntax/#...
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-04-07 09:24 PDT by Domenic Denicola
Modified: 2020-04-08 17:44 PDT (History)
19 users (show)

See Also:


Attachments
Patch (29.21 KB, patch)
2020-04-08 09:38 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (29.27 KB, patch)
2020-04-08 11:27 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (39.73 KB, patch)
2020-04-08 14:09 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (31.10 KB, patch)
2020-04-08 16:15 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Domenic Denicola 2020-04-07 09:24:53 PDT
Tested using the version of Safari deployed on wpt.fyi; I don't have a Mac available while working from home.

URL: http://wpt.live/dom/nodes/ParentNode-querySelector-escapes.html

Other browsers: Chrome fails, Firefox passes

Per spec, selectors are parsed with a preprocessing step that transforms U+0000 to U+FFFD. This does not seem to be implemented, at least in the case tested here.
Comment 1 Chris Dumez 2020-04-07 12:06:35 PDT
Looking at our code it looks like we indeed have no preprocessing stage and our CSSParser implementation tries to deal with this fact as a result. E.g.:


static bool isNewLine(UChar cc)
{
    // We check \r and \f here, since we have no preprocessing stage
    return (cc == '\r' || cc == '\n' || cc == '\f');
}
Comment 2 Chris Dumez 2020-04-07 13:29:42 PDT
CSSTokenizer::CSSTokenizer(const String& string)
    : m_input(string)
{
    // According to the spec, we should perform preprocessing here.
    // See: http://dev.w3.org/csswg/css-syntax/#input-preprocessing
    //
    // However, we can skip this step since:
    // * We're using HTML spaces (which accept \r and \f as a valid white space)
    // * Do not count white spaces
    // * CSSTokenizerInputStream::nextInputChar() replaces NULLs for replacement characters


Clearly it is not 100% working though.
Comment 3 Chris Dumez 2020-04-08 09:38:14 PDT
Created attachment 395820 [details]
Patch
Comment 4 Chris Dumez 2020-04-08 11:27:28 PDT
Created attachment 395844 [details]
Patch
Comment 5 Darin Adler 2020-04-08 12:09:44 PDT
Comment on attachment 395844 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395844&action=review

> Source/WebCore/css/parser/CSSTokenizer.cpp:44
> +static String& preprocessString(String& string)

Doesn’t seem like a great pattern to take a String& and also return it.

I suggest we take either String&& or const String& and return a String.

> Source/WebCore/css/parser/CSSTokenizer.cpp:48
> +    return string.replace('\0', 0xFFFD);

We have a named constant for this, replacementCharacter, in <wtf/unicode/CharacterNames.h>. Let's use that.

> Source/WebCore/css/parser/CSSTokenizer.cpp:208
> +    if (m_input.peek(0) == '!'
> +        && m_input.peek(1) == '-'
> +        && m_input.peek(2) == '-') {

We could make this a one-liner now, easier to read.

> Source/WebCore/css/parser/CSSTokenizer.cpp:227
> +    if (m_input.peek(0) == '-'
> +        && m_input.peek(1) == '>') {

We could make this a one-liner now, easier to read.

> Source/WebCore/css/parser/CSSTokenizer.cpp:330
> +    if (m_input.peek(0) == '+'
> +        && (isASCIIHexDigit(m_input.peek(1)) || m_input.peek(1) == '?')) {

We could make this a one-liner now, easier to read.

> Source/WebCore/css/parser/CSSTokenizer.h:49
> -    explicit CSSTokenizer(const String&);
> -    CSSTokenizer(const String&, CSSParserObserverWrapper&); // For the inspector
> +    explicit CSSTokenizer(String);
> +    CSSTokenizer(String, CSSParserObserverWrapper&); // For the inspector

We could have made this change without changing the argument type; it would simple cost a little reference count churn.

If we want to change the interface to avoid the churn, then I suggest a String&& argument instead of a String argument.

Or maybe you just like the simplicity of just String?

> Source/WebCore/css/parser/CSSTokenizerInputStream.h:50
>              return '\0';

Messy that we have kEndOfFileMarker but don’t use it much.

> Source/WebCore/css/parser/CSSTokenizerInputStream.h:56
> +    UChar peek(unsigned lookaheadOffset) const

I almost wish this was just operator[] instead a named function.
Comment 6 Chris Dumez 2020-04-08 12:25:02 PDT
Comment on attachment 395844 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395844&action=review

>> Source/WebCore/css/parser/CSSTokenizer.cpp:44
>> +static String& preprocessString(String& string)
> 
> Doesn’t seem like a great pattern to take a String& and also return it.
> 
> I suggest we take either String&& or const String& and return a String.

I was trying to reduce the amount of refactoring because the constructor uses the string both to initialize m_input (which is a CSSTokenizerInputStream) and then later on in its body. Taking a String& made this convenient. However, I agree this is not a great pattern so I will clean this up.

>> Source/WebCore/css/parser/CSSTokenizer.cpp:48
>> +    return string.replace('\0', 0xFFFD);
> 
> We have a named constant for this, replacementCharacter, in <wtf/unicode/CharacterNames.h>. Let's use that.

Ok.

>> Source/WebCore/css/parser/CSSTokenizer.cpp:208
>> +        && m_input.peek(2) == '-') {
> 
> We could make this a one-liner now, easier to read.

Will fix those.
Comment 7 Darin Adler 2020-04-08 12:34:32 PDT
Comment on attachment 395844 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395844&action=review

>>> Source/WebCore/css/parser/CSSTokenizer.cpp:44
>>> +static String& preprocessString(String& string)
>> 
>> Doesn’t seem like a great pattern to take a String& and also return it.
>> 
>> I suggest we take either String&& or const String& and return a String.
> 
> I was trying to reduce the amount of refactoring because the constructor uses the string both to initialize m_input (which is a CSSTokenizerInputStream) and then later on in its body. Taking a String& made this convenient. However, I agree this is not a great pattern so I will clean this up.

Sure, lets just use m_input instead of string inside the constructors.
Comment 8 Chris Dumez 2020-04-08 12:44:30 PDT
Comment on attachment 395844 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395844&action=review

>>>> Source/WebCore/css/parser/CSSTokenizer.cpp:44
>>>> +static String& preprocessString(String& string)
>>> 
>>> Doesn’t seem like a great pattern to take a String& and also return it.
>>> 
>>> I suggest we take either String&& or const String& and return a String.
>> 
>> I was trying to reduce the amount of refactoring because the constructor uses the string both to initialize m_input (which is a CSSTokenizerInputStream) and then later on in its body. Taking a String& made this convenient. However, I agree this is not a great pattern so I will clean this up.
> 
> Sure, lets just use m_input instead of string inside the constructors.

m_input is not a String though and does not really store the String either (just the StringImpl).
Comment 9 Chris Dumez 2020-04-08 12:46:44 PDT
Comment on attachment 395844 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395844&action=review

>>>>> Source/WebCore/css/parser/CSSTokenizer.cpp:44
>>>>> +static String& preprocessString(String& string)
>>>> 
>>>> Doesn’t seem like a great pattern to take a String& and also return it.
>>>> 
>>>> I suggest we take either String&& or const String& and return a String.
>>> 
>>> I was trying to reduce the amount of refactoring because the constructor uses the string both to initialize m_input (which is a CSSTokenizerInputStream) and then later on in its body. Taking a String& made this convenient. However, I agree this is not a great pattern so I will clean this up.
>> 
>> Sure, lets just use m_input instead of string inside the constructors.
> 
> m_input is not a String though and does not really store the String either (just the StringImpl).

The proposal I have is to make the constructors private and add a makeCSSTokenizer() free function that would construct one for you, preprocessing the string before passing it to the constructors.
Comment 10 Chris Dumez 2020-04-08 14:09:51 PDT
Created attachment 395867 [details]
Patch
Comment 11 Chris Dumez 2020-04-08 15:30:24 PDT
Comment on attachment 395867 [details]
Patch

Some inspector tests are crashing. I must have introduced a bug in this latest iteration.
Comment 12 Chris Dumez 2020-04-08 16:15:16 PDT
Created attachment 395881 [details]
Patch
Comment 13 EWS 2020-04-08 17:43:54 PDT
Committed r259773: <https://trac.webkit.org/changeset/259773>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 395881 [details].
Comment 14 Radar WebKit Bug Importer 2020-04-08 17:44:15 PDT
<rdar://problem/61488027>