Bug 212725 - Pasted JavaScript code from pasteboard is unicode-normalized and this changes meaning of code
Summary: Pasted JavaScript code from pasteboard is unicode-normalized and this changes...
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Inspector (show other bugs)
Version: Safari Technology Preview
Hardware: Mac macOS 10.15
: P2 Normal
Assignee: Wenson Hsieh
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-06-03 20:37 PDT by Huáng Jùnliàng
Modified: 2020-06-19 15:21 PDT (History)
8 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Huáng Jùnliàng 2020-06-03 20:37:15 PDT
Safari TP 107 throws on the following snippet

```js
// \u1f7c-\u1f7d
var range = "[ὼ-ώ]";
var regex = new RegExp(range);
```

However it does not throw when the range above is escaped as ascii only:

```js
var range = "[\u1f7c-\u1f7d]";
var regex = new RegExp(range);
```

While `1f7c` seems random, the following snippet is good.

```js
// \u1f7b-\u1f7c
var range="[ύ-ὼ]",regex=new RegExp(range);
```

I don't think this issue can be related to recent Unicode version updates because \u1f7b - \u1f7d have been available since Unicode 1.1.


**Context**

Found this issue when debugging https://github.com/babel/website/issues/2254, JSC throws on Babel's ugly identifier name detection regex: https://github.com/babel/babel/blob/master/packages/babel-helper-validator-identifier/src/identifier.js after minified by terser.

**Related version**
This issue is also reproducible on Safari Version 13.1.1 (15609.2.9.1.2).
Comment 1 Radar WebKit Bug Importer 2020-06-05 09:57:35 PDT
<rdar://problem/64033253>
Comment 2 Yusuke Suzuki 2020-06-09 16:55:44 PDT
(In reply to Huáng Jùnliàng from comment #0)
> Safari TP 107 throws on the following snippet
> 
> ```js
> // \u1f7c-\u1f7d
> var range = "[ὼ-ώ]";
> var regex = new RegExp(range);
> ```

In the above code, it looks like the range is not `\u1f7c-\u1f7d`.

[..."ὼ-ώ"].map((ch) => ch.charCodeAt(0).toString(16)) // => ["1f7c","2d","3ce"]

Since 0x3ce is smaller than 0x1f7c, this throws a SyntaxError.

I've ensured that the above script throws a SyntaxError in Firefox and Chrome.
So, it seems that this is a minifier's bug.
Can you check the result with the other browsers?

> 
> However it does not throw when the range above is escaped as ascii only:
> 
> ```js
> var range = "[\u1f7c-\u1f7d]";
> var regex = new RegExp(range);
> ```
> 
> While `1f7c` seems random, the following snippet is good.
> 
> ```js
> // \u1f7b-\u1f7c
> var range="[ύ-ὼ]",regex=new RegExp(range);
> ```
> 
> I don't think this issue can be related to recent Unicode version updates
> because \u1f7b - \u1f7d have been available since Unicode 1.1.
> 
> 
> **Context**
> 
> Found this issue when debugging
> https://github.com/babel/website/issues/2254, JSC throws on Babel's ugly
> identifier name detection regex:
> https://github.com/babel/babel/blob/master/packages/babel-helper-validator-
> identifier/src/identifier.js after minified by terser.
> 
> **Related version**
> This issue is also reproducible on Safari Version 13.1.1 (15609.2.9.1.2).
Comment 3 Huáng Jùnliàng 2020-06-09 19:27:10 PDT
> In the above code, it looks like the range is not `\u1f7c-\u1f7d`.

> [..."ὼ-ώ"].map((ch) => ch.charCodeAt(0).toString(16)) // => ["1f7c","2d","3ce"]

I can confirm that if you copy & paste the code example from this webpage, it throws syntax error on other browsers too and `ώ` becomes U+03CE.

I can also confirm that the minifier is working properly. You can try the following example on https://try.terser.org

```
"\u{1f7d}".codePointAt(0).toString(16);
```

And copy the output code to Chrome/Firefox/Safari console. Both Chrome and Firefox returns `1f7d` but Safari console returns `3ce`.

Note that U+1F7D has singleton decompositions, which means

```
String.fromCodePoint(0x1f7d).normalize() === String.fromCodePoint(0x3ce)
```

and it is included in CompositionExclusions, which means U+1F7D should never exist in a normalized Unicode string.

Since ECMAScript does not require the source text to be normalized, I think this is a bug of Safari console, which applies normalization on the pasted source code. I will file a new bug report about that.

However I think this bug also affects certain feature implementations, because runtime error is thrown on Safari exclusively in https://github.com/babel/website/issues/2254.

I will try to isolate that issue again and post an update, please leave it unresolved.
Comment 4 Yusuke Suzuki 2020-06-09 19:41:00 PDT
(In reply to Huáng Jùnliàng from comment #3)
> > In the above code, it looks like the range is not `\u1f7c-\u1f7d`.
> 
> > [..."ὼ-ώ"].map((ch) => ch.charCodeAt(0).toString(16)) // => ["1f7c","2d","3ce"]
> 
> I can confirm that if you copy & paste the code example from this webpage,
> it throws syntax error on other browsers too and `ώ` becomes U+03CE.

Thanks for your confirmation :)

> 
> I can also confirm that the minifier is working properly. You can try the
> following example on https://try.terser.org
> 
> ```
> "\u{1f7d}".codePointAt(0).toString(16);
> ```
> 
> And copy the output code to Chrome/Firefox/Safari console. Both Chrome and
> Firefox returns `1f7d` but Safari console returns `3ce`.

Interesting! This reproduced in my machine too.
One interesting thing is that,

1. I copied this from terser result
2. I created secret gist from Safari
3. I copied from the text from gist
4. Paste it to Chrome / Firefox consoles

Then, I got the same normalized results. That sounds like paste-board is normalizing the content when pasting when pasting a text in WebKit?
I should talk to platform folks.

> 
> Note that U+1F7D has singleton decompositions, which means
> 
> ```
> String.fromCodePoint(0x1f7d).normalize() === String.fromCodePoint(0x3ce)
> ```
> 
> and it is included in CompositionExclusions, which means U+1F7D should never
> exist in a normalized Unicode string.
> 
> Since ECMAScript does not require the source text to be normalized, I think
> this is a bug of Safari console, which applies normalization on the pasted
> source code. I will file a new bug report about that.

Maybe, this is not console's bug given that I encountered this normalization even in textarea.
Rather, it sounds like pasteboard is doing normalization. It would be possible that this is derived from UIKit...?

> However I think this bug also affects certain feature implementations,
> because runtime error is thrown on Safari exclusively in
> https://github.com/babel/website/issues/2254.
> 
> I will try to isolate that issue again and post an update, please leave it
> unresolved.

Sure! Thanks. This helps us a lot :D
Comment 5 Huáng Jùnliàng 2020-06-09 20:01:22 PDT
> That sounds like paste-board is normalizing the content when pasting when pasting a text in WebKit?
I should talk to platform folks.

That looks reasonable. Since you are much more familiar with the internals than I am, can you file a radar on that? I have also prepared a gist at https://gist.github.com/JLHwung/64fed33e2dbb3da7a18566fab26f045f.
Comment 6 Yusuke Suzuki 2020-06-09 20:59:00 PDT
(In reply to Huáng Jùnliàng from comment #5)
> > That sounds like paste-board is normalizing the content when pasting when pasting a text in WebKit?
> I should talk to platform folks.
> 
> That looks reasonable. Since you are much more familiar with the internals
> than I am, can you file a radar on that? I have also prepared a gist at
> https://gist.github.com/JLHwung/64fed33e2dbb3da7a18566fab26f045f.

OK, I've talked with Wenson and Alexey about it.
We have a code path which is performing unicode-normalizing the text from the pasteboard to textarea to alleviate the situation like fancy decomposed unicode HFS+'s filename is directly pasted to the textarea. We are still talking about the direction.

BTW, from this information, I guess that babel website console has a code path which copy & paste the babel code itself, correct?
Comment 7 Yusuke Suzuki 2020-06-18 21:48:17 PDT
After looking into babel console code, I suspect that this is normalized when creating Blob.

@Huáng I guess that babel repl website creates large Blob which includes all babel source code, and this Blob is created from user JS something like, `new Blob([codeText])`, is it correct?
Comment 8 Yusuke Suzuki 2020-06-18 21:48:37 PDT
I'll keep this bug for pasteboard text encoding.
Comment 9 Yusuke Suzuki 2020-06-18 21:51:35 PDT
I think babel's repl issue is derived from https://bugs.webkit.org/show_bug.cgi?id=213254
Comment 10 Huáng Jùnliàng 2020-06-19 15:21:01 PDT
@Yusuke Thanks for the attaching the see also issue. 

> babel repl website creates large Blob which includes all babel source code

Yes! You are right. I will reply on that issue as this issue should be focused on pasteboard.