WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
212393
UTF-8 encode strings of invalid URLs when converting WTF::URL to NSURL instead of truncating the UTF-16 encoding
https://bugs.webkit.org/show_bug.cgi?id=212393
Summary
UTF-8 encode strings of invalid URLs when converting WTF::URL to NSURL instea...
Alex Christensen
Reported
2020-05-26 17:34:33 PDT
UTF-8 encode strings of invalid URLs when converting WTF::URL to NSURL instead of truncating the UTF-16 encoding
Attachments
Patch
(6.56 KB, patch)
2020-05-26 17:41 PDT
,
Alex Christensen
no flags
Details
Formatted Diff
Diff
add more tests
(1.64 KB, patch)
2020-05-27 15:52 PDT
,
Alex Christensen
darin
: review+
achristensen
: commit-queue+
Details
Formatted Diff
Diff
View All
Add attachment
proposed patch, testcase, etc.
Alex Christensen
Comment 1
2020-05-26 17:41:50 PDT
Created
attachment 400289
[details]
Patch
EWS
Comment 2
2020-05-26 18:49:00 PDT
Committed
r262171
: <
https://trac.webkit.org/changeset/262171
> All reviewed patches have been landed. Closing bug and clearing flags on
attachment 400289
[details]
.
Radar WebKit Bug Importer
Comment 3
2020-05-26 18:49:14 PDT
<
rdar://problem/63655013
>
Darin Adler
Comment 4
2020-05-27 11:49:10 PDT
Comment on
attachment 400289
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=400289&action=review
> Source/WTF/wtf/cf/URLCF.cpp:59 > + RetainPtr<CFURLRef> cfURL; > + if (LIKELY(m_string.is8Bit())) > + cfURL = WTF::createCFURLFromBuffer(reinterpret_cast<const char*>(m_string.characters8()), m_string.length()); > + else { > + CString utf8 = m_string.utf8(); > + cfURL = WTF::createCFURLFromBuffer(utf8.data(), utf8.length()); > + }
On reflection, this code is really not quite right. If we have non-ASCII Latin-1 characters, they will great treated as Latin-1, not UTF-8. But only if the string happens to be stored 8-bit, which is an implementation detail. Handling of those characters should *not* be different based on whether the string is 8-bit or not. Sadly, I think we need the more expensive "isAllASCII" check. Test string would be "
http://\xb6
", both as an 8-bit string and a 16-bit string. Needs to give the same results in both cases.
> Source/WTF/wtf/cocoa/URLCocoa.mm:77 > if (LIKELY(m_string.is8Bit())) > cfURL = WTF::createCFURLFromBuffer(reinterpret_cast<const char*>(m_string.characters8()), m_string.length()); > else { > - // Slower path. > - WTF::URLCharBuffer buffer; > - copyToBuffer(buffer); > - cfURL = WTF::createCFURLFromBuffer(buffer.data(), buffer.size()); > + CString utf8 = m_string.utf8(); > + cfURL = WTF::createCFURLFromBuffer(utf8.data(), utf8.length()); > }
Same problem.
Alex Christensen
Comment 5
2020-05-27 15:51:54 PDT
"
http://\xb6
" would be canonicalized to "
http://xn--tba/
" then it would be ASCII as all valid URLs are. This change only applied to invalid URLs containing non-ASCII characters. If you'd like, we can add the test I'm about to upload to verify correct behavior, but I think they behave correctly.
Alex Christensen
Comment 6
2020-05-27 15:52:36 PDT
Created
attachment 400398
[details]
add more tests
Darin Adler
Comment 7
2020-05-27 18:24:21 PDT
Comment on
attachment 400398
[details]
add more tests I don’t understand why this works, since the code does not do the same thing with 8-bit and 16-bit strings.
Darin Adler
Comment 8
2020-05-27 18:25:55 PDT
In one case we pass a string with one byte, 0xB6, in it. In the other we pass a string with two bytes, 0xC2 0xB6, in it. Does WTF::createCFURLFromBuffer make the same URL in both cases? If so, how does it do that?
Alex Christensen
Comment 9
2020-05-27 21:55:22 PDT
Comment on
attachment 400398
[details]
add more tests We do pass one code point as input, U+00B6. The output of NSURL parsing that is the percent-encoded UTF-8 encoding of that code point, %C2%B6, which is the same in both cases. bugs.webkit.org's review tool is showing the UTF-8 encoding of my test case on line 216, but that's one character.
Darin Adler
Comment 10
2020-05-28 08:21:04 PDT
Sorry. I still don’t understand. WTF::createCFURLFromBuffer creates the same URL if you pass it: { 0xB6 } And if you pass it: { 0xC2, 0xB6 } Really?
Alex Christensen
Comment 11
2020-05-28 10:54:33 PDT
If you pass createCFURLFromBuffer { 0xB6 }, then the CFURLRef constructor will UTF-8 encode then percent encode that, resulting in %C2%B6. This happens if you call createCFURL() on a URL that has an 8-bit or a 16-bit String containing only the character U+00B6. This is what you see in the patch named "add more tests". The reason you think you see 2 characters in my code is because the code review tool UTF-8 encodes 0xB6 also. If you pass createCFURLFromBuffer { 0xC2, 0xB6 }, then it will UTF-8 encode then percent encode that, resulting in %C3%82%C2%B6
Darin Adler
Comment 12
2020-05-28 10:57:04 PDT
(In reply to Alex Christensen from
comment #11
)
> If you pass createCFURLFromBuffer { 0xC2, 0xB6 }, then it will UTF-8 encode > then percent encode that, resulting in %C3%82%C2%B6
OK, so then if I create a WTF::String containing the character U+00B6, but it’s a 16-bit string, then this code will run: CString utf8 = m_string.utf8(); cfURL = WTF::createCFURLFromBuffer(utf8.data(), utf8.length()); The value of 'utf8' will be { 0xC2, 0xB6 }, and it will call WTF::createCFURLFromBuffer and result in %C3%82%C2%B6. But that should not be handled any differently from an 8-bit string containing the character U+00B6. What’s wrong in the story above? Why doesn’t a test case show this problem?
Alex Christensen
Comment 13
2020-05-28 10:59:58 PDT
Ah, this is because createCFURLFromBuffer first tries to decode the string as UTF-8 if it can then as Latin1. We should pass createCFURLFromBuffer an encoding to use instead of having it try one then the other. I think I would have to write a fuzzer to come up with a test case that produced different results if there is one.
Darin Adler
Comment 14
2020-05-28 12:35:38 PDT
Got it! So to show the bug, what we need is a Latin-1 sequence that is also a valid UTF-8 sequence. The test case would be U+00C2, U+00B6. That would give different CFURL results if passed as an 8-bit string and a 16-bit string.
Alex Christensen
Comment 15
2020-05-28 17:46:21 PDT
Fixing in
https://bugs.webkit.org/show_bug.cgi?id=212486
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug