WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED INVALID
193339
Either StringView::UpconvertedCharacters::UpconvertedCharacters() or StringImpl::createCFString() is using the wrong encoding
https://bugs.webkit.org/show_bug.cgi?id=193339
Summary
Either StringView::UpconvertedCharacters::UpconvertedCharacters() or StringIm...
Myles C. Maxfield
Reported
2019-01-10 14:34:34 PST
Strings in WebKit have two flavors: UTF-16, and UTF-16 with all the leading 0 bytes removed (if all the code points are <= 0xFF). StringImpl::createCFString() pretends the second one is kCFStringEncodingISOLatin1, which is not correct.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout
https://www.unicode.org/charts/PDF/U0080.pdf
Attachments
Add attachment
proposed patch, testcase, etc.
Myles C. Maxfield
Comment 1
2019-01-10 14:36:05 PST
Looks like
https://bugs.webkit.org/show_bug.cgi?id=90720
is the culprit
Myles C. Maxfield
Comment 2
2019-01-10 14:40:36 PST
There seems to be some confusion over what encoding these strings have.
Myles C. Maxfield
Comment 3
2019-01-10 14:42:35 PST
Either StringBuilder::allocateBufferUpConvert() and StringView::UpconvertedCharacters::UpconvertedCharacters() are wrong, or StringImpl::createCFString() is wrong.
Myles C. Maxfield
Comment 4
2019-01-10 14:47:31 PST
StringImpl::copyCharacters() says it's not Latin1 TextBreakIteratorICU says it uses Latin1 (in set8BitText()) but imlements the functionality by calling StringImpl::copyCharacters(), which means TextBreakIteratorICU lies
Myles C. Maxfield
Comment 5
2019-01-10 14:52:21 PST
bool equal(const LChar* a, const UChar* b, unsigned length) says it's not Latin1
Myles C. Maxfield
Comment 6
2019-01-10 14:57:11 PST
I thought it wasn't true the Latin1 is just the first 255 characters of Unicode, but I'm checking now.
Myles C. Maxfield
Comment 7
2019-01-10 16:20:03 PST
Looks like it is true. Sorry for the noise. var error = U_ZERO_ERROR var i = Int8(1) for item in 1 ... 0xFF { let source = [Int8(i)] let targetCapacity = ucnv_convert_63("UTF-16", "ISO-8859-1", nil, 0, source, Int32(source.count), &error) assert(error.rawValue > U_ZERO_ERROR.rawValue) error = U_ZERO_ERROR var target = [Int8](repeating: 0, count: Int(targetCapacity)) ucnv_convert_63("UTF-16", "ISO-8859-1", &target, Int32(target.count), source, Int32(source.count), &error) target.withUnsafeBytes() {(unsafeRawBufferPointer: UnsafeRawBufferPointer) in let unsafeBufferPointer = unsafeRawBufferPointer.bindMemory(to: UInt16.self) for j in 0 ..< unsafeBufferPointer.count { print("\(String(item, radix: 16)) Code Unit \(j) -> \(String(unsafeBufferPointer[j], radix: 16))") } assert(unsafeBufferPointer.count == 2) assert(unsafeBufferPointer[0] == 0xfeff) assert(unsafeBufferPointer[1] == item) } i = i.addingReportingOverflow(1).partialValue } ... Program ended with exit code: 0
Myles C. Maxfield
Comment 8
2019-01-10 16:31:42 PST
Just verified it backwards, too.
Darin Adler
Comment 9
2019-01-14 06:35:46 PST
(In reply to Myles C. Maxfield from
comment #6
)
> I thought it wasn't true the Latin1 is just the first 255 characters of > Unicode, but I'm checking now.
Here’s one reason you might be confused: When a website specifies Latin-1 as its character encoding, compatible web browsers treat the content of the website as windows-1252 instead, which is like Latin-1 but the bytes in the range 0x80-0x9F for 32 different characters, rather than for U+0080 through U+009F. You can see this in the WhatWG encoding specification where the names for windows-1252 include strings like "l1", "latin1", and even "ascii". That encoding is what TextCodecLatin1.h/cpp implements. TextCodecLatin1.h/cpp could be renamed to avoid confusion with actual Latin-1.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug