Bug 193339
Summary: | Either StringView::UpconvertedCharacters::UpconvertedCharacters() or StringImpl::createCFString() is using the wrong encoding | ||
---|---|---|---|
Product: | WebKit | Reporter: | Myles C. Maxfield <mmaxfield> |
Component: | Text | Assignee: | Nobody <webkit-unassigned> |
Status: | RESOLVED INVALID | ||
Severity: | Normal | CC: | ap, benjamin, darin, mmaxfield, rniwa |
Priority: | P2 | ||
Version: | WebKit Nightly Build | ||
Hardware: | Unspecified | ||
OS: | Unspecified |
Myles C. Maxfield
Strings in WebKit have two flavors: UTF-16, and UTF-16 with all the leading 0 bytes removed (if all the code points are <= 0xFF).
StringImpl::createCFString() pretends the second one is kCFStringEncodingISOLatin1, which is not correct.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout
https://www.unicode.org/charts/PDF/U0080.pdf
Attachments | ||
---|---|---|
Add attachment proposed patch, testcase, etc. |
Myles C. Maxfield
Looks like https://bugs.webkit.org/show_bug.cgi?id=90720 is the culprit
Myles C. Maxfield
There seems to be some confusion over what encoding these strings have.
Myles C. Maxfield
Either StringBuilder::allocateBufferUpConvert() and StringView::UpconvertedCharacters::UpconvertedCharacters() are wrong, or StringImpl::createCFString() is wrong.
Myles C. Maxfield
StringImpl::copyCharacters() says it's not Latin1
TextBreakIteratorICU says it uses Latin1 (in set8BitText()) but imlements the functionality by calling StringImpl::copyCharacters(), which means TextBreakIteratorICU lies
Myles C. Maxfield
bool equal(const LChar* a, const UChar* b, unsigned length) says it's not Latin1
Myles C. Maxfield
I thought it wasn't true the Latin1 is just the first 255 characters of Unicode, but I'm checking now.
Myles C. Maxfield
Looks like it is true. Sorry for the noise.
var error = U_ZERO_ERROR
var i = Int8(1)
for item in 1 ... 0xFF {
let source = [Int8(i)]
let targetCapacity = ucnv_convert_63("UTF-16", "ISO-8859-1", nil, 0, source, Int32(source.count), &error)
assert(error.rawValue > U_ZERO_ERROR.rawValue)
error = U_ZERO_ERROR
var target = [Int8](repeating: 0, count: Int(targetCapacity))
ucnv_convert_63("UTF-16", "ISO-8859-1", &target, Int32(target.count), source, Int32(source.count), &error)
target.withUnsafeBytes() {(unsafeRawBufferPointer: UnsafeRawBufferPointer) in
let unsafeBufferPointer = unsafeRawBufferPointer.bindMemory(to: UInt16.self)
for j in 0 ..< unsafeBufferPointer.count {
print("\(String(item, radix: 16)) Code Unit \(j) -> \(String(unsafeBufferPointer[j], radix: 16))")
}
assert(unsafeBufferPointer.count == 2)
assert(unsafeBufferPointer[0] == 0xfeff)
assert(unsafeBufferPointer[1] == item)
}
i = i.addingReportingOverflow(1).partialValue
}
...
Program ended with exit code: 0
Myles C. Maxfield
Just verified it backwards, too.
Darin Adler
(In reply to Myles C. Maxfield from comment #6)
> I thought it wasn't true the Latin1 is just the first 255 characters of
> Unicode, but I'm checking now.
Here’s one reason you might be confused:
When a website specifies Latin-1 as its character encoding, compatible web browsers treat the content of the website as windows-1252 instead, which is like Latin-1 but the bytes in the range 0x80-0x9F for 32 different characters, rather than for U+0080 through U+009F.
You can see this in the WhatWG encoding specification where the names for windows-1252 include strings like "l1", "latin1", and even "ascii".
That encoding is what TextCodecLatin1.h/cpp implements. TextCodecLatin1.h/cpp could be renamed to avoid confusion with actual Latin-1.