Bug 193339

Summary:	Either StringView::UpconvertedCharacters::UpconvertedCharacters() or StringImpl::createCFString() is using the wrong encoding
Product:	WebKit	Reporter:	Myles C. Maxfield <mmaxfield>
Component:	Text	Assignee:	Nobody <webkit-unassigned>
Status:	RESOLVED INVALID
Severity:	Normal	CC:	ap, benjamin, darin, mmaxfield, rniwa
Priority:	P2
Version:	WebKit Nightly Build
Hardware:	Unspecified
OS:	Unspecified

Myles C. Maxfield

Reported 2019-01-10 14:34:34 PST

Strings in WebKit have two flavors: UTF-16, and UTF-16 with all the leading 0 bytes removed (if all the code points are <= 0xFF). StringImpl::createCFString() pretends the second one is kCFStringEncodingISOLatin1, which is not correct. https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout https://www.unicode.org/charts/PDF/U0080.pdf

Attachments
Add attachment proposed patch, testcase, etc.

Myles C. Maxfield

Comment 1 2019-01-10 14:36:05 PST

Looks like https://bugs.webkit.org/show_bug.cgi?id=90720 is the culprit

Myles C. Maxfield

Comment 2 2019-01-10 14:40:36 PST

There seems to be some confusion over what encoding these strings have.

Myles C. Maxfield

Comment 3 2019-01-10 14:42:35 PST

Either StringBuilder::allocateBufferUpConvert() and StringView::UpconvertedCharacters::UpconvertedCharacters() are wrong, or StringImpl::createCFString() is wrong.

Myles C. Maxfield

Comment 4 2019-01-10 14:47:31 PST

StringImpl::copyCharacters() says it's not Latin1 TextBreakIteratorICU says it uses Latin1 (in set8BitText()) but imlements the functionality by calling StringImpl::copyCharacters(), which means TextBreakIteratorICU lies

Myles C. Maxfield

Comment 5 2019-01-10 14:52:21 PST

bool equal(const LChar* a, const UChar* b, unsigned length) says it's not Latin1

Myles C. Maxfield

Comment 6 2019-01-10 14:57:11 PST

I thought it wasn't true the Latin1 is just the first 255 characters of Unicode, but I'm checking now.

Myles C. Maxfield

Comment 7 2019-01-10 16:20:03 PST

Looks like it is true. Sorry for the noise. var error = U_ZERO_ERROR var i = Int8(1) for item in 1 ... 0xFF { let source = [Int8(i)] let targetCapacity = ucnv_convert_63("UTF-16", "ISO-8859-1", nil, 0, source, Int32(source.count), &error) assert(error.rawValue > U_ZERO_ERROR.rawValue) error = U_ZERO_ERROR var target = [Int8](repeating: 0, count: Int(targetCapacity)) ucnv_convert_63("UTF-16", "ISO-8859-1", &target, Int32(target.count), source, Int32(source.count), &error) target.withUnsafeBytes() {(unsafeRawBufferPointer: UnsafeRawBufferPointer) in let unsafeBufferPointer = unsafeRawBufferPointer.bindMemory(to: UInt16.self) for j in 0 ..< unsafeBufferPointer.count { print("\(String(item, radix: 16)) Code Unit \(j) -> \(String(unsafeBufferPointer[j], radix: 16))") } assert(unsafeBufferPointer.count == 2) assert(unsafeBufferPointer[0] == 0xfeff) assert(unsafeBufferPointer[1] == item) } i = i.addingReportingOverflow(1).partialValue } ... Program ended with exit code: 0

Myles C. Maxfield

Comment 8 2019-01-10 16:31:42 PST

Just verified it backwards, too.

Darin Adler

Comment 9 2019-01-14 06:35:46 PST

(In reply to Myles C. Maxfield from comment #6) > I thought it wasn't true the Latin1 is just the first 255 characters of > Unicode, but I'm checking now. Here’s one reason you might be confused: When a website specifies Latin-1 as its character encoding, compatible web browsers treat the content of the website as windows-1252 instead, which is like Latin-1 but the bytes in the range 0x80-0x9F for 32 different characters, rather than for U+0080 through U+009F. You can see this in the WhatWG encoding specification where the names for windows-1252 include strings like "l1", "latin1", and even "ascii". That encoding is what TextCodecLatin1.h/cpp implements. TextCodecLatin1.h/cpp could be renamed to avoid confusion with actual Latin-1.

Note You need to log in before you can comment on or make changes to this bug.