193339 – Either StringView::UpconvertedCharacters::UpconvertedCharacters() or StringImpl::createCFString() is using the wrong encoding

RESOLVED INVALID 193339

Either StringView::UpconvertedCharacters::UpconvertedCharacters() or StringImpl::createCFString() is using the wrong encoding

https://bugs.webkit.org/show_bug.cgi?id=193339

Summary Either StringView::UpconvertedCharacters::UpconvertedCharacters() or StringIm...

Myles C. Maxfield

Reported 2019-01-10 14:34:34 PST

Strings in WebKit have two flavors: UTF-16, and UTF-16 with all the leading 0 bytes removed (if all the code points are <= 0xFF). StringImpl::createCFString() pretends the second one is kCFStringEncodingISOLatin1, which is not correct. https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout https://www.unicode.org/charts/PDF/U0080.pdf

Attachments
Add attachment proposed patch, testcase, etc.

Myles C. Maxfield

Comment 1 2019-01-10 14:36:05 PST

Looks like https://bugs.webkit.org/show_bug.cgi?id=90720 is the culprit

Myles C. Maxfield

Comment 2 2019-01-10 14:40:36 PST

There seems to be some confusion over what encoding these strings have.

Myles C. Maxfield

Comment 3 2019-01-10 14:42:35 PST

Either StringBuilder::allocateBufferUpConvert() and StringView::UpconvertedCharacters::UpconvertedCharacters() are wrong, or StringImpl::createCFString() is wrong.

Myles C. Maxfield

Comment 4 2019-01-10 14:47:31 PST

StringImpl::copyCharacters() says it's not Latin1 TextBreakIteratorICU says it uses Latin1 (in set8BitText()) but imlements the functionality by calling StringImpl::copyCharacters(), which means TextBreakIteratorICU lies

Myles C. Maxfield

Comment 5 2019-01-10 14:52:21 PST

bool equal(const LChar* a, const UChar* b, unsigned length) says it's not Latin1

Myles C. Maxfield

Comment 6 2019-01-10 14:57:11 PST

I thought it wasn't true the Latin1 is just the first 255 characters of Unicode, but I'm checking now.

Myles C. Maxfield

Comment 7 2019-01-10 16:20:03 PST

Looks like it is true. Sorry for the noise. var error = U_ZERO_ERROR var i = Int8(1) for item in 1 ... 0xFF { let source = [Int8(i)] let targetCapacity = ucnv_convert_63("UTF-16", "ISO-8859-1", nil, 0, source, Int32(source.count), &error) assert(error.rawValue > U_ZERO_ERROR.rawValue) error = U_ZERO_ERROR var target = [Int8](repeating: 0, count: Int(targetCapacity)) ucnv_convert_63("UTF-16", "ISO-8859-1", &target, Int32(target.count), source, Int32(source.count), &error) target.withUnsafeBytes() {(unsafeRawBufferPointer: UnsafeRawBufferPointer) in let unsafeBufferPointer = unsafeRawBufferPointer.bindMemory(to: UInt16.self) for j in 0 ..< unsafeBufferPointer.count { print("\(String(item, radix: 16)) Code Unit \(j) -> \(String(unsafeBufferPointer[j], radix: 16))") } assert(unsafeBufferPointer.count == 2) assert(unsafeBufferPointer[0] == 0xfeff) assert(unsafeBufferPointer[1] == item) } i = i.addingReportingOverflow(1).partialValue } ... Program ended with exit code: 0

Myles C. Maxfield

Comment 8 2019-01-10 16:31:42 PST

Just verified it backwards, too.

Darin Adler

Comment 9 2019-01-14 06:35:46 PST

(In reply to Myles C. Maxfield from comment #6) > I thought it wasn't true the Latin1 is just the first 255 characters of > Unicode, but I'm checking now. Here’s one reason you might be confused: When a website specifies Latin-1 as its character encoding, compatible web browsers treat the content of the website as windows-1252 instead, which is like Latin-1 but the bytes in the range 0x80-0x9F for 32 different characters, rather than for U+0080 through U+009F. You can see this in the WhatWG encoding specification where the names for windows-1252 include strings like "l1", "latin1", and even "ascii". That encoding is what TextCodecLatin1.h/cpp implements. TextCodecLatin1.h/cpp could be renamed to avoid confusion with actual Latin-1.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution INVALID

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component Text

Assignee

Nobody

Reported

2019-01-10 14:34 PST

Modified

2019-01-14 06:35 PST History

CC List

5 users Show

URL

Keywords

Depends on

Blocks