Bug 156137

Summary: Document the native format of JSChar type
Product: WebKit Reporter: David Kilzer (:ddkilzer) <ddkilzer>
Component: JavaScriptCoreAssignee: David Kilzer (:ddkilzer) <ddkilzer>
Status: RESOLVED FIXED    
Severity: Normal CC: commit-queue, darin, ggaren, keith_miller, mark.lam, msaboff, saam
Priority: P2    
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch v1
none
Patch v2
none
Patch v3 none

David Kilzer (:ddkilzer)
Reported 2016-04-02 13:55:43 PDT
We should document that JSStringCreateWithCharacters() takes UTF16LE as its native format. At least one customer had trouble figuring this out: <https://twitter.com/jeremy_wyld/status/716053938148143104>
Attachments
Patch v1 (1.51 KB, patch)
2016-04-02 14:06 PDT, David Kilzer (:ddkilzer)
no flags
Patch v2 (1.24 KB, patch)
2016-04-03 10:45 PDT, David Kilzer (:ddkilzer)
no flags
Patch v3 (1.33 KB, patch)
2016-06-14 15:21 PDT, David Kilzer (:ddkilzer)
no flags
David Kilzer (:ddkilzer)
Comment 1 2016-04-02 14:06:13 PDT
Created attachment 275484 [details] Patch v1
David Kilzer (:ddkilzer)
Comment 2 2016-04-02 14:08:20 PDT
(In reply to comment #0) > We should document that JSStringCreateWithCharacters() takes UTF16LE as its > native format. > > At least one customer had trouble figuring this out: > <https://twitter.com/jeremy_wyld/status/716053938148143104> And I didn't know off the top of my head, either.
Darin Adler
Comment 3 2016-04-02 20:51:19 PDT
Comment on attachment 275484 [details] Patch v1 View in context: https://bugs.webkit.org/attachment.cgi?id=275484&action=review > Source/JavaScriptCore/API/JSStringRef.h:54 > +@abstract Creates a JavaScript string from a buffer of Unicode characters > + in UTF16LE format. If we want to document the use of UTF-16, I think the best place to add a comment would be in the definition of the JSChar type. That’s the approach the Cocoa platform takes with UniChar in the MacTypes.h header. This is the same semantic as CFStringCreateWithCharacters; the documentation for that function doesn’t specify UTF-16 explicitly. If we did have a reason to specify UTF-16 for this function, we would also want to specify it in JSStringGetLength and JSStringGetCharactersPtr as well. If we do make this type of change, we should say UTF-16, not UTF16LE. It’s little endian in practice on Apple platforms, sure, but that’s because each JSChar is in the CPU’s native byte ordering. Because Apple platforms run Intel and ARM in little endian mode it’s always little endian on those. UTF-16 is a good name for this behavior. It might make sense to specify UTF-16 LE if the argument type was an array of bytes rather than an array of JSChar. Back when Apple had PowerPC versions of this, the function would take UTF-16 BE, and that’s still the case any time WebKit is compiled for a big endian processor.
Darin Adler
Comment 4 2016-04-02 20:52:42 PDT
I just read the thread with Jeremy and see that he was confused. The answer is that this has the same semantic as CFStringCreateWithCharacters, UTF-16 with CPU’s native encoding.
David Kilzer (:ddkilzer)
Comment 5 2016-04-03 10:45:40 PDT
Created attachment 275504 [details] Patch v2
David Kilzer (:ddkilzer)
Comment 6 2016-04-03 10:46:04 PDT
(In reply to comment #4) > I just read the thread with Jeremy and see that he was confused. The answer > is that this has the same semantic as CFStringCreateWithCharacters, UTF-16 > with CPU’s native encoding. Ah, makes sense. Updated patch.
Darin Adler
Comment 7 2016-04-03 11:58:09 PDT
Comment on attachment 275504 [details] Patch v2 View in context: https://bugs.webkit.org/attachment.cgi?id=275504&action=review > Source/JavaScriptCore/API/JSStringRef.h:45 > +@abstract A Unicode character in UTF-16 format. Endianness depends on the > + underlying architecture. I don’t think we should write “Endianness depends on the underlying architecture”; that’s true of all scalar types in C all the time. Just as true of "unsigned short" or "int" or even "char*" as it is of JSChar. I understand that Jeremy was confused about this, but I don’t think writing this sentence is the best way to make that clear. Makes sense to mention UTF-16, but it’s inaccurate to call this a “character in UTF-16 format” since some UTF-16 code units are not characters. The comment should say something more like this: A UTF-16 code unit; one, or a sequence of two, can encode any Unicode character. Often loosely referred to as a "Unicode character" because only one code unit was needed in pre-1996 Unicode. If we really want to mention endianness we could add a third sentence: As with all scalar types, endianness depends on the underlying architecture. I’m not sure my long comments should be used. They are more accurate but may not be as easy to understand.
David Kilzer (:ddkilzer)
Comment 8 2016-06-14 15:21:57 PDT
Created attachment 281290 [details] Patch v3
David Kilzer (:ddkilzer)
Comment 9 2016-06-14 15:23:58 PDT
Comment on attachment 275504 [details] Patch v2 View in context: https://bugs.webkit.org/attachment.cgi?id=275504&action=review >> Source/JavaScriptCore/API/JSStringRef.h:45 >> + underlying architecture. > > I don’t think we should write “Endianness depends on the underlying architecture”; that’s true of all scalar types in C all the time. Just as true of "unsigned short" or "int" or even "char*" as it is of JSChar. I understand that Jeremy was confused about this, but I don’t think writing this sentence is the best way to make that clear. > > Makes sense to mention UTF-16, but it’s inaccurate to call this a “character in UTF-16 format” since some UTF-16 code units are not characters. The comment should say something more like this: > > A UTF-16 code unit; one, or a sequence of two, can encode any Unicode character. > Often loosely referred to as a "Unicode character" because only one code unit was needed in pre-1996 Unicode. > > If we really want to mention endianness we could add a third sentence: > > As with all scalar types, endianness depends on the underlying architecture. > > I’m not sure my long comments should be used. They are more accurate but may not be as easy to understand. I updated the documentation to use your example, but omitted the "Often loosely...." sentence.
David Kilzer (:ddkilzer)
Comment 10 2016-06-14 15:28:15 PDT
(In reply to comment #7) > I don’t think we should write “Endianness depends on the underlying > architecture”; that’s true of all scalar types in C all the time. Just as > true of "unsigned short" or "int" or even "char*" as it is of JSChar. I > understand that Jeremy was confused about this, but I don’t think writing > this sentence is the best way to make that clear. What I was thinking at the time was whether one would ever have a string in UTF-16 BE format in memory on a little-endian system, or a string in UTF-16 LE format in memory on a big-endian system, rather than always matching endianness. (That's because files saved on disk can be in either format, and the system that loads them has to deal with both.) After thinking about it, though, I realize that you'd be crazy to store UTF-16 in an endianness format that didn't match the underlying architecture, since you'd just create a lot of extra work for yourself if you did that. :)
WebKit Commit Bot
Comment 11 2016-06-14 15:54:24 PDT
Comment on attachment 281290 [details] Patch v3 Clearing flags on attachment: 281290 Committed r202069: <http://trac.webkit.org/changeset/202069>
WebKit Commit Bot
Comment 12 2016-06-14 15:54:29 PDT
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.