Add precise language tag parser implementing BCP47.
Created attachment 405783 [details] Patch WIP
Created attachment 405784 [details] Patch WIP
Created attachment 405815 [details] Patch
Comment on attachment 405815 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=405815&action=review Looks good overall. It sucks that ICU canonicalization isn't sufficient, but I think you're right that this is unavoidable if we want cross-engine consistency, so thanks for doing what I was too stubborn to do. ;) I think Darin will likely have comments regarding the string manipulation parts. > Source/JavaScriptCore/runtime/IntlObject.cpp:851 > + struct Code { > + LChar characters[8] { }; > + }; Is it necessary that this be wrapped in a struct? > Source/JavaScriptCore/runtime/IntlObject.cpp:875 > +static bool isUnicodeExtensionAttribute(StringView string) We could probably merge this with isUnicodeExtensionTypeComponent... > Source/JavaScriptCore/runtime/IntlObject.cpp:888 > +static bool isUnicodeExtensionTypeComponent(StringView string) ...but note that isUnicodeLocaleIdentifierType also exists, so this already feels a bit duplicative? > Source/JavaScriptCore/runtime/IntlObject.cpp:917 > +static bool isUnicodeTValue(StringView string) This appears to be just a TValue "component"? (And also is matching the two functions above...perhaps we should have an alphanum{i,j} helper function?) > Source/JavaScriptCore/runtime/IntlObject.cpp:1078 > + bool isTLangOrTField = false; Seems like this should be renamed to something like `found`... > Source/JavaScriptCore/runtime/IntlObject.cpp:1109 > + if (!isTLangOrTField) > return false; > - ++cursor; > + return true; ...and then you can just return it directly. > Source/JavaScriptCore/runtime/IntlObject.cpp:1125 > + if (!isUnicodeOtherExtensionValue(m_current)) > + return true; > + if (!next()) > + return true; Seems like in a case like this you could conjoin the conditions, but I understand if you prefer to keep parallelism with the rest of the logic. > Source/JavaScriptCore/runtime/IntlObject.cpp:1231 > + if (!parser.parseUnicodeLocaleId()) > + return false; > + if (!parser.isEOS()) > + return false; > + return true; (There's a part of me that really wants to write this as a single return statement, but I do see what you're aiming to convey with this code style. ;D)
Comment on attachment 405815 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=405815&action=review I wonder about the idioms here compared with the StringParsingBuffer idioms that Sam has been using. Have you looked at that? Parsers need extensive test suites, with cases that tickle every single branch. Do we have enough? > Source/JavaScriptCore/runtime/IntlObject.cpp:871 > + ASSERT(isASCIIAlphanumeric(singleton)); > + singleton = toASCIILower(singleton); > + // 0 - 9 => numeric > + // 10 - 35 => alpha > + if (isASCIIDigit(singleton)) > + return singleton - '0'; > + return (singleton - 'a') + 10; This could be a bit more like toASCIIHexDigit from ASCIICType.h. > Source/JavaScriptCore/runtime/IntlObject.cpp:880 > + if (length >= 3 && length <= 8) > + return string.isAllSpecialCharacters<isASCIIAlphanumeric>(); > + return false; Should do it all with &&: return length >= 3 && length<= 8 && string.isAllSpecialCharacters<isASCIIAlphanumeric>();
Generally looks fine. I have no objections.
Comment on attachment 405815 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=405815&action=review >> Source/JavaScriptCore/runtime/IntlObject.cpp:1231 >> + return true; > > (There's a part of me that really wants to write this as a single return statement, but I do see what you're aiming to convey with this code style. ;D) I feel the same way.
<rdar://problem/66720470>
Comment on attachment 405815 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=405815&action=review >> Source/JavaScriptCore/runtime/IntlObject.cpp:851 >> + }; > > Is it necessary that this be wrapped in a struct? This makes `bitwise_cast<VariantCode>()` easy by suppressing array to pointer decay. >> Source/JavaScriptCore/runtime/IntlObject.cpp:871 >> + return (singleton - 'a') + 10; > > This could be a bit more like toASCIIHexDigit from ASCIICType.h. Yeah, the difference is it is using G-Z too (so, it is not hex). >> Source/JavaScriptCore/runtime/IntlObject.cpp:875 >> +static bool isUnicodeExtensionAttribute(StringView string) > > We could probably merge this with isUnicodeExtensionTypeComponent... I think we should have different ones because they are different part while the patterns are the same by chance. >> Source/JavaScriptCore/runtime/IntlObject.cpp:880 >> + return false; > > Should do it all with &&: > > return length >= 3 && length<= 8 && string.isAllSpecialCharacters<isASCIIAlphanumeric>(); Changed >> Source/JavaScriptCore/runtime/IntlObject.cpp:888 >> +static bool isUnicodeExtensionTypeComponent(StringView string) > > ...but note that isUnicodeLocaleIdentifierType also exists, so this already feels a bit duplicative? Ditto. >> Source/JavaScriptCore/runtime/IntlObject.cpp:917 >> +static bool isUnicodeTValue(StringView string) > > This appears to be just a TValue "component"? > > (And also is matching the two functions above...perhaps we should have an alphanum{i,j} helper function?) Right, changed. >> Source/JavaScriptCore/runtime/IntlObject.cpp:1109 >> + return true; > > ...and then you can just return it directly. Nice, fixed. >> Source/JavaScriptCore/runtime/IntlObject.cpp:1125 >> + return true; > > Seems like in a case like this you could conjoin the conditions, but I understand if you prefer to keep parallelism with the rest of the logic. Yeah, these codes are aligned to the style used in parse function in this parser. >>> Source/JavaScriptCore/runtime/IntlObject.cpp:1231 >>> + return true; >> >> (There's a part of me that really wants to write this as a single return statement, but I do see what you're aiming to convey with this code style. ;D) > > I feel the same way. Because the other functions are using `break` / `continue` etc., I'm aligning these parse & isEOS() patterns with the other places not to introduce a bug. I think this makes code consistent and avoid introducing a bug.
Created attachment 407058 [details] Patch Updating
> I wonder about the idioms here compared with the StringParsingBuffer idioms that Sam has been using. Have you looked at that? Currently, this parser relies on split feature, so we cannot use StringParsingBuffer. We can revise whether we can use it later. > Parsers need extensive test suites, with cases that tickle every single branch. Do we have enough? I added more tests and ensured code coverage of the parser is 100% at least.
Committed r266039: <https://trac.webkit.org/changeset/266039>
Re-opened since this is blocked by bug 215756
Committed r266043: <https://trac.webkit.org/changeset/266043>