RESOLVED FIXED 233276
[WGSL] Implement enough of the lexer for the simplest shaders
https://bugs.webkit.org/show_bug.cgi?id=233276
Summary [WGSL] Implement enough of the lexer for the simplest shaders
Robin Morisset
Reported 2021-11-17 12:44:43 PST
It begins!
Attachments
WIP (52.44 KB, patch)
2021-11-17 12:45 PST, Robin Morisset
rmorisset: review-
rmorisset: commit-queue-
WIP (57.84 KB, patch)
2021-11-17 13:14 PST, Robin Morisset
no flags
Patch (82.17 KB, patch)
2021-11-29 14:49 PST, Robin Morisset
mmaxfield: review+
mmaxfield: commit-queue-
Patch (57.28 KB, patch)
2022-02-14 08:39 PST, Robin Morisset
no flags
Patch (57.91 KB, patch)
2022-02-14 08:53 PST, Robin Morisset
mmaxfield: review+
Patch (56.70 KB, patch)
2022-02-14 11:09 PST, Robin Morisset
no flags
Patch (59.37 KB, patch)
2022-02-14 12:40 PST, Robin Morisset
no flags
Robin Morisset
Comment 1 2021-11-17 12:45:28 PST
Created attachment 444554 [details] WIP Not ready for review.
Robin Morisset
Comment 2 2021-11-17 13:14:56 PST
Created attachment 444557 [details] WIP without forgetting Lexer.h
Radar WebKit Bug Importer
Comment 3 2021-11-24 12:45:19 PST
Robin Morisset
Comment 4 2021-11-29 14:49:27 PST
Myles C. Maxfield
Comment 5 2021-11-29 23:06:08 PST
Comment on attachment 445349 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=445349&action=review > Source/WebGPU/WGSL/AST/ASTNode.h:30 > +namespace WGSL { namespace AST { namespace WGSL::AST { > Source/WebGPU/WGSL/AST/ASTNode.h:39 > +// FIXME: all of these should probably be marked FAST_MALLOC or whatever it is to use bmalloc/libpas I did something similar in https://trac.webkit.org/changeset/286224/webkit > Source/WebGPU/WGSL/AST/Attribute.h:32 > +class Attribute : public ASTNode {}; Nit: Can classes have the } on a new line? Why are attributes represented as AST nodes? They're not recursive. Why not treat them as data? > Source/WebGPU/WGSL/AST/Attribute.h:36 > +class AttributeGroup : public Attribute { Nit: Usually descriptions go before the thing they're describing, so this should probably be GroupAttribute (and ditto for below). > Source/WebGPU/WGSL/AST/Attribute.h:38 > + unsigned group() const { return m_value; } Nit: usually we have an empty line before private: (or protected:) > Source/WebGPU/WGSL/AST/Attribute.h:52 > + enum Stage : uint8_t { enum class? > Source/WebGPU/WGSL/AST/Expression.h:32 > +class Expression : public ASTNode {}; This file is going to get quite big. Are you sure you want to have all nodes defined in a single file? > Source/WebGPU/WGSL/AST/GlobalDecl.h:50 > + // Each of the following may be null Nit: I don't think this comment is particularly helpful, because we have a class UniqueRef for non-nullable unique ptrs, so the type says that it can be null. The comment on the next line, though, is helpful. > Source/WebGPU/WGSL/AST/GlobalDecl.h:63 > + StringView m_name; It's cool that you're using StringViews and not Strings. > Source/WebGPU/WGSL/AST/GlobalDecl.h:66 > +class StructDecl : public GlobalDecl { Nit: Blank line between classes, please. > Source/WebGPU/WGSL/AST/GlobalDecl.h:85 > +}; Ditto. > Source/WebGPU/WGSL/AST/GlobalDecl.h:92 > + Type* maybeReturnType() { return m_returnType.get(); } Because this returns a pointer, I don't see much value in calling it "maybe". If it wasn't "maybe," it would return a reference. > Source/WebGPU/WGSL/AST/Statement.h:34 > +class CompoundStatement : public Statement { Why not "Block"? > Source/WebGPU/WGSL/AST/Statement.h:41 > +class ReturnStatement : public Statement { You're probably going to want the SPECIALIZE_TYPE_TRAITS stuff and virtual functions to be able to downcast these. > Source/WebGPU/WGSL/AST/Statement.h:43 > + Expression* maybeExpression() { return m_expression.get(); } Ditto about "maybe" > Source/WebGPU/WGSL/AST/Statement.h:50 > + Expression* maybeLhs() { return m_lhs.get(); } ... and other places too > Source/WebGPU/WGSL/AST/Statement.h:51 > + Expression* rhs() { return m_rhs.get(); } This should probably return a & > Source/WebGPU/WGSL/AST/Statement.h:55 > + std::unique_ptr<Expression> m_rhs; This should probably be a UniqueRef > Source/WebGPU/WGSL/AST/Type.h:44 > + enum Kind { Woah. I guess that's cool, since we don't have real templates/generics (yet). > Source/WebGPU/WGSL/AST/Type.h:68 > +class BoolType : public Type {}; > +class Float32Type : public Type {}; > +class Int32Type : public Type {}; > +class Uint32Type : public Type {}; Should these really be C++ types? User-defined types will be represented in C++ as instances of a C++ class, not as new C++ classes. Why not just have an instance of Type in-memory that is "the Float32 type"? > Source/WebGPU/WGSL/AST/VariableQualifier.h:40 > +enum class AccessMode : uint8_t { Other places you have the enum class inside the class its used in. How do you decide whether it should be in the class or outside the class? > Source/WebGPU/WGSL/Lexer.cpp:33 > +Token Lexer<T>::lex() Aren't there lexer generators? Did you consider using one of them? > Source/WebGPU/WGSL/Lexer.cpp:60 > + case '(': WebKit style is to put the cases at the same indentation as the switch. > Source/WebGPU/WGSL/Lexer.cpp:107 > + std::optional<uint64_t> postPeriod = parseDecimalInteger(); How does this work with "a.b"? > Source/WebGPU/WGSL/Lexer.cpp:111 > + literalValue /= pow(10, m_offset - offset); How confident are you that we're getting the right ULPs here? > Source/WebGPU/WGSL/Lexer.cpp:123 > + break; I don't understand how this works. surely "a - b" shouldn't return an invalid token? How does unary minus work? > Source/WebGPU/WGSL/Lexer.cpp:155 > + fractionalPart /= pow(10, m_offset - offset); Ditto about ULPs > Source/WebGPU/WGSL/Lexer.cpp:222 > + // FIXME: a trie would be more efficient here, look at JavaScriptCore/KeywordLookupGenerator.py for an example of code autogeneration that produces such a trie. Yep. Do we actually need a custom python script, or can we use some existing package? Maybe something in lex/yacc/bison/something else? > Source/WebGPU/WGSL/Lexer.cpp:224 > + // FIXME: I don't think that true/false/f32/u32/i32/bool need to be their own tokens, they could just be regular identifiers. Is it legal to say "let f32 = 7;"? If not, they should probably be their own tokens. > Source/WebGPU/WGSL/Lexer.cpp:270 > + if (m_current == '\n') { we have a name "newlineCharacter" for this in wtf/CharacterNames.h > Source/WebGPU/WGSL/Lexer.cpp:309 > + if (value.hasOverflowed()) > + return std::nullopt; Surely this should fail in a different way than the !isDecimal() check above? > Source/WebGPU/WGSL/Lexer.cpp:450 > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isWhiteSpace(UChar ch) > +{ > + // WGSL does not currently support any non-ASCII whitespace characters > + return Lexer<LChar>::isWhiteSpace(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isIdentifierStart(UChar ch) > +{ > + // WGSL does not currently support any non-ASCII characters in identifiers > + return Lexer<LChar>::isIdentifierStart(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isValidIdentifierCharacter(UChar ch) > +{ > + // WGSL does not currently support any non-ASCII characters in identifiers > + return Lexer<LChar>::isValidIdentifierCharacter(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isDecimal(UChar ch) > +{ > + return Lexer<LChar>::isDecimal(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isHexadecimal(UChar ch) > +{ > + return Lexer<LChar>::isHexadecimal(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE uint64_t Lexer<UChar>::readDecimal(UChar ch) > +{ > + return Lexer<LChar>::readDecimal(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE uint64_t Lexer<UChar>::readHexadecimal(UChar ch) > +{ > + return Lexer<LChar>::readHexadecimal(static_cast<LChar>(ch)); > +} This I think is more evidence that a StringView and a size_t will be sufficient. > Source/WebGPU/WGSL/Lexer.h:39 > + if constexpr (std::is_same<T, LChar>::value) { https://mobile.twitter.com/Litherum/status/1456047144519954436 I'm still not convinced that we actually need this template goop though. A StringView and a size_t should be sufficient. > Source/WebGPU/WGSL/Lexer.h:71 > + // Parse pattern (e|E)(\+|-)?[0-9]+f? if it is present, and return the exponent Perhaps these comments would be more useful around the definition, rather than the declaration. > Source/WebGPU/WGSL/Lexer.h:75 > + const T* m_code; Why did you choose to hold this in a pointer, rather than a StringView and a size_t? If you did it that way, this whole thing wouldn't need to be templated. > Source/WebGPU/WGSL/Parser.cpp:35 > + (void) lexer; UNUSED_VARIABLE(lexer); > Source/WebGPU/WGSL/Parser.cpp:37 > + return { { }, { } }; This is my favorite shader. 😜 > Source/WebGPU/WGSL/Parser.h:32 > +template<typename Lexer> Are you planning on having more than one kind of lexer? Why not just hardcode the one lexer that we've got? > Source/WebGPU/WGSL/SourceSpan.h:40 > + unsigned m_line; > + unsigned m_lineOffset; > + unsigned m_offset; > + unsigned m_length; Can we derive 2 of these 4 fields at reporting time, so we can store half the data in all the nodes? > Source/WebGPU/WGSL/SourceSpan.h:42 > + SourceSpan(unsigned line, unsigned lineOffset, unsigned offset, unsigned length) This is unnecessary. In the places where you could have written SourceSpan(1, 2, 3, 4) you can instead write SourceSpan { 1, 2, 3, 4 } (or sometimes even just { 1, 2, 3, 4 }). Using { } is preferred according to webkit style. > Source/WebGPU/WGSL/Token.h:51 > + // FIXME: add all the other keywords: file:///Users/rmorisset/MyGPUWeb/wgsl/wgsl.html#keyword-summary putting paths inside your local machine probably isn't the best thing > Source/WebGPU/WGSL/Token.h:76 > + union { Can we use Variant instead? > Source/WebGPU/WGSL/Token.h:102 > + ASSERT(type == TokenType::IntegerLiteral > + || type == TokenType::IntegerLiteralSigned > + || type == TokenType::IntegerLiteralUnsigned > + || type == TokenType::DecimalFloatLiteral > + || type == TokenType::HexFloatLiteral); I think there's a way to use the C++ type system to turn these into compile errors rather than runtime errors. Same for the other constructors. > Source/WebGPU/WGSL/Token.h:136 > + if (m_type == TokenType::Identifier) > + (&m_ident)->~StringView(); Using a variant would get rid of this > Source/WebGPU/WGSLUnitTests/WGSLLexerTests.mm:109 > + ++lineNumber; > + checkNextTokenIsIdentifier("a"); > + checkNextToken(WGSL::TokenType::Colon); > + checkNextTokenIsIdentifier("i32"); > + checkNextToken(WGSL::TokenType::Semicolon); Cool.
Robin Morisset
Comment 6 2022-02-14 08:39:54 PST
Robin Morisset
Comment 7 2022-02-14 08:42:01 PST
Thanks for the review. I've tried to fix the issues you found in the lexer, and separated the rest of the patch.
Robin Morisset
Comment 8 2022-02-14 08:53:49 PST
Created attachment 451909 [details] Patch Make true/false/i32/u32/f32/bool be keywords to match the spec. Also fix various nits found by the style linter.
Sam Weinig
Comment 9 2022-02-14 10:02:27 PST
Comment on attachment 451909 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=451909&action=review > Source/WebGPU/WGSLUnitTests/WGSLLexerTests.mm:31 > +@interface WGSLLexerTests : XCTestCase Seems weird to use XCTest here when we use GTest for the unit tests elsewhere in the code base. Given the cross platform history of WebKit, I don't think using a single platform testing tool likely makes long term sense.
Sam Weinig
Comment 10 2022-02-14 10:04:30 PST
How does one get text into the WGSL parser from the web (or rather, how will one)? If it is always from a file resource, perhaps we should consider making things UTF-8 only? If it might come from a DOMString / JavaScript string, that would be much less compelling.
Myles C. Maxfield
Comment 11 2022-02-14 10:33:34 PST
Comment on attachment 451909 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=451909&action=review > Source/WebGPU/ChangeLog:17 > + - space efficiency: we don't use an extra word of memory for the variant's tag https://github.com/akrzemi1/compact_optional can solve this. > Source/WebGPU/Configurations/WGSLUnitTests.xcconfig:24 > +EXECUTABLE_PREFIX = lib; I think I deleted this on purpose. https://trac.webkit.org/changeset/288608/webkit > Source/WebGPU/WGSL/Lexer.cpp:32 > +Token Lexer<T>::lex() Why not generate this? > Source/WebGPU/WGSL/Lexer.cpp:44 > + return makeToken(TokenType::ParenLeft); Can you add a Token(TokenType) implicit constructor to avoid all these makeToken() calls? > Source/WebGPU/WGSL/Lexer.cpp:197 > + if (isIdentifierStart(m_current)) { It might be worth doing the same kind of range check for 1-9 instead of having a bunch of case statements. > Source/WebGPU/WGSL/Lexer.cpp:372 > + case ' ': > + case '\t': > + case '\n': > + case '\v': > + case '\f': > + case '\r': We have more-friendly names for these in wtf/characternames.h > Source/WebGPU/WGSL/Lexer.cpp:435 > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isIdentifierStart(UChar ch) > +{ > + // WGSL does not currently support any non-ASCII characters in identifiers > + return Lexer<LChar>::isIdentifierStart(static_cast<LChar>(ch)); > +} Why specialize if the implementation is the same? > Source/WebGPU/WGSL/Lexer.h:39 > + if constexpr (std::is_same<T, LChar>::value) { I'd suggest either using template specialization or using `if constexpr` so the `else` block at the bottom can be a static_assert instead of a release_assert. > Source/WebGPU/WGSL/SourceSpan.h:33 > + unsigned m_line; > + unsigned m_lineOffset; > + unsigned m_offset; Why not have at least one of these fields be derived? That way, we don't have to track as much in the common case (the compile succeeds) > Source/WebGPU/WGSL/Token.cpp:31 > +String toString(TokenType type) Again, wish this could be generated. > Source/WebGPU/WGSL/Token.h:37 > + // - space efficiency: we don't use an extra word of memory for the variant's tag See changelog. > Source/WebGPU/WGSL/Token.h:38 > + // - ease of use: everywhere that we check for a given TokenType, we would have to first check that the Token is not nullopt, and then check the TokenType. I think this is slightly misstated - it's that users will already have to check the token type, so if we fold the optional into the token type, users get 2 checks for the price of one. Easier to use, and slightly more performant, and still just as type-safe as the alternative. >> Source/WebGPU/WGSLUnitTests/WGSLLexerTests.mm:31 >> +@interface WGSLLexerTests : XCTestCase > > Seems weird to use XCTest here when we use GTest for the unit tests elsewhere in the code base. Given the cross platform history of WebKit, I don't think using a single platform testing tool likely makes long term sense. The integration point of WebGPU is above this. Non-apple ports want to use Dawn or wgpu instead of WebGPU.framework, so they wouldn't ever run these tests. WebGPU.framework, and this compiler inside it, is already Metal-specific.
Myles C. Maxfield
Comment 12 2022-02-14 10:35:24 PST
(In reply to Sam Weinig from comment #10) > How does one get text into the WGSL parser from the web (or rather, how will > one)? If it is always from a file resource, perhaps we should consider > making things UTF-8 only? If it might come from a DOMString / JavaScript > string, that would be much less compelling. It's specified as a USVString passed to a JS method. https://gpuweb.github.io/gpuweb/#dom-gpudevice-createshadermodule
Robin Morisset
Comment 13 2022-02-14 11:08:36 PST
> > Source/WebGPU/ChangeLog:17 > > + - space efficiency: we don't use an extra word of memory for the variant's tag > > https://github.com/akrzemi1/compact_optional can solve this. Thanks, I'll mention it in the comment/Changelog as an alternative; I did not know of this library. > > Source/WebGPU/Configurations/WGSLUnitTests.xcconfig:24 > > +EXECUTABLE_PREFIX = lib; > > I think I deleted this on purpose. > https://trac.webkit.org/changeset/288608/webkit Thanks, fixed. > > Source/WebGPU/WGSL/Lexer.cpp:32 > > +Token Lexer<T>::lex() > > Why not generate this? I followed JSC's code. Using lex/yacc would probably have been easier, but would have made the build system even more of a nightmare, and would also have made it much harder to customize the error messages, or generally adding special customizations to the code later on. > > Source/WebGPU/WGSL/Lexer.cpp:44 > > + return makeToken(TokenType::ParenLeft); > > Can you add a Token(TokenType) implicit constructor to avoid all these > makeToken() calls? No: makeToken() also looks at variables inside the Lexer to produce accurate localization. > > Source/WebGPU/WGSL/Lexer.cpp:197 > > + if (isIdentifierStart(m_current)) { > > It might be worth doing the same kind of range check for 1-9 instead of > having a bunch of case statements. I am not sure I understand what you mean, or why it would be better. > > Source/WebGPU/WGSL/Lexer.cpp:372 > > + case ' ': > > + case '\t': > > + case '\n': > > + case '\v': > > + case '\f': > > + case '\r': > > We have more-friendly names for these in wtf/characternames.h I've only found 4 of them, I could not find any equivalent name for `\f` nor `\r`. Do you want me to replace the top 4 anyway? > > Source/WebGPU/WGSL/Lexer.cpp:435 > > +template <> > > +ALWAYS_INLINE bool Lexer<UChar>::isIdentifierStart(UChar ch) > > +{ > > + // WGSL does not currently support any non-ASCII characters in identifiers > > + return Lexer<LChar>::isIdentifierStart(static_cast<LChar>(ch)); > > +} > > Why specialize if the implementation is the same? No idea what I was thinking, fixed, thanks. > > > Source/WebGPU/WGSL/Lexer.h:39 > > + if constexpr (std::is_same<T, LChar>::value) { > > I'd suggest either using template specialization or using `if constexpr` so > the `else` block at the bottom can be a static_assert instead of a > release_assert. Done, thanks. > > Source/WebGPU/WGSL/SourceSpan.h:33 > > + unsigned m_line; > > + unsigned m_lineOffset; > > + unsigned m_offset; > > Why not have at least one of these fields be derived? That way, we don't > have to track as much in the common case (the compile succeeds) This sounds not-trivial to implement, putting it in a FIXME. > > > Source/WebGPU/WGSL/Token.h:38 > > + // - ease of use: everywhere that we check for a given TokenType, we would have to first check that the Token is not nullopt, and then check the TokenType. > > I think this is slightly misstated - it's that users will already have to > check the token type, so if we fold the optional into the token type, users > get 2 checks for the price of one. Easier to use, and slightly more > performant, and still just as type-safe as the alternative.
Robin Morisset
Comment 14 2022-02-14 11:09:15 PST
Myles C. Maxfield
Comment 15 2022-02-14 11:15:51 PST
(In reply to Robin Morisset from comment #13) > > > Source/WebGPU/WGSL/Lexer.cpp:32 > > > +Token Lexer<T>::lex() > > > > Why not generate this? > > I followed JSC's code. Using lex/yacc would probably have been easier, but > would have made the build system even more of a nightmare, and would also > have made it much harder to customize the error messages, or generally > adding special customizations to the code later on. It might be worth mentioning this in the ChangeLog. > > > Source/WebGPU/WGSL/Lexer.cpp:197 > > > + if (isIdentifierStart(m_current)) { > > > > It might be worth doing the same kind of range check for 1-9 instead of > > having a bunch of case statements. > > I am not sure I understand what you mean, or why it would be better. You don't have case 'a': case 'b': case 'c': ... but you do have case '1': case '2': case '3': I'm just suggesting you might want to use the same approach for both of these. > > > > Source/WebGPU/WGSL/Lexer.cpp:372 > > > + case ' ': > > > + case '\t': > > > + case '\n': > > > + case '\v': > > > + case '\f': > > > + case '\r': > > > > We have more-friendly names for these in wtf/characternames.h > > I've only found 4 of them, I could not find any equivalent name for `\f` nor > `\r`. > Do you want me to replace the top 4 anyway? I think you should add the human-readable names for the missing characters into characternames.h
Robin Morisset
Comment 16 2022-02-14 12:40:55 PST
EWS
Comment 17 2022-02-15 01:38:19 PST
Committed r289799 (247264@main): <https://commits.webkit.org/247264@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 451932 [details].
Darin Adler
Comment 18 2022-02-15 09:30:03 PST
Comment on attachment 451932 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=451932&action=review > Source/WebGPU/WGSL/Lexer.cpp:369 > + switch (ch) { > + case WTF::Unicode::space: > + case WTF::Unicode::tabCharacter: > + case WTF::Unicode::carriageReturn: > + case WTF::Unicode::newlineCharacter: > + case WTF::Unicode::verticalTabulation: > + case WTF::Unicode::formFeed: > + return true; > + default: > + return false; > + } We have carefully-written and optimized versions of some of these checks in ASCIICType.h, and they should be used in places like this. If there is a reason we must not use those, I would like to know it. Here it would be * isASCIISpace*, and I believe its implementation is more efficient than the idiom here so using it could slightly help performance. > Source/WebGPU/WGSL/Lexer.cpp:375 > + return (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z'); isASCIIAlpha > Source/WebGPU/WGSL/Lexer.cpp:381 > + return (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || (ch >= '0' && ch <= '9') || ch == '_'; isASCIIAlphanumeric > Source/WebGPU/WGSL/Lexer.cpp:387 > + return (ch >= '0' && ch <= '9'); isASCIIDigit > Source/WebGPU/WGSL/Lexer.cpp:393 > + return (ch >= '0' && ch <= '9') || (ch >= 'a' && ch <= 'f') || (ch >= 'A' && ch <= 'F'); isASCIIHexDigit > Source/WebGPU/WGSL/Lexer.cpp:412 > + ASSERT(isHexadecimal(ch)); > + if (ch >= '0' && ch <= '9') > + return ch - '0'; > + if (ch >= 'a' && ch <= 'f') > + return ch - 'a'; > + ASSERT(ch >= 'A' && ch <= 'F'); > + return ch - 'A'; toASCIIHexValue
Robin Morisset
Comment 19 2022-02-17 08:53:40 PST
Thank you, I did not know about these functions in WTF. I made a trivial patch to use them: https://bugs.webkit.org/show_bug.cgi?id=236783.
Note You need to log in before you can comment on or make changes to this bug.