| Summary: | [WGSL] Implement enough of the lexer for the simplest shaders | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | WebKit | Reporter: | Robin Morisset <rmorisset> | ||||||||||||||||
| Component: | JavaScriptCore | Assignee: | Robin Morisset <rmorisset> | ||||||||||||||||
| Status: | RESOLVED FIXED | ||||||||||||||||||
| Severity: | Normal | CC: | benjamin, cdumez, cmarcelo, darin, ews-watchlist, mmaxfield, sam, webkit-bug-importer | ||||||||||||||||
| Priority: | P2 | Keywords: | InRadar | ||||||||||||||||
| Version: | WebKit Nightly Build | ||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||
| Bug Depends on: | |||||||||||||||||||
| Bug Blocks: | 236783, 237629 | ||||||||||||||||||
| Attachments: |
|
||||||||||||||||||
|
Description
Robin Morisset
2021-11-17 12:44:43 PST
Created attachment 444554 [details]
WIP
Not ready for review.
Created attachment 444557 [details]
WIP
without forgetting Lexer.h
Created attachment 445349 [details]
Patch
Comment on attachment 445349 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=445349&action=review > Source/WebGPU/WGSL/AST/ASTNode.h:30 > +namespace WGSL { namespace AST { namespace WGSL::AST { > Source/WebGPU/WGSL/AST/ASTNode.h:39 > +// FIXME: all of these should probably be marked FAST_MALLOC or whatever it is to use bmalloc/libpas I did something similar in https://trac.webkit.org/changeset/286224/webkit > Source/WebGPU/WGSL/AST/Attribute.h:32 > +class Attribute : public ASTNode {}; Nit: Can classes have the } on a new line? Why are attributes represented as AST nodes? They're not recursive. Why not treat them as data? > Source/WebGPU/WGSL/AST/Attribute.h:36 > +class AttributeGroup : public Attribute { Nit: Usually descriptions go before the thing they're describing, so this should probably be GroupAttribute (and ditto for below). > Source/WebGPU/WGSL/AST/Attribute.h:38 > + unsigned group() const { return m_value; } Nit: usually we have an empty line before private: (or protected:) > Source/WebGPU/WGSL/AST/Attribute.h:52 > + enum Stage : uint8_t { enum class? > Source/WebGPU/WGSL/AST/Expression.h:32 > +class Expression : public ASTNode {}; This file is going to get quite big. Are you sure you want to have all nodes defined in a single file? > Source/WebGPU/WGSL/AST/GlobalDecl.h:50 > + // Each of the following may be null Nit: I don't think this comment is particularly helpful, because we have a class UniqueRef for non-nullable unique ptrs, so the type says that it can be null. The comment on the next line, though, is helpful. > Source/WebGPU/WGSL/AST/GlobalDecl.h:63 > + StringView m_name; It's cool that you're using StringViews and not Strings. > Source/WebGPU/WGSL/AST/GlobalDecl.h:66 > +class StructDecl : public GlobalDecl { Nit: Blank line between classes, please. > Source/WebGPU/WGSL/AST/GlobalDecl.h:85 > +}; Ditto. > Source/WebGPU/WGSL/AST/GlobalDecl.h:92 > + Type* maybeReturnType() { return m_returnType.get(); } Because this returns a pointer, I don't see much value in calling it "maybe". If it wasn't "maybe," it would return a reference. > Source/WebGPU/WGSL/AST/Statement.h:34 > +class CompoundStatement : public Statement { Why not "Block"? > Source/WebGPU/WGSL/AST/Statement.h:41 > +class ReturnStatement : public Statement { You're probably going to want the SPECIALIZE_TYPE_TRAITS stuff and virtual functions to be able to downcast these. > Source/WebGPU/WGSL/AST/Statement.h:43 > + Expression* maybeExpression() { return m_expression.get(); } Ditto about "maybe" > Source/WebGPU/WGSL/AST/Statement.h:50 > + Expression* maybeLhs() { return m_lhs.get(); } ... and other places too > Source/WebGPU/WGSL/AST/Statement.h:51 > + Expression* rhs() { return m_rhs.get(); } This should probably return a & > Source/WebGPU/WGSL/AST/Statement.h:55 > + std::unique_ptr<Expression> m_rhs; This should probably be a UniqueRef > Source/WebGPU/WGSL/AST/Type.h:44 > + enum Kind { Woah. I guess that's cool, since we don't have real templates/generics (yet). > Source/WebGPU/WGSL/AST/Type.h:68 > +class BoolType : public Type {}; > +class Float32Type : public Type {}; > +class Int32Type : public Type {}; > +class Uint32Type : public Type {}; Should these really be C++ types? User-defined types will be represented in C++ as instances of a C++ class, not as new C++ classes. Why not just have an instance of Type in-memory that is "the Float32 type"? > Source/WebGPU/WGSL/AST/VariableQualifier.h:40 > +enum class AccessMode : uint8_t { Other places you have the enum class inside the class its used in. How do you decide whether it should be in the class or outside the class? > Source/WebGPU/WGSL/Lexer.cpp:33 > +Token Lexer<T>::lex() Aren't there lexer generators? Did you consider using one of them? > Source/WebGPU/WGSL/Lexer.cpp:60 > + case '(': WebKit style is to put the cases at the same indentation as the switch. > Source/WebGPU/WGSL/Lexer.cpp:107 > + std::optional<uint64_t> postPeriod = parseDecimalInteger(); How does this work with "a.b"? > Source/WebGPU/WGSL/Lexer.cpp:111 > + literalValue /= pow(10, m_offset - offset); How confident are you that we're getting the right ULPs here? > Source/WebGPU/WGSL/Lexer.cpp:123 > + break; I don't understand how this works. surely "a - b" shouldn't return an invalid token? How does unary minus work? > Source/WebGPU/WGSL/Lexer.cpp:155 > + fractionalPart /= pow(10, m_offset - offset); Ditto about ULPs > Source/WebGPU/WGSL/Lexer.cpp:222 > + // FIXME: a trie would be more efficient here, look at JavaScriptCore/KeywordLookupGenerator.py for an example of code autogeneration that produces such a trie. Yep. Do we actually need a custom python script, or can we use some existing package? Maybe something in lex/yacc/bison/something else? > Source/WebGPU/WGSL/Lexer.cpp:224 > + // FIXME: I don't think that true/false/f32/u32/i32/bool need to be their own tokens, they could just be regular identifiers. Is it legal to say "let f32 = 7;"? If not, they should probably be their own tokens. > Source/WebGPU/WGSL/Lexer.cpp:270 > + if (m_current == '\n') { we have a name "newlineCharacter" for this in wtf/CharacterNames.h > Source/WebGPU/WGSL/Lexer.cpp:309 > + if (value.hasOverflowed()) > + return std::nullopt; Surely this should fail in a different way than the !isDecimal() check above? > Source/WebGPU/WGSL/Lexer.cpp:450 > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isWhiteSpace(UChar ch) > +{ > + // WGSL does not currently support any non-ASCII whitespace characters > + return Lexer<LChar>::isWhiteSpace(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isIdentifierStart(UChar ch) > +{ > + // WGSL does not currently support any non-ASCII characters in identifiers > + return Lexer<LChar>::isIdentifierStart(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isValidIdentifierCharacter(UChar ch) > +{ > + // WGSL does not currently support any non-ASCII characters in identifiers > + return Lexer<LChar>::isValidIdentifierCharacter(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isDecimal(UChar ch) > +{ > + return Lexer<LChar>::isDecimal(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isHexadecimal(UChar ch) > +{ > + return Lexer<LChar>::isHexadecimal(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE uint64_t Lexer<UChar>::readDecimal(UChar ch) > +{ > + return Lexer<LChar>::readDecimal(static_cast<LChar>(ch)); > +} > + > +template <> > +ALWAYS_INLINE uint64_t Lexer<UChar>::readHexadecimal(UChar ch) > +{ > + return Lexer<LChar>::readHexadecimal(static_cast<LChar>(ch)); > +} This I think is more evidence that a StringView and a size_t will be sufficient. > Source/WebGPU/WGSL/Lexer.h:39 > + if constexpr (std::is_same<T, LChar>::value) { https://mobile.twitter.com/Litherum/status/1456047144519954436 I'm still not convinced that we actually need this template goop though. A StringView and a size_t should be sufficient. > Source/WebGPU/WGSL/Lexer.h:71 > + // Parse pattern (e|E)(\+|-)?[0-9]+f? if it is present, and return the exponent Perhaps these comments would be more useful around the definition, rather than the declaration. > Source/WebGPU/WGSL/Lexer.h:75 > + const T* m_code; Why did you choose to hold this in a pointer, rather than a StringView and a size_t? If you did it that way, this whole thing wouldn't need to be templated. > Source/WebGPU/WGSL/Parser.cpp:35 > + (void) lexer; UNUSED_VARIABLE(lexer); > Source/WebGPU/WGSL/Parser.cpp:37 > + return { { }, { } }; This is my favorite shader. 😜 > Source/WebGPU/WGSL/Parser.h:32 > +template<typename Lexer> Are you planning on having more than one kind of lexer? Why not just hardcode the one lexer that we've got? > Source/WebGPU/WGSL/SourceSpan.h:40 > + unsigned m_line; > + unsigned m_lineOffset; > + unsigned m_offset; > + unsigned m_length; Can we derive 2 of these 4 fields at reporting time, so we can store half the data in all the nodes? > Source/WebGPU/WGSL/SourceSpan.h:42 > + SourceSpan(unsigned line, unsigned lineOffset, unsigned offset, unsigned length) This is unnecessary. In the places where you could have written SourceSpan(1, 2, 3, 4) you can instead write SourceSpan { 1, 2, 3, 4 } (or sometimes even just { 1, 2, 3, 4 }). Using { } is preferred according to webkit style. > Source/WebGPU/WGSL/Token.h:51 > + // FIXME: add all the other keywords: file:///Users/rmorisset/MyGPUWeb/wgsl/wgsl.html#keyword-summary putting paths inside your local machine probably isn't the best thing > Source/WebGPU/WGSL/Token.h:76 > + union { Can we use Variant instead? > Source/WebGPU/WGSL/Token.h:102 > + ASSERT(type == TokenType::IntegerLiteral > + || type == TokenType::IntegerLiteralSigned > + || type == TokenType::IntegerLiteralUnsigned > + || type == TokenType::DecimalFloatLiteral > + || type == TokenType::HexFloatLiteral); I think there's a way to use the C++ type system to turn these into compile errors rather than runtime errors. Same for the other constructors. > Source/WebGPU/WGSL/Token.h:136 > + if (m_type == TokenType::Identifier) > + (&m_ident)->~StringView(); Using a variant would get rid of this > Source/WebGPU/WGSLUnitTests/WGSLLexerTests.mm:109 > + ++lineNumber; > + checkNextTokenIsIdentifier("a"); > + checkNextToken(WGSL::TokenType::Colon); > + checkNextTokenIsIdentifier("i32"); > + checkNextToken(WGSL::TokenType::Semicolon); Cool. Created attachment 451907 [details]
Patch
Thanks for the review. I've tried to fix the issues you found in the lexer, and separated the rest of the patch. Created attachment 451909 [details]
Patch
Make true/false/i32/u32/f32/bool be keywords to match the spec.
Also fix various nits found by the style linter.
Comment on attachment 451909 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=451909&action=review > Source/WebGPU/WGSLUnitTests/WGSLLexerTests.mm:31 > +@interface WGSLLexerTests : XCTestCase Seems weird to use XCTest here when we use GTest for the unit tests elsewhere in the code base. Given the cross platform history of WebKit, I don't think using a single platform testing tool likely makes long term sense. How does one get text into the WGSL parser from the web (or rather, how will one)? If it is always from a file resource, perhaps we should consider making things UTF-8 only? If it might come from a DOMString / JavaScript string, that would be much less compelling. Comment on attachment 451909 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=451909&action=review > Source/WebGPU/ChangeLog:17 > + - space efficiency: we don't use an extra word of memory for the variant's tag https://github.com/akrzemi1/compact_optional can solve this. > Source/WebGPU/Configurations/WGSLUnitTests.xcconfig:24 > +EXECUTABLE_PREFIX = lib; I think I deleted this on purpose. https://trac.webkit.org/changeset/288608/webkit > Source/WebGPU/WGSL/Lexer.cpp:32 > +Token Lexer<T>::lex() Why not generate this? > Source/WebGPU/WGSL/Lexer.cpp:44 > + return makeToken(TokenType::ParenLeft); Can you add a Token(TokenType) implicit constructor to avoid all these makeToken() calls? > Source/WebGPU/WGSL/Lexer.cpp:197 > + if (isIdentifierStart(m_current)) { It might be worth doing the same kind of range check for 1-9 instead of having a bunch of case statements. > Source/WebGPU/WGSL/Lexer.cpp:372 > + case ' ': > + case '\t': > + case '\n': > + case '\v': > + case '\f': > + case '\r': We have more-friendly names for these in wtf/characternames.h > Source/WebGPU/WGSL/Lexer.cpp:435 > +template <> > +ALWAYS_INLINE bool Lexer<UChar>::isIdentifierStart(UChar ch) > +{ > + // WGSL does not currently support any non-ASCII characters in identifiers > + return Lexer<LChar>::isIdentifierStart(static_cast<LChar>(ch)); > +} Why specialize if the implementation is the same? > Source/WebGPU/WGSL/Lexer.h:39 > + if constexpr (std::is_same<T, LChar>::value) { I'd suggest either using template specialization or using `if constexpr` so the `else` block at the bottom can be a static_assert instead of a release_assert. > Source/WebGPU/WGSL/SourceSpan.h:33 > + unsigned m_line; > + unsigned m_lineOffset; > + unsigned m_offset; Why not have at least one of these fields be derived? That way, we don't have to track as much in the common case (the compile succeeds) > Source/WebGPU/WGSL/Token.cpp:31 > +String toString(TokenType type) Again, wish this could be generated. > Source/WebGPU/WGSL/Token.h:37 > + // - space efficiency: we don't use an extra word of memory for the variant's tag See changelog. > Source/WebGPU/WGSL/Token.h:38 > + // - ease of use: everywhere that we check for a given TokenType, we would have to first check that the Token is not nullopt, and then check the TokenType. I think this is slightly misstated - it's that users will already have to check the token type, so if we fold the optional into the token type, users get 2 checks for the price of one. Easier to use, and slightly more performant, and still just as type-safe as the alternative. >> Source/WebGPU/WGSLUnitTests/WGSLLexerTests.mm:31 >> +@interface WGSLLexerTests : XCTestCase > > Seems weird to use XCTest here when we use GTest for the unit tests elsewhere in the code base. Given the cross platform history of WebKit, I don't think using a single platform testing tool likely makes long term sense. The integration point of WebGPU is above this. Non-apple ports want to use Dawn or wgpu instead of WebGPU.framework, so they wouldn't ever run these tests. WebGPU.framework, and this compiler inside it, is already Metal-specific. (In reply to Sam Weinig from comment #10) > How does one get text into the WGSL parser from the web (or rather, how will > one)? If it is always from a file resource, perhaps we should consider > making things UTF-8 only? If it might come from a DOMString / JavaScript > string, that would be much less compelling. It's specified as a USVString passed to a JS method. https://gpuweb.github.io/gpuweb/#dom-gpudevice-createshadermodule > > Source/WebGPU/ChangeLog:17 > > + - space efficiency: we don't use an extra word of memory for the variant's tag > > https://github.com/akrzemi1/compact_optional can solve this. Thanks, I'll mention it in the comment/Changelog as an alternative; I did not know of this library. > > Source/WebGPU/Configurations/WGSLUnitTests.xcconfig:24 > > +EXECUTABLE_PREFIX = lib; > > I think I deleted this on purpose. > https://trac.webkit.org/changeset/288608/webkit Thanks, fixed. > > Source/WebGPU/WGSL/Lexer.cpp:32 > > +Token Lexer<T>::lex() > > Why not generate this? I followed JSC's code. Using lex/yacc would probably have been easier, but would have made the build system even more of a nightmare, and would also have made it much harder to customize the error messages, or generally adding special customizations to the code later on. > > Source/WebGPU/WGSL/Lexer.cpp:44 > > + return makeToken(TokenType::ParenLeft); > > Can you add a Token(TokenType) implicit constructor to avoid all these > makeToken() calls? No: makeToken() also looks at variables inside the Lexer to produce accurate localization. > > Source/WebGPU/WGSL/Lexer.cpp:197 > > + if (isIdentifierStart(m_current)) { > > It might be worth doing the same kind of range check for 1-9 instead of > having a bunch of case statements. I am not sure I understand what you mean, or why it would be better. > > Source/WebGPU/WGSL/Lexer.cpp:372 > > + case ' ': > > + case '\t': > > + case '\n': > > + case '\v': > > + case '\f': > > + case '\r': > > We have more-friendly names for these in wtf/characternames.h I've only found 4 of them, I could not find any equivalent name for `\f` nor `\r`. Do you want me to replace the top 4 anyway? > > Source/WebGPU/WGSL/Lexer.cpp:435 > > +template <> > > +ALWAYS_INLINE bool Lexer<UChar>::isIdentifierStart(UChar ch) > > +{ > > + // WGSL does not currently support any non-ASCII characters in identifiers > > + return Lexer<LChar>::isIdentifierStart(static_cast<LChar>(ch)); > > +} > > Why specialize if the implementation is the same? No idea what I was thinking, fixed, thanks. > > > Source/WebGPU/WGSL/Lexer.h:39 > > + if constexpr (std::is_same<T, LChar>::value) { > > I'd suggest either using template specialization or using `if constexpr` so > the `else` block at the bottom can be a static_assert instead of a > release_assert. Done, thanks. > > Source/WebGPU/WGSL/SourceSpan.h:33 > > + unsigned m_line; > > + unsigned m_lineOffset; > > + unsigned m_offset; > > Why not have at least one of these fields be derived? That way, we don't > have to track as much in the common case (the compile succeeds) This sounds not-trivial to implement, putting it in a FIXME. > > > Source/WebGPU/WGSL/Token.h:38 > > + // - ease of use: everywhere that we check for a given TokenType, we would have to first check that the Token is not nullopt, and then check the TokenType. > > I think this is slightly misstated - it's that users will already have to > check the token type, so if we fold the optional into the token type, users > get 2 checks for the price of one. Easier to use, and slightly more > performant, and still just as type-safe as the alternative. Created attachment 451926 [details]
Patch
(In reply to Robin Morisset from comment #13) > > > Source/WebGPU/WGSL/Lexer.cpp:32 > > > +Token Lexer<T>::lex() > > > > Why not generate this? > > I followed JSC's code. Using lex/yacc would probably have been easier, but > would have made the build system even more of a nightmare, and would also > have made it much harder to customize the error messages, or generally > adding special customizations to the code later on. It might be worth mentioning this in the ChangeLog. > > > Source/WebGPU/WGSL/Lexer.cpp:197 > > > + if (isIdentifierStart(m_current)) { > > > > It might be worth doing the same kind of range check for 1-9 instead of > > having a bunch of case statements. > > I am not sure I understand what you mean, or why it would be better. You don't have case 'a': case 'b': case 'c': ... but you do have case '1': case '2': case '3': I'm just suggesting you might want to use the same approach for both of these. > > > > Source/WebGPU/WGSL/Lexer.cpp:372 > > > + case ' ': > > > + case '\t': > > > + case '\n': > > > + case '\v': > > > + case '\f': > > > + case '\r': > > > > We have more-friendly names for these in wtf/characternames.h > > I've only found 4 of them, I could not find any equivalent name for `\f` nor > `\r`. > Do you want me to replace the top 4 anyway? I think you should add the human-readable names for the missing characters into characternames.h Created attachment 451932 [details]
Patch
Committed r289799 (247264@main): <https://commits.webkit.org/247264@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 451932 [details]. Comment on attachment 451932 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=451932&action=review > Source/WebGPU/WGSL/Lexer.cpp:369 > + switch (ch) { > + case WTF::Unicode::space: > + case WTF::Unicode::tabCharacter: > + case WTF::Unicode::carriageReturn: > + case WTF::Unicode::newlineCharacter: > + case WTF::Unicode::verticalTabulation: > + case WTF::Unicode::formFeed: > + return true; > + default: > + return false; > + } We have carefully-written and optimized versions of some of these checks in ASCIICType.h, and they should be used in places like this. If there is a reason we must not use those, I would like to know it. Here it would be * isASCIISpace*, and I believe its implementation is more efficient than the idiom here so using it could slightly help performance. > Source/WebGPU/WGSL/Lexer.cpp:375 > + return (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z'); isASCIIAlpha > Source/WebGPU/WGSL/Lexer.cpp:381 > + return (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || (ch >= '0' && ch <= '9') || ch == '_'; isASCIIAlphanumeric > Source/WebGPU/WGSL/Lexer.cpp:387 > + return (ch >= '0' && ch <= '9'); isASCIIDigit > Source/WebGPU/WGSL/Lexer.cpp:393 > + return (ch >= '0' && ch <= '9') || (ch >= 'a' && ch <= 'f') || (ch >= 'A' && ch <= 'F'); isASCIIHexDigit > Source/WebGPU/WGSL/Lexer.cpp:412 > + ASSERT(isHexadecimal(ch)); > + if (ch >= '0' && ch <= '9') > + return ch - '0'; > + if (ch >= 'a' && ch <= 'f') > + return ch - 'a'; > + ASSERT(ch >= 'A' && ch <= 'F'); > + return ch - 'A'; toASCIIHexValue Thank you, I did not know about these functions in WTF. I made a trivial patch to use them: https://bugs.webkit.org/show_bug.cgi?id=236783. |