RESOLVED FIXED 142031
compile DFA to bytecode
https://bugs.webkit.org/show_bug.cgi?id=142031
Summary compile DFA to bytecode
Alex Christensen
Reported 2015-02-25 17:39:31 PST
Like https://bugs.webkit.org/show_bug.cgi?id=141017 but to byte code instead of machine code.
Attachments
Patch (33.17 KB, patch)
2015-02-25 17:49 PST, Alex Christensen
no flags
Patch (33.98 KB, patch)
2015-02-25 18:08 PST, Alex Christensen
no flags
Patch (34.97 KB, patch)
2015-02-26 12:18 PST, Alex Christensen
no flags
Patch (34.86 KB, patch)
2015-02-26 15:46 PST, Alex Christensen
no flags
Patch (34.92 KB, patch)
2015-02-26 16:15 PST, Alex Christensen
no flags
Patch (32.62 KB, patch)
2015-02-26 17:08 PST, Alex Christensen
no flags
Patch (31.86 KB, patch)
2015-02-26 17:30 PST, Alex Christensen
no flags
Patch (31.93 KB, patch)
2015-02-26 21:34 PST, Alex Christensen
benjamin: review+
Alex Christensen
Comment 1 2015-02-25 17:49:37 PST
Alex Christensen
Comment 2 2015-02-25 18:08:26 PST
Benjamin Poulain
Comment 3 2015-02-26 00:06:57 PST
Comment on attachment 247380 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=247380&action=review Partial review :( I did not get a chance to read everything :( > Source/WebCore/ChangeLog:31 > + (WebCore::ContentExtensions::DFABytecodeInterpreter::interpret): I can't wait to see a DFABytecodeDebugger ;) > Source/WebCore/contentextensions/ContentExtensionRule.h:38 > -enum class ExtensionActionType { > +// FIXME: make this smaller. > +enum class ExtensionActionType : uint64_t { Small mixup here. The output should be the index of the matched rule, not it's action type. Some actions will have additional meta data. For example, when adding the CSS support, each Action will have a selector list. > Source/WebCore/contentextensions/DFABytecode.h:35 > +enum class DFABytecode : uint8_t { IMHO, this is abusing enum class. I would typedef uint8_t to DFABytecode, since the bytecode contains Instruction+data in a dense form. Then I would have a enum class DFABytecodeInstruction with all the instructions types. > Source/WebCore/contentextensions/DFABytecode.h:38 > + // CheckValue is followed by the value to check (1 byte), > + // then the index to jump to if the values are equal (4 bytes). Oh thank Jibbers you started with documentation for the bytecode. The JSC bytecode is completely undocumented, quite a pain. If you want some idea on how to document this, you should have a look how LLVM IR is documented (e.g. http://llvm.org/docs/LangRef.html#add-instruction) > Source/WebCore/contentextensions/DFABytecode.h:41 > + // AppendAction is followed by the action to append (8 bytes). The "action" we will eventually add is the offset in the action file, we can limit ourself to 4 bytes IMHO. > Source/WebCore/contentextensions/DFABytecode.h:44 > + // Terminate is not followed by anything. This comments is a bit vague. Maybe "Terminate ends the execution even if there is input left." > Source/WebCore/contentextensions/DFABytecode.h:50 > + // FIXME: There should be a CheckBit bytecode to do binary searches if there are lots of edges from this node. No need for the FIXME, we'll likely do tons of extensions. I already have ideas :) Just an idea of a new bytecode to make the code more compact and faster: If all the characters exists in lower and upper case version, there could be a ChangeInputToASCIILower bytecode. Then CheckValue only need to check for lowercase characters, we can dismiss the uppercase characters. > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:42 > + // There's got to be a better way to do this, but this works. No need for the comment. The fast way to do this is to use memcpy() and let clang optimize it out. On x86, we can do unaligned memory accesses by design. On Apple ARM, we can too but I don't know if Clang knows about that. > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:62 > + // There's got to be a better way to do this, but this works. You can remove this comment too. > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:78 > + append<DFABytecode>(bytecode, DFABytecode::AppendAction); > + append<ExtensionActionType>(bytecode, static_cast<ExtensionActionType>(action)); I would use the "emitInstruction" pattern to make the code more modular. For example, this would be emitAppendAction(actionId); For example, take BytecodeGenerator::emitLoad() of JSC. > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:85 > + append<uint32_t>(bytecode, -1); // This value will be set when linking. It's probably better to put zero here. Any linker mistake would be stuck in an infinite loop on the invalid jump. > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:96 > +void DFABytecodeCompiler::linkNode(Vector<DFABytecode>& bytecode, unsigned index) IMHO, a static linker will quickly get in the way of optimizations. What I would do it: 1) Every time the compiler emits a jump, it stores the pair (offset where to write the jump, target node) in a list. 2) Then for linking you go over the pair, and do bytecode[offset where to write the jump] nodeStarts[target node index] > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:120 > + WTFLogAlways("COMPILING\n"); // This happens too often (OOPS!) Whaaaaaaat > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:133 > + // link Link. > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:137 > + WTFLogAlways("BUFFER SIZE: %d\n", (int)bytecode.size()); Nope. > Source/WebCore/contentextensions/DFABytecodeCompiler.h:44 > + { } The coding style is to use { } in this case: http://www.webkit.org/coding/coding-style.html#punctuation-member-init > Source/WebCore/contentextensions/DFABytecodeCompiler.h:45 > + Vector<DFABytecode> compile(); We should find a way to give up the ownership, we are dealing with really big objects. > Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:43 > + uint32_t bytecodeIndex = 0; I would name that program counter if it always point to an instruction and never to an argument. > Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:61 > + bytecodeIndex += sizeof(DFABytecode) + sizeof(uint8_t) + sizeof(uint32_t); There must be a better way. Usually the Instruction are encoded somewhere with their arguments so the size is known.
Alex Christensen
Comment 4 2015-02-26 12:18:29 PST
Alex Christensen
Comment 5 2015-02-26 15:00:23 PST
Comment on attachment 247438 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=247438&action=review > Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:57 > + if (url.data()[urlIndex] == getBits<uint8_t>(bytecode, programCounter + sizeof(DFABytecode))) { To correctly match regexes that match right at the end of a string, we need to check for a null character before this. > Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:60 > + if (!url.data()[urlIndex]) Not here. > Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:69 > + if (!url.data()[urlIndex]) And not here.
Alex Christensen
Comment 6 2015-02-26 15:46:39 PST
Alex Christensen
Comment 7 2015-02-26 16:15:56 PST
Benjamin Poulain
Comment 8 2015-02-26 17:00:19 PST
Comment on attachment 247465 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=247465&action=review > Source/WebCore/contentextensions/DFABytecode.h:56 > +static inline size_t instructionSizeWithArguments(DFABytecodeInstruction instruction) nice > Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:48 > + for (;;) { while (true)? > Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:56 > + ASSERT(url.data()); This could be out of the loop. You should put url.data() in a temporary to be safe. If you add any function call in the future, clang would have have to test for m_buffer every time if you have a temporary. > Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:79 > + ASSERT(urlIndex < url.length()); // We should always terminate at a null character at the end of a String. good > Source/WebCore/contentextensions/DFABytecodeInterpreter.h:31 > +#include "ContentExtensionRule.h" ? > Source/WebCore/contentextensions/DFABytecodeInterpreter.h:34 > +#include <wtf/text/CString.h> You just need to declare CString. > Source/WebCore/contentextensions/DFABytecodeInterpreter.h:44 > + { }; Semi colon. WebKit Style. > Source/WebCore/contentextensions/DFANode.h:31 > +#include "ContentExtensionRule.h" ? That does not seem right, the state machine should know nothing about ContentExtensions. > Source/WebCore/contentextensions/NFAToDFA.cpp:31 > +#include "ContentExtensionRule.h" Ditto.
Alex Christensen
Comment 9 2015-02-26 17:08:40 PST
Mark Lam
Comment 10 2015-02-26 17:21:41 PST
Comment on attachment 247471 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=247471&action=review > Source/WebCore/contentextensions/DFABytecodeInterpreter.h:47 > + HashSet<uint64_t, DefaultHash<uint64_t>::Hash, WTF::UnsignedWithZeroKeyHashTraits<uint64_t>> interpret(const CString&); Suggestion: typedef HashSet<uint64_t, DefaultHash<uint64_t>::Hash, WTF::UnsignedWithZeroKeyHashTraits<uint64_t>> Actions; ... and use Actions (or DFABytecodeInterpreter::Actions) everywhere you would use this. That way, you get to express its purpose rather than its implementation.
Alex Christensen
Comment 11 2015-02-26 17:30:39 PST
Alex Christensen
Comment 12 2015-02-26 21:09:36 PST
Comment on attachment 247473 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=247473&action=review > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:86 > + linkRecords.reserveCapacity(linkRecords.size() + node.transitions.size() * instructionSizeWithArguments(DFABytecodeInstruction::CheckValue)); This should be changed. Right now there is one link record per transition and one CheckValue with arguments per transition. linkRecords.reserveCapacity(linkRecords.size() + node.transitions.size()); bytecode.reserveCapacity(bytecode.size() + node.transitions.size() * instructionSizeWithArguments(DFABytecodeInstruction::CheckValue));
Alex Christensen
Comment 13 2015-02-26 21:34:20 PST
Benjamin Poulain
Comment 14 2015-02-26 23:57:24 PST
Comment on attachment 247496 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=247496&action=review This looks completely correct to me, I could not find any problem. Great work on the bytecode compiler and the interpreter. Clean design and well documented. > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:29 > +#include "ContentExtensionRule.h" This is fishy. Why is it not with the other #include after #if ENABLE(CONTENT_EXTENSIONS)? > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:77 > +void DFABytecodeCompiler::compileNode(Vector<DFABytecode>& bytecode, unsigned index) You can drop the argument now, you have a reference to the bytecode as an attribute. > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:86 > + bytecode.reserveCapacity(bytecode.size() + node.transitions.size() * instructionSizeWithArguments(DFABytecodeInstruction::CheckValue)); You could have a reserveOutputBufferCapacity(numberOfActions, numberOfTransitions, hasFallback) to help with each node. > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:87 > + linkRecords.reserveCapacity(linkRecords.size() + node.transitions.size()); We should really have a nicer way to do that with a Vector. This is a common pattern. :( > Source/WebCore/contentextensions/DFABytecodeCompiler.h:31 > +#include "ContentExtensionRule.h" This include is fishy. > Source/WebCore/contentextensions/DFABytecodeCompiler.h:43 > + DFABytecodeCompiler(const DFA& dfa, Vector<DFABytecode>& bytecode) I wonder if we'll en up taking a file descriptor as argument, and write the bytecode directly in the file. > Source/WebCore/contentextensions/DFABytecodeCompiler.h:59 > + Vector<DFABytecode>& bytecode; m_bytecode. > Source/WebCore/contentextensions/DFABytecodeCompiler.h:60 > + const DFA& dfa; m_dfa > Source/WebCore/contentextensions/DFABytecodeCompiler.h:62 > + Vector<unsigned> nodeStarts; m_nodeStarts Maybe m_nodeStartOffsets > Source/WebCore/contentextensions/DFABytecodeCompiler.h:65 > + // The first value is the index in the bytecode buffer where the jump is to be written. > + // The second value is the index of the node to jump to. This is why I hate std::pair so much: we cannot name the fields :) > Source/WebCore/contentextensions/DFABytecodeCompiler.h:66 > + Vector<std::pair<unsigned, unsigned>> linkRecords; m_linkRecords > Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:46 > + ASSERT(url.data()); ASSERT(url); I have a feeling you don't work with a debug build ;)
Alex Christensen
Comment 15 2015-02-27 11:21:48 PST
Addressed comments, tested debug builds, and committed to http://trac.webkit.org/changeset/180769 (In reply to comment #14) > > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:86 > > + bytecode.reserveCapacity(bytecode.size() + node.transitions.size() * instructionSizeWithArguments(DFABytecodeInstruction::CheckValue)); > > You could have a reserveOutputBufferCapacity(numberOfActions, > numberOfTransitions, hasFallback) to help with each node. reserveCapacity was making it compile VERY slowly (10 seconds in release). I removed all reserveCapacity calls. We can add them later. > > Source/WebCore/contentextensions/DFABytecodeCompiler.h:43 > > + DFABytecodeCompiler(const DFA& dfa, Vector<DFABytecode>& bytecode) > > I wonder if we'll en up taking a file descriptor as argument, and write the > bytecode directly in the file. That would be cool, but I'm not sure how linking would work. I think we should compile and link in a Vector, then write the whole Vector to a file to avoid lots of unnecessary rewinding and seeking in the file.
Note You need to log in before you can comment on or make changes to this bug.