Bug 142031 - compile DFA to bytecode
Summary: compile DFA to bytecode
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Alex Christensen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-25 17:39 PST by Alex Christensen
Modified: 2015-02-27 11:21 PST (History)
3 users (show)

See Also:


Attachments
Patch (33.17 KB, patch)
2015-02-25 17:49 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (33.98 KB, patch)
2015-02-25 18:08 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (34.97 KB, patch)
2015-02-26 12:18 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (34.86 KB, patch)
2015-02-26 15:46 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (34.92 KB, patch)
2015-02-26 16:15 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (32.62 KB, patch)
2015-02-26 17:08 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (31.86 KB, patch)
2015-02-26 17:30 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (31.93 KB, patch)
2015-02-26 21:34 PST, Alex Christensen
benjamin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Christensen 2015-02-25 17:39:31 PST
Like https://bugs.webkit.org/show_bug.cgi?id=141017 but to byte code instead of machine code.
Comment 1 Alex Christensen 2015-02-25 17:49:37 PST
Created attachment 247374 [details]
Patch
Comment 2 Alex Christensen 2015-02-25 18:08:26 PST
Created attachment 247380 [details]
Patch
Comment 3 Benjamin Poulain 2015-02-26 00:06:57 PST
Comment on attachment 247380 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=247380&action=review

Partial review :(
I did not get a chance to read everything :(

> Source/WebCore/ChangeLog:31
> +        (WebCore::ContentExtensions::DFABytecodeInterpreter::interpret):

I can't wait to see a DFABytecodeDebugger ;)

> Source/WebCore/contentextensions/ContentExtensionRule.h:38
> -enum class ExtensionActionType {
> +// FIXME: make this smaller.
> +enum class ExtensionActionType : uint64_t {

Small mixup here.

The output should be the index of the matched rule, not it's action type.

Some actions will have additional meta data. For example, when adding the CSS support, each Action will have a selector list.

> Source/WebCore/contentextensions/DFABytecode.h:35
> +enum class DFABytecode : uint8_t {

IMHO, this is abusing enum class.

I would typedef uint8_t to DFABytecode, since the bytecode contains Instruction+data in a dense form.
Then I would have a enum class DFABytecodeInstruction with all the instructions types.

> Source/WebCore/contentextensions/DFABytecode.h:38
> +    // CheckValue is followed by the value to check (1 byte),
> +    // then the index to jump to if the values are equal (4 bytes).

Oh thank Jibbers you started with documentation for the bytecode.

The JSC bytecode is completely undocumented, quite a pain.

If you want some idea on how to document this, you should have a look how LLVM IR is documented (e.g. http://llvm.org/docs/LangRef.html#add-instruction)

> Source/WebCore/contentextensions/DFABytecode.h:41
> +    // AppendAction is followed by the action to append (8 bytes).

The "action" we will eventually add is the offset in the action file, we can limit ourself to 4 bytes IMHO.

> Source/WebCore/contentextensions/DFABytecode.h:44
> +    // Terminate is not followed by anything.

This comments is a bit vague.

Maybe "Terminate ends the execution even if there is input left."

> Source/WebCore/contentextensions/DFABytecode.h:50
> +    // FIXME: There should be a CheckBit bytecode to do binary searches if there are lots of edges from this node.

No need for the FIXME, we'll likely do tons of extensions. I already have ideas :)

Just an idea of a new bytecode to make the code more compact and faster:
If all the characters exists in lower and upper case version, there could be a ChangeInputToASCIILower bytecode. Then CheckValue only need to check for lowercase characters, we can dismiss the uppercase characters.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:42
> +    // There's got to be a better way to do this, but this works.

No need for the comment.

The fast way to do this is to use memcpy() and let clang optimize it out. On x86, we can do unaligned memory accesses by design. On Apple ARM, we can too but I don't know if Clang knows about that.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:62
> +    // There's got to be a better way to do this, but this works.

You can remove this comment too.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:78
> +        append<DFABytecode>(bytecode, DFABytecode::AppendAction);
> +        append<ExtensionActionType>(bytecode, static_cast<ExtensionActionType>(action));

I would use the "emitInstruction" pattern to make the code more modular.

For example, this would be
   emitAppendAction(actionId);

For example, take BytecodeGenerator::emitLoad() of JSC.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:85
> +        append<uint32_t>(bytecode, -1); // This value will be set when linking.

It's probably better to put zero here. Any linker mistake would be stuck in an infinite loop on the invalid jump.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:96
> +void DFABytecodeCompiler::linkNode(Vector<DFABytecode>& bytecode, unsigned index)

IMHO, a static linker will quickly get in the way of optimizations.

What I would do it:
1) Every time the compiler emits a jump, it stores the pair (offset where to write the jump, target node) in a list.
2) Then for linking you go over the pair, and do
    bytecode[offset where to write the jump] nodeStarts[target node index]

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:120
> +    WTFLogAlways("COMPILING\n"); // This happens too often (OOPS!)

Whaaaaaaat

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:133
> +    // link

Link.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:137
> +    WTFLogAlways("BUFFER SIZE: %d\n", (int)bytecode.size());

Nope.

> Source/WebCore/contentextensions/DFABytecodeCompiler.h:44
> +    { }

The coding style is to use 
{
}
in this case: http://www.webkit.org/coding/coding-style.html#punctuation-member-init

> Source/WebCore/contentextensions/DFABytecodeCompiler.h:45
> +    Vector<DFABytecode> compile();

We should find a way to give up the ownership, we are dealing with really big objects.

> Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:43
> +    uint32_t bytecodeIndex = 0;

I would name that program counter if it always point to an instruction and never to an argument.

> Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:61
> +                bytecodeIndex += sizeof(DFABytecode) + sizeof(uint8_t) + sizeof(uint32_t);

There must be a better way.

Usually the Instruction are encoded somewhere with their arguments so the size is known.
Comment 4 Alex Christensen 2015-02-26 12:18:29 PST
Created attachment 247438 [details]
Patch
Comment 5 Alex Christensen 2015-02-26 15:00:23 PST
Comment on attachment 247438 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=247438&action=review

> Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:57
> +            if (url.data()[urlIndex] == getBits<uint8_t>(bytecode, programCounter + sizeof(DFABytecode))) {

To correctly match regexes that match right at the end of a string, we need to check for a null character before this.

> Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:60
> +                if (!url.data()[urlIndex])

Not here.

> Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:69
> +            if (!url.data()[urlIndex])

And not here.
Comment 6 Alex Christensen 2015-02-26 15:46:39 PST
Created attachment 247455 [details]
Patch
Comment 7 Alex Christensen 2015-02-26 16:15:56 PST
Created attachment 247465 [details]
Patch
Comment 8 Benjamin Poulain 2015-02-26 17:00:19 PST
Comment on attachment 247465 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=247465&action=review

> Source/WebCore/contentextensions/DFABytecode.h:56
> +static inline size_t instructionSizeWithArguments(DFABytecodeInstruction instruction)

nice

> Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:48
> +    for (;;) {

while (true)?

> Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:56
> +            ASSERT(url.data());

This could be out of the loop.

You should put url.data() in a temporary to be safe. If you add any function call in the future, clang would have have to test for m_buffer every time if you have a temporary.

> Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:79
> +        ASSERT(urlIndex < url.length()); // We should always terminate at a null character at the end of a String.

good

> Source/WebCore/contentextensions/DFABytecodeInterpreter.h:31
> +#include "ContentExtensionRule.h"

?

> Source/WebCore/contentextensions/DFABytecodeInterpreter.h:34
> +#include <wtf/text/CString.h>

You just need to declare CString.

> Source/WebCore/contentextensions/DFABytecodeInterpreter.h:44
> +    { };

Semi colon.
WebKit Style.

> Source/WebCore/contentextensions/DFANode.h:31
> +#include "ContentExtensionRule.h"

? That does not seem right, the state machine should know nothing about ContentExtensions.

> Source/WebCore/contentextensions/NFAToDFA.cpp:31
> +#include "ContentExtensionRule.h"

Ditto.
Comment 9 Alex Christensen 2015-02-26 17:08:40 PST
Created attachment 247471 [details]
Patch
Comment 10 Mark Lam 2015-02-26 17:21:41 PST
Comment on attachment 247471 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=247471&action=review

> Source/WebCore/contentextensions/DFABytecodeInterpreter.h:47
> +    HashSet<uint64_t, DefaultHash<uint64_t>::Hash, WTF::UnsignedWithZeroKeyHashTraits<uint64_t>> interpret(const CString&);

Suggestion:
typedef HashSet<uint64_t, DefaultHash<uint64_t>::Hash, WTF::UnsignedWithZeroKeyHashTraits<uint64_t>> Actions;

... and use Actions (or DFABytecodeInterpreter::Actions) everywhere you would use this.  That way, you get to express its purpose rather than its implementation.
Comment 11 Alex Christensen 2015-02-26 17:30:39 PST
Created attachment 247473 [details]
Patch
Comment 12 Alex Christensen 2015-02-26 21:09:36 PST
Comment on attachment 247473 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=247473&action=review

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:86
> +    linkRecords.reserveCapacity(linkRecords.size() + node.transitions.size() * instructionSizeWithArguments(DFABytecodeInstruction::CheckValue));

This should be changed.  Right now there is one link record per transition and one CheckValue with arguments per transition.
linkRecords.reserveCapacity(linkRecords.size() + node.transitions.size());
bytecode.reserveCapacity(bytecode.size() + node.transitions.size() * instructionSizeWithArguments(DFABytecodeInstruction::CheckValue));
Comment 13 Alex Christensen 2015-02-26 21:34:20 PST
Created attachment 247496 [details]
Patch
Comment 14 Benjamin Poulain 2015-02-26 23:57:24 PST
Comment on attachment 247496 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=247496&action=review

This looks completely correct to me, I could not find any problem.

Great work on the bytecode compiler and the interpreter. Clean design and well documented.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:29
> +#include "ContentExtensionRule.h"

This is fishy.

Why is it not with the other #include after #if ENABLE(CONTENT_EXTENSIONS)?

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:77
> +void DFABytecodeCompiler::compileNode(Vector<DFABytecode>& bytecode, unsigned index)

You can drop the argument now, you have a reference to the bytecode as an attribute.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:86
> +    bytecode.reserveCapacity(bytecode.size() + node.transitions.size() * instructionSizeWithArguments(DFABytecodeInstruction::CheckValue));

You could have a reserveOutputBufferCapacity(numberOfActions, numberOfTransitions, hasFallback) to help with each node.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:87
> +    linkRecords.reserveCapacity(linkRecords.size() + node.transitions.size());

We should really have a nicer way to do that with a Vector. This is a common pattern. :(

> Source/WebCore/contentextensions/DFABytecodeCompiler.h:31
> +#include "ContentExtensionRule.h"

This include is fishy.

> Source/WebCore/contentextensions/DFABytecodeCompiler.h:43
> +    DFABytecodeCompiler(const DFA& dfa, Vector<DFABytecode>& bytecode)

I wonder if we'll en up taking a file descriptor as argument, and write the bytecode directly in the file.

> Source/WebCore/contentextensions/DFABytecodeCompiler.h:59
> +    Vector<DFABytecode>& bytecode;

m_bytecode.

> Source/WebCore/contentextensions/DFABytecodeCompiler.h:60
> +    const DFA& dfa;

m_dfa

> Source/WebCore/contentextensions/DFABytecodeCompiler.h:62
> +    Vector<unsigned> nodeStarts;

m_nodeStarts

Maybe m_nodeStartOffsets

> Source/WebCore/contentextensions/DFABytecodeCompiler.h:65
> +    // The first value is the index in the bytecode buffer where the jump is to be written.
> +    // The second value is the index of the node to jump to.

This is why I hate std::pair so much: we cannot name the fields :)

> Source/WebCore/contentextensions/DFABytecodeCompiler.h:66
> +    Vector<std::pair<unsigned, unsigned>> linkRecords;

m_linkRecords

> Source/WebCore/contentextensions/DFABytecodeInterpreter.cpp:46
> +    ASSERT(url.data());

ASSERT(url);

I have a feeling you don't work with a debug build ;)
Comment 15 Alex Christensen 2015-02-27 11:21:48 PST
Addressed comments, tested debug builds, and committed to http://trac.webkit.org/changeset/180769

(In reply to comment #14)
> > Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:86
> > +    bytecode.reserveCapacity(bytecode.size() + node.transitions.size() * instructionSizeWithArguments(DFABytecodeInstruction::CheckValue));
> 
> You could have a reserveOutputBufferCapacity(numberOfActions,
> numberOfTransitions, hasFallback) to help with each node.
reserveCapacity was making it compile VERY slowly (10 seconds in release).  I removed all reserveCapacity calls.  We can add them later.
> > Source/WebCore/contentextensions/DFABytecodeCompiler.h:43
> > +    DFABytecodeCompiler(const DFA& dfa, Vector<DFABytecode>& bytecode)
> 
> I wonder if we'll en up taking a file descriptor as argument, and write the
> bytecode directly in the file.
That would be cool, but I'm not sure how linking would work.  I think we should compile and link in a Vector, then write the whole Vector to a file to avoid lots of unnecessary rewinding and seeking in the file.