Bug 25165 - First step towards replacing PCRE and completing regular expression JIT
Summary: First step towards replacing PCRE and completing regular expression JIT
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: Mac OS X 10.5
: P2 Normal
Assignee: Gavin Barraclough
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-04-13 15:40 PDT by Gavin Barraclough
Modified: 2009-04-14 00:07 PDT (History)
1 user (show)

See Also:


Attachments
Yarr! (192.76 KB, patch)
2009-04-13 15:41 PDT, Gavin Barraclough
ggaren: review+
Details | Formatted Diff | Diff
(with review comment fixes, just missing a changelog...) (191.33 KB, patch)
2009-04-13 22:22 PDT, Gavin Barraclough
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gavin Barraclough 2009-04-13 15:40:23 PDT
Land a new regex interpreter, YARR (yet another regex runtime).  This interpreter is a more correct implementation of the ECMA regex language (and closer to Firefox behaviour) than our port of PCRE, and is designed to support a JIT.  Currently the interpreter has not be optimized, and as such will likely perform less well than PCRE, as such we will land it disabled.

Also land the first cut of NARC (not another regex compiler) – YARR's JIT.  This presently supports around the same subset of regular expressions as WREC, however NARC is not designed to fall back to a JIT and as such presently enabling NARC will result in breaking functionality.
Comment 1 Gavin Barraclough 2009-04-13 15:41:57 PDT
Created attachment 29442 [details]
Yarr!
Comment 2 Geoffrey Garen 2009-04-13 17:30:23 PDT
Comment on attachment 29442 [details]
Yarr!

I'm ignoring commented-out code, because Gavin told me that he had a local change to remove it.

Let's rename the NARC #define to YARR_JIT.

+#if !ENABLE(YARR)
         ~RegExp();
+#endif

You can reduce the #ifdef'ing if you leave in ~RegExp in all builds, and just leave out the implementation in YARR builds.

+#if ENABLE(NARC)
+    jitCompileRegex(globalData, m_regExpJITCode, m_pattern, m_numSubpatterns, m_constructionError, ignoreCase(), multiline());
+#else
+    UNUSED_PARAM(globalData);
+    m_regExpBytecode.set(Yarr::byteCompileRegex(m_pattern, m_numSubpatterns, m_constructionError, ignoreCase(), multiline()));
+#endif

Kind of strange that the JIT version of the code does not use the Yarr namespace explicitly, but the bytecode version of the code does.

+CharacterClass* newlineOnce()
+{
+    static CharacterClass* once = newlineCreate();
+    return once;
+}

DANGER WILL ROBINSON. DANGER. Static initialization like this is not thread-safe. If we want to dynamically allocate these character classes, we need to store them in the parser, or the JSGlobalData.

I think function names like this would be easier to read as "newlineCharacterClass()", etc., instead of "newlineOnce()", etc.

+    // If the pattern contains illegal backreferences reset & reparse.
+    if (pattern.containsIllegalBackReference()) {

Let's be more explicit and mention that we're doing this to match a quirk in Firefox's behavior.

+    DisjunctionContext* allocDisjunctionContext(ByteDisjunction* disjunction)
+    {
+        return new(malloc(sizeof(DisjunctionContext) + (disjunction->m_frameSize - 1) * sizeof(uintptr_t))) DisjunctionContext();
+    }
+
+    void freeDisjunctionContext(DisjunctionContext* context)
+    {
+        free(context);
+    }

I think this code, and its client, would be cleaner if you made DisjunctionContext refcounted, and used RefPtr to manage its lifetime automatically.

+    BytecodePattern* compile()
+    {
+        regexBegin(m_pattern.m_body->m_callFrameSize);
+        emitDisjunction(m_pattern.m_body);
+        regexEnd();
+
+        BytecodePattern* bytecode = new BytecodePattern(bodyDisjunction, m_pattern.m_ignoreCase, m_pattern.m_multiline, m_pattern.m_numSubpatterns, m_allParenthesesInfo, m_pattern.m_userCharacterClasses);
+        bodyDisjunction = 0;
+        m_allParenthesesInfo.shrink(0);
+        m_pattern.m_userCharacterClasses.clear();
+        return bytecode;
+    }

RefPtr here would be good, too.

+    ~BytecodePattern()
+    {
+        delete m_body;
+        for (unsigned i = 0; i < m_allParenthesesInfo.size(); ++i)
+            delete m_allParenthesesInfo[i];
+        for (unsigned i = 0; i < m_userCharacterClasses.size(); ++i)
+            delete m_userCharacterClasses[i];
+    }

And here.

+struct RegexJITObject {
+    void* m_jitCode;
+    RefPtr<ExecutablePool> m_executablePool;
+
+    RegexJITObject()
+        : m_jitCode(0)
+    {
+    }
+};

Should we call this something more specific, like Yarr::CodeBlock? "Object" is a pretty generic name for an object.

+// Yarr::parse():
+//

Please use "/**/" syntax for multiline comments.

+    static const unsigned s_maxPatternSize = (1 << 13);

I think you want to standardize with the use of MAX_PATTERN_SIZE elsewhere, to avoid differences between the engines.

r=me
Comment 3 Gavin Barraclough 2009-04-13 22:22:57 PDT
Created attachment 29460 [details]
(with review comment fixes, just missing a changelog...)
Comment 4 Gavin Barraclough 2009-04-14 00:07:40 PDT
Sending        JavaScriptCore/ChangeLog
Sending        JavaScriptCore/JavaScriptCore.xcodeproj/project.pbxproj
Sending        JavaScriptCore/assembler/MacroAssemblerX86Common.h
Sending        JavaScriptCore/assembler/X86Assembler.h
Sending        JavaScriptCore/runtime/RegExp.cpp
Sending        JavaScriptCore/runtime/RegExp.h
Sending        JavaScriptCore/wtf/Platform.h
Adding         JavaScriptCore/yarr
Adding         JavaScriptCore/yarr/RegexCompiler.cpp
Adding         JavaScriptCore/yarr/RegexCompiler.h
Adding         JavaScriptCore/yarr/RegexInterpreter.cpp
Adding         JavaScriptCore/yarr/RegexInterpreter.h
Adding         JavaScriptCore/yarr/RegexJIT.cpp
Adding         JavaScriptCore/yarr/RegexJIT.h
Adding         JavaScriptCore/yarr/RegexParser.h
Adding         JavaScriptCore/yarr/RegexPattern.h
Transmitting file data ...............
Committed revision 42481.