Bug 111066 - svn-apply cannot apply patches which is generated by git to files that contain space characters in their path
Summary: svn-apply cannot apply patches which is generated by git to files that contai...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
: 115281 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-02-28 03:50 PST by Yuki Sekiguchi
Modified: 2013-06-06 16:37 PDT (History)
12 users (show)

See Also:


Attachments
Patch (2.52 KB, patch)
2013-02-28 04:12 PST, Yuki Sekiguchi
no flags Details | Formatted Diff | Diff
Patch (14.53 KB, patch)
2013-02-28 22:07 PST, Yuki Sekiguchi
no flags Details | Formatted Diff | Diff
Patch (18.48 KB, patch)
2013-05-23 22:17 PDT, Yuki Sekiguchi
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from webkit-ews-09 for mac-mountainlion-wk2 (683.94 KB, application/zip)
2013-05-24 00:29 PDT, Build Bot
no flags Details
Patch (20.46 KB, patch)
2013-05-26 23:18 PDT, Yuki Sekiguchi
no flags Details | Formatted Diff | Diff
Patch (18.23 KB, patch)
2013-06-04 04:54 PDT, Yuki Sekiguchi
no flags Details | Formatted Diff | Diff
Patch (20.48 KB, patch)
2013-06-05 01:31 PDT, Yuki Sekiguchi
no flags Details | Formatted Diff | Diff
Patch (20.54 KB, patch)
2013-06-05 23:56 PDT, Yuki Sekiguchi
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yuki Sekiguchi 2013-02-28 03:50:37 PST
WebKit Review Bot fail to apply the patch:
https://bug-111042-attachments.webkit.org/attachment.cgi?id=190663

> Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-03', 'apply-attachment', '--no-update', '--non-interactive', 190663, '--port=chromium-xvfb']" exit_code: 2 cwd: /mnt/git/webkit-commit-queue
> 
> Logging in as webkit.review.bot@gmail.com...
> Fetching: https://bugs.webkit.org/attachment.cgi?id=190663&action=edit
> Fetching: https://bugs.webkit.org/show_bug.cgi?id=111042&ctype=xml&excludefield=attachmentdata
> Processing 1 patch from 1 bug.
> Processing patch 190663 from bug 111042.
> Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', '--force', '--reviewer', 'Alexey Proskuryakov']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue
> 
> Parsed 2 diffs from patch file(s).
> patching file ChangeLog
> patching file 'WebProcess).xcscheme'
> Hunk #1 FAILED at 161.
> 1 out of 1 hunk FAILED -- saving rejects to file 'WebProcess).xcscheme.rej'
> 
> Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', '--force', '--reviewer', 'Alexey Proskuryakov']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue
Comment 1 Yuki Sekiguchi 2013-02-28 04:12:57 PST
Created attachment 190699 [details]
Patch
Comment 2 Yuki Sekiguchi 2013-02-28 04:21:39 PST
Hi Daniel,
Could you review this patch? I heard you are the best person to review svn-apply from Vivek.
Comment 3 Daniel Bates 2013-02-28 16:55:56 PST
Comment on attachment 190699 [details]
Patch

We need a test for this change. See the existing tests for this function in Tools/Scripts/webkitperl/VCSUtils_unittest/parseGitDiffHeader.pl. Run Tools/Scripts/test-webkitperl to run the Perl unit tests.
Comment 4 Yuki Sekiguchi 2013-02-28 22:07:30 PST
Created attachment 190883 [details]
Patch
Comment 5 Yuki Sekiguchi 2013-02-28 22:22:51 PST
Hi Daniel,

Thank you for reviewing.

(In reply to comment #3)
> (From update of attachment 190699 [details])
> We need a test for this change. See the existing tests for this function in Tools/Scripts/webkitperl/VCSUtils_unittest/parseGitDiffHeader.pl. Run Tools/Scripts/test-webkitperl to run the Perl unit tests.

I added test cases and update expectations.

I found that my fix break --no-prefix diff.
I added support --no-prefix diff with empty file path.
Comment 6 Daniel Bates 2013-05-19 23:30:34 PDT
*** Bug 115281 has been marked as a duplicate of this bug. ***
Comment 7 Daniel Bates 2013-05-19 23:33:11 PDT
Comment on attachment 190883 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=190883&action=review

Please excuse my tardy reply. Work has been very hectic. Regardless, I will work to be more responsive moving forward. Feel free to ping me via either email or IRC to expedite a response.

This patch won't work for git diffs generated using: git diff --no-prefix --find-renames --find-copies. For example, the following diff renames the file named "Primary File" to "Secondary File":

diff --git Primary File Secondary File
similarity index 100%
rename from Primary File
rename to Secondary File

Maybe it's uncommon to rename or copy a file whose path contains a space character?

> Tools/Scripts/VCSUtils.pm:753
> +            $_ = "--- $indexPath\t(revision 0)"; # Convert to SVN format.

We should add a comment above this line that explains that we explicitly emit the suffix "\t(revision 0)" so that both the command patch(1) won't complain when $indexPath contains a space character and to more closely match the SVN diff format.

> Tools/Scripts/VCSUtils.pm:755
> +            $_ = "+++ $indexPath\t(revision 0)"; # Convert to SVN format.

I suggest we emit the suffix "(working copy)" instead of "(revision 0)"  in this line so as match the output of the command svn diff.

Additionally, we should add a comment above this line that explains that we explicitly emit the suffix "\t(working copy)" so that both the command patch(1) won't complain when $indexPath contains a space character and to more closely match the SVN diff format.
Comment 8 Alexey Proskuryakov 2013-05-20 09:24:07 PDT
<rdar://problem/13925627>
Comment 9 Yuki Sekiguchi 2013-05-23 22:17:58 PDT
Created attachment 202766 [details]
Patch
Comment 10 Yuki Sekiguchi 2013-05-23 23:16:25 PDT
Hi Daniel,

(In reply to comment #7)
> (From update of attachment 190883 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=190883&action=review
> 
> Please excuse my tardy reply. Work has been very hectic. Regardless, I will work to be more responsive moving forward. Feel free to ping me via either email or IRC to expedite a response.

No problem. Thank your for your help.

> This patch won't work for git diffs generated using: git diff --no-prefix --find-renames --find-copies. For example, the following diff renames the file named "Primary File" to "Secondary File":

Fixed that case.

> > Tools/Scripts/VCSUtils.pm:753
> > +            $_ = "--- $indexPath\t(revision 0)"; # Convert to SVN format.
> 
> We should add a comment above this line that explains that we explicitly emit the suffix "\t(revision 0)" so that both the command patch(1) won't complain when $indexPath contains a space character and to more closely match the SVN diff format.
> 
> > Tools/Scripts/VCSUtils.pm:755
> > +            $_ = "+++ $indexPath\t(revision 0)"; # Convert to SVN format.
> 
> I suggest we emit the suffix "(working copy)" instead of "(revision 0)"  in this line so as match the output of the command svn diff.
> 
> Additionally, we should add a comment above this line that explains that we explicitly emit the suffix "\t(working copy)" so that both the command patch(1) won't complain when $indexPath contains a space character and to more closely match the SVN diff format.

Changed suffix to "(working copy)" and added comment.
Comment 11 Build Bot 2013-05-24 00:29:25 PDT
Comment on attachment 202766 [details]
Patch

Attachment 202766 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-queues.appspot.com/results/695004

New failing tests:
editing/style/5065910.html
Comment 12 Build Bot 2013-05-24 00:29:27 PDT
Created attachment 202776 [details]
Archive of layout-test-results from webkit-ews-09 for mac-mountainlion-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: webkit-ews-09  Port: mac-mountainlion-wk2  Platform: Mac OS X 10.8.3
Comment 13 Daniel Bates 2013-05-24 12:33:33 PDT
Comment on attachment 202766 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=202766&action=review

Thank you Yuki for updating the patch.

The idea presented in this patch is good. I suggest that we encapsulate the parsing machinery to determine the path to the target file into a separate function, say parseGitDiffHeaderForIndexPath, that takes a reference to a file handle F and the last line read from F as input and returns the path to the target file as output (or the empty string on error). parseGitDiffHeaderForIndexPath() will guarantee that F isn't advanced when it returns. That is, if F is at byte b_0 when parseGitDiffHeaderForIndexPath() is called then F will be at byte b_0 when the call returns. This function would have the form:

# Parse the Git diff header for the index path from the given file handle.
#
# This subroutine doesn't advance the handle.
#
# Args:
#   $fileHandle: a file handle; this should be at a line beginning
#                with "diff --git". The handle will not be advanced.
#   $line: the line last read from $fileHandle
#
# Returns $indexPath: the path of the target file.
sub parseGitDiffHeaderForIndexPath($$)
{
    my ($fileHandle, $line) = @_;
    $_ = $line;
    if (/$gitDiffStartWithPrefixRegEx/ || /$gitDiffStartWithoutPrefixNoSpaceRegEx/) {
        return $2;
    }
    # Assume the diff was generated with --no-prefix (e.g. git diff --no-prefix).
    my $sourcePath = "";
    my $destinationPath = "";
    my $savedPosition = tell($fileHandle);
    while (1) {
        # Temporarily strip off any end-of-line characters to simplify
        # regex matching below.
        s/([\n\r]+)$//;

        my $foundHeaderEnding;
        if (/^(copy|rename) to ([^\t\r\n]+)/) {
            $destinationPath = $2;
        } elsif (/^--- ([^\t\r\n]+)/) {
            $sourcePath = $1;
        } elsif (/^\+\+\+ ([^\t\r\n]+)/) {
            $destinationPath = $1;
            $foundHeaderEnding = 1;
        } elsif (/^GIT binary patch$/) {
            $foundHeaderEnding = 1;
        }

        $_ = <$fileHandle>; # Not defined if end-of-file reached.
        last if (!defined($_) || /$gitDiffStartRegEx/ || $foundHeaderEnding); 
    }
    seek($fileHandle, $savedPosition, 0);
    my $indexPath = "";
    if ($destinationPath eq "/dev/null") {
        # Deletion
        $indexPath = $sourcePath;
    } else {
        # Addition, modification, copy, or rename.
        $indexPath = $destinationPath;
    }
    return $indexPath;
}

Then in parseGitDiffHeader() we can write $indexPath = adjustPathForRecentRenamings(parseGitDiffHeaderForIndexPath($fileHandle, $line)) when $_ matches $gitDiffStartRegEx. When $indexPath is a non-empty string we construct the "Index:" line taking care to append the correct line ending (we'll need to save $POSTMATCH before calling parseGitDiffHeaderForIndexPath() as it will change on function return). Otherwise, we die() with an error message, say "Could not determine target file path from diff."

> Tools/Scripts/VCSUtils.pm:110
>  my $gitDiffStartRegEx = qr#^diff --git (\w/)?(.+) (\w/)?([^\r\n]+)#;

We seem to only use this regular expression to match the prefix "diff --git" and to extract the end of line character. Can we simplify this expression?

> Tools/Scripts/VCSUtils.pm:111
> +my $gitDiffStartWithPrefixRegEx = qr#^diff --git a/(.+) b/([^\r\n]+)#;

This is OK as-is. Notice that the source- (a/) and destination- prefix (b/) can be modified by specifying to the git diff command the optional command line arguments --src-prefix and --dst-prefix per <https://www.kernel.org/pub/software/scm/git/docs/v1.7.3/git-diff.html>. I assume this influenced the choice of the special escape \w in the regular expression $gitDiffStartRegEx as opposed to hardcoding the prefixes. The use of the special escape \w is insufficient to match an arbitrary source and destination prefix. Obviously the use of the special escape \w matches more variants of the diff --git line than a diff --git line with hardcoded source and destination prefixes.

> Tools/Scripts/VCSUtils.pm:701
> +            if (/^\+\+\+ (.+)/) {
> +                $indexPath = adjustPathForRecentRenamings($1);

Consider the following non-prefixed diff that deletes the file named "TWO WORDS.txt":

diff --git TWO WORDS.txt TWO WORDS.txt
deleted file mode 100644
index 03a05ce..0000000
--- TWO WORDS.txt       
+++ /dev/null
@@ -1 +0,0 @@
-Sample

Then $indexPath will be /dev/null. But it should be "TWO WORDS.txt".

> Tools/Scripts/VCSUtils.pm:704
> +                $indexPath = $1;

Don't we need to adjust the index path here (i.e. call adjustPathForRecentRenamings())?

> Tools/Scripts/VCSUtils.pm:707
> +                $indexPath = $1;

Ditto.

> Tools/Scripts/VCSUtils.pm:748
> +        } elsif (/^copy from ([^\r\n]+)/) {

I suggest we strengthen this regular expression and exclude tab characters from being considered in a path.

> Tools/Scripts/VCSUtils.pm:750
> +        } elsif (/^rename from ([^\r\n]+)/) {

Ditto.

> Tools/Scripts/VCSUtils.pm:762
> +            # The patch(1) command thinks a file path is characters before the parenthesis.

Is this correct? I thought patch(1) reads all characters up to the first tab character.

> Tools/Scripts/VCSUtils.pm:768
> +            # Please watch the comments in the "elsif (/^--- \S+/)" line for detail.

I think it would be clearer to duplicate the last two lines of the comment for the "elsif (/^--- \S+/)" branch than to refer to it. If you don't like such duplication then I suggest updating this sentence to read (*): See the comment for the /^--- \S+/ case (above) for more details.

(*) It's unusual to say "watch the comments" as the word "watch" implies that a person should keep the comment in view as if it were to change at any moment.
Comment 14 Yuki Sekiguchi 2013-05-26 23:18:00 PDT
Created attachment 202952 [details]
Patch
Comment 15 Yuki Sekiguchi 2013-05-27 00:16:37 PDT
Hi Daniel,

Thank you for reviewing.

(In reply to comment #13)
> (From update of attachment 202766 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=202766&action=review
> 
> Thank you Yuki for updating the patch.
> 
> The idea presented in this patch is good. I suggest that we encapsulate the parsing machinery to determine the path to the target file into a separate function, say parseGitDiffHeaderForIndexPath, that takes a reference to a file handle F and the last line read from F as input and returns the path to the target file as output (or the empty string on error). parseGitDiffHeaderForIndexPath() will guarantee that F isn't advanced when it returns. That is, if F is at byte b_0 when parseGitDiffHeaderForIndexPath() is called then F will be at byte b_0 when the call returns. This function would have the form:

Refactoring is good, but why should we use file handle as input of parseGitDiffHeaderForIndexPath()?
We have to read file to create @diffLines.
I think creating @diffLines and using it from parseGitDiffHeaderForIndexPath() and parseGitDiffHeader() is better than reading file twice.
My patch use @diffLines for the input of parseGitDiffHeaderForIndexPath().

> Then in parseGitDiffHeader() we can write $indexPath = adjustPathForRecentRenamings(parseGitDiffHeaderForIndexPath($fileHandle, $line)) when $_ matches $gitDiffStartRegEx. When $indexPath is a non-empty string we construct the "Index:" line taking care to append the correct line ending (we'll need to save $POSTMATCH before calling parseGitDiffHeaderForIndexPath() as it will change on function return). Otherwise, we die() with an error message, say "Could not determine target file path from diff."

Fixed.

> 
> > Tools/Scripts/VCSUtils.pm:110
> >  my $gitDiffStartRegEx = qr#^diff --git (\w/)?(.+) (\w/)?([^\r\n]+)#;
> 
> We seem to only use this regular expression to match the prefix "diff --git" and to extract the end of line character. Can we simplify this expression?

Yes. Simplified.

> > Tools/Scripts/VCSUtils.pm:111
> > +my $gitDiffStartWithPrefixRegEx = qr#^diff --git a/(.+) b/([^\r\n]+)#;
> 
> This is OK as-is. Notice that the source- (a/) and destination- prefix (b/) can be modified by specifying to the git diff command the optional command line arguments --src-prefix and --dst-prefix per <https://www.kernel.org/pub/software/scm/git/docs/v1.7.3/git-diff.html>. I assume this influenced the choice of the special escape \w in the regular expression $gitDiffStartRegEx as opposed to hardcoding the prefixes. The use of the special escape \w is insufficient to match an arbitrary source and destination prefix. Obviously the use of the special escape \w matches more variants of the diff --git line than a diff --git line with hardcoded source and destination prefixes.

Used \w+, and added comment the limitation.

> > Tools/Scripts/VCSUtils.pm:701
> > +            if (/^\+\+\+ (.+)/) {
> > +                $indexPath = adjustPathForRecentRenamings($1);
> 
> Consider the following non-prefixed diff that deletes the file named "TWO WORDS.txt":
> 
> diff --git TWO WORDS.txt TWO WORDS.txt
> deleted file mode 100644
> index 03a05ce..0000000
> --- TWO WORDS.txt       
> +++ /dev/null
> @@ -1 +0,0 @@
> -Sample
> 
> Then $indexPath will be /dev/null. But it should be "TWO WORDS.txt".

I added a test case and confirmed that your suggested function fixed this.
Thanks!

> > Tools/Scripts/VCSUtils.pm:704
> > +                $indexPath = $1;
> 
> Don't we need to adjust the index path here (i.e. call adjustPathForRecentRenamings())?
> 
> > Tools/Scripts/VCSUtils.pm:707
> > +                $indexPath = $1;
> 
> Ditto.

Yes, we need.

> > Tools/Scripts/VCSUtils.pm:748
> > +        } elsif (/^copy from ([^\r\n]+)/) {
> 
> I suggest we strengthen this regular expression and exclude tab characters from being considered in a path.
> 
> > Tools/Scripts/VCSUtils.pm:750
> > +        } elsif (/^rename from ([^\r\n]+)/) {
> 
> Ditto.

Added \t.

> > Tools/Scripts/VCSUtils.pm:762
> > +            # The patch(1) command thinks a file path is characters before the parenthesis.
> 
> Is this correct? I thought patch(1) reads all characters up to the first tab character.

This is not correct. You are right.
In GNU patch code, fetchname() @ util.c do that.

> > Tools/Scripts/VCSUtils.pm:768
> > +            # Please watch the comments in the "elsif (/^--- \S+/)" line for detail.
> 
> I think it would be clearer to duplicate the last two lines of the comment for the "elsif (/^--- \S+/)" branch than to refer to it. If you don't like such duplication then I suggest updating this sentence to read (*): See the comment for the /^--- \S+/ case (above) for more details.

Just copied commets.

> (*) It's unusual to say "watch the comments" as the word "watch" implies that a person should keep the comment in view as if it were to change at any moment.

English is very difficult for me like perl :(
Comment 16 Yuki Sekiguchi 2013-05-31 04:50:22 PDT
Hi Daniel,
Could you review my patch?
Comment 17 Daniel Bates 2013-06-01 14:05:37 PDT
I spoke with Chris Jerdonek, who wrote most of the Git diff parsing machinery, about this patch. We felt that it's sufficient to start with a simple solution and iterate on it as needed. One such simple solution is to use the first path segment w of the source file path to determine where the target file path begins. Consider the following diff --git line:

diff --git WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess).xcscheme WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess)2.xcscheme

Take w := "WebKit.xcworkspace/". Then locate the second occurrence of w and split the string accordingly. The source and destination file paths would then be "WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess).xcscheme", and "WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess)2.xcscheme", respectively.

Such a solution would have the form:

sub parseGitDiffHeader($$)
{
    ...
    my $indexPath;
    if (/$gitDiffStartRegEx/) {
        ...
        $indexPath = adjustPathForRecentRenamings(parseGitDiffStartLineForTargetPath($_));
        # Use $POSTMATCH to preserve the end-of-line character.
        $_ = "Index: $indexPath$POSTMATCH"; # Convert to SVN format.
    } else {
        die("Could not parse leading \"diff --git\" line: \"$line\".");
    }
    ...
    while (1) {
        ...
        } elsif (/^--- \S+/) {
            ...
            $_ = "--- $indexPath\t(revision 0)";
        } elsif (/^\+\+\+ \S+/) {
            ...
            $_ = "+++ $indexPath\t(working copy)";
        }
        ...
    }
    ...
}

sub parseGitDiffStartLine($)
{
    ...
    if (/$gitDiffStartWithPrefixRegEx/ || /$gitDiffStartWithoutPrefixNoSpaceRegEx/) {
        return $2;
    }
    ...
}

We expect that this approach will handle almost all cases in practice. However, initially it won't handle cases where the top-level directory changes or certain rare cases where the first path segment may occur later in the source path (e.g. diff --git A/B A/C.txt A/D.txt; source path = "A/B A/C.txt", target path = "A/D.txt"). Moreover, if we ever want to handle non-prefix Git diffs for binary files, the approach of parsing only the first line will be necessary because such diffs only reference the source and target path in this line.

Starting with a simple solution is attractive as the implementation tends to be concise and maintains the readability and hackability of the code base. Should the simple solution turn out to be insufficient we can iterate on it towards a more comprehensive solution.
Comment 18 David Farler 2013-06-03 16:28:25 PDT
(In reply to comment #17)
> I spoke with Chris Jerdonek, who wrote most of the Git diff parsing machinery, about this patch. We felt that it's sufficient to start with a simple solution and iterate on it as needed. One such simple solution is to use the first path segment w of the source file path to determine where the target file path begins. Consider the following diff --git line:
> 
> diff --git WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess).xcscheme WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess)2.xcscheme
> 
> Take w := "WebKit.xcworkspace/". Then locate the second occurrence of w and split the string accordingly. The source and destination file paths would then be "WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess).xcscheme", and "WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess)2.xcscheme", respectively.
> 
> Such a solution would have the form:
> 
> sub parseGitDiffHeader($$)
> {
>     ...
>     my $indexPath;
>     if (/$gitDiffStartRegEx/) {
>         ...
>         $indexPath = adjustPathForRecentRenamings(parseGitDiffStartLineForTargetPath($_));
>         # Use $POSTMATCH to preserve the end-of-line character.
>         $_ = "Index: $indexPath$POSTMATCH"; # Convert to SVN format.
>     } else {
>         die("Could not parse leading \"diff --git\" line: \"$line\".");
>     }
>     ...
>     while (1) {
>         ...
>         } elsif (/^--- \S+/) {
>             ...
>             $_ = "--- $indexPath\t(revision 0)";
>         } elsif (/^\+\+\+ \S+/) {
>             ...
>             $_ = "+++ $indexPath\t(working copy)";
>         }
>         ...
>     }
>     ...
> }
> 
> sub parseGitDiffStartLine($)
> {
>     ...
>     if (/$gitDiffStartWithPrefixRegEx/ || /$gitDiffStartWithoutPrefixNoSpaceRegEx/) {
>         return $2;
>     }
>     ...
> }
> 
> We expect that this approach will handle almost all cases in practice. However, initially it won't handle cases where the top-level directory changes or certain rare cases where the first path segment may occur later in the source path (e.g. diff --git A/B A/C.txt A/D.txt; source path = "A/B A/C.txt", target path = "A/D.txt"). Moreover, if we ever want to handle non-prefix Git diffs for binary files, the approach of parsing only the first line will be necessary because such diffs only reference the source and target path in this line.
> 
> Starting with a simple solution is attractive as the implementation tends to be concise and maintains the readability and hackability of the code base. Should the simple solution turn out to be insufficient we can iterate on it towards a more comprehensive solution.

If we like, we can also just see how git does it:

builtin/apply.c

static int find_header(const char *line, unsigned long size, int *hdrsize, struct patch *patch)
:1490
static int parse_git_header(const char *line, int len, unsigned int size, struct patch *patch)
:1280
static char *git_header_name(const char *line, int llen)
Comment 19 Yuki Sekiguchi 2013-06-04 04:54:43 PDT
Created attachment 203682 [details]
Patch
Comment 20 Yuki Sekiguchi 2013-06-04 05:05:22 PDT
Hi Daniel,

(In reply to comment #17)
> I spoke with Chris Jerdonek, who wrote most of the Git diff parsing machinery, about this patch. We felt that it's sufficient to start with a simple solution and iterate on it as needed. One such simple solution is to use the first path segment w of the source file path to determine where the target file path begins. Consider the following diff --git line:
> 
> diff --git WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess).xcscheme WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess)2.xcscheme
> 
> Take w := "WebKit.xcworkspace/". Then locate the second occurrence of w and split the string accordingly. The source and destination file paths would then be "WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess).xcscheme", and "WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess)2.xcscheme", respectively.

OK. Clean and simple code is good :)

> We expect that this approach will handle almost all cases in practice. However, initially it won't handle cases where the top-level directory changes or certain rare cases where the first path segment may occur later in the source path (e.g. diff --git A/B A/C.txt A/D.txt; source path = "A/B A/C.txt", target path = "A/D.txt"). Moreover, if we ever want to handle non-prefix Git diffs for binary files, the approach of parsing only the first line will be necessary because such diffs only reference the source and target path in this line.
> 
> Starting with a simple solution is attractive as the implementation tends to be concise and maintains the readability and hackability of the code base. Should the simple solution turn out to be insufficient we can iterate on it towards a more comprehensive solution.

I found that moving a file through sub directories of top directory is not supported, too (e.g diff --git A/B.txt C/B.txt).
However, I think this is also rare case.

I partial reverted your suggestion in Comment #13 which support --src-prefix and --dst-prefix.
We cannot distinguish the prefix from path prefix, so I think a single character is --src-prefix or --dst-prefix.
Comment 21 Yuki Sekiguchi 2013-06-04 05:15:39 PDT
(In reply to comment #18)
> If we like, we can also just see how git does it:
> 
> builtin/apply.c
> 
> static int find_header(const char *line, unsigned long size, int *hdrsize, struct patch *patch)
> :1490
> static int parse_git_header(const char *line, int len, unsigned int size, struct patch *patch)
> :1280
> static char *git_header_name(const char *line, int llen)

Thank you for good information.

I looked the code.

Summary:
* git get paths from "git --diff" line only if the file is changed.
* When the file is renamed, copied or deleted, git gets paths from diff file (e.g. "+++ foo.h").
* To get paths from a diff line which contains paths which have space, git splits the diff line with space or tab, and it checks the two paths is same.
* To support --src-prefix and --dst-prefix, git use -p parameter which is same as patch(1) or  heuristically guess -p parameter of a diff using the diff and relative path from the .git directory .
Comment 22 Daniel Bates 2013-06-04 14:54:00 PDT
Comment on attachment 203682 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=203682&action=review

This patch looks good. I have some minor nits.

> Tools/ChangeLog:8
> +        If the path in a diff has space characters, parseGitDiffHeader() considers last nonspace characters as file path.

I'm unclear what you mean by this sentence.

> Tools/ChangeLog:9
> +        When the diff have prefix, we consider next characters after "b/" as file path.

Is this sentence correct? I thought we currently considered the next characters after "b/" as part of a file path?

> Tools/ChangeLog:12
> +        We only support --src-prefix and --dst-prefix don't have a \W and end with '/' because we cannot distinguish the prefix from directory path.

For your consideration, I suggest we write out the meaning of \W in English and hence change "have a \W" to "contain a non-word character (\W)".

> Tools/Scripts/VCSUtils.pm:111
> +my $gitDiffStartWithPrefixRegEx = qr#^diff --git \w/(.+) \w/([^\r\n]+)#; # We suppose that --src-prefix and --dst-prefix don't have a \W and end with '/'.

For your consideration, I suggest we write out the meaning of \W in English and hence change "have a \W" to "contain a non-word character (\W)".

> Tools/Scripts/VCSUtils.pm:114
> +my $gitDiffPathPrefix = qr#^diff --git ([^\/]+\/)#;

It's unnecessary to escape the '/' character in this regular expression since we use '#' as the delimiter. Thus, we can simplify the regular expression in this line to read: ^diff --git ([^/]+/)

On another note, I wish we could come up with a more descriptive name for this variable. Maybe $gitDiffStartWithoutPrefixFirstDirectoryNameRegExp? or $gitDiffStartWithoutPrefixSourceDirectoryPrefixRegExp? or $gitDiffStartPathPrefixRegExp?

> Tools/Scripts/VCSUtils.pm:624
> +#   $line: the line last read from $fileHandle

The description of this argument is incorrect. We expect $line to be the diff --git line.

> Tools/Scripts/VCSUtils.pm:626
> +# Returns $indexPath: the path of the target file.

There is no such local variable called $indexPath. I would update this line to read:

Returns the path of the target file.

> Tools/Scripts/VCSUtils.pm:636
> +        # Moving top directory file is not supported (e.g diff --git A.txt B.txt).

We should make this a FIXME comment.

> Tools/Scripts/VCSUtils.pm:637
> +        die("Could not find '/' in  \"diff --git\" line: \"$line\".\nPlease try git diff without --no-prefix");

I suggest we further elaborate on the reason we failed similar to the comment on the line above. Maybe something like:

"Could not find '/' in \"diff --git\" line: \"$line\"; only non-prefixed git diffs (i.e. not generated with --no-prefix) that move a top-level directory file are supported."

> Tools/Scripts/VCSUtils.pm:640
> +    if (!/^diff --git $pathPrefix.+ ($pathPrefix.+)$/) {

We should disable all metacharacters in $pathPrefix when interpolating its value by surrounding it with the special escapes \Q, \E:

/^diff --git \Q$pathPrefix\E.+ (\Q$pathPrefix\E.+)$/

For completeness, the special escape characters \Q and \E are discussed in <http://perldoc.perl.org/perlop.html#Quote-and-Quote-like-Operators>.

> Tools/Scripts/VCSUtils.pm:641
> +        # Moving a file through sub directories of top directory is not supported (e.g diff --git A/B.txt C/B.txt).

We should make this a FIXME coment.

> Tools/Scripts/VCSUtils.pm:642
> +        die("Could not find '/' in  \"diff --git\" line: \"$line\".\nPlease try git diff without --no-prefix");

I suggest we further elaborate on the reason we failed similar to the comment on the line above. Maybe something like:

"Could not find '/' in \"diff --git\" line: \"$line\"; only non-prefixed git diffs (i.e. not generated with --no-prefix) that move a file between top-level directories are supported."

> Tools/Scripts/VCSUtils.pm:740
> +            # This suffix make our diff more closely match the SVN diff format.

Nit: make => makes

> Tools/Scripts/VCSUtils.pm:746
> +            # This suffix make our diff more closely match the SVN diff format.

Nit: make => makes

> Tools/Scripts/webkitperl/VCSUtils_unittest/parseGitDiffHeader.pl:84
> +    diffName => "Modified file which have empty characters in path",

Nit: "empty" => "space"

> Tools/Scripts/webkitperl/VCSUtils_unittest/parseGitDiffHeader.pl:107
> +{   # New test

These test cases are good. For your consideration, I suggest we create a test case that has a diff --git line identical to one in attachment #190663 [details] (which is the patch that revealed the bug in our Git diff parsing code):

diff --git a/WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess).xcscheme b/WebKit.xcworkspace/xcshareddata/xcschemes/All Source (target WebProcess).xcscheme

> Tools/Scripts/webkitperl/VCSUtils_unittest/parseGitDiffHeader.pl:108
> +    diffName => "Modified file which have empty characters in path using --no-prefix",

Nit: "empty" => "space"

> Tools/Scripts/webkitperl/VCSUtils_unittest/parseGitDiffHeader.pl:186
> +    diffName => "delete file which have empty characters in path using --no-prefix",

Nit: "empty" => "space"

> Tools/Scripts/webkitperl/VCSUtils_unittest/parseGitDiffHeader.pl:264
> +    diffName => "copy file which have empty characters in path using --no-prefix (with similarity index 100%)",

Ditto.

> Tools/Scripts/webkitperl/VCSUtils_unittest/parseGitDiffHeader.pl:338
> +    diffName => "rename file which have empty characters in path using --no-prefix (with similarity index 100%)",

Ditto.
Comment 23 Yuki Sekiguchi 2013-06-05 01:31:25 PDT
Created attachment 203769 [details]
Patch
Comment 24 Yuki Sekiguchi 2013-06-05 01:43:57 PDT
Hi Daniel,

Thank you for kindly review.

I fixed codes which you comment.

(In reply to comment #22)
> (From update of attachment 203682 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=203682&action=review
> 
> This patch looks good. I have some minor nits.
> 
> > Tools/ChangeLog:8
> > +        If the path in a diff has space characters, parseGitDiffHeader() considers last nonspace characters as file path.
> 
> I'm unclear what you mean by this sentence.

I want to describe bug cause.
I changed this to the past tense.
Is this make sense?

> 
> > Tools/ChangeLog:9
> > +        When the diff have prefix, we consider next characters after "b/" as file path.
> 
> Is this sentence correct? I thought we currently considered the next characters after "b/" as part of a file path?

Sorry, I cannot understand your comment.
I added "part of a", but my fix might be wrong.

> > Tools/Scripts/VCSUtils.pm:114
> > +my $gitDiffPathPrefix = qr#^diff --git ([^\/]+\/)#;
> 
> It's unnecessary to escape the '/' character in this regular expression since we use '#' as the delimiter. Thus, we can simplify the regular expression in this line to read: ^diff --git ([^/]+/)
> 
> On another note, I wish we could come up with a more descriptive name for this variable. Maybe $gitDiffStartWithoutPrefixFirstDirectoryNameRegExp? or $gitDiffStartWithoutPrefixSourceDirectoryPrefixRegExp? or $gitDiffStartPathPrefixRegExp?

I like $gitDiffStartWithoutPrefixSourceDirectoryPrefixRegExp :)
Comment 25 Daniel Bates 2013-06-05 05:07:39 PDT
Comment on attachment 203682 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=203682&action=review

>>> Tools/ChangeLog:8
>>> +        If the path in a diff has space characters, parseGitDiffHeader() considers last nonspace characters as file path.
>> 
>> I'm unclear what you mean by this sentence.
> 
> I want to describe bug cause.
> I changed this to the past tense.
> Is this make sense?

I think a better way to phrase this sentence is to write:

Fixes an issue where parseGitDiffHeader() would extract the wrong substring of the diff --git line as the target file path when the source file path contains a space character.

>>> Tools/ChangeLog:9
>>> +        When the diff have prefix, we consider next characters after "b/" as file path.
>> 
>> Is this sentence correct? I thought we currently considered the next characters after "b/" as part of a file path?
> 
> Sorry, I cannot understand your comment.
> I added "part of a", but my fix might be wrong.

OK. This sentence reads better with the proposed correction.
Comment 26 Daniel Bates 2013-06-05 05:09:49 PDT
Comment on attachment 203769 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=203769&action=review

> Tools/ChangeLog:8
> +        If the path in diff start line has space characters, parseGitDiffHeader() considered last non-space characters as file path.

As aforementioned in comment #25, I think a better way to phrase this sentence would be:

Fixes an issue where parseGitDiffHeader() would extract the wrong substring of the diff --git line as the target file path when the source file path contains a space character.
Comment 27 Yuki Sekiguchi 2013-06-05 23:56:17 PDT
Created attachment 203907 [details]
Patch
Comment 28 Yuki Sekiguchi 2013-06-06 00:07:11 PDT
Hi Daniel,

(In reply to comment #26)
> (From update of attachment 203769 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=203769&action=review
> 
> > Tools/ChangeLog:8
> > +        If the path in diff start line has space characters, parseGitDiffHeader() considered last non-space characters as file path.
> 
> As aforementioned in comment #25, I think a better way to phrase this sentence would be:
> 
> Fixes an issue where parseGitDiffHeader() would extract the wrong substring of the diff --git line as the target file path when the source file path contains a space character.

Thank you.
Fixed.
Comment 29 Daniel Bates 2013-06-06 07:16:36 PDT
Comment on attachment 203907 [details]
Patch

r=me
Comment 30 WebKit Commit Bot 2013-06-06 16:37:25 PDT
Comment on attachment 203907 [details]
Patch

Clearing flags on attachment: 203907

Committed r151300: <http://trac.webkit.org/changeset/151300>
Comment 31 WebKit Commit Bot 2013-06-06 16:37:29 PDT
All reviewed patches have been landed.  Closing bug.