175075 – webkitpy: Allow caller to specify response to unicode encode/decode error in filesystem

RESOLVED FIXED 175075

webkitpy: Allow caller to specify response to unicode encode/decode error in filesystem

https://bugs.webkit.org/show_bug.cgi?id=175075

Summary webkitpy: Allow caller to specify response to unicode encode/decode error in ...

Jonathan Bedard

Reported 2017-08-02 10:10:05 PDT

Currently, if there is an encode or decode error encountered while reading or writing to a text file, there is no way to specify the response, an exception will always be thrown.

Attachments
Patch (6.10 KB, patch) 2017-08-02 10:12 PDT, Jonathan Bedard	no flags	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Jonathan Bedard

Comment 1 2017-08-02 10:12:29 PDT

Created attachment 316970 [details] Patch

David Kilzer (:ddkilzer)

Comment 2 2017-08-02 10:17:53 PDT

Comment on attachment 316970 [details] Patch r=me

WebKit Commit Bot

Comment 3 2017-08-02 11:22:00 PDT

The commit-queue encountered the following flaky tests while processing attachment 316970 [details]: The commit-queue is continuing to process your patch.

WebKit Commit Bot

Comment 4 2017-08-02 13:01:36 PDT

Comment on attachment 316970 [details] Patch Clearing flags on attachment: 316970 Committed r220150: <http://trac.webkit.org/changeset/220150>

WebKit Commit Bot

Comment 5 2017-08-02 13:01:37 PDT

All reviewed patches have been landed. Closing bug.

Radar WebKit Bug Importer

Comment 6 2017-08-02 13:02:47 PDT

<rdar://problem/33683253>

Daniel Bates

Comment 7 2017-08-02 14:07:20 PDT

I know that this patch was already reviewed and landed. It seems error prone to allow a caller to relax UTF-8 decoding rules and I worry it may led to a proliferation of sloppy coding. With the exception of one or two methods, the FileSystem methods that work with text-files assume UTF-8 encoded text files. We should raise an error when a text file contains malformed UTF-8 data and we should make it hard for a person to use FileSystem methods meant for interacting with a text-file on a non-text file because both of these activities indicate a correctness issue. From talking with Jonathan today in-person, the motivation for this change is to ignore decoding errors that have been seen when processing some crash logs. We should acquire one of these problematic crash logs and analyze it. I hope that from this analysis we can revert this change.

David Kilzer (:ddkilzer)

Comment 8 2017-08-04 12:48:28 PDT

(In reply to Daniel Bates from comment #7) > I know that this patch was already reviewed and landed. It seems error prone > to allow a caller to relax UTF-8 decoding rules and I worry it may led to a > proliferation of sloppy coding. With the exception of one or two methods, > the FileSystem methods that work with text-files assume UTF-8 encoded text > files. We should raise an error when a text file contains malformed UTF-8 > data and we should make it hard for a person to use FileSystem methods meant > for interacting with a text-file on a non-text file because both of these > activities indicate a correctness issue. So your concern is that the "errors='strict'" default value is too easily overridden by someone just trying to make a script work, and that the reviewer of a future patch wouldn't question why "errors='replace'" or "errors='ignore'" is being used when it shouldn't be? def read_text_file(self, path, errors='strict'): I was assuming a best-case scenario where both patch author and reviewer understood what they were doing, although I can see how it could easily be misused. > From talking with Jonathan today in-person, the motivation for this change > is to ignore decoding errors that have been seen when processing some crash > logs. We should acquire one of these problematic crash logs and analyze it. > I hope that from this analysis we can revert this change. If the invalid UTF-8 characters are in the "Filtered syslog" section of the crash log, it will be difficult to (a) fix all of them in a timely manner and (b) prevent invalid UTF-8 characters from being introduced in the future. In that case, do we need a separate method to read a text file with potentially invalid UTF-8 characters named something like this (to make it clear what it does, and to make it easier to raise awareness during code reviews)? def read_text_file_ignoring_invalid_utf_8_characters(self, path):

Daniel Bates

Comment 9 2017-08-07 10:50:36 PDT

(In reply to David Kilzer (:ddkilzer) from comment #8) > So your concern is that the "errors='strict'" default value is too easily > overridden by someone just trying to make a script work, and that the > reviewer of a future patch wouldn't question why "errors='replace'" or > "errors='ignore'" is being used when it shouldn't be? > My concern is that is that there may be a better approach to avoiding the UnicodeDecodeError exceptions if the motivation for adding the errors parameter was adequately explained. I was under the impression that it was good programming practice to read a file whose encoding is unknown or may contain malformed characters as a binary file (i.e. use FileSystem.read_binary_file()) as opposed to reading the file as UTF-8 with a decoder error strategy unless the output of the file will be used in some fashion that requires printable characters. > def read_text_file(self, path, errors='strict'): > > I was assuming a best-case scenario where both patch author and reviewer > understood what they were doing, although I can see how it could easily be > misused. > This has nothing to do with the competency of the author and reviewer. I would like to know what is the preferred idiom for opening a file to read, searching for an ASCII character sequence, that has an unknown encoding or may contain malformed characters. More generally, when is it appropriate to read a file as binary VS read a file as UTF-8 with a decoder error strategy? > > From talking with Jonathan today in-person, the motivation for this change > > is to ignore decoding errors that have been seen when processing some crash > > logs. We should acquire one of these problematic crash logs and analyze it. > > I hope that from this analysis we can revert this change. > > If the invalid UTF-8 characters are in the "Filtered syslog" section of the > crash log, it will be difficult to (a) fix all of them in a timely manner > and (b) prevent invalid UTF-8 characters from being introduced in the future. > If your hypothesis is correct then adding the optional parameter errors to FileSystem methods is more understandable. (Even more so if the intended use of these functions necessitates printable characters as output). It would be good to know if we should be reading crash logs on Darwin-based system as binary files as we do Windows. Regardless, I would not have commented in this bug if the description of this bug (comment #0) or the ChangeLog description explained that this parameter was added because of issues reading crash logs on Darwin-based systems with the suspected cause being that the "Filtered syslog" section of a crash log may contain arbitrary data. Both the bug description and ChangeLog make it sound like we are adding this parameter on a whim because we encountered Python UnicodeDecodeError exceptions and want to silence them without understanding the underlying cause of the exceptions and whether it makes sense to read the crash logs as text instead of binary. > In that case, do we need a separate method to read a text file with > potentially invalid UTF-8 characters named something like this (to make it > clear what it does, and to make it easier to raise awareness during code > reviews)? > > def read_text_file_ignoring_invalid_utf_8_characters(self, path): I do not have a strong opinion on this. I do not get the impression that such a function name makes the code more understandable than the errors parameter.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component Tools / Tests

Assignee

Jonathan Bedard

Reported

2017-08-02 10:10 PDT

Modified

2017-08-07 10:50 PDT History

CC List

9 users Show

URL

Keywords InRadar

Depends on

Blocks