RESOLVED FIXED 175075
webkitpy: Allow caller to specify response to unicode encode/decode error in filesystem
https://bugs.webkit.org/show_bug.cgi?id=175075
Summary webkitpy: Allow caller to specify response to unicode encode/decode error in ...
Jonathan Bedard
Reported 2017-08-02 10:10:05 PDT
Currently, if there is an encode or decode error encountered while reading or writing to a text file, there is no way to specify the response, an exception will always be thrown.
Attachments
Patch (6.10 KB, patch)
2017-08-02 10:12 PDT, Jonathan Bedard
no flags
Jonathan Bedard
Comment 1 2017-08-02 10:12:29 PDT
David Kilzer (:ddkilzer)
Comment 2 2017-08-02 10:17:53 PDT
Comment on attachment 316970 [details] Patch r=me
WebKit Commit Bot
Comment 3 2017-08-02 11:22:00 PDT
The commit-queue encountered the following flaky tests while processing attachment 316970 [details]: The commit-queue is continuing to process your patch.
WebKit Commit Bot
Comment 4 2017-08-02 13:01:36 PDT
Comment on attachment 316970 [details] Patch Clearing flags on attachment: 316970 Committed r220150: <http://trac.webkit.org/changeset/220150>
WebKit Commit Bot
Comment 5 2017-08-02 13:01:37 PDT
All reviewed patches have been landed. Closing bug.
Radar WebKit Bug Importer
Comment 6 2017-08-02 13:02:47 PDT
Daniel Bates
Comment 7 2017-08-02 14:07:20 PDT
I know that this patch was already reviewed and landed. It seems error prone to allow a caller to relax UTF-8 decoding rules and I worry it may led to a proliferation of sloppy coding. With the exception of one or two methods, the FileSystem methods that work with text-files assume UTF-8 encoded text files. We should raise an error when a text file contains malformed UTF-8 data and we should make it hard for a person to use FileSystem methods meant for interacting with a text-file on a non-text file because both of these activities indicate a correctness issue. From talking with Jonathan today in-person, the motivation for this change is to ignore decoding errors that have been seen when processing some crash logs. We should acquire one of these problematic crash logs and analyze it. I hope that from this analysis we can revert this change.
David Kilzer (:ddkilzer)
Comment 8 2017-08-04 12:48:28 PDT
(In reply to Daniel Bates from comment #7) > I know that this patch was already reviewed and landed. It seems error prone > to allow a caller to relax UTF-8 decoding rules and I worry it may led to a > proliferation of sloppy coding. With the exception of one or two methods, > the FileSystem methods that work with text-files assume UTF-8 encoded text > files. We should raise an error when a text file contains malformed UTF-8 > data and we should make it hard for a person to use FileSystem methods meant > for interacting with a text-file on a non-text file because both of these > activities indicate a correctness issue. So your concern is that the "errors='strict'" default value is too easily overridden by someone just trying to make a script work, and that the reviewer of a future patch wouldn't question why "errors='replace'" or "errors='ignore'" is being used when it shouldn't be? def read_text_file(self, path, errors='strict'): I was assuming a best-case scenario where both patch author and reviewer understood what they were doing, although I can see how it could easily be misused. > From talking with Jonathan today in-person, the motivation for this change > is to ignore decoding errors that have been seen when processing some crash > logs. We should acquire one of these problematic crash logs and analyze it. > I hope that from this analysis we can revert this change. If the invalid UTF-8 characters are in the "Filtered syslog" section of the crash log, it will be difficult to (a) fix all of them in a timely manner and (b) prevent invalid UTF-8 characters from being introduced in the future. In that case, do we need a separate method to read a text file with potentially invalid UTF-8 characters named something like this (to make it clear what it does, and to make it easier to raise awareness during code reviews)? def read_text_file_ignoring_invalid_utf_8_characters(self, path):
Daniel Bates
Comment 9 2017-08-07 10:50:36 PDT
(In reply to David Kilzer (:ddkilzer) from comment #8) > So your concern is that the "errors='strict'" default value is too easily > overridden by someone just trying to make a script work, and that the > reviewer of a future patch wouldn't question why "errors='replace'" or > "errors='ignore'" is being used when it shouldn't be? > My concern is that is that there may be a better approach to avoiding the UnicodeDecodeError exceptions if the motivation for adding the errors parameter was adequately explained. I was under the impression that it was good programming practice to read a file whose encoding is unknown or may contain malformed characters as a binary file (i.e. use FileSystem.read_binary_file()) as opposed to reading the file as UTF-8 with a decoder error strategy unless the output of the file will be used in some fashion that requires printable characters. > def read_text_file(self, path, errors='strict'): > > I was assuming a best-case scenario where both patch author and reviewer > understood what they were doing, although I can see how it could easily be > misused. > This has nothing to do with the competency of the author and reviewer. I would like to know what is the preferred idiom for opening a file to read, searching for an ASCII character sequence, that has an unknown encoding or may contain malformed characters. More generally, when is it appropriate to read a file as binary VS read a file as UTF-8 with a decoder error strategy? > > From talking with Jonathan today in-person, the motivation for this change > > is to ignore decoding errors that have been seen when processing some crash > > logs. We should acquire one of these problematic crash logs and analyze it. > > I hope that from this analysis we can revert this change. > > If the invalid UTF-8 characters are in the "Filtered syslog" section of the > crash log, it will be difficult to (a) fix all of them in a timely manner > and (b) prevent invalid UTF-8 characters from being introduced in the future. > If your hypothesis is correct then adding the optional parameter errors to FileSystem methods is more understandable. (Even more so if the intended use of these functions necessitates printable characters as output). It would be good to know if we should be reading crash logs on Darwin-based system as binary files as we do Windows. Regardless, I would not have commented in this bug if the description of this bug (comment #0) or the ChangeLog description explained that this parameter was added because of issues reading crash logs on Darwin-based systems with the suspected cause being that the "Filtered syslog" section of a crash log may contain arbitrary data. Both the bug description and ChangeLog make it sound like we are adding this parameter on a whim because we encountered Python UnicodeDecodeError exceptions and want to silence them without understanding the underlying cause of the exceptions and whether it makes sense to read the crash logs as text instead of binary. > In that case, do we need a separate method to read a text file with > potentially invalid UTF-8 characters named something like this (to make it > clear what it does, and to make it easier to raise awareness during code > reviews)? > > def read_text_file_ignoring_invalid_utf_8_characters(self, path): I do not have a strong opinion on this. I do not get the impression that such a function name makes the code more understandable than the errors parameter.
Note You need to log in before you can comment on or make changes to this bug.