137165 – Support modern for loops over StringView code units and code points

RESOLVED FIXED 137165

Support modern for loops over StringView code units and code points

https://bugs.webkit.org/show_bug.cgi?id=137165

Summary Support modern for loops over StringView code units and code points

Myles C. Maxfield

Reported 2014-09-26 19:49:44 PDT

Support modern for loops over StringViews

Attachments
Patch (9.45 KB, patch) 2014-09-26 20:03 PDT, Myles C. Maxfield	no flags	Details Formatted Diff Diff
Patch (9.95 KB, patch) 2014-09-28 21:53 PDT, Myles C. Maxfield	no flags	Details Formatted Diff Diff
Patch (10.67 KB, patch) 2014-09-28 22:37 PDT, Myles C. Maxfield	no flags	Details Formatted Diff Diff
Patch (10.66 KB, patch) 2014-09-28 22:41 PDT, Myles C. Maxfield	no flags	Details Formatted Diff Diff
Patch (10.67 KB, patch) 2014-09-28 22:43 PDT, Myles C. Maxfield	no flags	Details Formatted Diff Diff
Patch (13.50 KB, patch) 2014-09-29 12:18 PDT, Myles C. Maxfield	no flags	Details Formatted Diff Diff
Patch (10.50 KB, patch) 2014-09-30 09:48 PDT, Myles C. Maxfield	no flags	Details Formatted Diff Diff
Patch (10.04 KB, patch) 2014-10-02 17:31 PDT, Myles C. Maxfield	darin: review+	Details Formatted Diff Diff
Show Obsolete (7) View All Add attachment proposed patch, testcase, etc.

Myles C. Maxfield

Comment 1 2014-09-26 20:03:16 PDT

Created attachment 238754 [details] Patch

Darin Adler

Comment 2 2014-09-28 16:37:54 PDT

Comment on attachment 238754 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=238754&action=review I think this is a pretty good idea. But what guarantees these objects don’t last longer than the underlying StringView? It seems really easy to make the mistake of keeping one of these alive when the StringView is gone. For example, this code would not work: for (UChar character : functionReturningStringView().codeUnits()) I think we should consider a different version of these that will work even if the StringView has been destroyed. Probably the best way to do this is to use a StringView rather than a const StringView& as a data member of the iterator. We need to test how codePoints works on invalid sequences; such as when the last character is the lead and there is no trail, and when a trail appears with no lead. U16_NEXT handles these cases, but I am not sure that our code handles them correctly. I am not sure that > Source/WTF/ChangeLog:8 > + This patch adds two functions, codepoints() and codeunits(), on StringView. Code point and code unit are each two word phrases, not single words. Lets not smash them together into one word. uNless there is a good reason. > Source/WTF/wtf/text/StringView.h:40 > +class StringViewCodepoints { I think this should be a member class. So it would be declared inside the StringView class. class StringView { ... class CodePoints; Then later, like this: class StringView::CodePoints { ... }; > Source/WTF/wtf/text/StringView.h:47 > + bool operator!=(const Iterator&) const; I think we should supply both == and != operators. > Source/WTF/wtf/text/StringView.h:57 > + StringViewCodepoints(const StringView&); Might want this to be explicit. > Source/WTF/wtf/text/StringView.h:60 > +private: Need a blank line before this. > Source/WTF/wtf/text/StringView.h:240 > + StringViewCodepoints codepoints() const > + { > + return StringViewCodepoints(*this); > + } > + > + StringViewCodeunits codeunits() const > + { > + return StringViewCodeunits(*this); > + } Please put the bodies of these functions outside the class. I have a patch coming soon that does this for all StringView member functions. > Source/WTF/wtf/text/StringView.h:348 > + if (m_cacheIsValid) { > + m_index = m_nextIndexCache; > + m_cacheIsValid = false; Would read better with early return. > Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:158 > + for (auto codepoint : StringView(builder.toString()).codepoints()) { I think this code would crash if it was run under a memory debugger; the StringView won’t stay alive.

Myles C. Maxfield

Comment 3 2014-09-28 21:47:57 PDT

Comment on attachment 238754 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=238754&action=review >> Source/WTF/ChangeLog:8 >> + This patch adds two functions, codepoints() and codeunits(), on StringView. > > Code point and code unit are each two word phrases, not single words. Lets not smash them together into one word. uNless there is a good reason. Done. >> Source/WTF/wtf/text/StringView.h:40 >> +class StringViewCodepoints { > > I think this should be a member class. So it would be declared inside the StringView class. > > class StringView { > ... > class CodePoints; > > Then later, like this: > > class StringView::CodePoints { > ... > }; Done. I really like the translation of StringViewCodePoints to StringView::CodePoints. There is a way to describe exactly what I'm trying to describe! >> Source/WTF/wtf/text/StringView.h:47 >> + bool operator!=(const Iterator&) const; > > I think we should supply both == and != operators. Done. >> Source/WTF/wtf/text/StringView.h:57 >> + StringViewCodepoints(const StringView&); > > Might want this to be explicit. Done. >> Source/WTF/wtf/text/StringView.h:60 >> +private: > > Need a blank line before this. Done. >> Source/WTF/wtf/text/StringView.h:240 >> + } > > Please put the bodies of these functions outside the class. I have a patch coming soon that does this for all StringView member functions. Done. >> Source/WTF/wtf/text/StringView.h:348 >> + m_cacheIsValid = false; > > Would read better with early return. Done. >> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:158 >> + for (auto codepoint : StringView(builder.toString()).codepoints()) { > > I think this code would crash if it was run under a memory debugger; the StringView won’t stay alive. I tested this out and you're right. I've made the internal member a StringView instead of a const StringView&.

Myles C. Maxfield

Comment 4 2014-09-28 21:53:23 PDT

Created attachment 238839 [details] Patch

Myles C. Maxfield

Comment 5 2014-09-28 22:37:56 PDT

Created attachment 238840 [details] Patch

Myles C. Maxfield

Comment 6 2014-09-28 22:41:41 PDT

Created attachment 238841 [details] Patch

Myles C. Maxfield

Comment 7 2014-09-28 22:43:30 PDT

Created attachment 238842 [details] Patch

Said Abou-Hallawa

Comment 8 2014-09-29 10:29:38 PDT

I have two comments on this change: 1. There is no way to use this code to traverse the string backward. The std::string creates reverse beginning and reverse end iterators which can be used to traverse the string backward. Maybe you need to do the same for your iterators. Or maybe we can add operator--() to your iterators. 2. It was had for me to keep track to with your code in StringView::CodePoints::Iterator::operator++() and StringView::CodePoints::Iterator::operator*(). It seems you have two different cases: is8Bit() and !is8Bit(). When is8Bit() is true you do not need to cache the values m_nextIndexCache and m_codepointCache. But you are doing the following: a) you are not initializing m_codepointCache in the constructor. I know it is guarded by having m_cacheIsValid set to false in the constructor. And m_codepointCache is going to be fixed in operator++() or operator*() if is8Bit() is false. But I think initializing it to nullptr in the constructor might make things clearer. b) In StringView::CodePoints::Iterator::operator*() and if m_stringView.is8Bit() you are ++m_nextIndexCache. This looks weird to me to have the operator*() increments the index cache in this cache. And actually it can produce bugs: Consider this example: StringView strView = StringView(String("ABC")).codePoints(); Iterator it = strView.begin(); std::cout << *it; std::cout << *it; it++; std::cout << *it; Result will be: AAC Expected: AAB The reason is in operator*() we moved the m_nextIndexCache twice and in operator++() we set the m_index to m_nextIndexCache which was '2'. c) I would suggest in StringView::CodePoints::Iterator::operator*() and StringView::CodePoints::Iterator::operator++() to handle the cases is8Bit() and !is8Bit() in completely different code paths since when is8Bit() is true, you do not need the cache and when is8Bit() is true you need the cache. I personally like the code when it looks like a tree: The code starts and diverges and keep diverging. But I got confused when it looks like a diamond: The code starts, diverge and then converge again. The code in these two functions looks like a diamond.

Myles C. Maxfield

Comment 9 2014-09-29 11:27:56 PDT

Comment on attachment 238842 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=238842&action=review > Source/WTF/wtf/text/StringView.h:387 > + ++m_nextIndexCache; This is wrong if operator++() has been called multiple times since the last call to operator*()

Myles C. Maxfield

Comment 10 2014-09-29 12:14:23 PDT

Replies inline. (In reply to comment #8) > I have two comments on this change: > > 1. There is no way to use this code to traverse the string backward. The std::string creates reverse beginning and reverse end iterators which can be used to traverse the string backward. Maybe you need to do the same for your iterators. Or maybe we can add operator--() to your iterators. Yep. That is a non-goal. If we need this in the future, we can add it in. > > 2. It was had for me to keep track to with your code in StringView::CodePoints::Iterator::operator++() and StringView::CodePoints::Iterator::operator*(). It seems you have two different cases: is8Bit() and !is8Bit(). When is8Bit() is true you do not need to cache the values m_nextIndexCache and m_codepointCache. But you are doing the following: > > a) you are not initializing m_codepointCache in the constructor. I know it is guarded by having m_cacheIsValid set to false in the constructor. And m_codepointCache is going to be fixed in operator++() or operator*() if is8Bit() is false. But I think initializing it to nullptr in the constructor might make things clearer. The m_cacheIsValid should be enough, but it's cheap enough to do this that I'll do it for readability. > b) In StringView::CodePoints::Iterator::operator*() and if m_stringView.is8Bit() you are ++m_nextIndexCache. This looks weird to me to have the operator*() increments the index cache in this cache. And actually it can produce bugs: Consider this example: > StringView strView = StringView(String("ABC")).codePoints(); > Iterator it = strView.begin(); > std::cout << *it; > std::cout << *it; > it++; > std::cout << *it; > > Result will be: > AAC > > Expected: > AAB > > The reason is in operator*() we moved the m_nextIndexCache twice and in operator++() we set the m_index to m_nextIndexCache which was '2'. Line 390 makes this not possible. However, you bring up a good point - I don't have any tests regarding calling operator++() or operator*() multiple times in a row. I've added some tests. > c) I would suggest in StringView::CodePoints::Iterator::operator*() and StringView::CodePoints::Iterator::operator++() to handle the cases is8Bit() and !is8Bit() in completely different code paths since when is8Bit() is true, you do not need the cache and when is8Bit() is true you need the cache. I personally like the code when it looks like a tree: The code starts and diverges and keep diverging. But I got confused when it looks like a diamond: The code starts, diverge and then converge again. The code in these two functions looks like a diamond. I agree with your premise but disagree with your conclusion. I don't think that having four iterator classes would be worth the amount of complexity that could be avoided. There are only 4 variables inside CodePoints, and each of the 4 is a (variation on an) int.

Myles C. Maxfield

Comment 11 2014-09-29 12:18:20 PDT

(In reply to comment #8) > I have two comments on this change: > > 1. There is no way to use this code to traverse the string backward. The std::string creates reverse beginning and reverse end iterators which can be used to traverse the string backward. Maybe you need to do the same for your iterators. Or maybe we can add operator--() to your iterators. > > 2. It was had for me to keep track to with your code in StringView::CodePoints::Iterator::operator++() and StringView::CodePoints::Iterator::operator*(). It seems you have two different cases: is8Bit() and !is8Bit(). When is8Bit() is true you do not need to cache the values m_nextIndexCache and m_codepointCache. But you are doing the following: > > a) you are not initializing m_codepointCache in the constructor. I know it is guarded by having m_cacheIsValid set to false in the constructor. And m_codepointCache is going to be fixed in operator++() or operator*() if is8Bit() is false. But I think initializing it to nullptr in the constructor might make things clearer. > b) In StringView::CodePoints::Iterator::operator*() and if m_stringView.is8Bit() you are ++m_nextIndexCache. This looks weird to me to have the operator*() increments the index cache in this cache. And actually it can produce bugs: Consider this example: > StringView strView = StringView(String("ABC")).codePoints(); > Iterator it = strView.begin(); > std::cout << *it; > std::cout << *it; > it++; > std::cout << *it; > > Result will be: > AAC > > Expected: > AAB > > The reason is in operator*() we moved the m_nextIndexCache twice and in operator++() we set the m_index to m_nextIndexCache which was '2'. > c) I would suggest in StringView::CodePoints::Iterator::operator*() and StringView::CodePoints::Iterator::operator++() to handle the cases is8Bit() and !is8Bit() in completely different code paths since when is8Bit() is true, you do not need the cache and when is8Bit() is true you need the cache. I personally like the code when it looks like a tree: The code starts and diverges and keep diverging. But I got confused when it looks like a diamond: The code starts, diverge and then converge again. The code in these two functions looks like a diamond. I've added some ASSERTs that should make it more obvious what is going on.

Myles C. Maxfield

Comment 12 2014-09-29 12:18:44 PDT

Created attachment 238865 [details] Patch

Myles C. Maxfield

Comment 13 2014-09-30 09:48:12 PDT

Created attachment 238931 [details] Patch

Darin Adler

Comment 14 2014-10-02 16:17:54 PDT

Comment on attachment 238931 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=238931&action=review review- because we don’t want StringView == StringView to do what is coded here. > Source/WTF/wtf/text/StringView.h:41 > + class CodePoints; > + class CodeUnits; I suggest defining these right before they are used instead of at the top of the class. > Source/WTF/wtf/text/StringView.h:188 > + bool operator==(const StringView&) const; > + bool operator!=(const StringView&) const; These operators should be implemented as free functions, not member functions. If these are member functions, then no type conversions can be done on the left side of the == and !=, but if they are free functions they can be. > Source/WTF/wtf/text/StringView.h:247 > +#if !ASSERT_DISABLED > + mutable Mode m_mode; > +#endif I think a boolean named: bool m_alreadyIncremented; or something like that, would be clearer than this enum, which I find mysterious. > Source/WTF/wtf/text/StringView.h:339 > +inline bool StringView::operator==(const StringView& other) const > +{ > + return m_characters == other.m_characters && m_length == other.m_length; > +} I don’t think it’s good idea to have a StringView == operator that checks identity rather than contents of the strings. I think an == operator on two string views should compare the actual characters, not the pointers. And should return true if the two views look at two different ranges of memory with the same characters, even if one is 8-bit and the other 16-bit. Or we could just not have an == operator, but having one with this meaning seems like a bad idea. > Source/WTF/wtf/text/StringView.h:396 > + return m_stringView == other.m_stringView && m_index == other.m_index; This should assert that &m_stringView == &other.m_stringView and should not look at m_stringView at all in non-debug builds. > Source/WTF/wtf/text/StringView.h:438 > + return m_stringView == other.m_stringView && m_index == other.m_index; This should assert that &m_stringView == &other.m_stringView and should not look at m_stringView at all in non-debug builds.

Myles C. Maxfield

Comment 15 2014-10-02 17:01:50 PDT

Comment on attachment 238931 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=238931&action=review >> Source/WTF/wtf/text/StringView.h:41 >> + class CodeUnits; > > I suggest defining these right before they are used instead of at the top of the class. Done. >> Source/WTF/wtf/text/StringView.h:188 >> + bool operator!=(const StringView&) const; > > These operators should be implemented as free functions, not member functions. If these are member functions, then no type conversions can be done on the left side of the == and !=, but if they are free functions they can be. Given your comment below, I think providing these operators is out of scope for this patch. >> Source/WTF/wtf/text/StringView.h:247 >> +#endif > > I think a boolean named: > > bool m_alreadyIncremented; > > or something like that, would be clearer than this enum, which I find mysterious. Done. >> Source/WTF/wtf/text/StringView.h:339 >> +} > > I don’t think it’s good idea to have a StringView == operator that checks identity rather than contents of the strings. I think an == operator on two string views should compare the actual characters, not the pointers. And should return true if the two views look at two different ranges of memory with the same characters, even if one is 8-bit and the other 16-bit. > > Or we could just not have an == operator, but having one with this meaning seems like a bad idea. Given your comment below, I think providing an == operator is out of scope for this patch. There are different ways to look at the characters inside a StringView: we could compare them code point by code point, or we could use ICU to canonicalize the strings and then compare the canonicalized versions. We could compare code units instead if we wanted to avoid the U16_NEXT() and know that both strings are in UTF-16. >> Source/WTF/wtf/text/StringView.h:396 >> + return m_stringView == other.m_stringView && m_index == other.m_index; > > This should assert that &m_stringView == &other.m_stringView and should not look at m_stringView at all in non-debug builds. Done. >> Source/WTF/wtf/text/StringView.h:438 >> + return m_stringView == other.m_stringView && m_index == other.m_index; > > This should assert that &m_stringView == &other.m_stringView and should not look at m_stringView at all in non-debug builds. Done.

Myles C. Maxfield

Comment 16 2014-10-02 17:31:09 PDT

Created attachment 239162 [details] Patch

Darin Adler

Comment 17 2014-10-02 17:36:15 PDT

Comment on attachment 239162 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=239162&action=review > Source/WTF/wtf/text/StringView.h:223 > + mutable bool m_alreadyIncremented; I guess this isn’t as clear as I thought it would be, because I meant exactly the opposite of what you did! > Source/WTF/wtf/text/StringView.h:389 > + m_alreadyIncremented = false; My idea is that we would be setting this to true here, because as part of the * operator, we “already incremented” the iterator.

Darin Adler

Comment 18 2014-10-03 09:08:20 PDT

Comment on attachment 239162 [details] Patch This is OK, but I think the m_alreadyIncremented boolean is confusing. What I proposed was actually the opposite of what you wrote, showing that the two of us don’t even agree on this!

Myles C. Maxfield

Comment 19 2014-10-03 09:11:32 PDT

(In reply to comment #18) > (From update of attachment 239162 [details]) > This is OK, but I think the m_alreadyIncremented boolean is confusing. What I proposed was actually the opposite of what you wrote, showing that the two of us don’t even agree on this! Interesting. I thought you meant I should make line 380 set it to true because, once you've run the increment operator, the iterator has already been incremented, so you can't increment it again.

Darin Adler

Comment 20 2014-10-03 09:17:04 PDT

(In reply to comment #19) > (In reply to comment #18) > > This is OK, but I think the m_alreadyIncremented boolean is confusing. What I proposed was actually the opposite of what you wrote, showing that the two of us don’t even agree on this! > > Interesting. I thought you meant I should make line 380 set it to true because, once you've run the increment operator, the iterator has already been incremented, so you can't increment it again. Yes, makes logical sense but is not what I meant. I meant that once you use the * operator the iterator is in an unusual state. It has already been incremented, so when you then actually use the ++ operator you don’t have to do any incrementing. Further, it’s not safe to do operations like == on it when it’s in this “already incremented” state because they will yield incorrect results. So, critically, StringView::CodePoints::Iterator::operator== is missing assertions that check that neither of the two iterators is in this unusual “already incremented” state, whatever we call the boolean.

Myles C. Maxfield

Comment 21 2014-10-03 09:34:44 PDT

(In reply to comment #20) > (In reply to comment #19) > > (In reply to comment #18) > > > This is OK, but I think the m_alreadyIncremented boolean is confusing. What I proposed was actually the opposite of what you wrote, showing that the two of us don’t even agree on this! > > > > Interesting. I thought you meant I should make line 380 set it to true because, once you've run the increment operator, the iterator has already been incremented, so you can't increment it again. > > Yes, makes logical sense but is not what I meant. > > I meant that once you use the * operator the iterator is in an unusual state. It has already been incremented, so when you then actually use the ++ operator you don’t have to do any incrementing. Further, it’s not safe to do operations like == on it when it’s in this “already incremented” state because they will yield incorrect results. So, critically, StringView::CodePoints::Iterator::operator== is missing assertions that check that neither of the two iterators is in this unusual “already incremented” state, whatever we call the boolean. Ah, I understand. Your thinking of the term "increment" is referring to the internal m_index variable, and my thinking of the term "increment" is referring to the operator++ API call. I'll switch the boolean (and add the missing ASSERTs).

Myles C. Maxfield

Comment 22 2014-10-03 10:58:11 PDT

http://trac.webkit.org/changeset/174271

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component New Bugs

Assignee

Myles C. Maxfield

Reported

2014-09-26 19:49 PDT

Modified

2014-10-03 10:58 PDT History

CC List

10 users Show

URL

Keywords

Depends on

Blocks