ASSIGNED 209804
[WPE][GTK] Avoid copying JSON source in webkit_user_content_filter_store_save() and friends
https://bugs.webkit.org/show_bug.cgi?id=209804
Summary [WPE][GTK] Avoid copying JSON source in webkit_user_content_filter_store_save...
Adrian Perez
Reported 2020-03-31 05:56:13 PDT
API functions webkit_user_content_filter_store_save() and webkit_user_content_filter_store_save_from_file() result in passing a GBytes to webkitUserContentFilterStoreSaveBytes(), the the later instantiates a WTF::String with the contents of the GBytes. This last step creates a copy of the data, which can be potentially big (in the megabytes magnitude). Moreover, webkit_user_content_filter_store_save_from_file() will try to mmap() local files (good ☺) but to the data copy to create the WTF::String will make the UI thread block on page faults to load the data from disk, defeating the purpose of the optimization (bad ☹). It should be possible to use a WTF::ExternalStringImpl to create a WTF::String pointing to the data from the GBytes without copying, keeping a ref so the GBytes stays alive and dropping the ref in ExternalStringImplFreeFunction callback.
Attachments
Patch (2.84 KB, patch)
2020-03-31 07:46 PDT, Adrian Perez
no flags
Patch (3.07 KB, patch)
2020-03-31 14:31 PDT, Adrian Perez
aperez: review?
aperez: commit-queue?
Adrian Perez
Comment 1 2020-03-31 07:46:39 PDT
EWS Watchlist
Comment 2 2020-03-31 07:47:21 PDT
Thanks for the patch. If this patch contains new public API please make sure it follows the guidelines for new WebKit2 GTK+ API. See http://trac.webkit.org/wiki/WebKitGTK/AddingNewWebKit2API
Charlie Turner
Comment 3 2020-03-31 09:07:42 PDT
Comment on attachment 395037 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395037&action=review > Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:193 > + RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { }); It seems like the rule list compiler will copy one way or the other though. Before this patch, we're getting a copy in the call to the compiler, after this patch, we get a copy when the rule list is dispatched to the compile queue?
Adrian Perez
Comment 4 2020-03-31 09:25:35 PDT
(In reply to Charlie Turner from comment #3) > Comment on attachment 395037 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=395037&action=review > > > Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:193 > > + RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { }); > > It seems like the rule list compiler will copy one way or the other though. > Before this patch, we're getting a copy in the call to the compiler, after > this patch, we get a copy when the rule list is dispatched to the compile > queue? Well, you do have a point: there is a call to String::isolatedCopy() in ContentRuleListStore::compileContentRuleList(). But this patch still saves one memory copy, which is a good thing. Note that it String::isolatedCopy() calls ::isSafeToSendToAnotherThread() and if that returns ”true” then it just moves the object instead of making a copy… In this case the String should have only one ref, because we are creating it and moving it into the call to ::compileContentRuleList() so I *think* that the copy in this case is being avoided.
Charlie Turner
Comment 5 2020-03-31 09:41:19 PDT
(In reply to Adrian Perez from comment #4) > (In reply to Charlie Turner from comment #3) > > Comment on attachment 395037 [details] > > Patch > > > > View in context: > > https://bugs.webkit.org/attachment.cgi?id=395037&action=review > > > > > Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:193 > > > + RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { }); > > > > It seems like the rule list compiler will copy one way or the other though. > > Before this patch, we're getting a copy in the call to the compiler, after > > this patch, we get a copy when the rule list is dispatched to the compile > > queue? > > Well, you do have a point: there is a call to String::isolatedCopy() in > ContentRuleListStore::compileContentRuleList(). But this patch still > saves one memory copy, which is a good thing. > > Note that it String::isolatedCopy() calls ::isSafeToSendToAnotherThread() > and if that returns ”true” then it just moves the object instead of making > a copy… In this case the String should have only one ref, because we are > creating it and moving it into the call to ::compileContentRuleList() so > I *think* that the copy in this case is being avoided. I started reading the const& overload of isolatedCopy, not the &&-version... An ASSERT() at the callsite about the refCount could be useful to easily catch a performance regression. I wonder if the compiler could be adjusted to do its work from a immutable StringView, don't see why it'd need to modify the source. Anyway, drive-by over. Informal r+ from me :)
Adrian Perez
Comment 6 2020-03-31 14:30:16 PDT
(In reply to Charlie Turner from comment #5) > (In reply to Adrian Perez from comment #4) > > (In reply to Charlie Turner from comment #3) > > > Comment on attachment 395037 [details] > > > Patch > > > > > > View in context: > > > https://bugs.webkit.org/attachment.cgi?id=395037&action=review > > > > > > > Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:193 > > > > + RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { }); > > > > > > It seems like the rule list compiler will copy one way or the other though. > > > Before this patch, we're getting a copy in the call to the compiler, after > > > this patch, we get a copy when the rule list is dispatched to the compile > > > queue? > > > > Well, you do have a point: there is a call to String::isolatedCopy() in > > ContentRuleListStore::compileContentRuleList(). But this patch still > > saves one memory copy, which is a good thing. > > > > Note that it String::isolatedCopy() calls ::isSafeToSendToAnotherThread() > > and if that returns ”true” then it just moves the object instead of making > > a copy… In this case the String should have only one ref, because we are > > creating it and moving it into the call to ::compileContentRuleList() so > > I *think* that the copy in this case is being avoided. > > I started reading the const& overload of isolatedCopy, not the &&-version... > > An ASSERT() at the callsite about the refCount could be useful to easily > catch a performance regression. Even better, we can construct the String first, then write the assertion as “ASSERT(jsonString.isSafeToSendToAnotherThread())”, and then move the string. Thanks for the idea! > I wonder if the compiler could be adjusted to do its work from a immutable > StringView, don't see why it'd need to modify the source. Anyway, drive-by > over. Informal r+ from me :) Maybe it would make sense to change this, but that would a separate bug :) Note in case that gets done: Instead of capturing the GRefPtr<GBytes> object we would need to keep it around in the SaveTaskData struct.
Adrian Perez
Comment 7 2020-03-31 14:31:45 PDT
Carlos Garcia Campos
Comment 8 2020-04-02 01:48:10 PDT
Comment on attachment 395091 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395091&action=review > Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:192 > + RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { }); Bytes data is UTF-8 and String expects UTF-16. Aren't we losing the conversion here?
Adrian Perez
Comment 9 2020-04-02 05:51:21 PDT
(In reply to Carlos Garcia Campos from comment #8) > Comment on attachment 395091 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=395091&action=review > > > Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:192 > > + RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { }); > > Bytes data is UTF-8 and String expects UTF-16. Aren't we losing the > conversion here? String can be *both* 8-bit (ASCII, UTF-8) or 16-bit (UTF-16). It depends on how the StringImpl backing it is constructed. In the case of this patch, note that the following factory function is used: ExternalStringImpl::create(const LChar* characters, unsigned length, ExternalStringImplFreeFunction&&) This takes an 8-bit string (LChar*, I think the “L” there is for “Latin1 Character” but don't take my word for it), and ends up calling: StringImpl::createWithoutCopying(const LChar*, unsigned length) which constructs with: StringImpl::StringImpl(const LChar* characters, unsigned length, ConstructWithoutCopyingTag) and ends up in with: inline StringImplShape::StringImplShape(unsigned refCount, unsigned length, const LChar* data8, unsigned hashAndFlags) : m_refCount(refCount) , m_length(length) , m_data8(data8) , m_hashAndFlags(hashAndFlags) { } Note how this sets the m_data8 member, which means the string has 8-bit character data. So… yes, the GBytes is always 8-bit data, but we are constructing a WTF::String which *knows* that the data is 8-bit. Side note: all this is fine and dandy, and no UTF-16 conversions are done, ever, because the JSON spec mandates UTF-8 and therefore the content extensions compiler expects UTF-8 as input as well. Life is always better on the UTF-8 side of life 😎️
Adrian Perez
Comment 10 2020-04-02 05:52:32 PDT
…not to mention that the content extensions layout tests are passing with this patch applied, too!
Darin Adler
Comment 11 2020-04-06 18:44:29 PDT
Comment on attachment 395091 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395091&action=review >>> Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:192 >>> + RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { }); >> >> Bytes data is UTF-8 and String expects UTF-16. Aren't we losing the conversion here? > > String can be *both* 8-bit (ASCII, UTF-8) or 16-bit (UTF-16). It depends > on how the StringImpl backing it is constructed. In the case of this patch, > note that the following factory function is used: > > ExternalStringImpl::create(const LChar* characters, unsigned length, > ExternalStringImplFreeFunction&&) > > This takes an 8-bit string (LChar*, I think the “L” there is for “Latin1 > Character” but don't take my word for it), and ends up calling: > > StringImpl::createWithoutCopying(const LChar*, unsigned length) > > which constructs with: > > StringImpl::StringImpl(const LChar* characters, unsigned length, > ConstructWithoutCopyingTag) > > and ends up in with: > > inline StringImplShape::StringImplShape(unsigned refCount, > unsigned length, const LChar* data8, unsigned hashAndFlags) > : m_refCount(refCount) > , m_length(length) > , m_data8(data8) > , m_hashAndFlags(hashAndFlags) > { > } > > Note how this sets the m_data8 member, which means the string has > 8-bit character data. > > So… yes, the GBytes is always 8-bit data, but we are constructing a > WTF::String which *knows* that the data is 8-bit. Side note: all this > is fine and dandy, and no UTF-16 conversions are done, ever, because > the JSON spec mandates UTF-8 and therefore the content extensions > compiler expects UTF-8 as input as well. Life is always better on > the UTF-8 side of life 😎️ That’s close, but incorrect. WTF::String can be 8-bit (ASCII, Latin-1) or 16-bit (UTF-16). It cannot be UTF-8. That’s unfortunate because we all love UTF-8.
Darin Adler
Comment 12 2020-04-06 18:44:55 PDT
And yes, LChar means "Latin-1 character".
Darin Adler
Comment 13 2020-04-06 18:48:27 PDT
Comment on attachment 395091 [details] Patch This patch creates a bug where we no longer correctly handle non-ASCII characters. We can do this optimization of the string is all ASCII, but if there are any non-ASCII characters, then we need to use String::fromUTF8 to translate them to UTF-16. You should construct a test case to demonstrate this in case I am wrong.
Adrian Perez
Comment 14 2020-04-10 05:55:23 PDT
(In reply to Darin Adler from comment #13) > Comment on attachment 395091 [details] > Patch > > This patch creates a bug where we no longer correctly handle non-ASCII > characters. We can do this optimization of the string is all ASCII, but if > there are any non-ASCII characters, then we need to use String::fromUTF8 to > translate them to UTF-16. > > You should construct a test case to demonstrate this in case I am wrong. You are indeed right: so far I had been testing only with JSON content extension rulesets which are all ASCII. As you suggested to me by chat, an option would be to check whether the input JSON source string is all ASCII and avoid the copy in that case and otherwise use String::fromUTF8(). Another option would be to change ContentRuleListStore to use data buffers, likely WebCore::SharedBuffer, which can wrap GBytes/NSData/CFData/etc.—but this would be a bigger rework, so let's repurpose this bug for the smaller optimization for now.
Darin Adler
Comment 15 2020-04-10 08:51:53 PDT
(In reply to Adrian Perez from comment #14) > Another option would be to change ContentRuleListStore to use data buffers, > likely WebCore::SharedBuffer, which can wrap GBytes/NSData/CFData/etc.—but > this would be a bigger rework, so let's repurpose this bug for the smaller > optimization for now. Please note that the character set issue gets even more challenging if we use data buffers, unless we are deciding they are UTF-8 on all platforms.
Adrian Perez
Comment 16 2020-04-10 09:38:26 PDT
(In reply to Darin Adler from comment #15) > (In reply to Adrian Perez from comment #14) > > Another option would be to change ContentRuleListStore to use data buffers, > > likely WebCore::SharedBuffer, which can wrap GBytes/NSData/CFData/etc.—but > > this would be a bigger rework, so let's repurpose this bug for the smaller > > optimization for now. > > Please note that the character set issue gets even more challenging if we > use data buffers, unless we are deciding they are UTF-8 on all platforms. I think we should not assume any encoding for data buffers: they are just a pile of bytes. In this particular case, my thinking is that it would be fine to assume that JSON must be UTF-8 because RFC-8295 mandates it for exchange—and loading a JSON rule set which may have been generated by some other software or even by hand qualifies as “JSON text exchanged between systems that are not part of a closed ecosystem” (quoting the RFC, see https://tools.ietf.org/html/rfc8259#section-8.1). ;-)
Darin Adler
Comment 17 2020-04-10 09:45:33 PDT
(In reply to Adrian Perez from comment #16) > I think we should not assume any encoding for data buffers: they are > just a pile of bytes. My point exactly. We can’t replace strings with data buffers because strings come with information about character encoding and buffers do not, we have to replace strings with data buffers along with information about character encoding. > In this particular case, my thinking is that it would be fine to assume > that JSON must be UTF-8 OK.
Note You need to log in before you can comment on or make changes to this bug.