Bug 219091

Summary: [Concurrent display lists] Synchronize display list rendering across remote image buffers
Product: WebKit Reporter: Wenson Hsieh <wenson_hsieh>
Component: WebKit2Assignee: Wenson Hsieh <wenson_hsieh>
Status: RESOLVED FIXED    
Severity: Normal CC: ggaren, rniwa, sabouhallawa, simon.fraser, thorton, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 219067, 219089, 219130, 219240, 219262    
Bug Blocks: 219586    
Attachments:
Description Flags
Work in progress
none
Work in progress
none
Rebase on trunk
ews-feeder: commit-queue-
Rebase on trunk
ews-feeder: commit-queue-
Try to fix Win build
none
v2
none
v2 (remove unused variable)
none
v2 (rebase on trunk)
ggaren: review+
Rebase on trunk again none

Description Wenson Hsieh 2020-11-18 08:29:37 PST
Remove the display list item batching mechanism in RemoteImageBufferProxy. Instead, we should:

- Attempt to wake up the GPU process as early as possible (i.e. as soon as the first item is appended)
- Always bump the unread bytes count when appending items.
Comment 1 Wenson Hsieh 2020-11-19 15:26:27 PST
(In reply to Wenson Hsieh from comment #0)
> Remove the display list item batching mechanism in RemoteImageBufferProxy.
> Instead, we should:
> 
> - Attempt to wake up the GPU process as early as possible (i.e. as soon as
> the first item is appended)

(Rather than "early as possible", this should actually be generic so that it's "after n items")

> - Always bump the unread bytes count when appending items.
Comment 2 Wenson Hsieh 2020-11-24 15:19:37 PST
Created attachment 414854 [details]
Work in progress
Comment 3 Wenson Hsieh 2020-11-25 13:10:10 PST
Created attachment 414868 [details]
Work in progress
Comment 4 Radar WebKit Bug Importer 2020-11-26 03:18:35 PST
<rdar://problem/71747695>
Comment 5 Wenson Hsieh 2020-11-27 12:16:19 PST
Created attachment 414965 [details]
Rebase on trunk
Comment 6 Wenson Hsieh 2020-11-27 12:46:15 PST
Created attachment 414967 [details]
Rebase on trunk
Comment 7 Wenson Hsieh 2020-11-27 13:35:15 PST
Created attachment 414970 [details]
Try to fix Win build
Comment 8 Tim Horton 2020-12-01 18:12:32 PST
Comment on attachment 414970 [details]
Try to fix Win build

View in context: https://bugs.webkit.org/attachment.cgi?id=414970&action=review

> Source/WebKit/WebProcess/GPU/graphics/RemoteRenderingBackendProxy.cpp:126
> +    if (imageBuffer->isAccelerated())

We need to fix this

> Source/WebKit/WebProcess/GPU/graphics/RemoteRenderingBackendProxy.cpp:170
> +    // FIXME: We can remove this either once we introduce a GetImageData display list item.

either??
Comment 9 Tim Horton 2020-12-01 23:50:10 PST
Comment on attachment 414970 [details]
Try to fix Win build

View in context: https://bugs.webkit.org/attachment.cgi?id=414970&action=review

> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:148
> +            // FIXME: We can't block forever here while waiting for new display list items.
> +            // We will need some mechanism here to stop waiting after a certain time limit
> +            // and coordinate with the web process so that we receive a new wakeup message
> +            // when there are more items to be processed.
> +            sleep(8_us);

"Here", at the moment, is the GPU process main thread, right? While I recognize that you've got a FIXME, and a plan, can we even land it in this state? Won't things that come via IPC (e.g. Media stuff) get dropped on the floor (or, rather, stuck...)?
Comment 10 Wenson Hsieh 2020-12-02 08:04:41 PST
Comment on attachment 414970 [details]
Try to fix Win build

View in context: https://bugs.webkit.org/attachment.cgi?id=414970&action=review

>> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:148
>> +            sleep(8_us);
> 
> "Here", at the moment, is the GPU process main thread, right? While I recognize that you've got a FIXME, and a plan, can we even land it in this state? Won't things that come via IPC (e.g. Media stuff) get dropped on the floor (or, rather, stuck...)?

Yeah, I was pretty iffy about this bit...I had considered preserving the "old model" of going to sleep in the GPUP when we're done processing all display list items, and then just taking the resulting (sizable) performance hit on MM — at least, until we figure out a fast way for the web process to signal the GPUP to let it know that it should do more work. Perhaps this is preferable (take the temporary perf hit and try to recover it later)?

>> Source/WebKit/WebProcess/GPU/graphics/RemoteRenderingBackendProxy.cpp:126
>> +    if (imageBuffer->isAccelerated())
> 
> We need to fix this

Yeah, I'd like to make this better too. Maybe endDisplayList() could be a no-op stub on ImageBuffer that's implemented in RemoteImageBufferProxy? (I'm not sure if that would be an improvement, but it would remove the need for this casting and isAccelerated check)

>> Source/WebKit/WebProcess/GPU/graphics/RemoteRenderingBackendProxy.cpp:170
>> +    // FIXME: We can remove this either once we introduce a GetImageData display list item.
> 
> either??

Whoops, fixed.
Comment 11 Geoffrey Garen 2020-12-02 12:43:39 PST
Comment on attachment 414970 [details]
Try to fix Win build

View in context: https://bugs.webkit.org/attachment.cgi?id=414970&action=review

> Source/WebCore/platform/graphics/displaylists/DisplayListItemBuffer.cpp:915
> +void ItemBuffer::didAppendDataToWritableBuffer(size_t numberOfBytes)

I think it makes code harder to follow when, transiting from point A -> B -> C, the name of something changes ever so slightly. In this case, we have appendEncodedData -> didAppendDataToWritableBuffer -> didAppendData. But I think all three transit points are talking about the same event. So I don't understand what we're trying to convey when we drop "EncodedData", then add "ToWritableBuffer", then remove "ToWritableBuffer". Can you change this sequence to use more consistent naming? I think appendEncodedData -> didAppend -> didAppend might be best. But I might even drop "EncodedData" from "appendEncodedData"; C++ will do the right thing.

> Source/WebCore/platform/graphics/displaylists/DisplayListItemBuffer.cpp:918
> +    if (m_writingClient)

I'm curious: Under what conditions do we intend to append in a mode where m_writingClient is null?

> Source/WebKit/GPUProcess/graphics/DisplayListReaderHandle.cpp:34
> +    auto previousUnreadBytes = header().unreadBytes.exchangeSub(amount);

You can use std::memory_order::relaxed here. This logic doesn't guard any reads or writes -- it just needs to atomically update our shared understanding of how many bytes are left.

> Source/WebKit/GPUProcess/graphics/DisplayListReaderHandle.cpp:37
> +        ASSERT_NOT_REACHED();

Until we make it terminate the web process, this should be RELEASE_ASSERT.

>>> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:148
>>> +            sleep(8_us);
>> 
>> "Here", at the moment, is the GPU process main thread, right? While I recognize that you've got a FIXME, and a plan, can we even land it in this state? Won't things that come via IPC (e.g. Media stuff) get dropped on the floor (or, rather, stuck...)?
> 
> Yeah, I was pretty iffy about this bit...I had considered preserving the "old model" of going to sleep in the GPUP when we're done processing all display list items, and then just taking the resulting (sizable) performance hit on MM — at least, until we figure out a fast way for the web process to signal the GPUP to let it know that it should do more work. Perhaps this is preferable (take the temporary perf hit and try to recover it later)?

If we're slower without this sleep then we must have a bug.

Think about it: Previously, the GPU Process only learned about new items every 512 items. That meant that the GPU Process could go to sleep with [0, 512) items still in the queue. That meant that the GPU Process went to sleep more than it had to and had to be woken up more.

In this patch, the GPU Process learns about new items right away. That means that the GPU Process can only go to sleep with [0, 1) items in the queue. That means that the GPU Process will go to sleep less and have to be woken up less.

If a version of this patch that allows the GPU Process to go to sleep causes the GPU Process go to sleep *more*, that has to be a bug. We should look at that version of the patch, and debug it.

> Source/WebKit/Shared/SharedDisplayListHandle.h:53
> +        return header().unreadBytes.load();

You can use std::memory_order::acquire here for better performance. The default (std::memory_order_seq_cst) guarantees that all previous instructions commit before this instruction commits, and no subsequent instructions commit before this instruction commits. But we only need to ensure that all subsequent reads from the buffer commit after this instruction commits. std::memory_order::acquire does that.

> Source/WebKit/WebProcess/GPU/graphics/DisplayListWriterHandle.cpp:39
>      if (UNLIKELY(checkedWritableOffset.hasOverflowed()))
>          RELEASE_ASSERT_NOT_REACHED();

Wouldn't just calling get() do this for you?

> Source/WebKit/WebProcess/GPU/graphics/DisplayListWriterHandle.cpp:43
> +    CheckedSize checkedUnreadBytes = header().unreadBytes.exchangeAdd(amount);

You can use std::memory_order::release here for better performance. The default (std::memory_order_seq_cst) guarantees that all previous instructions commit before this instruction commits, and no subsequent instructions commit before this instruction commits. But we only need to ensure that all previous writes to the buffer commit before this instruction commits. std::memory_order::release does that.

Related side note: It's slightly more efficient to do this exchange at the top of the function. That way, you're saying that it's OK to proceed with all the checked arithmetic in this function without waiting for the unreadBytes change to commit.

> Source/WebKit/WebProcess/GPU/graphics/DisplayListWriterHandle.cpp:48
>      return checkedUnreadBytes.unsafeGet();

Same comment about get().

> Source/WebKit/WebProcess/GPU/graphics/DisplayListWriterHandle.cpp:76
> +    if (header().hasReader.load())
> +        return false;

Why can't we move the write cursor when we have a reader? If the reader read everything, we should be good to go.

> Source/WebKit/WebProcess/GPU/graphics/RemoteImageBufferProxy.h:243
> +        if (m_remoteRenderingBackendProxy->setDestinationImageBuffer(m_renderingResourceIdentifier) == RemoteRenderingBackendProxy::IsFirstDestination::Yes)
> +            m_isAppendingDisplayListItemForInitialDestinationImageBuffer = true;
>      }
>  
> -    void willAppendItemOfType(WebCore::DisplayList::ItemType) override
> +    void didAppendData(const WebCore::DisplayList::ItemBufferHandle& handle, size_t numberOfBytes) override
>      {
> -        constexpr size_t DisplayListBatchSize = 512;
> -        auto& displayList = m_drawingContext.displayList();
> -        if (++m_itemCountInCurrentDisplayList < DisplayListBatchSize)
> +        if (UNLIKELY(!m_remoteRenderingBackendProxy))
>              return;
>  
> -        m_itemCountInCurrentDisplayList = 0;
> -        submitDisplayList(displayList);
> -        displayList.clear();
> +        auto needsWakeupMessage = m_isAppendingDisplayListItemForInitialDestinationImageBuffer
> +            ? RemoteRenderingBackendProxy::NeedsWakeupMessage::Yes
> +            : RemoteRenderingBackendProxy::NeedsWakeupMessage::No;

This might be the source of the performance bug. It seems that this patch removes *all* batching. So, we send one item right away, and then invite the GPU Process to go to sleep right away. Any display list item for which processing is faster than generation will cause a sleep. That's a lot of sleep!

In a way, this changes the best case into the worst case: When the GPU Process is hot and ready to crush its workload, we virtually guarantee that the next display list item will trigger a sleep. Kind of like the Hare in the Tortoise and the Hare.

Instead, I'd suggest maintaining a batch limit of 512 before we wake up, but always keeping the unread bytes count up to date eagerly.
Comment 12 Wenson Hsieh 2020-12-04 13:52:39 PST
Comment on attachment 414970 [details]
Try to fix Win build

View in context: https://bugs.webkit.org/attachment.cgi?id=414970&action=review

Thanks for the review! Comments inline:

>> Source/WebCore/platform/graphics/displaylists/DisplayListItemBuffer.cpp:915
>> +void ItemBuffer::didAppendDataToWritableBuffer(size_t numberOfBytes)
> 
> I think it makes code harder to follow when, transiting from point A -> B -> C, the name of something changes ever so slightly. In this case, we have appendEncodedData -> didAppendDataToWritableBuffer -> didAppendData. But I think all three transit points are talking about the same event. So I don't understand what we're trying to convey when we drop "EncodedData", then add "ToWritableBuffer", then remove "ToWritableBuffer". Can you change this sequence to use more consistent naming? I think appendEncodedData -> didAppend -> didAppend might be best. But I might even drop "EncodedData" from "appendEncodedData"; C++ will do the right thing.

Good point — this does make the code easier to follow. Renamed appendEncodedData to just append, and renamed didAppendDataToWritableBuffer to didAppendData to match the rest of the codepath.

>> Source/WebCore/platform/graphics/displaylists/DisplayListItemBuffer.cpp:918
>> +    if (m_writingClient)
> 
> I'm curious: Under what conditions do we intend to append in a mode where m_writingClient is null?

m_writingClient is null in the case of in-process display lists. AFAIK, these in-process display lists are only used (1) for a text painting optimization where we cache computed glyph data when painting the same text at high frequency, and (2) for some layout tests where we use it to record painting and dump it as text.

There's also a whole codepath for fully in-process DisplayList rendering, but I don't think that is really maintained anymore.

>> Source/WebKit/GPUProcess/graphics/DisplayListReaderHandle.cpp:34
>> +    auto previousUnreadBytes = header().unreadBytes.exchangeSub(amount);
> 
> You can use std::memory_order::relaxed here. This logic doesn't guard any reads or writes -- it just needs to atomically update our shared understanding of how many bytes are left.

Sounds good! I'll give `std::memory_order_relaxed` a try here.

>> Source/WebKit/GPUProcess/graphics/DisplayListReaderHandle.cpp:37
>> +        ASSERT_NOT_REACHED();
> 
> Until we make it terminate the web process, this should be RELEASE_ASSERT.

Okay — changed to a RELEASE_ASSERT.

>>>> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:148
>>>> +            sleep(8_us);
>>> 
>>> "Here", at the moment, is the GPU process main thread, right? While I recognize that you've got a FIXME, and a plan, can we even land it in this state? Won't things that come via IPC (e.g. Media stuff) get dropped on the floor (or, rather, stuck...)?
>> 
>> Yeah, I was pretty iffy about this bit...I had considered preserving the "old model" of going to sleep in the GPUP when we're done processing all display list items, and then just taking the resulting (sizable) performance hit on MM — at least, until we figure out a fast way for the web process to signal the GPUP to let it know that it should do more work. Perhaps this is preferable (take the temporary perf hit and try to recover it later)?
> 
> If we're slower without this sleep then we must have a bug.
> 
> Think about it: Previously, the GPU Process only learned about new items every 512 items. That meant that the GPU Process could go to sleep with [0, 512) items still in the queue. That meant that the GPU Process went to sleep more than it had to and had to be woken up more.
> 
> In this patch, the GPU Process learns about new items right away. That means that the GPU Process can only go to sleep with [0, 1) items in the queue. That means that the GPU Process will go to sleep less and have to be woken up less.
> 
> If a version of this patch that allows the GPU Process to go to sleep causes the GPU Process go to sleep *more*, that has to be a bug. We should look at that version of the patch, and debug it.

I've rewritten parts of this patch so that:
- The GPUP now goes to sleep when it's done processing all unread bytes, instead of waiting (indefinitely) for more data.
- We now have a configurable hysteresis (I'm currently starting at 512 items) between when we first append unread data after the GPUP has gone to sleep, and when we actually send the wakeup message.

>> Source/WebKit/Shared/SharedDisplayListHandle.h:53
>> +        return header().unreadBytes.load();
> 
> You can use std::memory_order::acquire here for better performance. The default (std::memory_order_seq_cst) guarantees that all previous instructions commit before this instruction commits, and no subsequent instructions commit before this instruction commits. But we only need to ensure that all subsequent reads from the buffer commit after this instruction commits. std::memory_order::acquire does that.

Good catch! Will give this a try...

>> Source/WebKit/WebProcess/GPU/graphics/DisplayListWriterHandle.cpp:39
>>          RELEASE_ASSERT_NOT_REACHED();
> 
> Wouldn't just calling get() do this for you?

Oh, true. Changed to just unsafeGet().

>> Source/WebKit/WebProcess/GPU/graphics/DisplayListWriterHandle.cpp:43
>> +    CheckedSize checkedUnreadBytes = header().unreadBytes.exchangeAdd(amount);
> 
> You can use std::memory_order::release here for better performance. The default (std::memory_order_seq_cst) guarantees that all previous instructions commit before this instruction commits, and no subsequent instructions commit before this instruction commits. But we only need to ensure that all previous writes to the buffer commit before this instruction commits. std::memory_order::release does that.
> 
> Related side note: It's slightly more efficient to do this exchange at the top of the function. That way, you're saying that it's OK to proceed with all the checked arithmetic in this function without waiting for the unreadBytes change to commit.

👍🏻 Will try `std::memory_order_release` here.

>> Source/WebKit/WebProcess/GPU/graphics/DisplayListWriterHandle.cpp:48
>>      return checkedUnreadBytes.unsafeGet();
> 
> Same comment about get().

Fixed!

>> Source/WebKit/WebProcess/GPU/graphics/DisplayListWriterHandle.cpp:76
>> +        return false;
> 
> Why can't we move the write cursor when we have a reader? If the reader read everything, we should be good to go.

I've deleted this in the latest version of this patch (which also removes the `hasReader` flag entirely).

The only reason this was here before was that if the GPUP stopped in the middle of reading to wait for more items, the WP would reset the write cursor and start appending items at the start of the buffer, but the GPUP would still expect items to come in from the offset where it began waiting. I suppose I could've made this work by adding some way for the WP to let the GPUP know that it has reset and should read from the start again, but since we now go to sleep after we're done in the GPUP, this edge case is no longer relevant.

>> Source/WebKit/WebProcess/GPU/graphics/RemoteImageBufferProxy.h:243
>> +            : RemoteRenderingBackendProxy::NeedsWakeupMessage::No;
> 
> This might be the source of the performance bug. It seems that this patch removes *all* batching. So, we send one item right away, and then invite the GPU Process to go to sleep right away. Any display list item for which processing is faster than generation will cause a sleep. That's a lot of sleep!
> 
> In a way, this changes the best case into the worst case: When the GPU Process is hot and ready to crush its workload, we virtually guarantee that the next display list item will trigger a sleep. Kind of like the Hare in the Tortoise and the Hare.
> 
> Instead, I'd suggest maintaining a batch limit of 512 before we wake up, but always keeping the unread bytes count up to date eagerly.

Yep! I've added an n-item hysteresis before sending the wakeup message in the latest version of my patch (which I'll upload shortly after I update the changelogs and do some more performance testing).
Comment 13 Wenson Hsieh 2020-12-04 14:26:34 PST
> > In a way, this changes the best case into the worst case: When the GPU Process is hot and ready to crush its workload, we virtually guarantee that the next display list item will trigger a sleep. Kind of like the Hare in the Tortoise and the Hare.
> > 
> > Instead, I'd suggest maintaining a batch limit of 512 before we wake up, but always keeping the unread bytes count up to date eagerly.
> 
> Yep! I've added an n-item hysteresis before sending the wakeup message in
> the latest version of my patch (which I'll upload shortly after I update the
> changelogs and do some more performance testing).

Overall, this seems to slightly regress performance on lines/paths/arcs, by ~4% (in comparison, I measured the changes in https://bugs.webkit.org/attachment.cgi?id=414970&action=review to be a 3-5% *progression*).

I think going with this approach is fine for now, but we'll need to do something to recover the lost perf (maybe by experimenting with ways to sleep for a brief period in the GPUP after unread count drops to 0, and wait for additional data).
Comment 14 Wenson Hsieh 2020-12-04 14:43:17 PST
> 
> Sounds good! I'll give `std::memory_order_relaxed` a try here.
> 
> > You can use std::memory_order::acquire here for better performance. The default (std::memory_order_seq_cst) guarantees that all previous instructions commit before this instruction commits, and no subsequent instructions commit before this instruction commits. But we only need to ensure that all subsequent reads from the buffer commit after this instruction commits. std::memory_order::acquire does that.
> 
> Good catch! Will give this a try...
> 
> > You can use std::memory_order::release here for better performance. The default (std::memory_order_seq_cst) guarantees that all previous instructions commit before this instruction commits, and no subsequent instructions commit before this instruction commits. But we only need to ensure that all previous writes to the buffer commit before this instruction commits. std::memory_order::release does that.
> > 
> > Related side note: It's slightly more efficient to do this exchange at the top of the function. That way, you're saying that it's OK to proceed with all the checked arithmetic in this function without waiting for the unreadBytes change to commit.
> 
> 👍🏻 Will try `std::memory_order_release` here.

Using the `std::memory_order_*`s in these three spots above seems to be about perf-neutral (or possibly give slightly worse performance — the difference was not stat. significant).

I'll hold off on landing this bit for now (it seems like we ought to revisit this — maybe some other details of the implementation are preventing this from being a progression)
Comment 15 Wenson Hsieh 2020-12-04 14:56:03 PST
Created attachment 415460 [details]
v2
Comment 16 Wenson Hsieh 2020-12-04 15:00:27 PST
Created attachment 415461 [details]
v2 (remove unused variable)
Comment 17 Wenson Hsieh 2020-12-04 15:46:03 PST
Created attachment 415469 [details]
v2 (rebase on trunk)
Comment 18 Geoffrey Garen 2020-12-04 16:48:09 PST
Comment on attachment 415469 [details]
v2 (rebase on trunk)

r=me
Comment 19 Wenson Hsieh 2020-12-05 13:26:55 PST
Created attachment 415497 [details]
Rebase on trunk again
Comment 20 Wenson Hsieh 2020-12-05 14:17:37 PST
Committed r270478: <https://trac.webkit.org/changeset/270478>