Bug 230860 - [GPU Process] Use IPC streaming for concurrent display list processing
Summary: [GPU Process] Use IPC streaming for concurrent display list processing
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit2 (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Wenson Hsieh
URL:
Keywords: InRadar
Depends on: 230714 231263 231296 231301 231304 231305 231319 231326 231335 231447 231483 231484 231532 231584
Blocks: 231596
  Show dependency treegraph
 
Reported: 2021-09-27 15:10 PDT by Wenson Hsieh
Modified: 2021-10-13 15:06 PDT (History)
7 users (show)

See Also:


Attachments
Extremely hacky WIP - no resource caching (193.12 KB, patch)
2021-09-27 15:15 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
WIP - supports cached images (324.15 KB, patch)
2021-10-01 17:17 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
WIP - clip to mask, get|putImageData, draw glyphs, paint media (359.45 KB, patch)
2021-10-03 16:30 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
WIP - rebase on trunk (354.75 KB, patch)
2021-10-04 18:10 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Final piece (still blocked on subtasks) (98.99 KB, patch)
2021-10-11 08:52 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Rebase on trunk (101.37 KB, patch)
2021-10-11 17:38 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Rebase again (100.79 KB, patch)
2021-10-12 09:39 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Rebase again + fix test failures (100.28 KB, patch)
2021-10-12 15:37 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Wenson Hsieh 2021-09-27 15:10:23 PDT
.
Comment 1 Wenson Hsieh 2021-09-27 15:15:01 PDT
Created attachment 439399 [details]
Extremely hacky WIP - no resource caching
Comment 2 Kimmo Kinnunen 2021-09-28 00:23:24 PDT
Comment on attachment 439399 [details]
Extremely hacky WIP - no resource caching

View in context: https://bugs.webkit.org/attachment.cgi?id=439399&action=review

> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:705
> +    MESSAGE_CHECK(m_renderingDestinationImageBuffer, "No destination image buffer specified.");

So this is a bit highlighting the pondering issue:
The "Save" is a message to the context of the destination image buffer.
How can we be in the middle of dispatching "save" to non-existing object?

The implementation would be:

void RemoteGraphicsContext::save(DisplayList::Save&& item) {
   item.apply(m_context);
}

> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.messages.in:25
> +messages -> RemoteRenderingBackend NotRefCounted Stream {

I know this is WIP and I see the value of having it this way due to existing code. However, I want to put this out there so you can ponder about it. (Also I don't mean to say you didn't already)

Nomenclature:
ImageBuffer == drawing target == drawing context == display list context == roughly GraphicsContext

Now you have structure of:

GPUConnectionToWebProcess:
  - Map of RemoteRenderingBackends

RemoteRenderingBackend:
  - Map of ImageBuffers

So identity reference (id) gatekeeping is done at two levels: 
- I want to send message to RemoteRenderingBackend by this id (CreateImageBuffer)
- I want to draw to this ImageBuffer ( HandleMetaCommandChangeDestinationImageBuffer )

The first level of gatekeeping is redundant. Each individual RemoteRenderingBackend do not _really_ contribute to the algorithm.

So if you think of C++, you wouldn't necessarily have API like:

  MyRenderingBackend::handleSave(ImageBuffer, DisplayList::Save);

Instead, you would have
  MyImageBuffer::handleSave(DisplayList::Save)

Which we already actually have:
  GraphicsContext::save(...);

The current strangeness is visible in the API, since following can happen:
  HandleSave
  HandleMetaCommandChangeDestinationImageBuffer
  HandleFillRect

E.g. it is obvious that the ImageBuffers carry the state that is modified through the RemoteRenderingBackend, while RemoteRenderingBackend cannot carry that.
So then the question is: why isn't the command sent to the ImageBuffer?


So the structure could be:

GPUConnectionToWebProcess:
  - Map of RemoteImageBuffers / RemoteGraphicsContext / RemoteDisplayListContext

GPUConnectionToWebProcess::CreateImageBuffer

The "stream" signifies the sequence of ordered operations on a group of objects.

So the stream actually already has implemented `HandleMetaCommandChangeDestinationImageBuffer`, it's IPC control message `SetStreamDestinationID`.

> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.messages.in:28
> +    HandleSave(WebCore::DisplayList::Save item)

Also this I know might be just because of how the WIP is organized, but mentioning just in case.

If there's value in having the args already in DisplayList items this makes sense. E.g. if we anyway construct the lists.

If we don' t construct the lists, then this wraps a message inside a message. So these could be just plain GraphicsContext args.


.messages.in:
HandleFillRect(FloatRect rect) 

proxy:
void RemoteGraphicsContextProxy::fillRect(FloatRect rect) {
  m_streamConnection->send(Messages::HandleFillRect { rect } );
}

remote:
void RemoteGraphicsContext::handleFillRect(Messages::HandleFillRect&& msg) {
  m_context.fillRect(msg.rect)
}
Comment 3 Kimmo Kinnunen 2021-09-28 00:48:02 PDT
Comment on attachment 439399 [details]
Extremely hacky WIP - no resource caching

View in context: https://bugs.webkit.org/attachment.cgi?id=439399&action=review

> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.messages.in:39
> +    HandleSetLineCap(WebCore::DisplayList::SetLineCap item)

So one example where the logic of "use DisplayList:: items so we can memcpy things around and that's that" is flawed is that
the IPC layer was built inherently for the reason of security, and as such the messages need to be verified by the 
machinery. Currently the DisplayList items are insecure.
Comment 4 Wenson Hsieh 2021-09-28 09:07:55 PDT
(In reply to Kimmo Kinnunen from comment #3)
> Comment on attachment 439399 [details]
> Extremely hacky WIP - no resource caching
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=439399&action=review
> 
> > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.messages.in:39
> > +    HandleSetLineCap(WebCore::DisplayList::SetLineCap item)
> 
> So one example where the logic of "use DisplayList:: items so we can memcpy
> things around and that's that" is flawed is that
> the IPC layer was built inherently for the reason of security, and as such
> the messages need to be verified by the 
> machinery. Currently the DisplayList items are insecure.

Note that this patch drops the `isValid()` checks for each of the inline display list items, which we currently have in trunk. Once we add these back in, DisplayList items sent via SimpleArgumentCoder will no longer be insecure (and validation failure should result in a MESSAGE_CHECK).
Comment 5 Wenson Hsieh 2021-09-28 09:57:19 PDT
Comment on attachment 439399 [details]
Extremely hacky WIP - no resource caching

View in context: https://bugs.webkit.org/attachment.cgi?id=439399&action=review

>> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:705
>> +    MESSAGE_CHECK(m_renderingDestinationImageBuffer, "No destination image buffer specified.");
> 
> So this is a bit highlighting the pondering issue:
> The "Save" is a message to the context of the destination image buffer.
> How can we be in the middle of dispatching "save" to non-existing object?
> 
> The implementation would be:
> 
> void RemoteGraphicsContext::save(DisplayList::Save&& item) {
>    item.apply(m_context);
> }

This is my eventual goal (in this patch, I even included empty RemoteGraphicsContext/RemoteGraphicsContextProxy files that I added with this intention, but did not build on, since adding logic to the existing RRB/RRBP pipeline was more convenient).

I plan to make both RRB and the new RemoteGraphicsContext streaming IPC message receivers that share the same StreamServerConnection; then, RemoteGraphicsContextProxy in the web process would be able to directly message RemoteGraphicsContext.
Comment 6 Wenson Hsieh 2021-09-28 10:08:29 PDT
(In reply to Wenson Hsieh from comment #4)
> (In reply to Kimmo Kinnunen from comment #3)
> > Comment on attachment 439399 [details]
> > Extremely hacky WIP - no resource caching
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=439399&action=review
> > 
> > > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.messages.in:39
> > > +    HandleSetLineCap(WebCore::DisplayList::SetLineCap item)
> > 
> > So one example where the logic of "use DisplayList:: items so we can memcpy
> > things around and that's that" is flawed is that
> > the IPC layer was built inherently for the reason of security, and as such
> > the messages need to be verified by the 
> > machinery. Currently the DisplayList items are insecure.
> 
> Note that this patch drops the `isValid()` checks for each of the inline
> display list items, which we currently have in trunk. Once we add these back
> in, DisplayList items sent via SimpleArgumentCoder will no longer be
> insecure (and validation failure should result in a MESSAGE_CHECK).

I would also add that we know for a fact that very fast, low-overhead encoding/decoding of display list items and their arguments is necessary in order to hit our performance goals, particularly on the Canvas subtests.

In theory, we could use normal IPC encoding/decoding and validation instead, but this would require us to make equivalent progressions elsewhere to mitigate the performance hit.
Comment 7 Kimmo Kinnunen 2021-09-29 10:06:27 PDT
(In reply to Wenson Hsieh from comment #6)

> I would also add that we know for a fact that very fast, low-overhead
> encoding/decoding of display list items and their arguments is necessary in
> order to hit our performance goals, particularly on the Canvas subtests.
> 
> In theory, we could use normal IPC encoding/decoding and validation instead,
> but this would require us to make equivalent progressions elsewhere to
> mitigate the performance hit.

This statement is just not logical.
Encoding of IPC messages filled with display list items:
 - Put draw args to display list item
 - memcpy item to message
 - send message
 - receive message
 - memcpy item from message
 - extract args from display list item
 - call with args

Encoding of IPC messages filled with draw args:
 - Put draw args to message
 - send message
 - receive message
 - extract args from display list item
 - call with args


All other things being equal, this does not make sense. The only way it would make sense is we already had the display list items.

Think of it the other way:
Putting FloatRect into display list item is *superior* to putting it to plain IPC message directly.
Why would we use inferior method for IPC messages?
Where is the magic that encodes FloatRects faster than IPC encoder?
What is the fundamental reason that magic is not the default mode of the IPC encoder?
Comment 8 Kimmo Kinnunen 2021-09-29 10:07:14 PDT
Err, it should have been:

Encoding of IPC messages filled with draw args:
 - Put draw args to message
 - send message
 - receive message
 - extract args from message
 - call with args
Comment 9 Wenson Hsieh 2021-09-29 10:23:22 PDT
(In reply to Kimmo Kinnunen from comment #7)
> (In reply to Wenson Hsieh from comment #6)
> 
> > I would also add that we know for a fact that very fast, low-overhead
> > encoding/decoding of display list items and their arguments is necessary in
> > order to hit our performance goals, particularly on the Canvas subtests.
> > 
> > In theory, we could use normal IPC encoding/decoding and validation instead,
> > but this would require us to make equivalent progressions elsewhere to
> > mitigate the performance hit.
> 
> This statement is just not logical.
> Encoding of IPC messages filled with display list items:
>  - Put draw args to display list item
>  - memcpy item to message
>  - send message
>  - receive message
>  - memcpy item from message
>  - extract args from display list item
>  - call with args
> 
> Encoding of IPC messages filled with draw args:
>  - Put draw args to message
>  - send message
>  - receive message
>  - extract args from display list item
>  - call with args

This is not what I'm comparing against. "normal IPC encoding/decoding" was referring to the established IPC argument encoding technique of allocating a new chunk of memory and encoding the arguments (one fixed-length piece of data at a time), resizing the buffer and copying out the contents of buffer as needed.

I would hope that this:

> Encoding of IPC messages filled with display list items:

..and this:

> Encoding of IPC messages filled with draw args:

..would yield about the same performance in IPC streams, though the latter will make multiple calls to `memcpy` whereas the former would just make a single call.

> 
> 
> All other things being equal, this does not make sense. The only way it
> would make sense is we already had the display list items.
> 
> Think of it the other way:
> Putting FloatRect into display list item is *superior* to putting it to
> plain IPC message directly.
> Why would we use inferior method for IPC messages?
> Where is the magic that encodes FloatRects faster than IPC encoder?

I'm not sure I totally understand what you mean, but FloatRect encoding is optimized to use SimpleArgumentCoder, which performs a single `memcpy`.

> What is the fundamental reason that magic is not the default mode of the IPC
> encoder?

(..following from above, SimpleArgumentCoder is not the default mode if the IPC encoder because we can't assume objects consist only of memcpy-able data with no pointers by default).
Comment 10 Kimmo Kinnunen 2021-09-29 10:30:08 PDT
Yeah, you're probably right that the dl items will be in practice faster due to the nature of how the IPC encoders iterate field by field.
Maybe someday we could just memcpy the messages that have trivially copyable fields.
Comment 11 Wenson Hsieh 2021-10-01 17:17:13 PDT
Created attachment 439939 [details]
WIP - supports cached images
Comment 12 Wenson Hsieh 2021-10-03 16:30:59 PDT
Created attachment 440023 [details]
WIP - clip to mask, get|putImageData, draw glyphs, paint media
Comment 13 Kimmo Kinnunen 2021-10-04 11:06:26 PDT
Comment on attachment 440023 [details]
WIP - clip to mask, get|putImageData, draw glyphs, paint media

View in context: https://bugs.webkit.org/attachment.cgi?id=440023&action=review

Such an amazingly clean patch for a WIP.
Some small comment, did not go through all of it yet.

> Source/WebKit/GPUProcess/graphics/RemoteImageBuffer.h:61
> +        m_remoteRenderingBackend.streamConnection().startReceivingMessages(m_context.get(), Messages::RemoteGraphicsContext::messageReceiverName(), renderingResourceIdentifier.toUInt64());

You need the startReceivingIPC - stopReceivingIPC pattern, since you will start receiving vfunc calls immediately after this method call, and the vtable ptr is not yet set up.
(or initialize-stopReceivingIPC, cannot remember which).

> Source/WebKit/GPUProcess/graphics/RemoteImageBuffer.h:67
> +        m_remoteRenderingBackend.streamConnection().stopReceivingMessages(Messages::RemoteGraphicsContext::messageReceiverName(), m_renderingResourceIdentifier.toUInt64());

The object lifetime has already ended, and you might still receive messages until this method returns. So the initialize - stopReceivingIPC is needed.

> Source/WebKit/Platform/IPC/Timeout.h:37
> +        : m_deadline(std::isinf(timeDelta) ? MonotonicTime::infinity() : MonotonicTime::now() + timeDelta)

Great addition!

Overall, I think passing infinity generally should be a bug, so in production something else is needed for the call sites?
Either that approximate time or then maybe we could compute the deadline upon first would-block call?
E.g. explain in timeout that it's not strictly accurate timeout, rather it would start approximately during the first blocking call..
Comment 14 Wenson Hsieh 2021-10-04 11:48:01 PDT
Comment on attachment 440023 [details]
WIP - clip to mask, get|putImageData, draw glyphs, paint media

View in context: https://bugs.webkit.org/attachment.cgi?id=440023&action=review

>> Source/WebKit/GPUProcess/graphics/RemoteImageBuffer.h:61
>> +        m_remoteRenderingBackend.streamConnection().startReceivingMessages(m_context.get(), Messages::RemoteGraphicsContext::messageReceiverName(), renderingResourceIdentifier.toUInt64());
> 
> You need the startReceivingIPC - stopReceivingIPC pattern, since you will start receiving vfunc calls immediately after this method call, and the vtable ptr is not yet set up.
> (or initialize-stopReceivingIPC, cannot remember which).

Good catch! Definitely going to fix this for my next WIP iteration.

>> Source/WebKit/GPUProcess/graphics/RemoteImageBuffer.h:67
>> +        m_remoteRenderingBackend.streamConnection().stopReceivingMessages(Messages::RemoteGraphicsContext::messageReceiverName(), m_renderingResourceIdentifier.toUInt64());
> 
> The object lifetime has already ended, and you might still receive messages until this method returns. So the initialize - stopReceivingIPC is needed.

πŸ‘πŸ»

>> Source/WebKit/Platform/IPC/Timeout.h:37
>> +        : m_deadline(std::isinf(timeDelta) ? MonotonicTime::infinity() : MonotonicTime::now() + timeDelta)
> 
> Great addition!
> 
> Overall, I think passing infinity generally should be a bug, so in production something else is needed for the call sites?
> Either that approximate time or then maybe we could compute the deadline upon first would-block call?
> E.g. explain in timeout that it's not strictly accurate timeout, rather it would start approximately during the first blocking call..

Ah, yes β€” using ∞ and adding the optimization here was just meant to be temporary, until I could adopt a faster, coarse-granularity alternative to `MonotonicTime::now()`.

Yusuke recently added `WTF::ApproximateTime` in r283161, so my next step was to adopt that, change the ∞ timeouts in this patch to something reasonable, and finally re-run Canvas Lines to verify that it's sufficient to address the performance regression on this subtest.
Comment 15 Radar WebKit Bug Importer 2021-10-04 15:11:21 PDT Comment hidden (obsolete)
Comment 16 Wenson Hsieh 2021-10-04 16:09:05 PDT
Comment on attachment 440023 [details]
WIP - clip to mask, get|putImageData, draw glyphs, paint media

View in context: https://bugs.webkit.org/attachment.cgi?id=440023&action=review

>>> Source/WebKit/Platform/IPC/Timeout.h:37
>>> +        : m_deadline(std::isinf(timeDelta) ? MonotonicTime::infinity() : MonotonicTime::now() + timeDelta)
>> 
>> Great addition!
>> 
>> Overall, I think passing infinity generally should be a bug, so in production something else is needed for the call sites?
>> Either that approximate time or then maybe we could compute the deadline upon first would-block call?
>> E.g. explain in timeout that it's not strictly accurate timeout, rather it would start approximately during the first blocking call..
> 
> Ah, yes β€” using ∞ and adding the optimization here was just meant to be temporary, until I could adopt a faster, coarse-granularity alternative to `MonotonicTime::now()`.
> 
> Yusuke recently added `WTF::ApproximateTime` in r283161, so my next step was to adopt that, change the ∞ timeouts in this patch to something reasonable, and finally re-run Canvas Lines to verify that it's sufficient to address the performance regression on this subtest.

Hm…using `ApproximateTime::now()` does partially mitigate the regression, but using ∞ and skipping the `mach_(approximate|absolute)_time` call altogether is still significantly faster.

I might stick with an indefinite wait for now, with a FIXME to refactor it to speed this up. I suspect that we might be able to cache an IPC timeout once per frame, and use that instead.
Comment 17 Wenson Hsieh 2021-10-04 18:10:40 PDT Comment hidden (obsolete)
Comment 18 Jon Lee 2021-10-06 14:27:18 PDT
rdar://83438038
Comment 19 Wenson Hsieh 2021-10-11 08:52:26 PDT Comment hidden (obsolete)
Comment 20 Wenson Hsieh 2021-10-11 17:38:17 PDT Comment hidden (obsolete)
Comment 21 Wenson Hsieh 2021-10-12 09:39:00 PDT
Created attachment 440949 [details]
Rebase again
Comment 22 Tim Horton 2021-10-12 12:28:51 PDT
Comment on attachment 440949 [details]
Rebase again

View in context: https://bugs.webkit.org/attachment.cgi?id=440949&action=review

> Source/WebKit/WebProcess/GPU/graphics/RemoteRenderingBackendProxy.cpp:78
> +        m_streamConnection = makeUnique<IPC::StreamClientConnection>(gpuProcessConnection.connection(), 1 << 21);

What in the world is this 1 << 21
Comment 23 Wenson Hsieh 2021-10-12 12:34:07 PDT
Comment on attachment 440949 [details]
Rebase again

View in context: https://bugs.webkit.org/attachment.cgi?id=440949&action=review

>> Source/WebKit/WebProcess/GPU/graphics/RemoteRenderingBackendProxy.cpp:78
>> +        m_streamConnection = makeUnique<IPC::StreamClientConnection>(gpuProcessConnection.connection(), 1 << 21);
> 
> What in the world is this 1 << 21

…good point! I'll pull this out into a `constexpr` variable.

> Source/WebKit/WebProcess/GPU/graphics/RemoteRenderingBackendProxy.cpp:204
> +    sendSyncToStream(Messages::RemoteRenderingBackend::GetDataURLForImageBuffer(mimeType, quality, preserveResolution, renderingResourceIdentifier), Messages::RemoteRenderingBackend::GetDataURLForImageBuffer::Reply(urlString), 3_s);

Also going to change these timeouts back to 1_s from 3_s, as we discussed. I think I changed these when debugging `get/putImageData` and forgot to change them back.
Comment 24 Wenson Hsieh 2021-10-12 15:37:46 PDT
Created attachment 441004 [details]
Rebase again + fix test failures
Comment 25 EWS 2021-10-12 22:50:43 PDT
Committed r284079 (242907@main): <https://commits.webkit.org/242907@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 441004 [details].