Bug 211080

Summary: Stop using waiting for a BinarySemaphore on the main thread in the NetworkProcess
Product: WebKit Reporter: Alex Christensen <achristensen>
Component: New BugsAssignee: Alex Christensen <achristensen>
Status: RESOLVED FIXED    
Severity: Normal CC: cdumez, darin, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 211207    
Bug Blocks:    
Attachments:
Description Flags
Patch
none
Patch cdumez: review+

Description Alex Christensen 2020-04-27 09:33:51 PDT
Stop using waiting for a BinarySemaphore on the main thread in the NetworkProcess
Comment 1 Alex Christensen 2020-04-27 09:36:57 PDT
Created attachment 397690 [details]
Patch
Comment 2 Alex Christensen 2020-04-27 09:37:00 PDT
<rdar://problem/62377357>
Comment 3 Chris Dumez 2020-04-27 09:45:05 PDT
Comment on attachment 397690 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=397690&action=review

> Source/WebKit/ChangeLog:9
> +        There was an out-of-date comment suggesting we needed to use a semaphore, but we've since added these in the destructor:

Yes, when this code was written, this store lived in the UIProcess, not the network process. Seems have changed a lot since.
Comment 4 Chris Dumez 2020-04-27 12:30:56 PDT
------------------------------
Test suite failed
Failed
    TestWebKitAPI.ResourceLoadStatistics.ChildProcessesNotLaunchedDatabase
        
        /Volumes/Data/worker/macOS-Mojave-Release-Build-EWS/build/Tools/TestWebKitAPI/Tests/WebKitCocoa/ResourceLoadStatistics.mm:448
        Value of: [message isEqualToString:@"PopulatedWithoutGrandfathering"]
          Actual: false
        Expected: true
        
    TestWebKitAPI.ResourceLoadStatistics.ShouldNotGrandfatherOnStartupDatabase
        _RegisterApplication(), FAILED TO establish the default connection to the WindowServer, _CGSDefaultConnection() is NULL.
        
        /Volumes/Data/worker/macOS-Mojave-Release-Build-EWS/build/Tools/TestWebKitAPI/Tests/WebKitCocoa/ResourceLoadStatistics.mm:330
        Value of: [message isEqualToString:@"PopulatedWithoutGrandfathering"]
          Actual: false
        Expected: true
Comment 5 Alex Christensen 2020-04-27 12:32:08 PDT
When running an individual test, everything's fine.  When I run multiple tests, I see lots of these:

ERROR: SQLite database failed to set journal_mode to WAL, error: database is locked
        ./platform/sql/SQLiteDatabase.cpp(176) : void WebCore::SQLiteDatabase::useWALJournalMode()
        ERROR: SQLite database failed to checkpoint: database is locked
        ./platform/sql/SQLiteDatabase.cpp(186) : void WebCore::SQLiteDatabase::useWALJournalMode()
        ERROR: SQLite database failed to checkpoint: disk I/O error
        ./platform/sql/SQLiteDatabase.cpp(186) : void WebCore::SQLiteDatabase::useWALJournalMode()
        ASSERTION FAILED: resetResult == 0
Comment 6 Chris Dumez 2020-04-27 12:34:05 PDT
(In reply to Chris Dumez from comment #4)
> ------------------------------
> Test suite failed
> Failed
>     TestWebKitAPI.ResourceLoadStatistics.ChildProcessesNotLaunchedDatabase
>         
>        
> /Volumes/Data/worker/macOS-Mojave-Release-Build-EWS/build/Tools/
> TestWebKitAPI/Tests/WebKitCocoa/ResourceLoadStatistics.mm:448
>         Value of: [message isEqualToString:@"PopulatedWithoutGrandfathering"]
>           Actual: false
>         Expected: true
>         
>    
> TestWebKitAPI.ResourceLoadStatistics.ShouldNotGrandfatherOnStartupDatabase
>         _RegisterApplication(), FAILED TO establish the default connection
> to the WindowServer, _CGSDefaultConnection() is NULL.
>         
>        
> /Volumes/Data/worker/macOS-Mojave-Release-Build-EWS/build/Tools/
> TestWebKitAPI/Tests/WebKitCocoa/ResourceLoadStatistics.mm:330
>         Value of: [message isEqualToString:@"PopulatedWithoutGrandfathering"]
>           Actual: false
>         Expected: true

Interestingly, the 2 corresponding API tests that do not use the Database backend (ResourceLoadStatistics.ChildProcessesNotLaunched & ResourceLoadStatistics.ShouldNotGrandfatherOnStartup) do not fail, so the failure seems specific to using the database backend for ITP.
Comment 7 Chris Dumez 2020-04-27 12:37:01 PDT
(In reply to Alex Christensen from comment #5)
> When running an individual test, everything's fine.  When I run multiple
> tests, I see lots of these:
> 
> ERROR: SQLite database failed to set journal_mode to WAL, error: database is
> locked
>         ./platform/sql/SQLiteDatabase.cpp(176) : void
> WebCore::SQLiteDatabase::useWALJournalMode()
>         ERROR: SQLite database failed to checkpoint: database is locked
>         ./platform/sql/SQLiteDatabase.cpp(186) : void
> WebCore::SQLiteDatabase::useWALJournalMode()
>         ERROR: SQLite database failed to checkpoint: disk I/O error
>         ./platform/sql/SQLiteDatabase.cpp(186) : void
> WebCore::SQLiteDatabase::useWALJournalMode()
>         ASSERTION FAILED: resetResult == 0

 
useWALJournalMode() gets called when opening a SQLite Database. This may indicate that we are trying to open a database file that is still open (or wasn't closed properly). Given that the API test that fail use the SQLite backend, this is likely related. If I had to bet, I would say that the previous code was making sure that the database file was properly flushed/closed before something else was happening, and that's not the case anymore after your change.
Comment 8 Chris Dumez 2020-04-27 12:41:42 PDT
(In reply to Chris Dumez from comment #7)
> (In reply to Alex Christensen from comment #5)
> > When running an individual test, everything's fine.  When I run multiple
> > tests, I see lots of these:
> > 
> > ERROR: SQLite database failed to set journal_mode to WAL, error: database is
> > locked
> >         ./platform/sql/SQLiteDatabase.cpp(176) : void
> > WebCore::SQLiteDatabase::useWALJournalMode()
> >         ERROR: SQLite database failed to checkpoint: database is locked
> >         ./platform/sql/SQLiteDatabase.cpp(186) : void
> > WebCore::SQLiteDatabase::useWALJournalMode()
> >         ERROR: SQLite database failed to checkpoint: disk I/O error
> >         ./platform/sql/SQLiteDatabase.cpp(186) : void
> > WebCore::SQLiteDatabase::useWALJournalMode()
> >         ASSERTION FAILED: resetResult == 0
> 
>  
> useWALJournalMode() gets called when opening a SQLite Database. This may
> indicate that we are trying to open a database file that is still open (or
> wasn't closed properly). Given that the API test that fail use the SQLite
> backend, this is likely related. If I had to bet, I would say that the
> previous code was making sure that the database file was properly
> flushed/closed before something else was happening, and that's not the case
> anymore after your change.

When NetworkSession::setResourceLoadStatisticsEnabled() gets called (and it likely gets called during testing to toggle ITP), it destroys/reconstructs the WebResourceLoadStatisticsStore.
setResourceLoadStatisticsEnabled(false) would call NetworkSession::destroyResourceLoadStatistics() which would call WebResourceLoadStatisticsStore::didDestroyNetworkSession() which would trigger the code you modified. Then if the test calls setResourceLoadStatisticsEnabled(true) right away, it will try to reopen the same DB file without synchronization (because the new WorkQueue will be used).
Comment 9 Alex Christensen 2020-04-27 12:44:09 PDT
Ah, I can probably make that test-only function use a CompletionHandler to fix it.
Comment 10 Chris Dumez 2020-04-27 12:45:07 PDT
(In reply to Chris Dumez from comment #8)
> (In reply to Chris Dumez from comment #7)
> > (In reply to Alex Christensen from comment #5)
> > > When running an individual test, everything's fine.  When I run multiple
> > > tests, I see lots of these:
> > > 
> > > ERROR: SQLite database failed to set journal_mode to WAL, error: database is
> > > locked
> > >         ./platform/sql/SQLiteDatabase.cpp(176) : void
> > > WebCore::SQLiteDatabase::useWALJournalMode()
> > >         ERROR: SQLite database failed to checkpoint: database is locked
> > >         ./platform/sql/SQLiteDatabase.cpp(186) : void
> > > WebCore::SQLiteDatabase::useWALJournalMode()
> > >         ERROR: SQLite database failed to checkpoint: disk I/O error
> > >         ./platform/sql/SQLiteDatabase.cpp(186) : void
> > > WebCore::SQLiteDatabase::useWALJournalMode()
> > >         ASSERTION FAILED: resetResult == 0
> > 
> >  
> > useWALJournalMode() gets called when opening a SQLite Database. This may
> > indicate that we are trying to open a database file that is still open (or
> > wasn't closed properly). Given that the API test that fail use the SQLite
> > backend, this is likely related. If I had to bet, I would say that the
> > previous code was making sure that the database file was properly
> > flushed/closed before something else was happening, and that's not the case
> > anymore after your change.
> 
> When NetworkSession::setResourceLoadStatisticsEnabled() gets called (and it
> likely gets called during testing to toggle ITP), it destroys/reconstructs
> the WebResourceLoadStatisticsStore.
> setResourceLoadStatisticsEnabled(false) would call
> NetworkSession::destroyResourceLoadStatistics() which would call
> WebResourceLoadStatisticsStore::didDestroyNetworkSession() which would
> trigger the code you modified. Then if the test calls
> setResourceLoadStatisticsEnabled(true) right away, it will try to reopen the
> same DB file without synchronization (because the new WorkQueue will be
> used).

A few early fix ideas:
1. Make NetworkSession::setResourceLoadStatisticsEnabled() async and take a completion handler that only gets called after flushing / closing has happened on the background queue
2. Use a single WorkQueue for all WebResourceLoadStatisticsStore objects instead of one per WebResourceLoadStatisticsStore (so things get serialized properly even after constructing the WebResourceLoadStatisticsStore)
3. Add some synchronization in NetworkSession::setResourceLoadStatisticsEnabled()
Comment 11 Alex Christensen 2020-04-27 12:49:35 PDT
recreateResourceLoadStatisticStore was the problem.  Adding CompletionHandler...
Comment 12 Chris Dumez 2020-04-27 12:55:53 PDT
(In reply to Alex Christensen from comment #11)
> recreateResourceLoadStatisticStore was the problem.  Adding
> CompletionHandler...

Yes, a completion handler would work. I may be a bit intrusive as it may require a lot of piping the completion handler through (maybe even though IPC and SPI).
Note that option 2 I suggested earlier may be less work (We'd have a serial ITP WorkQueue on the NetworkProcess and use that for all WebResourceLoadStatisticsStore objects). This way things would get serialized nicely (flushing+close then re-opening).
Comment 13 Alex Christensen 2020-04-27 13:43:58 PDT
Created attachment 397729 [details]
Patch
Comment 14 Chris Dumez 2020-04-27 13:58:06 PDT
Comment on attachment 397729 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=397729&action=review

r=me but let's wait for EWS to be green to land.

> Source/WebKit/NetworkProcess/NetworkProcess.cpp:270
> +        networkSession.destroyResourceLoadStatistics([callbackAggregator = callbackAggregator.copyRef()] { });

Yes, this is probably a good idea to do.
Comment 15 Alex Christensen 2020-04-27 15:33:46 PDT
http://trac.webkit.org/r260791
Comment 16 Alex Christensen 2020-04-28 11:10:45 PDT
This change is suspected to cause iOS crashes like those in API tests in https://build.webkit.org/builders/Apple%20iOS%2013%20Simulator%20Release%20WK2%20%28Tests%29/builds/4034
Investigating now.
Comment 17 Alex Christensen 2020-04-28 12:01:17 PDT
False alarm.  Those tests were caused by https://trac.webkit.org/changeset/260764/webkit which is unrelated and fixed in http://trac.webkit.org/r260836 which is not needed for this fix.
Comment 18 Chris Dumez 2020-04-29 14:34:18 PDT
This introduced a pretty bad regression unfortunately:
https://bugs.webkit.org/show_bug.cgi?id=211207