Bug 225890 - Use UTF-8 internally from Strings inside SQLiteStatement
Summary: Use UTF-8 internally from Strings inside SQLiteStatement
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Chris Dumez
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2021-05-17 15:00 PDT by Chris Dumez
Modified: 2021-05-18 12:00 PDT (History)
8 users (show)

See Also:


Attachments
Patch (4.63 KB, patch)
2021-05-17 15:02 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (4.63 KB, patch)
2021-05-17 15:32 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (4.71 KB, patch)
2021-05-17 15:56 PDT, Chris Dumez
ews-feeder: commit-queue-
Details | Formatted Diff | Diff
Patch (5.88 KB, patch)
2021-05-17 16:52 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (6.99 KB, patch)
2021-05-17 18:41 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (7.26 KB, patch)
2021-05-18 07:42 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (7.25 KB, patch)
2021-05-18 10:09 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Dumez 2021-05-17 15:00:31 PDT
Use UTF-8 internally from Strings inside SQLiteStatement. Our SQLite databases use UTF-8 internally for text encoding (see we open the database with sqlite3_open_v2() and not sqlite3_open16).

Previously, we were providing UTF-16 strings to SQLite, which meant it had to convert it internally to UTF-8.
Also, whenever we were requesting a string from SQLite, it had to convert it from UTF-8 to UTF-16 before returning it to us. Our String constructed from returned from SQLiteStatement would also be 16-bit encoded, even if all ASCII so we were potentially wasting memory too.

We now provide UTF-8 to SQLite by using StringView::utf8(). We also request UTF-8 strings from SQLite and rely on String::fromUTF8() to convert it to a WTF::String. For all ASCII characters, this means we'll end up with a 8-bit String and save memory.
Comment 1 Chris Dumez 2021-05-17 15:02:21 PDT
Created attachment 428876 [details]
Patch
Comment 2 Chris Dumez 2021-05-17 15:32:22 PDT
Created attachment 428878 [details]
Patch
Comment 3 Chris Dumez 2021-05-17 15:56:59 PDT
Created attachment 428883 [details]
Patch
Comment 4 Chris Dumez 2021-05-17 16:52:20 PDT
Created attachment 428892 [details]
Patch
Comment 5 Chris Dumez 2021-05-17 18:41:01 PDT
Created attachment 428904 [details]
Patch
Comment 6 Alex Christensen 2021-05-17 20:13:37 PDT
Is this compatible with existing databases in files?
Comment 7 Chris Dumez 2021-05-17 20:15:36 PDT
(In reply to Alex Christensen from comment #6)
> Is this compatible with existing databases in files?

Yes, there is no compatibility issue here. Our databases already use UTF-8 encoding. Now we also use UTF-8 in SQLiteStatement to reduce the amount of conversion.
Comment 8 Darin Adler 2021-05-17 23:18:25 PDT
Comment on attachment 428904 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=428904&action=review

> Source/WebCore/platform/sql/SQLiteStatement.cpp:120
> +    auto utf8Text = text.utf8();

Seems like we should consider an idiom that uses the characters in place for all-ASCII strings. No need to make a copy on the heap in that case, and that may be the common case, and the scan through the characters to check they are all ASCII is probably efficient enough that it’s an optimization on balance. We could make a function that returns a structure that sometimes contains a CString but always contains a const char* pointing to the UTF-8 characters, and a length. Or we could make a function that takes a lambda and calls it with the UTF-8 characters and length. Or we can just write it out here:

    CString buffer;
    const char* characters;
    unsigned length;
    if (text.is8Bit() && text.isAllASCII()) {
        characters = reinterpret_cast<const char*>(text.characters8());
        length = text.length();
    } else {
        buffer = text.utf8();
        characters = utf8Buffer.data();
        length = utf8Buffer.length();
    }
    return sqlite3_bind_text(m_statement, index, characters, length, SQLITE_TRANSIENT);
Comment 9 Chris Dumez 2021-05-18 07:38:32 PDT
(In reply to Darin Adler from comment #8)
> Comment on attachment 428904 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=428904&action=review
> 
> > Source/WebCore/platform/sql/SQLiteStatement.cpp:120
> > +    auto utf8Text = text.utf8();
> 
> Seems like we should consider an idiom that uses the characters in place for
> all-ASCII strings. No need to make a copy on the heap in that case, and that
> may be the common case, and the scan through the characters to check they
> are all ASCII is probably efficient enough that it’s an optimization on
> balance. We could make a function that returns a structure that sometimes
> contains a CString but always contains a const char* pointing to the UTF-8
> characters, and a length. Or we could make a function that takes a lambda
> and calls it with the UTF-8 characters and length. Or we can just write it
> out here:
> 
>     CString buffer;
>     const char* characters;
>     unsigned length;
>     if (text.is8Bit() && text.isAllASCII()) {
>         characters = reinterpret_cast<const char*>(text.characters8());
>         length = text.length();
>     } else {
>         buffer = text.utf8();
>         characters = utf8Buffer.data();
>         length = utf8Buffer.length();
>     }
>     return sqlite3_bind_text(m_statement, index, characters, length,
> SQLITE_TRANSIENT);

Yes, I agree all ASCII is likely a common case. I'll make this optimization.
Comment 10 Chris Dumez 2021-05-18 07:42:20 PDT
Created attachment 428936 [details]
Patch
Comment 11 Chris Dumez 2021-05-18 08:56:49 PDT
Comment on attachment 428936 [details]
Patch

This last minute optimization caused a regression, I must have messed up something. I'll investigate.
Comment 12 Chris Dumez 2021-05-18 10:09:54 PDT
Created attachment 428950 [details]
Patch
Comment 13 EWS 2021-05-18 11:59:37 PDT
Committed r277667 (237869@main): <https://commits.webkit.org/237869@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 428950 [details].
Comment 14 Radar WebKit Bug Importer 2021-05-18 12:00:23 PDT
<rdar://problem/78166478>