Bug 217766 - Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues()
Summary: Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues()
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Audio (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Chris Dumez
URL:
Keywords: InRadar
Depends on:
Blocks: 212611
  Show dependency treegraph
 
Reported: 2020-10-15 10:43 PDT by Chris Dumez
Modified: 2020-10-15 14:16 PDT (History)
11 users (show)

See Also:


Attachments
Patch (2.06 KB, patch)
2020-10-15 10:44 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Dumez 2020-10-15 10:43:05 PDT
Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues().
Comment 1 Chris Dumez 2020-10-15 10:44:53 PDT
Created attachment 411460 [details]
Patch
Comment 2 Geoffrey Garen 2020-10-15 10:47:16 PDT
Comment on attachment 411460 [details]
Patch

r=me
Comment 3 Darin Adler 2020-10-15 11:03:17 PDT
Comment on attachment 411460 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=411460&action=review

> Source/WebCore/ChangeLog:8
> +        Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues().

Does this get us any vectorization or parallelization? I know that’s the long term ambition of the C++ library, they even have std::execution::seq/par/par_unseq/unseq that you can pass as the first argument.
Comment 4 Chris Dumez 2020-10-15 11:11:47 PDT
(In reply to Darin Adler from comment #3)
> Comment on attachment 411460 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=411460&action=review
> 
> > Source/WebCore/ChangeLog:8
> > +        Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues().
> 
> Does this get us any vectorization or parallelization? I know that’s the
> long term ambition of the C++ library, they even have
> std::execution::seq/par/par_unseq/unseq that you can pass as the first
> argument.

I actually don't know. I figured the code was more concise with std::fill and we *might* get more optimized code. I guess I could write a simple benchmark to compare the 2.

Also note that we could very easily add a VectorMath function that uses vDSP_fill() [1] to guarantee we get vectorization. What do you think?

[1] https://developer.apple.com/documentation/accelerate/1450501-vdsp_vfill?language=objc
Comment 5 Darin Adler 2020-10-15 11:18:00 PDT
Comment on attachment 411460 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=411460&action=review

>>> Source/WebCore/ChangeLog:8
>>> +        Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues().
>> 
>> Does this get us any vectorization or parallelization? I know that’s the long term ambition of the C++ library, they even have std::execution::seq/par/par_unseq/unseq that you can pass as the first argument.
> 
> I actually don't know. I figured the code was more concise with std::fill and we *might* get more optimized code. I guess I could write a simple benchmark to compare the 2.
> 
> Also note that we could very easily add a VectorMath function that uses vDSP_fill() [1] to guarantee we get vectorization. What do you think?
> 
> [1] https://developer.apple.com/documentation/accelerate/1450501-vdsp_vfill?language=objc

Like most optimization situations, I think:

1) We should optimize if it makes a measurable difference.
2) Outside of that, we should choose an idiom that is both easy to understand and reasonably optimized by default.

I think that using std::fill_n already accomplishes (2) and we could go further if we measure something that shows an optimization opportunity.
Comment 6 Sam Weinig 2020-10-15 11:18:27 PDT
(In reply to Chris Dumez from comment #4)
> (In reply to Darin Adler from comment #3)
> > Comment on attachment 411460 [details]
> > Patch
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=411460&action=review
> > 
> > > Source/WebCore/ChangeLog:8
> > > +        Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues().
> > 
> > Does this get us any vectorization or parallelization? I know that’s the
> > long term ambition of the C++ library, they even have
> > std::execution::seq/par/par_unseq/unseq that you can pass as the first
> > argument.
> 
> I actually don't know. I figured the code was more concise with std::fill
> and we *might* get more optimized code. I guess I could write a simple
> benchmark to compare the 2.
> 
> Also note that we could very easily add a VectorMath function that uses
> vDSP_fill() [1] to guarantee we get vectorization. What do you think?
> 
> [1]
> https://developer.apple.com/documentation/accelerate/1450501-
> vdsp_vfill?language=objc

I think using accelerate as much as possible in VectorMath is what we should be doing. I really see it as a "platform" abstraction around it.

I still think using std::fill_n here for the non-HAVE(ACCELERATE) case it the right way to go. It doesn't guarantee any vectorization, but since it is closer the the compiler, if autovectorization is in the compiler, it is more likely the compiler will ensure it works with standard library idioms.
Comment 7 Chris Dumez 2020-10-15 11:20:01 PDT
(In reply to Sam Weinig from comment #6)
> (In reply to Chris Dumez from comment #4)
> > (In reply to Darin Adler from comment #3)
> > > Comment on attachment 411460 [details]
> > > Patch
> > > 
> > > View in context:
> > > https://bugs.webkit.org/attachment.cgi?id=411460&action=review
> > > 
> > > > Source/WebCore/ChangeLog:8
> > > > +        Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues().
> > > 
> > > Does this get us any vectorization or parallelization? I know that’s the
> > > long term ambition of the C++ library, they even have
> > > std::execution::seq/par/par_unseq/unseq that you can pass as the first
> > > argument.
> > 
> > I actually don't know. I figured the code was more concise with std::fill
> > and we *might* get more optimized code. I guess I could write a simple
> > benchmark to compare the 2.
> > 
> > Also note that we could very easily add a VectorMath function that uses
> > vDSP_fill() [1] to guarantee we get vectorization. What do you think?
> > 
> > [1]
> > https://developer.apple.com/documentation/accelerate/1450501-
> > vdsp_vfill?language=objc
> 
> I think using accelerate as much as possible in VectorMath is what we should
> be doing. I really see it as a "platform" abstraction around it.
> 
> I still think using std::fill_n here for the non-HAVE(ACCELERATE) case it
> the right way to go. It doesn't guarantee any vectorization, but since it is
> closer the the compiler, if autovectorization is in the compiler, it is more
> likely the compiler will ensure it works with standard library idioms.

Yes, I think this is a good idea. I will still benchmark std::fill_n() and vDSP_fill() because I am curious now :)
Comment 8 Chris Dumez 2020-10-15 11:55:36 PDT
(In reply to Chris Dumez from comment #7)
> (In reply to Sam Weinig from comment #6)
> > (In reply to Chris Dumez from comment #4)
> > > (In reply to Darin Adler from comment #3)
> > > > Comment on attachment 411460 [details]
> > > > Patch
> > > > 
> > > > View in context:
> > > > https://bugs.webkit.org/attachment.cgi?id=411460&action=review
> > > > 
> > > > > Source/WebCore/ChangeLog:8
> > > > > +        Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues().
> > > > 
> > > > Does this get us any vectorization or parallelization? I know that’s the
> > > > long term ambition of the C++ library, they even have
> > > > std::execution::seq/par/par_unseq/unseq that you can pass as the first
> > > > argument.
> > > 
> > > I actually don't know. I figured the code was more concise with std::fill
> > > and we *might* get more optimized code. I guess I could write a simple
> > > benchmark to compare the 2.
> > > 
> > > Also note that we could very easily add a VectorMath function that uses
> > > vDSP_fill() [1] to guarantee we get vectorization. What do you think?
> > > 
> > > [1]
> > > https://developer.apple.com/documentation/accelerate/1450501-
> > > vdsp_vfill?language=objc
> > 
> > I think using accelerate as much as possible in VectorMath is what we should
> > be doing. I really see it as a "platform" abstraction around it.
> > 
> > I still think using std::fill_n here for the non-HAVE(ACCELERATE) case it
> > the right way to go. It doesn't guarantee any vectorization, but since it is
> > closer the the compiler, if autovectorization is in the compiler, it is more
> > likely the compiler will ensure it works with standard library idioms.
> 
> Yes, I think this is a good idea. I will still benchmark std::fill_n() and
> vDSP_fill() because I am curious now :)

Interestingly, std::fill_n() seems consistently faster for large arrays than vDSP_fill() on my MacBook Pro:

std::fill_n() took 1.0884ms
vDSP::vfill() took 1.20753ms

Benchmark:

// clang++ -O2 -std=c++14 -framework Accelerate fill_benchmark.cpp -o fill_benchmark
  
#include <algorithm>
#include <chrono>
#include <iostream>
#include <Accelerate/Accelerate.h>

int main()
{
    constexpr unsigned N = 524288;
    const float pi = 3.1415926535;
    float array1[N];
    float array2[N];

    auto start = std::chrono::steady_clock::now();
    std::fill_n(array1, N, pi);
    auto end = std::chrono::steady_clock::now();

    std::chrono::duration<double> diff = end - start;
    std::cout << "std::fill_n() took " << diff.count() * 1000 << "ms\n";

    start = std::chrono::steady_clock::now();
    vDSP_vfill(&pi, array2, 1, N);
    end = std::chrono::steady_clock::now();

    diff = end - start;
    std::cout << "vDSP::vfill() took " << diff.count() * 1000 << "ms\n";

    std::cout << array1[2000] << " " << array2[2000] << "\n";

    return 0;
}
Comment 9 Sam Weinig 2020-10-15 13:59:37 PDT
(In reply to Chris Dumez from comment #8)
> (In reply to Chris Dumez from comment #7)
> > (In reply to Sam Weinig from comment #6)
> > > (In reply to Chris Dumez from comment #4)
> > > > (In reply to Darin Adler from comment #3)
> > > > > Comment on attachment 411460 [details]
> > > > > Patch
> > > > > 
> > > > > View in context:
> > > > > https://bugs.webkit.org/attachment.cgi?id=411460&action=review
> > > > > 
> > > > > > Source/WebCore/ChangeLog:8
> > > > > > +        Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues().
> > > > > 
> > > > > Does this get us any vectorization or parallelization? I know that’s the
> > > > > long term ambition of the C++ library, they even have
> > > > > std::execution::seq/par/par_unseq/unseq that you can pass as the first
> > > > > argument.
> > > > 
> > > > I actually don't know. I figured the code was more concise with std::fill
> > > > and we *might* get more optimized code. I guess I could write a simple
> > > > benchmark to compare the 2.
> > > > 
> > > > Also note that we could very easily add a VectorMath function that uses
> > > > vDSP_fill() [1] to guarantee we get vectorization. What do you think?
> > > > 
> > > > [1]
> > > > https://developer.apple.com/documentation/accelerate/1450501-
> > > > vdsp_vfill?language=objc
> > > 
> > > I think using accelerate as much as possible in VectorMath is what we should
> > > be doing. I really see it as a "platform" abstraction around it.
> > > 
> > > I still think using std::fill_n here for the non-HAVE(ACCELERATE) case it
> > > the right way to go. It doesn't guarantee any vectorization, but since it is
> > > closer the the compiler, if autovectorization is in the compiler, it is more
> > > likely the compiler will ensure it works with standard library idioms.
> > 
> > Yes, I think this is a good idea. I will still benchmark std::fill_n() and
> > vDSP_fill() because I am curious now :)
> 
> Interestingly, std::fill_n() seems consistently faster for large arrays than
> vDSP_fill() on my MacBook Pro:
> 
> std::fill_n() took 1.0884ms
> vDSP::vfill() took 1.20753ms
> 
> Benchmark:
> 
> // clang++ -O2 -std=c++14 -framework Accelerate fill_benchmark.cpp -o
> fill_benchmark
>   
> #include <algorithm>
> #include <chrono>
> #include <iostream>
> #include <Accelerate/Accelerate.h>
> 
> int main()
> {
>     constexpr unsigned N = 524288;
>     const float pi = 3.1415926535;
>     float array1[N];
>     float array2[N];
> 
>     auto start = std::chrono::steady_clock::now();
>     std::fill_n(array1, N, pi);
>     auto end = std::chrono::steady_clock::now();
> 
>     std::chrono::duration<double> diff = end - start;
>     std::cout << "std::fill_n() took " << diff.count() * 1000 << "ms\n";
> 
>     start = std::chrono::steady_clock::now();
>     vDSP_vfill(&pi, array2, 1, N);
>     end = std::chrono::steady_clock::now();
> 
>     diff = end - start;
>     std::cout << "vDSP::vfill() took " << diff.count() * 1000 << "ms\n";
> 
>     std::cout << array1[2000] << " " << array2[2000] << "\n";
> 
>     return 0;
> }

Interesting. Probably worth a radar to the Accelerate folks.
Comment 10 Chris Dumez 2020-10-15 14:12:47 PDT
(In reply to Sam Weinig from comment #9)
> (In reply to Chris Dumez from comment #8)
> > (In reply to Chris Dumez from comment #7)
> > > (In reply to Sam Weinig from comment #6)
> > > > (In reply to Chris Dumez from comment #4)
> > > > > (In reply to Darin Adler from comment #3)
> > > > > > Comment on attachment 411460 [details]
> > > > > > Patch
> > > > > > 
> > > > > > View in context:
> > > > > > https://bugs.webkit.org/attachment.cgi?id=411460&action=review
> > > > > > 
> > > > > > > Source/WebCore/ChangeLog:8
> > > > > > > +        Use std::fill_n() instead of for loops in AudioParam::calculateFinalValues().
> > > > > > 
> > > > > > Does this get us any vectorization or parallelization? I know that’s the
> > > > > > long term ambition of the C++ library, they even have
> > > > > > std::execution::seq/par/par_unseq/unseq that you can pass as the first
> > > > > > argument.
> > > > > 
> > > > > I actually don't know. I figured the code was more concise with std::fill
> > > > > and we *might* get more optimized code. I guess I could write a simple
> > > > > benchmark to compare the 2.
> > > > > 
> > > > > Also note that we could very easily add a VectorMath function that uses
> > > > > vDSP_fill() [1] to guarantee we get vectorization. What do you think?
> > > > > 
> > > > > [1]
> > > > > https://developer.apple.com/documentation/accelerate/1450501-
> > > > > vdsp_vfill?language=objc
> > > > 
> > > > I think using accelerate as much as possible in VectorMath is what we should
> > > > be doing. I really see it as a "platform" abstraction around it.
> > > > 
> > > > I still think using std::fill_n here for the non-HAVE(ACCELERATE) case it
> > > > the right way to go. It doesn't guarantee any vectorization, but since it is
> > > > closer the the compiler, if autovectorization is in the compiler, it is more
> > > > likely the compiler will ensure it works with standard library idioms.
> > > 
> > > Yes, I think this is a good idea. I will still benchmark std::fill_n() and
> > > vDSP_fill() because I am curious now :)
> > 
> > Interestingly, std::fill_n() seems consistently faster for large arrays than
> > vDSP_fill() on my MacBook Pro:
> > 
> > std::fill_n() took 1.0884ms
> > vDSP::vfill() took 1.20753ms
> > 
> > Benchmark:
> > 
> > // clang++ -O2 -std=c++14 -framework Accelerate fill_benchmark.cpp -o
> > fill_benchmark
> >   
> > #include <algorithm>
> > #include <chrono>
> > #include <iostream>
> > #include <Accelerate/Accelerate.h>
> > 
> > int main()
> > {
> >     constexpr unsigned N = 524288;
> >     const float pi = 3.1415926535;
> >     float array1[N];
> >     float array2[N];
> > 
> >     auto start = std::chrono::steady_clock::now();
> >     std::fill_n(array1, N, pi);
> >     auto end = std::chrono::steady_clock::now();
> > 
> >     std::chrono::duration<double> diff = end - start;
> >     std::cout << "std::fill_n() took " << diff.count() * 1000 << "ms\n";
> > 
> >     start = std::chrono::steady_clock::now();
> >     vDSP_vfill(&pi, array2, 1, N);
> >     end = std::chrono::steady_clock::now();
> > 
> >     diff = end - start;
> >     std::cout << "vDSP::vfill() took " << diff.count() * 1000 << "ms\n";
> > 
> >     std::cout << array1[2000] << " " << array2[2000] << "\n";
> > 
> >     return 0;
> > }
> 
> Interesting. Probably worth a radar to the Accelerate folks.

Ok, rdar://problem/70351530.
Comment 11 EWS 2020-10-15 14:15:23 PDT
Committed r268553: <https://trac.webkit.org/changeset/268553>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 411460 [details].
Comment 12 Radar WebKit Bug Importer 2020-10-15 14:16:20 PDT
<rdar://problem/70351684>