Re: Unexpectedly low floating-point performance in C
Posted: Tue Dec 18, 2018 3:20 pm
I've tried the following modified example, compiled with the GCC 8 toolchain, running on ESP32 240 MHz CPU core. SPI Flash/RAM is 80 MHz QSPI, but it doesn't matter, as the code and data easily fits into the cache.
The benchmark loops have been modified to do 4 independent instructions per loop, as opposed to 4 sequential dependent instructions. If some instructions can be executed in parallel, this should be hopefully a better test. I've also made sure that the compiler actually generates tight arithmetic code.
The following are the results.
At 240 MHz, 1 clock cycle = 4.17 ns.
Integer Addition
1.600000 ns / insn
625.000000 MOPS
Integer Multiplication
23.400000 ns / insn
42.735043 MOPS
This test is meaningless. The mul* instructions are not used and the compiler does some other arithmetic optimizations.
Integer Multiply-Accumulate
22.500000 ns / insn
44.444444 MOPS
This test is meaningless. The addx instruction is not used and the compiler does some other arithmetic optimizations.
Float Addition
5.800000 ns / insn
172.413793 MOPS
Float Multiplication
5.800000 ns / insn
172.413793 MOPS
Float Multiply-Accumulate
9.900000 ns / insn
101.010101 MOPS
This case actually does use the madd.s instruction.
Since there is no hardware support for double precision floating pont, there's little point in testing its performance, but just for completeness sake ...
Double Addition
246.500000 ns / insn
4.056795 MOPS
Double Multiplication
456.600000 ns / insn
2.190101 MOPS
Double Multiply-Accumulate
667.200000 ns / insn
1.498801 MOPS
Notice that this is a synthetic benchmark. It looks a little like the compiler has some issues with register allocation of larger code snippets and the actual computation gets slowed down by register moves and so on. As usual with these things, real-world performance will be lower than that, except for hand-tuned compute functions.
So yeah, hmm .. FP performance is a little low on the ESP32.
The benchmark loops have been modified to do 4 independent instructions per loop, as opposed to 4 sequential dependent instructions. If some instructions can be executed in parallel, this should be hopefully a better test. I've also made sure that the compiler actually generates tight arithmetic code.
Code: Select all
#include <cstdio>
#include <chrono>
#include <array>
#include <functional>
#if 0
typedef unsigned int ftype;
ftype f0 = 1;
ftype f1 = 2;
ftype f2 = 3;
ftype f3 = 4;
const int N = 10000;
const ftype C = 1.00001;
const ftype CI = (100/1.00001);
const ftype CII = 13;
#else
typedef float ftype;
// typedef double ftype;
const int N = 10000;
const ftype C = 1.00001;
const ftype CI = (1/1.00001);
const ftype CII = 0.001;
#endif
const int M = 3;
std::array<ftype, 4> test_values;
[[gnu::noinline, gnu::optimize ("fast-math")]]
void test_addition (void)
{
auto f0 = test_values[0];
auto f1 = test_values[1];
auto f2 = test_values[2];
auto f3 = test_values[3];
for (int j = 0; j < N/4; j++)
{
f0 += C;
f1 += C;
f2 += C;
f3 += C;
}
test_values[0] = f0;
test_values[1] = f1;
test_values[2] = f2;
test_values[3] = f3;
}
[[gnu::noinline, gnu::optimize ("fast-math")]]
void test_multiplication (void)
{
auto f0 = test_values[0];
auto f1 = test_values[1];
auto f2 = test_values[2];
auto f3 = test_values[3];
for (int j = 0; j < N/4; j++)
{
f0 *= CI;
f1 *= CI;
f2 *= CI;
f3 *= CI;
}
test_values[0] = f0;
test_values[1] = f1;
test_values[2] = f2;
test_values[3] = f3;
}
[[gnu::noinline, gnu::optimize ("fast-math")]]
void test_multiply_accumulate (void)
{
auto f0 = test_values[0];
auto f1 = test_values[1];
auto f2 = test_values[2];
auto f3 = test_values[3];
for (int j = 0; j < N/4; j++)
{
f0 = f0 + f3 * CII;
f1 = f1 + f2 * CII;
f2 = f2 + f1 * CII;
f3 = f3 + f0 * CII;
}
test_values[0] = f0;
test_values[1] = f1;
test_values[2] = f2;
test_values[3] = f3;
}
void run_test (const char* name, std::function<void (void)> func)
{
std::printf ("%s ... \n", name);
for (unsigned int i = 0; i < M; ++i)
{
auto start_time = std::chrono::high_resolution_clock::now ();
if (func)
func ();
else
__builtin_unreachable ();
auto end_time = std::chrono::high_resolution_clock::now ();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds> (end_time - start_time);
std::printf ("f0 = %lf f1 = %lf f2 = %lf f3 = %lf\n",
(double)test_values[0], (double)test_values[1],
(double)test_values[2], (double)test_values[3]);
std::printf ("%lf ns / insn\n", duration.count () / (double)N);
std::printf ("%lf MOPS\n\n", (1'000'000'000 / (duration.count () / (double)N)) / 1'000'000);
}
}
extern "C"
void app_main (void)
{
test_values[0] = 1;
test_values[1] = 2;
test_values[2] = 3;
test_values[3] = 4;
run_test ("addition", &test_addition);
run_test ("multiplication", &test_multiplication);
run_test ("multiply-accumulate", &test_multiply_accumulate);
}
At 240 MHz, 1 clock cycle = 4.17 ns.
Integer Addition
1.600000 ns / insn
625.000000 MOPS
Integer Multiplication
23.400000 ns / insn
42.735043 MOPS
This test is meaningless. The mul* instructions are not used and the compiler does some other arithmetic optimizations.
Integer Multiply-Accumulate
22.500000 ns / insn
44.444444 MOPS
This test is meaningless. The addx instruction is not used and the compiler does some other arithmetic optimizations.
Float Addition
5.800000 ns / insn
172.413793 MOPS
Float Multiplication
5.800000 ns / insn
172.413793 MOPS
Float Multiply-Accumulate
9.900000 ns / insn
101.010101 MOPS
This case actually does use the madd.s instruction.
Since there is no hardware support for double precision floating pont, there's little point in testing its performance, but just for completeness sake ...
Double Addition
246.500000 ns / insn
4.056795 MOPS
Double Multiplication
456.600000 ns / insn
2.190101 MOPS
Double Multiply-Accumulate
667.200000 ns / insn
1.498801 MOPS
Notice that this is a synthetic benchmark. It looks a little like the compiler has some issues with register allocation of larger code snippets and the actual computation gets slowed down by register moves and so on. As usual with these things, real-world performance will be lower than that, except for hand-tuned compute functions.
So yeah, hmm .. FP performance is a little low on the ESP32.