reveal.js

Who am I?

Senior Software Engineer at Google
Data Pipelining Efficiency and Infrastructure (MapReduce and more)
In the past Core Search Engine at Yandex
ClickHouse infra and efficiency
C++ library and compiler contributor
C++ teacher in universities

There is a rumor going on...

You should trust your compiler
Compilers are smart
Don't overoptimize
C++ is fast

People don't understand how compiler works after thousands hours of practicing


						static constexpr auto foo() noexcept -> auto&&

We prioritize freedom

No centralized build system or a compiler

People choose C++ to control

But they have no idea of the compiler internals

But either way you and companies choose to write C++

Speed
Expressiveness
History
Knowledge
Hiring
Library Ecosystem
Tooling
Put your reason here

Compiler level

Speed
Tooling

Compilers

And others

Focusing more on Clang/LLVM. Most points are available for GCC also

Same C++, same set of inputs, same compilers

Source: simdjson

What happened?

Well, that's complicated

SIMD (Single Instruction Multiple Data)

Compiler options

Hundreds of hours of assembly generation debugging

Point 1: don't blame compilers for being slow. They will not generate best possible code for you

Point 2: There is a chance your code will grow a lot and you won't notice

Forgotten story 1: visibility

// library.h
void Foo();
void Bar();

// library.cc
void Foo() { ... }
void Bar() { ... }

// main.cc
#include "library.h"
Foo();
Bar();

\[ \Longrightarrow \textrm{ ?} \]

// compiled.o
call _Z3Bar
call _Z3Foo

// main.cc
#include "library.h"
Foo();
Bar();

\[ \Longrightarrow \textrm{ No*} \]

// compiled.o
call _Z3Bar
call _Z3Foo

They can use global memory


								void Foo() {
									LOG(INFO) << ...
								}
								void Bar() {
									LOG(INFO) << ...
								}

Compilers just do not see it

Solution: write in headers!

Well... Please, don't write everything in headers

// library.h
// Don't do that often
inline void Foo() {
}
// Don't do that often
inline void Bar() {
}

Compile times are also important, don't ignore them.

But we lose performance!

Productivity is important also

Link time optimizations

LTO sees all program
ThinLTO provides summaries of modules (sees all potential optimization parts of the program)
-flto, -flto=thin (only in LLVM)
0-10% performance, at Google we saw 3-4%

(Possible) Workflow

Usual builds without ThinLTO
Release builds with ThinLTO
Gains and no blow up in compile times
Usual LTO may be unscalable

Open Source ClickHouse (400k+ LOC)

-flto did not finish

Caveats

You don't test what you really build

At Google no problems seen for thousands of targets

Extensive sanitizer usage helps

Integration testing helps

Point 3: More visibility to your compiler opens more optimizations

You probably heard of optimization levels `-O0`, `-O1`, `-O2`, `-O3`, `-Ofast`

What do they exactly mean?

Let's discuss first some high level architectural overview of compilers

Front end: parse semantics of C++ into first IR
Middle end: optimizations over IR
Back end: codegen for the architecture

Clang+LLVM

GCC

Middle end: optimizations over IR

Cleanup and canonicalization
Simplification and more canonicalization
Target-specific optimizations and lowering

LLVM and GCC: * `-O0` almost no passes * `-O1` includes some passes * `-O2` includes most passes * `-O3` includes all passes * `-Ofast` includes `-O3` + `-ffast-math` + some weird more (dangerous!)

O3 includes in GCC (clickable)

/* -O3 optimizations.  */
{ OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_floop_interchange, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_floop_unroll_and_jam, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fpeel_loops, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fpredictive_commoning, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fsplit_loops, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fsplit_paths, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribution, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_vectorize, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_funswitch_loops, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL, VECT_COST_MODEL_DYNAMIC },
{ OPT_LEVELS_3_PLUS, OPT_fversion_loops_for_strides, NULL, 1 },

/* -ffast-math */
if (!opts->frontend_set_flag_unsafe_math_optimizations) {
	opts->x_flag_unsafe_math_optimizations = set;
	set_unsafe_math_optimizations_flags (opts, set);
}
if (!opts->frontend_set_flag_finite_math_only)
	opts->x_flag_finite_math_only = set;
if (!opts->frontend_set_flag_errno_math)
	opts->x_flag_errno_math = !set;

Loop vectorization is enabled by default in -O2 from GCC12 !

https://gcc.godbolt.org/z/ofKqnbKWY

LLVM in O3 includes


if (Level == OptimizationLevel::O3)
  FPM.addPass(AggressiveInstCombinePass());
// FIXME: It isn't at all clear why this should
// be limited to O3.
if (Level == OptimizationLevel::O3)
  MainCGPipeline.addPass(ArgumentPromotionPass());
if (Level == OptimizationLevel::O3)
  EarlyFPM.addPass(CallSiteSplittingPass());

LPM.addPass(SimpleLoopUnswitchPass(/* NonTrivial */ Level ==
                                   OptimizationLevel::O3));

https://gcc.godbolt.org/z/redozT7aK

Point 4: Use `-O0 -g` for debugging purposes, use `-O3` for optimization purposes

Problem: -O0 is too slow for debugging!

-O1
-fno-omit-frame-pointer
-fno-optimize-sibling-calls

Problem: -O0 is too slow for debugging!

Some optimizations

-fno-omit-frame-pointer

Better stacktraces (saves the base pointer)

-fno-optimize-sibling-calls

Removes tail call elimination (debuggers are less confused)

Source: LLVM better sanitizer debugging

-Og. Optimize debugging experience

Works nicely only in GCC

Source: GCC optimize Options

Request from a friend:

Hey, buddy! We have likely a compiler bug. We move function out of anonymous namespace and it works. Also works in debug mode.

My response:

No problem, you can bisect optimizations https://llvm.org/docs/OptBisect.html

-mllvm -opt-bisect-limit=N

I don't know of any infra for this in GCC

Friend:

BISECT: running pass (35481) Two-Address instruction pass on function (_ZN3NYT7NDetail10TBindStateILb1EZNS0_11ApplyHelperIN12_GLOBAL__N_111TFirstClassEiFS4_RKNS_8TErrorOrIiEEEEENS_7TFutureIT_EENS_11TFutureBaseIT0_EENS_9TCallbackIT1_EEEUlS8_E_NS_4NMpl9TSequenceIJEEEJEE3RunIJS8_EEEDaPNS0_14TBindStateBaseEDpOT_)

It really was a bug 😳

Source

Request from a friend:

I change < to != and get +30% loop performance!

-Rpass-missed=loop-vectorize

-Rpass=loop-vectorize

-Rpass-analysis=loop-vectorize

LLVM Vectorizing debug https://llvm.org/docs/Vectorizers.html


// example.cpp:44:9: remark: loop not vectorized:
// could not determine number of loop iterations
// [-Rpass-analysis=loop-vectorize]
for (; it < span_end; ++it)
^

Macro

https://github.com/ClickHouse/ClickHouse/pull/9498

Forgotten story 2: Microarchitecture

Hardware evolves
New instruction sets are added
Caches and cachelines grow

Note: it depends on your hardware

x86. Westmere era (2010)

-msse2, -msse3, -mssse3, -msse4.1, -msse4.2
-mpclmul
-maes
-mcx16
-mpopcnt

x86. Westmere era (2010)

-msse2, -msse3, -mssse3, -msse4.1, -msse4.2

SSE instruction sets (16 byte registers)

-mpclmul

Carry less multiplication (benefitial for CRC and Erasure codes)

-maes

Advanced Encryption Set

-mcx16

Lock free 16 byte atomics

-mpopcnt

Popcount instruction

x86. Haswell era (2013)

-mavx, -mavx2
-mbmi2
-mfma

x86. Haswell era (2013)

-mavx, -mavx2

Wider 32 byte registers

-mbmi2

Bit manipulation

-mfma

Multiply-addition, a += b * c

x86. Why not just AVX/AVX2/AVX512 by default?

It depends on your hardware and workload
Share AVX with non AVX -> CPU downclocking
Heavy AVX (floating point) -> CPU downclocking

All before including Skylake
From Ice Lake it becomes better

⚠️ Next slide is absolutely subjective and comes from my biased experience

Point 5: Compilers assume really old hardware (e.g. core2 for x86, 2006), use better options

Point 6: `-march=native` is extremely dangerous because of illegal instructions and non-homogeneous hardware

Ok, I don't use any SIMD, why would I want that?

Third party libraries may require that

SIMDJSON, Hyperscan, Turbo-Base64, ZSTD, Snappy, Yandex FastOps, OpenSSL

Forgotten story 3: Function inlining

Generally inlining increases performance (see ThinLTO)

Function inlining is a cost based model: computes value and decides on a threshold

const int OptSizeThreshold = 50;
const int OptMinSizeThreshold = 5;
const int OptAggressiveThreshold = 250;
const int InstrCost = 5;
const int IndirectCallThreshold = 100;
const int LoopPenalty = 25;
const int LastCallToStaticBonus = 15000;
const int ColdccPenalty = 2000;
const unsigned TotalAllocaSizeRecursiveCaller = 1024;
const uint64_t MaxSimplifiedDynamicAllocaToInline = 65536;

You cannot inline everything and sometimes inlining is bad (e.g. for instruction cache)

`-mllvm -inline-threshold=N` Default is 225

Inlining increases your build size
And compile time

inline keyword matters! (hehe) __attribute__((always_inline))

Idea: run with `-mllvm -inline-threshold=1000`, compare benchmarks, set always_inline attributes to a dozens of calls

Good problem to have: I have an absolutely unrelated commit but benchmarks became worse

Answer: check out for function/loop alignment

Caution: awful code!

Stackoverflow: Why does GCC generate 15-20% faster code if I optimize for size instead of speed?

It was a jump over 32 byte boundary

Bad version:

00000000004004fd <_ZL3addRKiS0_.isra.0>:
  4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]
  400500:       c3

Good version:

00000000004004fa <_ZL3addRKiS0_.isra.0>:
  4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004fd:       c3                      ret

[...]

  40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>

Another problem: Jump Conditional Code Erratum

* `-mbranches-within-32B-boundaries` * `-falign-functions=32`

They highly stabilized benchmarks, for example, at Google and in ClickHouse

Standard library

Likely compilers come with std

Main differences:

Small string optimization size
Vector/string resizing policy
Additional options

#if defined(_LIBCPP_ABI_UNSTABLE) || _LIBCPP_ABI_VERSION >= 2
// Re-worked external template instantiations for std::string with a focus on
// performance and fast-path inlining.
#define _LIBCPP_ABI_STRING_OPTIMIZED_EXTERNAL_INSTANTIATION
// Enable clang::trivial_abi on std::unique_ptr.
#define _LIBCPP_ABI_ENABLE_UNIQUE_PTR_TRIVIAL_ABI
// Enable clang::trivial_abi on std::shared_ptr and std::weak_ptr
#define _LIBCPP_ABI_ENABLE_SHARED_PTR_TRIVIAL_ABI

2-2.5% build size win at Google (it requires building everything from source)

C++20 char8_t is one of the best features for performance. It allows to specify non overlapping pointers

Source: ClickHouse

This means that, unlike unsigned char, it cannot be used for the underlying storage of objects of another type, nor can it be used to examine the underlying representation of objects of other types; in other words, it cannot be used to alias other types

Source

memmove/memcpy

ClickHouse is column oriented and marking columns to be char8_t is desired

Summary:

Visibility
Optimization Passes
Microarchitecture
Standard Library
Inlining

Many more

Profile guided optimization
Post link optimization (BOLT/Propeller)
New pass manager (default in clang-13)
Tooling: clang-tidy, AST matchers

Many more

Profile guided optimization

Collect profiles and have better loop/inline decisions: LLVM PGO

Post link optimization (BOLT/Propeller)

Move blocks together without recompilation

New pass manager (default in clang-13)

Better pass manager strategy (mostly inlining)

Tooling: clang-tidy, AST matchers

Static analysis

Results

Mozilla Firefox

Move from GCC/MSVC to Clang
-flto, PGO
-fexperimental-new-pass-manager (by default in clang-13)

5-7% win on Talos tests

Mozilla Blog

Pass Manager Commit

Update: As there’s been some interest on reddit and HN, and I failed to mention it originally, it’s worth noting that comparing GCC+PGO vs. clang+LTO or GCC+PGO vs. clang+PGO was a win for clang overall in both cases, although GCC was winning on a few benchmarks. If I remember correctly, clang without PGO/LTO was also winning against GCC without PGO.

Stockfish