People don't understand how compiler works after thousands hours of practicing
static constexpr auto foo() noexcept -> auto&&
We prioritize freedom
People choose C++ to control
But they have no idea of the compiler internals
Focusing more on Clang/LLVM. Most points are available for GCC also
Source: simdjson
// library.h
void Foo();
void Bar();
// library.cc
void Foo() { ... }
void Bar() { ... }
// main.cc
#include "library.h"
Foo();
Bar();
\[ \Longrightarrow \textrm{ ?} \]
// compiled.o
call _Z3Bar
call _Z3Foo
// main.cc
#include "library.h"
Foo();
Bar();
\[ \Longrightarrow \textrm{ No*} \]
// compiled.o
call _Z3Bar
call _Z3Foo
void Foo() {
LOG(INFO) << ...
}
void Bar() {
LOG(INFO) << ...
}
// library.h
// Don't do that often
inline void Foo() {
}
// Don't do that often
inline void Bar() {
}
But we lose performance!
Productivity is important also
-flto did not finish
You don't test what you really build
At Google no problems seen for thousands of targets
Extensive sanitizer usage helps
Integration testing helps
Let's discuss first some high level architectural overview of compilers
Clang+LLVM
GCC
/* -O3 optimizations. */
{ OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_floop_interchange, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_floop_unroll_and_jam, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fpeel_loops, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fpredictive_commoning, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fsplit_loops, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fsplit_paths, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribution, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_vectorize, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_funswitch_loops, NULL, 1 },
{ OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL, VECT_COST_MODEL_DYNAMIC },
{ OPT_LEVELS_3_PLUS, OPT_fversion_loops_for_strides, NULL, 1 },
/* -ffast-math */
if (!opts->frontend_set_flag_unsafe_math_optimizations) {
opts->x_flag_unsafe_math_optimizations = set;
set_unsafe_math_optimizations_flags (opts, set);
}
if (!opts->frontend_set_flag_finite_math_only)
opts->x_flag_finite_math_only = set;
if (!opts->frontend_set_flag_errno_math)
opts->x_flag_errno_math = !set;
if (Level == OptimizationLevel::O3)
FPM.addPass(AggressiveInstCombinePass());
// FIXME: It isn't at all clear why this should
// be limited to O3.
if (Level == OptimizationLevel::O3)
MainCGPipeline.addPass(ArgumentPromotionPass());
if (Level == OptimizationLevel::O3)
EarlyFPM.addPass(CallSiteSplittingPass());
LPM.addPass(SimpleLoopUnswitchPass(/* NonTrivial */ Level ==
OptimizationLevel::O3));
Source: LLVM better sanitizer debugging
-Og. Optimize debugging experience
Works nicely only in GCC
Source: GCC optimize Options
Hey, buddy! We have likely a compiler bug. We move function out of anonymous namespace and it works. Also works in debug mode.
No problem, you can bisect optimizations https://llvm.org/docs/OptBisect.html
-mllvm -opt-bisect-limit=N
I don't know of any infra for this in GCCBISECT: running pass (35481) Two-Address instruction pass on function (_ZN3NYT7NDetail10TBindStateILb1EZNS0_11ApplyHelperIN12_GLOBAL__N_111TFirstClassEiFS4_RKNS_8TErrorOrIiEEEEENS_7TFutureIT_EENS_11TFutureBaseIT0_EENS_9TCallbackIT1_EEEUlS8_E_NS_4NMpl9TSequenceIJEEEJEE3RunIJS8_EEEDaPNS0_14TBindStateBaseEDpOT_)
I change < to != and get +30% loop performance!
-Rpass-missed=loop-vectorize
-Rpass=loop-vectorize
-Rpass-analysis=loop-vectorize
LLVM Vectorizing debug https://llvm.org/docs/Vectorizers.html
// example.cpp:44:9: remark: loop not vectorized:
// could not determine number of loop iterations
// [-Rpass-analysis=loop-vectorize]
for (; it < span_end; ++it)
^
const int OptSizeThreshold = 50;
const int OptMinSizeThreshold = 5;
const int OptAggressiveThreshold = 250;
const int InstrCost = 5;
const int IndirectCallThreshold = 100;
const int LoopPenalty = 25;
const int LastCallToStaticBonus = 15000;
const int ColdccPenalty = 2000;
const unsigned TotalAllocaSizeRecursiveCaller = 1024;
const uint64_t MaxSimplifiedDynamicAllocaToInline = 65536;
Bad version:
00000000004004fd <_ZL3addRKiS0_.isra.0>:
4004fd: 8d 04 37 lea eax,[rdi+rsi*1]
400500: c3
00000000004004fa <_ZL3addRKiS0_.isra.0>:
4004fa: 8d 04 37 lea eax,[rdi+rsi*1]
4004fd: c3 ret
[...]
40051a: e8 db ff ff ff call 4004fa <_ZL3addRKiS0_.isra.0>
Main differences:
#if defined(_LIBCPP_ABI_UNSTABLE) || _LIBCPP_ABI_VERSION >= 2
// Re-worked external template instantiations for std::string with a focus on
// performance and fast-path inlining.
#define _LIBCPP_ABI_STRING_OPTIMIZED_EXTERNAL_INSTANTIATION
// Enable clang::trivial_abi on std::unique_ptr.
#define _LIBCPP_ABI_ENABLE_UNIQUE_PTR_TRIVIAL_ABI
// Enable clang::trivial_abi on std::shared_ptr and std::weak_ptr
#define _LIBCPP_ABI_ENABLE_SHARED_PTR_TRIVIAL_ABI
This means that, unlike unsigned char, it cannot be used for the underlying storage of objects of another type, nor can it be used to examine the underlying representation of objects of other types; in other words, it cannot be used to alias other typesSource
Summary:
Many more
Many more
Mozilla Firefox
Update: As thereβs been some interest on reddit and HN, and I failed to mention it originally, itβs worth noting that comparing GCC+PGO vs. clang+LTO or GCC+PGO vs. clang+PGO was a win for clang overall in both cases, although GCC was winning on a few benchmarks. If I remember correctly, clang without PGO/LTO was also winning against GCC without PGO.
Stockfish
Yandex Search Engine
One trading company
Thanks to carzil@ for proofreading!