Who am I?
- Senior Software Engineer at Google
- DC efficiency
Agenda for today
- History of sorting
- Why have we decided to change anything?
- Bugs, bugs, bugs
- Implementation
- What can you do?
Reminders
- Sorting is the ordering of elements
- std::sort, std::stable_sort, ranges::sort, etc
Reminders
- Sorting is the ordering of elements
- std::sort, std::stable_sort, ranges::sort, etc
Quicksort
- Quick sort
- Take any element
- Less to the left
- More to the right
- Recurse on both parts
Standard library evolved from STL
Quicksort's worst case is $O(n^2)$
Just pick the worst element
Introsort
- Almost like a quick sort
- If we suspect a lot of recursion calls, use heap sort
- GCC libstdc++ uses $2 \log_2 n$ depth limit
- Microsoft standard library does the same
LLVM libc++ did not do anything at all for a long time
7 years
Only 0.01% of all calls got into the heap sort fallback in production
What is a good sort?
- $O(n\log n)$ comparisons
- Recognizes almost sorted patterns
- Fast for modern hardware
- Fewer comparisons for heavy comparison sorting
libcxx on ascending
libstdcxx on ascending
How libcxx has achieved that?
Insertion sort on every step up to 8 insertions.
Overall every now and then we have a better sorting
Let's change and save some cycles
We tried
Problem: ties
Problem: ties
It took a year to fix all golden tests
Problem: ties. How
Shuffle the range before
Problem: ties. How
Address space layout randomization
Randomization went beyond std::sort
Also to std::nth_element, std::partial_sort
std::nth_element, std::partial_sort
Randomize before and randomize after the undefined ranges
Determinism
Example: ClickHouse
Determinism
- Assume you sort some data
- Cache it
- Now ties are handled randomly
- Lower hit rate
Determinism
Use std::stable_sort
Let's write a median function
Before that you might get lucky because nth_element puts elements around nth close to total order
We found so many of them!
Use _LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY with libcxx
Strict weak ordering
- Irreflexivity: $x < x$ is false
- Asymmetry: $x < y$ and $y < x$ cannot be both true
- Transitivity: $x < y$ and $y < z$ imply $x < z$
- Transitivity of incomparability: $x == y$ and $y == z$ imply $x == z$, where $x == y$ means $x < y$ and $y < x$ are both false.
To check this we need to check all triples of elements, it's $O(n^3)$
But what are the risks?
On paper, UB. But in practice?
Risks
- Incorrect comparator for $\leq 30$ elements, all good
- Incorrect comparator for $> 30$ elements, can fail with OOB
- Have you written a well thought test for 30+ elements?
At Google we consistently (once in months) have outage somewhere because of that
To check this we need to check all triples of elements, it's $O(n^3)$, sorting is $O(n \log n)$.
There is a $O(n^2)$ algorithm to check for strict weak ordering
$O(n^2)$ is still a lot but we can use it for, say, 100 first elements
Most sorts are small!
Caveats
- Sort should not fail
- It only tells if strict weak ordering is met
- Arbitrary comparator can do anything
- We can check we don't read OOB in std::sort itself
Submitted SWO checks for std::{stable_,partial_}sort and rolled out in LLVM 17 in 7e1ee1
-D_LIBCPP_DEBUG_STRICT_WEAK_ORDERING_CHECK -D_LIBCPP_ENABLE_DEBUG_MODE=1
We found fewer of them but more dangerous.
Can we sort {1.02, 1.01, 1.0}?
Yes
Can we sort {3.0, NaN, 4.0}?
No
NaN thinks it is equal to 3.0 and 4.0
Can we sort {NaN, 3.0, NaN}?
Yes
All elements are equal
What are we changing?
- After these cleanups we can change to any sort
- We chose a mix of BlockQuickSort and pdqsort
Data dependent branches are hard to predict
CTZ - Count Trailing Zeros, BLSR - Reset Lowest Set Bit
We also improved picking a pivot out of 9 elements. So called "Tukey's ninther"
old new
BM_Sort_uint64_Rand_1 4.26ns ± 0% 4.41ns ± 1% 3.44%
BM_Sort_uint64_Rand_4 1.96ns ± 0% 1.95ns ± 1% -0.34%
BM_Sort_uint64_Rand_16 10.5ns ± 0% 11.6ns ± 1% 10.77%
BM_Sort_uint64_Rand_64 18.8ns ± 0% 17.5ns ± 1% -6.60%
BM_Sort_uint64_Rand_256 26.4ns ± 0% 21.7ns ± 1% -17.88%
BM_Sort_uint64_Rand_1024 33.8ns ± 1% 24.6ns ± 1% -27.15%
BM_Sort_uint64_Rand_16384 48.7ns ± 0% 30.1ns ± 1% -38.10%
BM_Sort_uint64_Rand_262144 63.4ns ± 1% 35.7ns ± 1% -43.71%
On other patterns we sometimes regressed but by a little, hardest cases are random.
BM_Sort_string_Random_1 4.66ns ± 6% 4.40ns ± 4% -5.55%
BM_Sort_string_Random_4 14.9ns ± 3% 15.0ns ± 6% ~
BM_Sort_string_Random_16 45.5ns ± 6% 35.8ns ± 8% -21.37%
BM_Sort_string_Random_64 66.6ns ± 4% 58.2ns ± 3% -12.69%
BM_Sort_string_Random_256 86.0ns ± 5% 77.4ns ± 3% -10.01%
BM_Sort_string_Random_1024 106ns ± 3% 96ns ± 6% -9.39%
BM_Sort_string_Random_16384 154ns ± 3% 141ns ± 5% -8.03%
BM_Sort_string_Random_262144 213ns ± 4% 197ns ± 4% -7.59%
Some thoughts
- Default for standard sorting is unstable
- Python, Rust, Java have stable default sorts
- Why do we have std::sort that can read OOB even if comparator is broken?
- std::ranges::sort wanted to have 1 argument, found 0 bugs with different iterators in std::sort
Some more thoughts
- std::weak_ordering in C++20 allows to make fewer bugs
Results
- We called it a comparator sanitizer
- We saw 7-8% perf improvement in production
- Reduced branch mispredictions a lot
- Fixed thousands of bugs
- Fought the Hyrum's Law
- Even simplest things are hard to do right
- Things take time (2.5 years)
What can you do?
- If you use libcxx, enable -D_LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY -D_LIBCPP_DEBUG_STRICT_WEAK_ORDERING_CHECK -D_LIBCPP_ENABLE_DEBUG_MODE=1 in your debug builds
- Probably randomization can be applied to std::partition, std::make_heap
- SWO checks for std::nth_element, etc
What can you do?
- Remember Bjarne talked about invalidation of vector iterators?
- In debug mode reallocate on first 100 push_backs