Introduction Link to heading
In the high-stakes world of financial trading, where milliseconds can mean the difference between profit and loss, software efficiency isn’t just a nice-to-have—it’s a competitive edge. High-frequency trading (HFT) systems process millions of orders per second, demanding code that runs at breakneck speeds while managing vast amounts of real-time data. At the heart of these systems lies the order book, a dynamic structure that tracks buy (bid) and sell (ask) orders for a given asset, sorted by price and time priority.
But why C++? Its low-level control over memory, direct hardware access, and support for advanced optimizations make it the language of choice for performance-critical applications like trading engines. In this blog post, we’ll dive into practical optimization techniques tailored for an order book implementation.
From leveraging data structures to exploiting cache locality and multithreading, we’ll explore how to squeeze every ounce of performance out of your code. Along the way, I’ll share code snippets and benchmarks to illustrate real-world gains, drawing from common pitfalls in trading software development.
Let’s get started by examining the core components of an order book and how naive implementations fall short in high-volume scenarios.
Understanding Order Book Data Structures Link to heading
A limit order book (LOB) is the core data structure in a trading matching engine. It maintains all outstanding buy (bid) and sell (ask) limit orders for an asset, organized to enable fast matching based on price-time priority:
- Price priority: Best bids (highest price) and best asks (lowest price) match first.
- Time priority: At the same price, older orders match before newer ones (FIFO).
The book is split into two sides:
- Bids: Sorted descending by price (best at the top).
- Asks: Sorted ascending by price (best at the top).
Each price level aggregates orders at that price, tracking total volume and often a queue of individual orders.
Key Operations and Requirements Link to heading
A production-grade order book must support:
- Add order: Insert at a price level (append to time queue).
- Cancel order: Remove by order ID.
- Execute/reduce: Partially or fully fill an order.
- Queries: Get best bid/offer (BBO), volume at price, depth traversal.
These need to be extremely fast (often O(1) or amortized O(1)) because trading systems process millions of events per second. Cache efficiency, low latency, and predictable performance are critical in C++ implementations.
Common Data Structure Choices Link to heading
The most widely used and performant design (seen in many open-source HFT books and production systems) is:
- Price levels container: A sorted map or tree for price levels.
- Orders at each price level: A FIFO queue.
- Fast order lookup
The Dense Vector-Based Order Book: A High-Performance Alternative Link to heading
While the classic std::map-based order book excels in generality for sparse price levels, many real-world trading scenarios—especially in crypto, futures, or equities with fixed tick sizes—benefit from a dense array (vector) implementation. This approach trades some memory for blazing-fast access and updates, making it ideal when prices are discretized (e.g., integer ticks or multiplied by 100 for cents) and the price range is bounded.
The core idea is that every book entry is indexed by its price. Assume prices are integers (or can be mapped to integers via tick size scaling). For example:
- Minimum price: 0
- Maximum price: 1,000,000 (covers most assets with sub-cent precision)
We use two vectors:
std::vector<BookEntry> bids_;std::vector<BookEntry> offers_;
Each vector is sized to max_price + 1. The index corresponds directly to the price:
bids_[price]holds the aggregated quantity at that bid price.offers_[price]holds the aggregated quantity at that ask price.
Key Advantages Link to heading
- O(1) access and updates: Add/cancel/reduce quantity at a price → direct index access.
- Cache-friendly: Contiguous memory, excellent locality when scanning depth or matching.
- No dynamic allocations per order (if aggregating quantity only).
- Fast depth queries: Cumulative volume from best price is a simple loop over contiguous array.
Trade-offs and Limitations Link to heading
- Memory usage: Fixed size based on price range which can be huge depending on .
- No time priority per se: This aggregates all orders at a price level into one quantity. True FIFO requires a separate queue per price (e.g., vector of deques or lists), adding complexity.
- Sparse markets: Wasteful if prices span a huge range with few active levels (use map instead).
- Cancel by ID: Requires a separate
unordered_map<OrderID, Price>to locate the price level quickly.
When to Use This Structure Link to heading
Perfect for:
- Cryptocurrency spot markets (often integer or fixed-decimal prices).
- High-volume tick-based instruments.
- When you prioritize raw speed and can aggregate quantities (many HFT systems do this for the core book, handling individual orders separately).
A Ring approach Link to heading
While the vector-based order book is a very intuitive design, I’ll be implementing a ring over each vector so the book is less prone to be resized.
With this new design, The book will track the best bid and ask using a pointer on the vector. The data read for best entry remains O(1).
1auto Ring::get_best_bid() const -> const BookEntry &
2{
3 assert(bids_end_ != nullptr);
4 return *bids_end_;
5}
6
7auto Ring::get_best_offer() const -> const BookEntry &
8{
9 assert(offers_begin_ != nullptr);
10 return *offers_begin_;
11}
12
13auto Ring::top() const -> std::tuple<const BookEntry &, const BookEntry &>
14{
15 assert(bids_end_ != nullptr);
16 assert(offers_begin_ != nullptr);
17 return {*bids_end_, *offers_begin_};
18}
Book entry Link to heading
BookEntry is simple: it’s a price and quantity combo.
1struct alignas(16) BookEntry
2{
3 OrderPrice price;
4 OrderQuantity quantity;
5
6public:
7 BookEntry() : price{0}, quantity{0} {}
8 BookEntry(OrderPrice price, OrderQuantity quantity) : price{price}, quantity{quantity} {}
9 // Copy constructors are required for vector::resize operations (if resizing book depth becomes necessary)
10};
Using concepts, it can be defined like this:
1#if __cplusplus >= 202002L
2#include <concepts>
3namespace libmarket::book
4{
5 template <typename T>
6 concept IsBookEntry = requires(T e) {
7 { e.price } -> std::convertible_to<libmarket::market_data::OrderPrice>;
8 { e.quantity } -> std::convertible_to<libmarket::market_data::OrderQuantity>;
9 };
10}
11#endif
Handling an incoming order Link to heading
Handling an incoming order in the book is a hot path. It has to be fast and should be easily benchmarked.
The goal is to first find the book entry slot (aka orders index) corresponding to the order price.
1auto Ring::handle(const Order &e) -> bool
2{
3 // Decide if operating on bids_ or offers_ in a branchless manner
4 auto type = static_cast<typename std::underlying_type_t<OrderType>>(e.type);
5 auto **begin = std::array<BookEntry **, 2>{&bids_begin_, &offers_begin_}[type];
6 auto **end = std::array<BookEntry **, 2>{&bids_end_, &offers_end_}[type];
7 auto *orders = std::array<std::vector<BookEntry> *, 2>{&bids_, &offers_}[type];
8
9 const auto index = find_index(orders, begin, end, e.price);
10 if (index == InvalidSlot)
11 {
12 return false;
13 }
14
15 // Implying new order or order (atomic) update
16 orders->at(index) = BookEntry(e.price, e.quantity);
17 return true;
18}
Note: The order deletion and trade execution are not discussed here as it won’t add anything to our code optimization topic.
Branchless Side Selection (Bids vs. Offers/Asks) Link to heading
The code avoids an explicit if (e.type == BID) { ... } else { ... } branch when deciding whether to operate on the bids or asks side.
Why branchless? Modern CPUs use branch prediction to speculatively execute code, but mispredictions cost cycles which is devastating in tight loops processing millions of orders/sec. In trading, order types are often somewhat predictable (e.g., bursts of bids), but not perfectly. Making selection branchless eliminates prediction risk entirely.
To do so, I cast enum OrderType to its underlying integer and use that integer as an index into small std::arrays of pointers. The compiler generates a simple address calculation + load, no conditional jump.
This is a classic array lookup dispatch. It’s fast, predictable, and cache-friendly.
Pointer-to-Pointer for Begin/End Iterators Link to heading
Notice auto **begin and auto **end are pointers to pointers pointing to bids_begin_ and bids_end_.
These are member variables that cache pointers to the active range of the dense vector (e.g., the region with non-zero quantities). Using ** allows passing them by reference into find_index without duplicating code for bids vs. asks.
This enables find_index to potentially update the cached begin/end if the price extends the book depth—common in dense implementations to avoid scanning the entire vector every time.
Finding the order index in the ring Link to heading
Now the find_index function is where the complexity lies because of the ring design. You would be using the order price as index in a simpler design.
1auto Ring::find_index(std::vector<BookEntry> *orders, BookEntry **begin, BookEntry **end, OrderPrice price) const -> int64_t
2{
3 assert(price > 0);
4 [[assume(price > 0)]];
5
6 // The begin / end window is not set yet
7 if (*begin == nullptr)
8 {
9 *begin = *end = orders->data();
10 return 0;
11 }
12
13 // Test if fully the price/boundary distance overlaps the container
14 const auto fully_overlaps = (udiff((*end)->price, price) / depth_) > 0;
15 if (fully_overlaps)
16 {
17 return InvalidSlot;
18 }
19
20 // Compute begin index (BI)
21 const int64_t begin_index = *begin - orders->data();
22
23 // Test if price is inside the begin / end window
24 // We must not to update pointers
25
26 // |00|00|00|13|00|00|00|07|00|00|
27 // EI BI
28 // -------- -----| valid ranges for setting the new price
29 if ((*begin)->price <= price && price <= (*end)->price)
30 {
31 auto deviation = price - (*begin)->price;
32 auto price_index = libmarket::math::mod((begin_index + deviation), depth_);
33 return static_cast<int64_t>(price_index);
34 }
35
36 // Price is outside the begin / end window
37 // We must update pointers
38
39 // Compute price index (PI)
40 auto deviation = libmarket::math::mod(udiff((*begin)->price, price), depth_);
41 auto pushing_low_boundary = price < (*begin)->price;
42
43 // XXX: check under/overflow for price_index computation
44 // Being branchless here does not significantly decrease branch-misses, neither moving the modulo
45 auto new_price_indexes = std::array<uint64_t, 2>{
46 (begin_index + deviation), // higher price index
47 (begin_index + depth_ - deviation) // lower price index
48 };
49 auto price_index = libmarket::math::mod(new_price_indexes[static_cast<uint8_t>(pushing_low_boundary)], depth_);
50
51 // Compute end index (EI)
52 const int64_t end_index = *end - orders->data();
53
54 // |10|00|00|13|00|00|00|07|00|00|
55 // EI BI
56 // -------- | valid range for pushing data outside begin / end window
57 if (end_index < begin_index)
58 {
59 if (price_index <= end_index)
60 {
61 return InvalidSlot;
62 }
63
64 if (price_index >= begin_index)
65 {
66 return InvalidSlot;
67 }
68
69 goto price_index_is_valid;
70 }
71
72 // |00|00|00|07|00|00|00|10|00|00|
73 // BI EI
74 // -------- -----| valid ranges for pushing data outside begin / end window
75 if (begin_index < end_index)
76 {
77 if (price_index < begin_index)
78 {
79 goto test_boundary_collision;
80 }
81
82 if (price_index > end_index)
83 {
84 goto test_boundary_collision;
85 }
86
87 return InvalidSlot;
88 }
89
90test_boundary_collision:
91
92 if (price_index == begin_index or price_index == end_index)
93 {
94 return InvalidSlot;
95 }
96
97price_index_is_valid:
98
99 // Push left or right boundary up to new price index
100 auto *boundary = pushing_low_boundary ? begin : end;
101 *boundary = orders->data() + price_index;
102
103 return static_cast<int64_t>(price_index);
104}
This function has a straightforward design with emphasis on the hotpath. Gains are made by enhancing common operations such as the modulo and the absolute distance.
[[assume]] attribute Link to heading
The code snippet demonstrates a common C++ optimization technique introduced in C++23: pairing a runtime assert with the standard [[assume]] attribute.
The assume expression is not evaluated at runtime so no code is generated for it.
However, the compiler can use this knowledge for aggressive optimizations (e.g., eliminating bounds checks, choosing faster instruction sequences, propagating constants, or simplifying control flow).
The compiler can:
- Eliminate redundant checks downstream (e.g., no need for division-by-price safeguards if price > 0 guarantees no zero-division).
- Strength-propagate values (e.g., treat price as positive for better codegen in loops or math operations).
- Improve branch prediction or vectorization by knowing certain paths are impossible.
- Optimize operations like
1.0 / priceinto faster approximations or mul-by-reciprocal if positivity/range is known.
Better modulo Link to heading
And because the book depth is a power of two, I can compute the mod faster.
The standard % operator performs full integer division, which is slow on most CPUs. However, when the modulus n is a power of two (e.g., 1024 = 2¹⁰), a mathematical identity allows replacement with a simple bitwise AND: a % n == a & (n - 1) for a >= 0 and n = 2^k
1namespace libmarket::math
2{
3 [[nodiscard]] auto is_power_of_two(uint64_t n) -> bool
4 {
5 return std::has_single_bit(n);
6 }
7
8 [[nodiscard]] inline auto _mod_branchless_power(uint64_t a, uint64_t n) -> uint64_t
9 {
10 assert(a >= 0);
11 assert(constexpr is_power_of_two(n));
12 return (a & (n - 1));
13 }
14
15 [[nodiscard]] inline auto mod(uint64_t a, uint64_t n) -> uint64_t
16 {
17 return _mod_branchless_power(a, n);
18 }
19}
With this one you’re saving a lot of cycles by relying on a simple bitwise AND.
Enhanced absolute distance Link to heading
Here is the udiff function that computes absolute difference ((a > b) ? a - b : b - a):
1// Branchless absolute difference for unsigned
2[[nodiscard]] auto udiff(OrderPrice a, OrderPrice b) -> OrderPrice
3{
4 auto n = (unsigned)((long long)((unsigned long long)a - (unsigned long long)b) >> 32);
5 unsigned result = a - b;
6 return (result ^ n) - n;
7}
While this is nifty bit tricks and while this can help when you need guaranteed branchless behavior on obscure platforms. In practice on x86-64 with modern compilers, simpler code wins both in speed and readability (think cmov).
These kinds of micro-optimizations however remind us why C++ remains king in latency-sensitive trading engines: (total) control over the generated assembly when you need it.
Benchmarking Link to heading
In performance-critical systems like a trading order book, optimizations often target microsecond or nanosecond gains. A change that looks brilliant in theory can underperform in practice due to cache effects, branch prediction, or allocator overhead. Without rigorous benchmarking, you risk deploying slower code or wasting time on irrelevant tweaks.
Benchmarking provides:
- Objective evidence: Quantifies if an optimization actually improves latency, throughput, or memory usage.
- Guidance for decisions: In our vector-based order book, benchmarks will reveal if O(1) indexed access beats
std::map’s O(log N) in real workloads. - Detection of regressions: Ensures new features don’t degrade performance.
- Realism in trading: HFT systems handle bursty, high-volume order flows; benchmarks simulating market data feeds expose bottlenecks that unit tests miss.
Effective Benchmarking Techniques Link to heading
Reliable microbenchmarks (measuring small operations like adding an order) require care to avoid misleading results:
- Warm-up and steady state: Run the operation millions of times to warm caches and JIT (if applicable).
- Isolate the code: Benchmark the specific function (e.g., handle(order)) in a loop, not the entire application.
- Prevent compiler over-optimization Use
benchmark::DoNotOptimizeor consume results to avoid dead code elimination. - Statistical rigor Run multiple iterations, report mean/median/min/max, and standard deviation.
- Realistic workloads Use traces from real market data (e.g., NASDAQ ITCH feeds) or synthetic generators mimicking order arrival rates.
- Control the environment Pin threads to cores, disable frequency scaling, use performance governors, isolate from OS noise.
- Measure relevant metrics:
- Latency (ns per operation)
- Throughput (orders/sec)
- Memory allocation rate
- Cache misses (via perf or VTune if applicable)
Common pitfalls:
- Benchmarking in Debug mode.
- Measuring wall-clock time without high-resolution timers
- Ignoring branch misprediction in random price accesses
Example with google benchmark Link to heading
1namespace
2{
3 std::random_device rd; // Will be used to obtain a seed for the random number engine
4 std::mt19937 gen(rd()); // Standard mersenne_twister_engine seeded with rd()
5
6 inline auto prepare_ring_order(const libmarket::book::Ring &book, benchmark::State &state) -> libmarket::book::Ring::Order
7 {
8 std::uniform_int_distribution<uint64_t> dis(1, std::numeric_limits<uint16_t>::max());
9 return libmarket::book::Ring::Order{
10 .type = static_cast<libmarket::market_data::OrderType>(dis(gen) % 2),
11 .price = static_cast<libmarket::market_data::OrderPrice>(dis(gen) % book.depth()),
12 .quantity = static_cast<libmarket::market_data::OrderQuantity>(dis(gen)),
13 .action = static_cast<libmarket::market_data::OrderAction>(dis(gen) % 3),
14 };
15 }
16}
17
18static void BM_Ring_get_best_bid(benchmark::State &state)
19{
20 auto book = libmarket::book::Ring(10);
21 auto order = prepare_ring_order(book, state);
22 order.action = libmarket::market_data::OrderAction::New;
23 order.type = libmarket::market_data::OrderType::Bid;
24 book.handle(order);
25
26 for (auto _ : state)
27 {
28 std::ignore = book.get_best_bid();
29 }
30
31 state.SetItemsProcessed(1 * state.iterations());
32}
33
34BENCHMARK(BM_Ring_get_best_bid);
You can then use perf to compute metrics:
- cycles: number of CPU cycles executed.
- instructions: number of instructions executed.
- branches: number of branch instructions executed"
- cache-referneces : memory accesses to the cache.
- cache-misses: memory accesses that require fetchingdata from a higher-level cache or main memory.
- faults: number of page faults, which occur when a process tries to access a page of memory that is not currently mapped in its address space.
- migrations: number of times a task is migrated between CPU cores.
1perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses,faults,migrations ./build/libmarket-benchmark/libmarket_benchmark --benchmark_filter="$@*"