Coding – The ryg blog

I’ve been meaning to write another proper post on this for a while, but the last few months have been very busy and I didn’t feel like writing in my spare time.

This is not that post, sadly. This is about something else: namely, me only citing Jarek Duda’s work on ANS and not Yann Collet‘s work on FSE. Apparently, there were multiple versions of Jarek’s ANS paper, and the second version (which contains rANS, the topic I’ve been writing about) was significantly influenced by Yann’s experiences with integrating ideas from the first version into FSE.

Anyway, I did not mention Yann’s work in my rANS posts at all. I just want to make clear that this was because I was writing about rANS not tANS (the family that FSE is a member of), and I simply wasn’t aware that Yann’s work and input significantly influenced the second version of the ANS paper. My apologies; this was a simple oversight, not a deliberate attempt to talk down Yann’s contribution!

On a separate but related note: Mid-february, I wrote a short paper on how entropy coders (with a focus on ANS) can be interleaved on the encode side to allow the decoder side to exploit instruction-level parallelism and/or SIMD instructions. I originally meant to do a separate post about it here, but on trying to write it discovered that I didn’t have much to say on the topic that wasn’t in the paper. Hence, no separate blog post. But I figured I should at least link to it once from here.

Anyway, more regular blog updates should start again soon. Until then!

I’m planning to write a bit about data organization for multi-core scenarios. I started writing a first post but quickly realized that there’s a bunch of basics I need to cover first. In this post, I’ll try just that.

Caches

This is a whirlwhind primer on CPU caches. I’m assuming you know the basic concept, but you might not be familiar with some of the details. (If you are, feel free to skip this section.)

In modern CPUs (almost) all memory accesses go through the cache hierarchy; there’s some exceptions for memory-mapped IO and write-combined memory that bypass at least parts of this process, but both of these are corner cases (in the sense that the vast majority of user-mode code will never see either), so I’ll ignore them in this post.

The CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly – it’s physically impossible; the necessary wires don’t exist! Instead, they talk to their L1 caches which are supposed to handle it. And about 20 years ago, the L1 caches would indeed talk to memory directly. At this point, there’s generally more cache levels involved; this means the L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory. Or maybe to a L3 cache. You get the idea.

Caches are organized into “lines”, corresponding to aligned blocks of either 32 (older ARMs, 90s/early 2000s x86s/PowerPCs), 64 (newer ARMs and x86s) or 128 (newer Power ISA machines) bytes of memory. Each cache line knows what physical memory address range it corresponds to, and in this article I’m not going to differentiate between the physical cache line and the memory it represents – this is sloppy, but conventional usage, so well, get used to it. In particular, I’m going to say “cache line” to mean a suitably aligned group of bytes in memory, no matter whether these bytes are currently cached (i.e. present in any of the cache levels) or not.

When the CPU core sees a memory load instruction, it passes the address to the L1 data cache (or “L1D$”, playing on the “cache” being pronounced the same way as “cash”). The L1D$ checks whether it contains the corresponding cache line. If not, the whole cache line is brought in from memory (or the next-deeper cache level, if present) – yes, the whole cache line; the assumption being that memory accesses are localized, so if we’re looking at some byte in memory we’re likely to access its neighbors soon. Once the cache line is present in the L1D$, the load instruction can go ahead and perform its memory read.

And as long as we’re dealing with read-only access, it’s all really simple, since all cache levels obey what I’ll call the

Basic invariant: the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses, at all times.

Things gets a bit more complicated once we allow stores, i.e. memory writes. There’s two basic approaches here: write-through and write-back. Write-through is the easier one: we just pass stores through to the next-level cache (or memory). If we have the corresponding line cached, we update our copy (or maybe even just discard it), but that’s it. This preserves the same invariant as before: if a cache line is present in the cache, its contents match memory, always.

Write-back is a bit trickier. The cache doesn’t pass writes on immediately. Instead, such modifications are applied locally to the cached data, and the corresponding cache lines are flagged “dirty”. Dirty cache lines can trigger a write-back, at which points their contents are written back to memory or the next cache level. After a write-back, dirty cache lines are “clean” again. When a dirty cache line is evicted (usually to make space for something else in the cache), it always needs to perform a write-back first. The invariant for write-back caches is slightly different.

Write-back invariant: after writing back all dirty cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses.

In other words, in write-back caches we lose the “at all times” qualifier and replace it with a weaker condition: either the cache contents match memory (this is true for all clean cache lines), or they contain values that eventually need to get written back to memory (for dirty cache lines).

Write-through caches are simpler, but write-back has some advantages: it can filter repeated writes to the same location, and if most of the cache line changes on a write-back, it can issue one large memory transaction instead of several small ones, which is more efficient.

Some (mostly older) CPUs use write-through caches everywhere; some use write-back caches everywhere; some have a simpler write-through L1$ backed by a write-back L2$. This may generate redundant traffic between L1$ and L2$ but gets the write-back benefits for transfers to lower cache levels or memory. My point being that there’s a whole set of trade-offs here, and different designs use different solutions. Nor is there a requirement that cache line sizes be the same at all levels – it’s not unheard-of for CPUs to have 32-byte lines in L1$ but 128-byte lines in L2$ for example.

Omitted for simplicity in this section: cache associativity/sets; write-allocate or not (I described write-through without write-allocate and write-back with, which is the most common usage); unaligned accesses; virtually-addresses caches. These are all things you can look up if you’re interested, but I’m not going to go that deep here.

Coherency protocols

As long as that single CPU core is alone in the system, this all works just fine. Add more cores, each with their own caches, and we have a problem: what happens if some other core modifies data that’s in one of our caches?

Well, the answer is quite simple: nothing happens. And that’s bad, because we want something to happen when someone else modifies memory that we have a cached copy of. Once we have multiple caches, we really need to keep them synchronized, or we don’t really have a “shared memory” system, more like a “shared general idea of what’s in memory” system.

Note that the problem really is that we have multiple caches, not that we have multiple cores. We could solve the entire problem by sharing all caches between all cores: there’s only on L1$, and all processors have to share it. Each cycle, the L1$ picks one lucky core that gets to do a memory operation this cycle, and runs it.

This works just fine. The only problem is that it’s also slow, because cores now spend most of their time waiting in line for their next turn at a L1$ request (and processors do a lot of those, at least one for every load/store instruction). I’m pointing this out because it shows that the problem really isn’t so much a multi-core problem as it is a multi-cache problem. We know that one set of caches works, but when that’s too slow, the next best thing is to have multiple caches and then make them behave as if there was only one cache. This is what cache coherency protocols are for: as the name suggests, they ensure that the contents of multiple caches stay coherent.

There’s multiple types of coherency protocols, but most computing devices you deal with daily fall into the category of “snooping” protocols, and that’s what I’ll cover here. (The primary alternative, directory-based systems, has higher latency but scales better to systems with lots of cores).

The basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores: the caches themselves are independent, but memory itself is a shared resource, and memory access needs to be arbitrated: only one cache gets to read data from, or write back to, memory in any given cycle. Now the idea in a snooping protocol is that the caches don’t just interact with the bus when they want to do a memory transaction themselves; instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing. So if one cache wants to read from or write to memory on behalf of its core, all the other cores notice, and that allows them to keep their caches synchronized. As soon as one core writes to a memory location, the other cores know that their copies of the corresponding cache line are now stale and hence invalid.

With write-through caches, this is fairly straightforward, since writes get “published” as soon as they happen. But if there’s write-back caches in the mix, this doesn’t work, since the physical write-back to memory can happen a long time after the core executed the corresponding store – and for the intervening time, the other cores and their caches are none the wiser, and might themselves try to write to the same location, causing a conflict. So with a write-back model, it’s not enough to broadcast just the writes to memory when they happen; if we want to avoid conflicts, we need to tell other cores about our intention to write before we start changing anything in our local copy. Working out the details, the easiest solution that fits the bill and works for write-back caches is what’s commonly called the MESI protocol.

MESI and friends

This section is called “MESI and friends” because MESI spawned a whole host of closely related coherency protocols. Let’s start with the original though: MESI are the initials for the four states a cache line can be in for any of the multiple cores in a multi-core system. I’m gonna cover them in reverse order, because that’s the better order to explain them in:

Invalid lines are cache lines that are either not present in the cache, or whose contents are known to be stale. For the purposes of caching, these are ignored. Once a cache line is invalidated, it’s as if it wasn’t in the cache in the first place.
Shared lines are clean copies of the contents of main memory. Cache lines in the shared state can be used to serve reads but they can’t be written to. Multiple caches are allowed to have a copy of the same memory location in “shared” state at the same time, hence the name.
Exclusive lines are also clean copies of the contents of main memory, just like the S state. The difference is that when one core holds a line in E state, no other core may hold it at the same time, hence “exclusive”. That is, the same line must be in the I state in the caches of all other cores.
Modified lines are dirty; they have been locally modified. If a line is in the M state, it must be in the I state for all other cores, same as E. In addition, modified cache lines need to be written back to memory when they get evicted or invalidated – same as the regular dirty state in a write-back cache.

If you compare this to the presentation of write-back caches in the single-core case above, you’ll see that the I, S and M states already had their equivalents: invalid/not present, clean, and dirty cache lines, respectively. So what’s new is the E state denoting exclusive access. This state solves the “we need to tell other cores before we start modifying memory” problem: each core may only write to cache lines if their caches hold them in the E or M states, i.e. they’re exclusively owned. If a core does not have exclusive access to a cache line when it wants to write, it first needs to send an “I want exclusive access” request to the bus. This tells all other cores to invalidate their copies of that cache line, if they have any. Only once that exclusive access is granted may the core start modifying data – and at that point, the core knows that the only copies of that cache line are in its own caches, so there can’t be any conflicts.

Conversely, once some other core wants to read from that cache line (which we learn immediately because we’re snooping the bus), exclusive and modified cache lines have to revert back to the “shared” (S) state. In the case of modified cache lines, this also involves writing their data back to memory first.

The MESI protocol is a proper state machine that responds both to requests coming from the local core, and to messages on the bus. I’m not going to go into detail about the full state diagram and what the different transition types are; you can find more in-depth information in books on hardware architecture if you care, but for our purposes this is overkill. As a software developer, you’ll get pretty far knowing only two things:

Firstly, in a multi-core system, getting read access to a cache line involves talking to the other cores, and might cause them to perform memory transactions.
Writing to a cache line is a multi-step process: before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents (a so-called “Read For Ownership” request).

And secondly, while we have to do some extra gymnastics, the end result actually does provide some pretty strong guarantees. Namely, it obeys what I’ll call the

MESI invariant: after writing back all dirty (M-state) cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses. In addition, at all times, when a memory location is exclusively cached (in E or M state) by one core, it is not present in any of the other core’s caches..

Note that this is really just the write-back invariant we already saw with the additional exclusivity rule thrown in. My point being that the presence of MESI or multiple cores does not necessarily weaken our memory model at all.

Okay, so that (very roughly) covers vanilla MESI (and hence also CPUs that use it, ARMs for example). Other processors use extended variants. Popular extensions include an “O” (Owned) state similar to “E” that allows sharing of dirty cache lines without having to write them back to memory first (“dirty sharing”), yielding MOESI, and MERSI/MESIF, which are different names for the same idea, namely making one core the designated responder for read requests to a given cache line. When multiple cores hold a cache line in Shared state, only the designated responder (which holds the cache line in “R” or “F” state) replies to read requests, rather than everyone who holds the cache line in S state. This reduces bus traffic. And of course you can add both the R/F states and the O state, or get even fancier. All these are optimizations, but none of them change the basic invariants provided or guarantees made by the protocol.

I’m no expert on the topic, and it’s quite possible that there are other protocols in use that only provide substantially weaker guarantees, but if so I’m not aware of them, or any popular CPU core that uses them. So for our purposes, we really can assume that coherency protocols keep caches coherent, period. Not mostly-coherent, not “coherent except for a short window after a change” – properly coherent. At that level, barring hardware malfunction, there is always agreement on what the current state of memory should be. In technical terms, MESI and all its variants can, in principle anyway, provide full sequential consistency, the strongest memory ordering guarantee specified in the C++11 memory model. Which begs the question, why do we have weaker memory models, and “where do they happen”?

Memory models

Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model.

I won’t go into the details of memory models here; it quickly gets really technical, and is outside the scope of this article. But I do want to talk a bit about “how they happen” – that is, where the weakened guarantees (compared to the full sequential consistency we can get from MESI etc.) come from, and why. And as usual, it all boils down to performance.

So here’s the deal: you will indeed get full sequential consistency if a) the cache immediately responds to bus events on the very cycle it receives them, and b) the core dutifully sends each memory operation to the cache, in program order, and wait for it to complete before you send the next one. And of course, in practice modern CPUs normally do none of these things:

Caches do not respond to bus events immediately. If a bus message triggering a cache line invalidation arrives while the cache is busy doing other things (sending data to the core for example), it might not get processed that cycle. Instead, it will enter a so-called “invalidation queue”, where it sits for a while until the cache has time to process it.
Cores do not, in general, send memory operations to the cache in strict program order; this is certainly the case for cores with Out-of-Order execution, but even otherwise in-order cores may have somewhat weaker ordering guarantees for memory operations (for example, to ensure that a single cache miss doesn’t immediately make the entire core grind to a halt).
In particular, stores are special, because they’re a two-phase operation: we first need to acquire exclusive ownership of a cache line before a store can go through. And if we don’t already have exclusive ownership, we need to talk to the other cores, which takes a while. Again, having the core idle and twiddling thumbs while this is happening is not a good use of execution resources. Instead, what happens is that stores start the process of getting exclusive ownership, then get entered into a queue of so-called “store buffers” (some refer to the entire queue as “store buffer”, but I’m going to use the term to refer to the entries). The stay around in this queue for a while until the cache is ready to actually perform the store operation, at which point the corresponding store buffer is “drained” and can be recycled to hold a new pending store.

The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue), stores actually finish later than their position in the code would suggest, and everything gets even more vague when Out of Order execution is involved. So going back to memory models, there’s essentially two camps:

Architectures with a weak memory model do the minimum amount of work necessary in the core that allows software developers to write correct code. Instruction reordering and the various buffering stages are officially permitted; there’s no guarantees. If you need guarantees, you need to insert the appropriate memory barriers – which will prevent reordering and drain queues of pending operations where required.

Architectures with stronger memory models do a lot more bookkeeping on the inside. For example, x86 processors keep track of all pending memory operations that aren’t fully finished (“retired”) yet, in a chip-internal data structure that’s called the MOB (“memory ordering buffer”). As part of the Out of Order infrastructure, x86 cores can roll back non-retired operations if there’s a problem – say an exception like a page fault, or a branch mispredict. I covered some of the details, as well as some of the interactions with the memory subsystem, in my earlier article “Speculatively speaking“. The gist of it is that x86 processors actively watch out for external events (such as cache invalidations) that would retroactively invalidate the results of some of the operations that have already executed, but not been retired yet. That is, x86 processors know what their memory model is, and when an event happens that’s inconsistent within that model, the machine is state is rolled back to the last time when it was still consistent with the rules of the memory model. This is the “memory ordering machine clear” I covered in yet another earlier post.

So, weaker memory models make for simpler (and potentially lower-power) cores. Stronger memory models make the design of cores (and their memory subsystems) more complex, but are easier to write code for. In theory, the weaker models allow for more scheduling freedom and can be potentially faster; in practice, x86s seem to be doing fine on the performance of memory operations, for the time being at least. So it’s hard for me to call a definite winner so far. Certainly, as a software developer I’m happy to take the stronger x86 memory model when I can get it.

Anyway. That’s plenty for one post. And now that I have all this written up on my blog, the idea is that future posts can just reference it. We’ll see how that goes. Thanks for reading!

Last time, we covered the basics of how cache coherency works. Today, let’s talk about some of the primitives necessary to build useful systems on top of a coherent cache, and how they work.

Atomicity and atomic operations

A crucial building block for all of this are atomic operations. This has nothing to do with nuclear physics and everything to do with the root of the word atom, the Ancient Greek “ἄτομος” (atomos, “indivisible”). An atomic operation is one that cannot by divided into any smaller parts, or appears to be for the purposes of other cores in the system. To see why this matters, consider what happens when two cores both try to increment a counter at almost the same type, running the equivalent of the C statement counter++;:

Cycle #	Core 1	Core 2
0	reg = load(&counter);
1	reg = reg + 1;	reg = load(&counter);
2	store(&counter, reg);	reg = reg + 1;
3		store(&counter, reg);

In compiled code, this single turns into a load operation, a register increment, and finally a store operation (here written in C-esque pseudocode). These three steps are distinct and will execute in sequence (note that on the micro-architectural level, this is true on x86 as well, even though the instruction set architecture actually features a read-modify-write add [memory], value instruction). And because of this splitting into multiple cycles, it’s possible for Core 2 to read counter after Core 1 has read it (and started incrementing it), but before it has written the result back. The end result is that, even though both cores ran code to increment the counter, the final value of the counter is only incremented by 1; one of the two increment operations was “lost”.

This problem is exactly what atomic operations are there to prevent; if we use an atomic increment (or more generally, atomic add) instead of a regular increment, the active core will make sure that the three steps above (load, add, store) appear to happen as one operation, hence atomic; no other core is allowed to “peek” while the increment is going on.

How atomics are implemented

Now the question is, how is this done? Conceptually, the easiest way to do this is using a locking mechanism: only one core is allowed to execute an atomic operation at any point in time. The core enters the lock before it starts the operation, and leaves it once the operation is complete. This is what the x86 LOCK prefix originally used to mean (approximately; I’m simplifying here). Here, the lock enter operation consists of a message on the bus that says “okay, everyone, back off the bus for a bit” (for our purposes, this means “stop doing memory operations”). Then the core that sent that message needs to wait for all other cores to finish memory operations they’re in the middle of doing, after which they will acknowledge the lock. Only after every other core has acknowledged, the core attempting the locked operation can proceed. Finally, once the lock is released, it again needs to send a message to everyone on the bus saying that “all clear, you can resume issuing requests on the bus now”.

This works. It is also incredibly slow. x86 CPUs still support this (or an equivalent), but only as an absolute emergency, when-all-else-fails fallback path; they need a fallback because the x86 ISA permits very dubious constructs like unaligned atomic operations that cross multiple cache lines, for backwards compatibility. Other architectures generally just don’t allow atomic operations on values that aren’t naturally aligned, nor on values that are “too big”. These constraints guarantee that a single atomic operation always takes place within a single cache line. And once we have that, we’re in good shape: as we saw last time when discussing cache coherency protocols, inter-core communication synchronizes memory at cache line granularity anyway, so in principle we can do complex modifications to a single cache line and then publish all changes at once by pushing the new cache line. Moreover, the MESI state machine features two states (M and E, “modified” and “exclusive”) that guarantee exclusive ownership of a cache line by one core – while a cache line is exclusively owned, no other core can “peek”. We can use that as substitute for our locking protocol.

So here’s the punchline: in a MESI (or derived) system, all we need to do to make sure an operation touching a single cache line is atomic is to a) make sure we issue the right memory barriers so memory operations are correctly ordered with reference to the surrounding code (see previous post), b) acquire exclusive ownership of the cache line before we read anything, c) only write back the results if we had exclusive ownership for the entire duration of the atomic operation. This guarantees that no other core saw any half-finished data. There’s multiple ways to accomplish c). For example, we can build hardware to make a limited set of atomic operations complete in a single bus clock cycle; if we have exclusive ownership of our cache line by the start of a cycle, we can have our modified data by the end of it. Since a cache line can’t possibly “change hands” within a cycle, this is fast enough. Depending on the bus protocol, we might also start playing games where we respond to coherency messages immediately, but might send the data a bit late if we’re in the middle of an atomic operation. Finally, we can just decide not to play games with timing at all; instead we implement steps b) and c) directly: any cache line that’s being used for an atomic operation is “monitored”, and if some other core looks at our cache line before the atomic operation is complete, we need to start over. This is what leads to the load-link/store-conditional (LL/SC) operations present in most RISC CPUs.

And by the way, on the bus side (and hence to other cores), properly aligned atomic operations don’t look any different than normal memory updates. Again, all of the processing is internal to the core that does it; other cores neither know nor care whether memory was updated from an atomic compare-and-swap operation bracketed by memory barriers or a relaxed store.

This all sounds nice and simple, and conceptually it is, but the devil’s in the details. The bad news is that if you’re a CPU architect, every single detail of this process is of crucial importance; your internal implementation of memory operations needs to avoid starvation (every core that wants to gain exclusive access to a cache line should be able to do so, eventually, no matter what the other cores are doing), and make sure that it’s possible to implement certain fundamental atomic operations without deadlocks or livelocks. It sounds obvious, but these guarantees are not automatic for low-level primitives like atomic operations or LL/SC. This stuff is hard to get right, and CPUs need to have an implementation that’s not just correct, it also needs to be fast for “typical cases” (whatever they are). The good news is that if you’re not working at a company designing CPUs, none of this is your problem, and you can rest assured that somebody else has thought this through, backed up by an army of validation engineers trying very hard to find test cases that break it.

The cost of memory operations

Back on the SW side, let’s assume we’re on a typical CPU architecture and are running code on multiple cores. What’s the cost of a memory operation in that environment? It breaks down into several components:

Execution. Executing a memory operation isn’t free. Suppose for now that only one core is active and running single-threaded code; even then, memory access is still complicated. Programs deal with virtual addresses, but coherent caches and memory buses normally deal exclusively in physical memory addresses. So every memory operation first starts with a virtual to physical address conversion (these are itself cached in the what’s commonly called Translation Lookaside Buffer or TLB). If you’re unlucky, that virtual address isn’t currently mapped to physical memory and needs to be brought in from storage; whenever this happens, the OS is going to schedule another thread on the active core for a while, since IO takes a long time in processor terms. But let’s assume that doesn’t happen here.

With the physical address known, the operation starts to go through the memory hierarchy. For example, to complete a memory load, the relevant data normally needs to be brought into the L1 cache first. If it’s not there yet, this can be a multi-step process that – in the worst case – involves a real memory access and then populating all the intermediate cache levels for the relevant cache line. With poor (i.e. not nicely localized) memory access patterns, waiting for cache levels to get populated is one of the main ways a CPU core spends its time. But for now, let’s assume that doesn’t happen (too often) either.

So how fast can we run memory operations if everything goes well? Pretty fast, it turns out. Usually at least one operation (loads/stores) completed per clock cycle, often more than that. Reasonably cache-friendly code will complete billions of memory operations per second on a single 3GHz core.

Memory barriers and atomic read-modify-write operations. For the next step, let’s suppose we’re running code that’s intended for multi-threaded operation, but we’re still doing so on only a single core. As a result, we will now see memory barriers and atomic operations, but no actual interference from another core; let’s just suppose that all relevant cache lines are already exclusively held by our own core. In that situation, how expensive is, say, updating a reference count using an atomic integer addition?

Well, that really depends on the CPU core. In general, micro-architectures with aggressive reordering of memory operations have more expensive memory barriers and atomic operations than those with only slight reordering or in-order execution of memory operations. For example, incrementing a reference count using LOCK INC [mem] on an Intel Atom core (an in-order design) has essentially the same cost as a regular INC [mem] instruction, and somewhat more complicated atomic operations like exchange or exchange-add end costing about 2x to 3x as much as a “vanilla” memory read-modify-write instruction. By contrast, on Intel and AMD’s out-of-order x86 desktop parts, an atomic increment has about 10x-25x the cost of the non-atomic version; that’s the cost of ensuring proper memory ordering. And again, to reiterate: this is still on code that is executing single-threaded. There’s no actual cross-core communication going on yet; this extra cost is incurred purely within a single core, to make the code safe for multi-core execution.

Bus traffic and cache coherency. Some percentage of memory accesses actually misses the cache and goes straight to memory; and once cache line that haven’t been used in a while get evicted, we start getting write-backs. All these events cause bus traffic (and memory traffic). Bus and memory bandwidth is a limited resource; as we start saturating their capacities, things start to get slower.

Moreover, once we switch to running multiple threads of our program on multiple cores, we actually start getting cache coherency traffic, as the cores continually synchronize their respective views of memory. If every thread works on its own independent memory region, this doesn’t really do much; if a given region of memory is only used by one core, then there’s no sharing, and getting exclusive ownerships of one of the corresponding cache lines is easy and doesn’t cause any invalidations elsewhere.

By contrast, if two or more cores frequently access the same cache lines, then these cache lines need to be kept synchronized. Updates to one of these cache lines require exclusive ownership, which means all other cores need to invalidate their copies of that cache line first. As a result, the next time that cache line is accessed by another core, its contents need to be sent across the bus. So we get both extra cache misses (on the remote core) and extra bus traffic. This phenomenon of multiple cores hitting a cache line that is being updated regularly is called “cache (line) contention”, and it is probably the easiest way to make parallel code in shared-memory environments crawl.

Cache line contention

To get cache line contention, we need multiple cores frequently accessing the same cache line, with at least some of these regular accesses being writes. Private data (cache lines only accessed by one thread) is never a problem; neither is immutable data (written once and then not modified until the end of its lifetime). What’s tricky is data that is both shared and mutable: doing so requires a lot of communication to maintain a consistent (as per the constraints of the memory model) view of memory between cores, and communication is expensive – and only keeps getting more so as more parties get involved.

How much more expensive are we talking? I wrote a test (for x86/Windows) a few weeks ago. This test is by no means user-friendly or easy to read, and it assumes a quad-core CPU with simultaneous multi-threading, or a similar topology, with at least 8 logical processors. Here’s the gist of it: as explained above, replacing a read-modify-write add of a value in memory with an atomic “add” operation generally makes it about 10x-25x as expensive (how much exactly depends on the micro-architecture). If you need a rule of thumb, just assume about 10x (good enough for Fermi estimation).

Once there is a second core reading that cache line, though, the cost explodes. If there’s another core generating lots of read traffic on the cache line in a tight loop, the atomic add gets more expensive – much more expensive: typically, 4x to 6x more (that’s on top of the ~10x hit we take from using an atomic add to begin with!). And this cost only goes up if there are more readers, and even more so if there are other writers too. Now, please don’t take these values as gospel; this is a completely synthetic benchmark that doesn’t do anything useful. The actual cost of coherency traffic very much depends on context. All I want to do here is give you a very rough feel for the cost of coherency traffic and the communication it does (namely: it’s not negligible).

Some of this communication is not actually necessary. For example, because cache coherency is tracked at cache line granularity, it’s possible to get lots of bogus coherency traffic simple because the different types of data – immutable, private and shared – are intermingled within the same cache line (or similarly, because one cache line contains private data for multiple threads). This is called “false sharing“. Luckily, this kind of problem is fairly easy to find with a profiler, and also relatively straightforward to fix by either reordering the data in memory (possibly while adding padding to make sure two different kinds of data don’t end up in the same cache line) or just removing some of the offending data entirely. My older post “Cores don’t like to share” gives an example.

What’s left over after this process is “real” contention – contended access to shared data. This includes both actual shared mutable data structures and certain kinds of metadata such as locks and other synchronization objects. Exactly how well this works depends on the precise layout of data in memory, as well as the operations used to access it.

In general, the way to get scalable multi-processor code is to avoid contention as much as possible, and to make whatever contention remains pass quickly – in that order. And to do a decent job at this, it’s important to know how cache coherency works (in broad strokes, anyway), what kind of messages cores exchange to maintain memory coherency, and when that coherency traffic happens. Now that we have those basics covered, we can look at somewhat higher levels of the stack. This post is long enough already, but in the next post, I’m planning to have a look at locks and lock-free data structures, and discuss some of the trade-offs. Until then, take care!

Applying the rANS-with-alias-table construction from “rANS with static probability distributions” to Huffman codes has some interesting results. In a sense, there’s nothing new here once you have these two ingredients. I remember mentioning this idea in a mail when I wrote ryg_rans, but it didn’t seem worth writing an article about. I’ve changed my mind on that: while the restriction to Huffman-like code lengths is strictly weaker than “proper” arithmetic coding, we do get a pretty interesting variant on table/state machine-style “Huffman” decoders out of the deal. So let’s start with a description of how they usually operate and work our way to the alias rANS variant.

Table-based Huffman decoders

Conceptually, a Huffman decoder starts from the root, then reads one bit at a time, descending into the sub-tree denoted by that bit. If that sub-tree is a leaf node, return the corresponding symbol. Otherwise, keep reading more bits and descending into smaller and smaller sub-trees until you do hit a leaf node. That’s all there is to it.

Except, of course, no serious implementation of Huffman decoding works that way. Processing the input one bit at a time is just a lot of overhead for very little useful work done. Instead, normal implementations effectively look ahead by a bunch of bits and table-drive the whole thing. Peek ahead by k bits, say k=10. You also prepare a table with 2^k entries that encodes what the one-bit-at-a-time Huffman decoder would do when faced with those k input bits:

struct TableEntry {
    int num_bits; // Number of bits consumed
    int symbol;   // Index of decoded symbol
};

If it reaches a leaf node, you record the ID of the symbol it arrived at, and how many input bits were actually consumed to get there (which can be less than k). If not, the next symbol takes more than k bits, and you need a back-up plan. Set num_bits to 0 (or some other value that’s not a valid code length) and use a different strategy to decode the next symbol: typically, you either chain to another (secondary) table or fall back to a slower (one-bit-at-a-time or similar) Huffman decoder with no length limit. Since Huffman coding only assigns long codes to rare symbols – that is, after all, the whole point – it doesn’t tend to matter much; with well-chosen k (typically, slightly larger than the log2 of the size of your symbol alphabet), the “long symbol” case is pretty rare.

So you get an overall decoder that looks like this:

while (!done) {
    // Read next k bits without advancing the cursor
    int bits = peekBits(k);

    // Decode using our table
    int nbits = table[bits].num_bits;
    if (nbits != 0) { // Symbol
        *out++ = table[bits].symbol;
        consumeBits(nbits);
    } else {
        // Fall-back path for long symbols here!
    }
}

This ends up particularly nice combined with canonical Huffman codes, and some variant of it is used in most widely deployed Huffman decoders. All of this is classic and much has been written about it elsewhere. If any of this is news to you, I recommend Moffat and Turpin’s 1997 paper “On the implementation of minimum redundancy prefix codes”. I’m gonna assume it’s not and move on.

State machines

For the next step, suppose we fix k to be the length of our longest codeword. Anything smaller and we need to deal with the special cases just discussed; anything larger is pointless. A table like the one above then tells us what to do for every possible combination of k input bits, and when we turn the k-bit lookahead into explicit state, we get a finite-state machine that decodes Huffman codes:

state = getBits(k); // read initial k bits
while (!done) {
    // Current state determines output symbol
    *out++ = table[state].symbol;

    // Update state (assuming MSB-first bit packing)
    int nbits = table[state].num_bits;
    state = (state << nbits) & ((1 << k) - 1);
    state |= getBits(nbits);
}

state is essentially a k-bit shift register that contains our lookahead bits, and we need to update it in a way that matches our bit packing rule. Note that this is precisely the type of Huffman decoder Charles talks about here while explaining ANS. Alternatively, with LSB-first bit packing:

state = getBits(k);
while (!done) {
    // Current state determines output symbol
    *out++ = table[state].symbol;

    // Update state (assuming LSB-first bit packing)
    int nbits = table[state].num_bits;
    state >>= nbits;
    state |= getBits(nbits) << (k - nbits);
}

This is still the exact same table as before, but because we’ve sized the table so that each symbol is decoded in one step, we don’t need a fallback path. But so far this is completely equivalent to what we did before; we’re just explicitly keeping track of our lookahead bits in state.

But this process still involves, essentially, two separate state machines: one explicit for our Huffman decoder, and one implicit in the implementation of our bitwise IO functions, which ultimately read data from the input stream at least one byte at a time.

A bit buffer state machine

For our next trick, let’s look at the bitwise IO we need and turn that into an explicit state machine as well. I’m assuming you’ve implemented bitwise IO before; if not, I suggest you stop here and try to figure out how to do it before reading on.

Anyway, how exactly the bit IO works depends on the bit packing convention used, the little/big endian of the compression world. Both have their advantages and their disadvantages; in this post, my primary version is going to be LSB-first, since it has a clearer correspondence to rANS which we’ll get to later. Anyway, whether LSB-first or MSB-first, a typical bit IO implementation uses two variables, one for the “bit buffer” and one that counts how many bits are currently in it. A typical implementation looks like this:

uint32_t buffer;   // The bits themselves
uint32_t num_bits; // Number of bits in the buffer right now

uint32_t getBits(uint32_t count)
{
    // Return low "count" bits from buffer
    uint32_t ret = buffer & ((1 << count) - 1);

    // Consume them
    buffer >>= count;
    num_bits -= count;

    // Refill the bit buffer by reading more bytes
    // (kMinBits is a constant here)
    while (num_bits < kMinBits) {
        buffer |= *in++ << num_bits;
        num_bits += 8;
    }

    return ret;
}

Okay. That’s fine, but we’d like for there to be only one state variable in our state machine, and preferably not just on a technicality such as declaring our one state variable to be a pair of two values. Luckily, there’s a nice trick to encode both the data and the number of bits in the bit buffer in a single value: we just keep an extra 1 bit in the state, always just past the last “real” data bit. Say we have a 8-bit state, then we assign the following codes (in binary):

in_binary(state)	num_bits
`0 0 0 0 0 0 0 1`	0
`0 0 0 0 0 0 1 *`	1
`0 0 0 0 0 1 * *`	2
`0 0 0 0 1 * * *`	3
`0 0 0 1 * * * *`	4
`0 0 1 * * * * *`	5
`0 1 * * * * * *`	6
`1 * * * * * * *`	7

The locations denoted * store the actual data bits. Note that we’re fitting 1 + 2 + … + 128 = 255 different states into a 8-bit byte, as we should. The only value we’re not using is “0”. Also note that we have num_bits = floor(log2(state)) precisely, and that we can determine num_bits using bit scanning instructions when we need to. Let’s look at how the code comes out:

uint32_t state; // As described above

uint32_t getBits(uint32_t count)
{
    // Return low "count" bits from state
    uint32_t ret = state & ((1 << count) - 1);

    // Consume them
    state >>= count;

    // Refill the bit buffer by reading more bytes
    // (kMinBits is a constant here)
    // Note num_bits is a local variable!
    uint32_t num_bits = find_highest_set_bit(state);
    while (num_bits < kMinBits) {
        // Need to clear 1-bit at position "num_bits"
        // and add a 1-bit at bit "num_bits + 8", hence the
        // "+ (256 - 1)".
        state += (*in++ + (256 - 1)) << num_bits;
        num_bits += 8;
    }

    return ret;
}

Okay. This is written to be as similar as possible to the implementation we had before. You can phrase the while condition in terms of state and only compute num_bits inside the refill loop, which makes the non-refill case slightly faster, but I wrote it the way I did to emphasize the similarities.

Consuming bits is slightly cheaper than the regular bit buffer, refilling is a bit more expensive, but we’re down to one state variable instead of two. Let’s call that a win for our purposes (and it certainly can be when low on registers). Note I only covered LSB-first bit packing here, but we can do a similar trick for MSB bit buffers by using the least-significant set bit as a sentinel instead. It works out very similar.

So what happens when we plug this into the finite-state Huffman decoder from before?

State machine Huffman decoder with built-in bit IO

Note that our state machine decoder above still just kept the k lookahead bits in state, and that they’re not exactly hard to recover from our bit buffer state. In fact, they’re pretty much the same. So we can just fuse them together to get a state machine-based Huffman decoder that only uses byte-wise IO:

state = 1; // No bits in buffer
refill();  // Run "refill" step from the loop once

while (!done) {
    // Current state determines output symbol
    index = state & ((1 << k) - 1);
    *out++ = table[index].symbol;

    // Update state (consume bits)
    state >>= table[index].num_bits;

    // Refill bit buffer (make sure at least k bits in it)
    // This reads bytes at a time, but could just as well
    // read 16 or 32 bits if "state" is large enough.
    num_bits = find_highest_set_bit(state);
    while (num_bits < k) {
        state += (*in++ + (256 - 1)) << num_bits;
        num_bits += 8;
    }
}

The slightly weird refill() call at the start is just to keep the structure as similar as possible to what we had before. And there we have it, a simple Huffman decoder with one state variable and a table. Of course you can combine this type of bit IO with other Huffman approaches, such as multi-table decoding, too. You could also go even further and bake most of the bit IO into tables like Charles describes here, effectively using a table on the actual state and not just its low bits, but that leads to enormous tables and is really not a good idea in practice; not only are the tables too large to fit in the cache, general-purpose compressors will also usually spend more time building these tables than they ever spend using them (since it’s rare to use a single Huffman table for more than a few dozen kilobytes at a time).

Okay. So far, there’s nothing in here that’s not at least 20 years old.

Let’s get weird, stage 1

The decoder above still reads the exact same bit stream as the original LSB-first decoder. But if we’re willing to prescribe the exact form of the decoder, we can use a different refilling strategy that’s more convenient (or cheaper). In particular, we can do this:

state = read_3_bytes() | (1 << 24); // might as well!

while (!done) {
    // Current state determines output symbol
    index = state & ((1 << k) - 1);
    *out++ = table[index].symbol;

    // Update state (consume bits)
    state >>= table[index].num_bits;

    // Refill
    while (state < (1 << k))
        state = (state << 8) | *in++;
}

This is still workable a Huffman decoder, and it’s cheaper than the one we saw before, because refilling got cheaper. But it also got a bit, well, strange. Note we’re reading 8 bits and putting them into the low bits of state; since we’re processing bits LSB-first, that means we added them at the “front” of our bit queue, rather than appending them as we used to! In principle, this is fine. Bits are bits. But processing bits out-of-sequence in that way is certainly atypical, and means extra work for the encoder, which now needs to do extra work to figure out exactly how to permute the bits so the decoder reads them in the right order. In fact, it’s not exactly obvious that you can encode this format efficiently to begin with.

But you definitely can, by encoding backwards. Because, drum roll: this isn’t a regular table-driven Huffman decoder anymore. What this actually is is a rANS decoder for symbols with power-of-2 probabilities. The state >>= table[index].num_bits; is what the decoding state transition function for rANS reduces to in that case.

In other words, this is where we start to see new stuff. It might be possible that someone did a decoder like this before last year, but if they did, I certainly never encountered it before. And trust me, it is weird; the byte stream the corresponding encoder emits is uniquely decodable and has the same length as the bit stream generated for the corresponding Huffman or canonical Huffman code, but the bit-shuffling means it’s not even a regular prefix code stream.

Let’s get weird, stage 2: binary alias coding

But there’s one more, which is a direct corollary of the existence of alias rANS: we can use the alias method to build a fast decoding table with size proportional to the number of symbols in the alphabet, completely independent of the code lengths!

Note the alias method allows you to construct a table with an arbitrary number of entries, as long as it’s larger than the number of symbols. For efficiency, you’ll typically want to round up to the next power of 2. I’m not going to describe the exact encoder details here, simply because it’s just rANS with power-of-2 probabilities, and the ryg_rans encoder/decoder can handle that part just fine. So you already have example code. But that means you can build a fast “Huffman” decoder like this:

kMaxCodeLen = 24; // max code len in bits
kCodeMask = (1 << kMaxCodeLen) - 1;
kBucketShift = kMaxCodeLen - SymbolStats::LOG2NSYMS;

state = read_3_bytes() | (1 << 24); // might as well!

while (!done) {
    // Figure out bucket in alias table; same data structures as in
    // ryg_rans, except syms->slot_nbits (number of bits in Huffman
    // code for symbol) instead of syms->slot_nfreqs is given.
    uint32_t index = state & kCodeMask;
    uint32_t bucket_id = index >> kBucketShift;
    uint32_t bucket2 = bucket_id * 2;
    if (index < syms->divider[bucket_id])
        ++bucket2;

    // bucket determines output symbol
    *out++ = syms->sym_id[bucket2];

    // Update state (just D(x) for pow2 probabilities)
    state = (state & ~kCodeMask) >> syms->slot_nbits[bucket2];
    state += index - syms->slot_adjust[bucket2];

    // Refill (make sure at least kMaxCodeLen bits in buffer)
    while (state <= kCodeMask)
        state = (state << 8) | *in++;
}

I find this remarkable because essentially all other fast (~constant time per symbol) Huffman decoding tricks have some dependence on the distribution of code lengths. This one does not; the alias table size is determined strictly by the number of symbols. The only fundamental data-dependency is how often the “refill” code is run (it runs, necessarily, once per input byte, so it will run less often – relatively speaking – on highly compressible data than it will on high-entropy data). (I’m not counting the computation of bucket2 here because it’s just a conditional add, and is in fact written the way it is precisely so that it can be mapped to a compare-then-add-with-carry sequence.)

Note that this one really is a lot weirder still than the previous variant, which at least kept the “space” assigned to individual codes connected. This one will, through the alias table construction, end up allocating small parts of the code range for large symbols all over the place. It’s still exactly equivalent to a Huffman coder in terms compression ratio and code “lengths”, but the underlying construction really doesn’t have much to do with Huffman at all at this point, and we’re not even emitting particular bit strings for code words anymore.

All that said, I don’t think this final variant is actually interesting in practice; if I did, I would have written about it earlier. If you’re bothering to implement rANS and build an alias table, it really doesn’t make sense to skimp out on the one extra multiply that turns this algorithm into a full arithmetic decoder (as opposed to quasi-Huffman), unless your multiplier is really slow that is.

But I do find it to be an interesting construction from a theoretical standpoint, if nothing else. And if you don’t agree, well, maybe you at least learned something about certain types of Huffman decoders and their relation to table-based ANS decoders. :)

This is precisely what the last post was about. So nothing new. This is just my original mail on the topic with some more details that might be interesting and/or amusing to a few people. :)

Date: Wed, 05 Feb 2014 16:43:36 -0800
From: Fabian Giesen
Subject: Alias Huffman coding.

Huffman <= ANS (strict subset)
(namely, power-of-2 frequencies)

We can take any discrete probability distribution of N events and use the Alias method to construct a O(N)-entry table that allows us to sample from that distribution in O(1) time.

We can apply that same technique to e.g. rANS coding to map from (x mod M) to “what symbol is x”. We already have that.

Ergo, we can construct a Huffman-esque coder that can decode symbols using a single table lookup, where the table size only depends on N_sym and not the code lengths. (And the time to build said table given the code lengths is linear in N_sym too).

Unlike regular/canonical Huffman codes, these can have multiple unconnected ranges for the same symbol, so you still need to deal with the range remapping (the “slot_adjust” thing) you have in Alias table ANS; basically, the only difference ends up being that you have a shift instead of a multiply by the frequency.

But there’s still some advantages in that a few things simplify; for example, there’s no need (or advantage) to using a L that’s larger than M. An obvious candidate is choosing L=M=B so that your Huffman codes are length-limited to half your word size and you never do IO in smaller chunks than that.

Okay. So where does that get us? Well, something like the MSB alias rANS decoder, with a shift instead of a multiply, really:

   // decoder state
   // suppose max_code_len = 16
   U32 x;
   U16 const * input_ptr;

   U32 const m = (1 << max_code_len) - 1;
   U32 const bucket_shift = max_code_len - log2_nbuckets;

   // decode:
   U32 xm = x & m;
   U32 xm_shifted = xm >> bucket_shift;
   U32 bucket = xm_shifted * 2;
   if (xm < hufftab_divider[xm_shifted])
     bucket++;

   x = (x & ~m) >> hufftab_shift[bucket];
   x += xm - hufftab_adjust[bucket];

   if (x < (1<<16))
     x = (x << 16) | *input_ptr++;

   return hufftab_symbol[bucket];

So with a hypothetical compiler that can figure out the adc-for-bucket
thing, we’d get something like

   ; x in eax, input_ptr in esi
   movzx    edx, ax     ; x & m (for bucket id)
   shr      edx, 8      ; edx = xm_shifted
   movzx    ebx, ax     ; ebx = xm
   cmp      ax, [hufftab_divider + edx*2]
   adc      edx, edx    ; edx = bucket
   xor      eax, ebx    ; eax = x & ~m
   mov      cl, [hufftab_shift + edx]
   shr      eax, cl
   movzx    ecx, word [hufftab_adjust + edx*2]
   add      eax, ebx    ; x += xm
   movzx    edx, byte [hufftab_symbol + edx] ; symbol
   sub      eax, ecx    ; x -= adjust[bucket]
   cmp      eax, (1<<16)
   jae      done
   shl      eax, 16
   movzx    ecx, word [esi]
   add      esi, 2
   or       eax, ecx
done:

   ; new x in eax, new input_ptr in esi
   ; symbol in edx

which is actually pretty damn nice considering that’s both Huffman decode and bit buffer rolled into one. Especially so since it handles all cases – there’s no extra conditions and no cases (rare though they might be) where you have to grab more bits and look into another table. Bonus points because it has an obvious variant that’s completely branch-free:

   ; same as before up until...
   sub      eax, ecx    ; x -= adjust[bucket]
   movzx    ecx, word [esi]
   mov      ebx, eax
   shl      ebx, 16
   or       ebx, ecx
   lea      edi, [esi+2]
   cmp      eax, (1<<16)
   cmovb    eax, ebx
   cmovb    esi, edi

Okay, all that’s nice and everything, but for x86 it’s nothing we haven’t seen before. I have a punch line though: the same thing works on PPC – the adc thing and “sbb reg, reg” both have equivalents, so you can do branch-free computation based on some carry flag easily.

BUT, couple subtle points:

this thing has a bunch of (x & foo) >> bar (left-shift or right-shift) kind of things, which map really really well to PPC because there’s rlwinm / rlwimi.
The in-order PPCs hate variable shifts (something like 12+ cycles microcoded). Well, guess what, everything we multiply with is a small per-symbol constant, so we can just store (1 << len) per symbol and use mullw. That’s 9 cycles non-pipelined (and causes a stall after issue), but still, better than the microcode. But… wait a second.

If this ends up faster than your usual Huffman, and there’s a decent chance that it might (branch-free and all), the fastest “Huffman” decoder on in-order PPC would, in fact, be a full-blown arithmetic decoder. Which amuses me no end.

   # NOTE: LSB of "bucket" complemented compared to x86

   # r3 = x, r4 = input ptr
   # r20 = &tab_divider[0]
   # r21 = &tab_symbol[0]
   # r22 = &tab_mult[0]
   # r23 = &tab_adjust[0]

   rlwinm    r5, r3, 24, 23, 30   # r5 = (xm >> bucket_shift) * 2
   rlwinm    r6, r3, 0, 16, 31    # r6 = xm
   lhzx      r7, r20, r5          # r7 = tab_divider[xm_shifted]
   srwi      r8, r3, 16           # r8 = x >> log2(m)
   subfc     r9, r7, r6           # (r9 ignored but sets carry)
   lhz       r10, 0(r4)           # *input_ptr
   addze     r5, r5               # r5 = bucket
   lbzx      r9, r21, r5          # r9 = symbol
   add       r5, r5, r5           # r5 = bucket word offs
   lhzx      r7, r22, r5          # r7 = mult
   li        r6, 0x10000          # r6 = op for sub later
   lhzx      r5, r23, r5          # r5 = adjust
   mullw     r7, r7, r8           # r7 = mult * (x >> m)
   subf      r5, r5, r6           # r5 = xm - tab_adjust[bucket]
   add       r5, r5, r7           # r5 = new x
   subfc     r6, r6, r5           # sets carry iff (x >= (1<<16))
   rlwimi    r10, r5, 16, 0, 16   # r10 = (x << 16) | *input_ptr
   subfe     r6, r6, r6           # ~0 if (x < (1<<16)), 0 otherwise
   slwi      r7, r6, 1            # -2 if (x < (1<<16)), 0 otherwise
   and       r10, r10, r6
   andc      r5, r5, r6
   subf      r4, r7, r4           # input_ptr++ if (x < (1<<16))
   or        r5, r5, r10          # new x

That should be a complete alias rANS decoder assuming M=L=b=2¹⁶.

-Fabian

Suppose we want to calculate a product between a 4×4 matrix M and a 4-element vector v:

$Mv = \begin{pmatrix}a_x & b_x & c_x & d_x \\ a_y & b_y & c_y & d_y \\ a_z & b_z & c_z & d_z \\ a_w & b_w & c_w & d_w\end{pmatrix} \begin{pmatrix}v_x \\ v_y \\ v_z \\ v_w\end{pmatrix}$

The standard approach to computing Mv using SIMD instructions boils down to taking a linear combination of the four column vectors a, b, c and d, using standard SIMD componentwise addition, multiplication and broadcast shuffles.

  // Given M as its four constituent column vectors a, b, c, d,
  // compute r=M*v.
  r = v.xxxx*a + v.yyyy*b + v.zzzz*c + v.wwww*d;

This computes the vector-matrix product using four shuffles, four (SIMD) multiplies, and three additions. This is all bog-standard. And if the ISA we’re working on has free broadcast swizzles (ARM NEON for example), we’re done. But if not, can we do better? Certainly if we know things about M or v: if M has a special structure, or some components of v are known to be always 0, 1 or -1, chances are good we can save a bit of work (whether it makes a difference is another matter). But what if M and v are completely general, and all we know is that we want to transform a lot of vectors with a single M? If v is either given as or returned in SoA form (structure-of-arrays), we can reduce the number of per-vector shuffles greatly if we’re willing to preprocess M a bit and have enough registers available. But let’s say we’re not doing that either: our input v is in packed form, and we want the results packed too. Is there anything we can do?

There’s no way to reduce the number of multiplies or additions in general, but we can get rid of exactly one shuffle per vector, if we’re willing to rearrange M a bit. The trick is to realize that we’re using each of v.x, v.y, v.z, and v.w exactly four times, and that the computations we’re doing (a bunch of component-wise multiplies and additions) are commutative and associative, so we can reorder them, in exact arithmetic anyway. (This type of computation is usually done in floating point, where we don’t actually have associativity, but I’m going to gloss over this.)

Let’s look at the our first set of products, v.xxxx * a. We’re just walking down a column of M, multiplying each element we see by v_x. What if we walk in a different direction? Going along horizontals turns out to be boring (it’s essentially the same, just transposed), but diagonals of M are interesting, the main diagonal in particular.

So here’s the punch line: we form four new vectors by walking along diagonals (with wrap-around) as follows:

$e = \begin{pmatrix} a_x \\ b_y \\ c_z \\ d_w \end{pmatrix} \quad f = \begin{pmatrix} b_x \\ c_y \\ d_z \\ a_w \end{pmatrix} \quad g = \begin{pmatrix} c_x \\ d_y \\ a_z \\ b_w \end{pmatrix} \quad h = \begin{pmatrix} d_x \\ a_y \\ b_z \\ c_w \end{pmatrix}$

Phrasing the matrix multiply in terms of these four vectors, we get:

  r = v*e + v.yzwx*f + v.zwxy*g + v.wxyz*h;

Same number of multiplies and adds, but one shuffle per vector less (because the swizzle pattern for v in the first term is xyzw, which is the natural ordering of v). Also note that forming e, f, g, and h given M in column vector form is also relatively cheap: it’s a matrix transposition with a few post-swizzles to implement the cyclic rotations. If you have M as row vectors (for example because it’s stored in row-major order), it’s even cheaper.

So: multiplying a packed 4-vector with a constant 4×4-matrix takes one shuffle less than the standard approach, if we’re willing to do some preprocessing on M (or store our matrices in a weird layout to begin with). Does this matter? It depends. On current desktop x86 cores, it’s pretty marginal, because SIMD shuffles can execute in parallel (during the same cycle) with additions and multiplications. On older cores with less execution resources, on in-order SIMD CPUs, and on low-power parts, it can definitely help though.

For what it’s worth: if your 4D vectors come from graphics or physics workloads and are actually homogeneous 3-vectors with a constant w=1 and no projective transforms anywhere in sight, you can exploit that structure explicitly for higher gains than this. But I ran into this with a DSP workload (with v just being a vector of 4 arbitrary samples), and in that case it’s definitely useful to know, especially since anything convolution-related tends to have highly diagonal (Toeplitz, to be precise) structure to begin with.

Suppose we want to calculate a product between a 4×4 matrix M and a 4-element vector v:

$Mv = \begin{pmatrix}a_x & b_x & c_x & d_x \\ a_y & b_y & c_y & d_y \\ a_z & b_z & c_z & d_z \\ a_w & b_w & c_w & d_w\end{pmatrix} \begin{pmatrix}v_x \\ v_y \\ v_z \\ v_w\end{pmatrix}$

  // Given M as its four constituent column vectors a, b, c, d,
  // compute r=M*v.
  r = v.xxxx*a + v.yyyy*b + v.zzzz*c + v.wwww*d;

So here’s the punch line: we form four new vectors by walking along diagonals (with wrap-around) as follows:

Phrasing the matrix multiply in terms of these four vectors, we get:

  r = v*e + v.yzwx*f + v.zwxy*g + v.wxyz*h;

Much has been written about all the myriad ways to go wrong when writing software. Poor management; scope creep; too little structure, not modular enough, and it’s a “big ball of mud”. Too much (or too rigid) and it’s a “software crystal”, impossible to alter. And so on.

Suppose you get all that right, and actually ship a useful system to users, it solves their problems well enough, and the code is reasonably clean, has a sound design and a modular structure with interface that, while not perfect, work okay. That’s about as good as it gets! Well done.

Alas, you’re not out of the woods. Success has its own failure modes, and I want to talk about one in particular that affects modular designs.

The arguments for modularity are well known: separating concerns breaks large systems down into smaller constituent parts that can be understood individually, with clearly-defined interfaces between them. Ideally, modules are designed so they can be developed and tested in isolation, and if an individual module is found wanting (say it’s unreliable, faulty or there are simply better solutions available), it can be replaced with another module provided it has the same interface.

And there really are systems like that, where the interfaces are rigid and well-specified, components come only in a handful of “shapes”, and everything cleanly fits together, like Lego bricks. But more commonly, shipping systems look like this (prepare for an extended metaphor):

“Dry stone wall, Island of Mull”. Photo by Jan Smith, CC-BY 2.0

The modules have irregular shapes and irregular sizes. Some are big, some are quite small. Some closely align with their neighbors; others have big gaps between them. They add up to a coherent whole, but it’s clear that for most of the development time, none of these components really had to have any particular shape. Occasionally you need a small piece with a specific shape to fill a gap, but for the most part, you just work with the materials you have.

The result is still “modular”; it’s built out of smaller pieces, each with their own clearly defined boundaries. But it’s not very regular, and outright weird in some places. That chipped corner on one of the bottom pieces was just an early mistake, but it made for a good place to stick that one flat rock on and somehow that ended up being one of the primary supports for the whole thing. And while building that wall, “I need a rock, about this big” was the only constraint you really had, and you just sort of piled it on. But when repairing it after one of the pieces has been damaged, working out the right shape, finding a replacement that meets that description and getting it in place is really tricky, fiddly work. (End of extended metaphor.)

Know any systems like that? I certainly do. And the end result is what I hereby dub a “modulith” (I am sure this has been observed and named before, but I haven’t seen it elsewhere yet). Made out of small, distinct, cleanly separable pieces, but still, everything but the topmost layer is actually kind of hard to disentangle from the rest, due to a myriad of small interactions with everything surrounding it. Because once you use a module as a building block for something else, there’s a disturbing tendency for all of its remaining quirks and bugs to effectively become part of the spec, as other modules (implicitly or explicitly) start to rely on them.

This is related to, but distinct from, other concepts such as software entropy and technical debt, which primarily deal with effects within a single codebase over time. Here we are dealing with something slightly different: as a particular component is successfully used or re-used (in unmodified form!), the users of said code tend to end up relying (often inadvertently) on various unspecified or underspecified behaviors, implicitly assuming a stronger contract than the component is actually supposed to provide. At that point, your choices are to either make those assumed behaviors actually contractual (not breaking existing code at the cost of severely constraining future evolution of said component), or to fix all users that make stronger assumptions than what is guaranteed (easier said than done if the component in question is popular; often causes ripple effects that break yet more code).

Either way, I don’t have any good solutions, but I’m feeling whimsical and haven’t seen this exact problem described before, so I’m naming it. In the extremely likely case that this has already been described and named by someone else, I’d appreciate a reference!

I wrote about regular interval overlap checking before. Let’s consider a somewhat trickier case, where our intervals are not defined over the real numbers or regular integers, but are instead subsets of the integers modulo N (for some fixed N). In this post, I’ll just consider the math; but this is something I’ve used in several places, so I expect I’ll be writing about some concrete uses eventually. Throughout this document, I’ll be writing $\mathbb{Z}_N$ for the set $\mathbb{Z} / N\mathbb{Z}$ of integers mod N (feel free to ignore the notation if you’re not familiar with it).

Intervals mod N, first attempt

The first thing we need to do is agree on what we mean by an interval mod N. Suppose that N=16. In that setting, the meaning of the interval [5,7] is pretty clear: the set {5, 6, 7}. And we can also take something like the integer interval [14,17] representing the set {14, 15, 16, 17} and reduce it mod N to get {14, 15, 0, 1}. But doing this required us to use values outside of {0, …, N-1} to specify the end points. This is often a useful model – I talk about some advantages when doing this for ring buffers in this old post – but it’s a bit of a cheat; we’d like a way to specify intervals mod N that don’t require us to leave the set of integers mod N to handle the “wraparound” case.

This means that we want to define something like the interval [14,1]. For the regular integers, this would be the empty set; but mod N, it makes more sense to properly wrap around (for example, when you say “the trains don’t run from 11:30pm to 5:30am”, it’s understood that you mean 5:30am the next day, not an empty interval). One way to do this is by handling the wrap-around case separately. Assuming a, b are reduced mod N, we could for example define:

$[a,b] \mod N := \begin{cases} \{ a, a + 1, \hdots, b \} & \mbox{if } a \le b \\ \{ a, a + 1, \hdots, N - 1 \} \cup \{ 0, 1, \hdots, b \} & \mbox{if } a > b \end{cases}$

This works, but it’s messy, and chopping up our single connected interval mod N into two pieces for the wrap-around case feels like it’s missing something fundamental about arithmetic mod N. We’re thinking in terms of a number line:

But the integers mod N aren’t very “line-like” at all, and they’re commonly (and more appropriately) drawn as a circle (like a clock, with 0 at the top and numbers increasing clockwise, to maximize familiarity—apologies to all mathematicians who expect positive angles to move counter-clockwise).

And in this visualization, there is absolutely nothing special about our interval [14,1]. We happen to pass through 0, but 0 isn’t actually special. It’s just another point on the circle. The reason 0 becomes special when we think in terms of regular integers (and hence a number line) is that it’s usually the place where we decide to cut open the circle so we can flatten it into a line. But that has nothing to do with the circle (or the integers mod N); it’s an incidental artifact of the representation we’re picking.

A different approach

The key problem here is that intervals are normally defined in terms of ordering relationships. For example, real-number intervals are usually defined as sets

$[a,b] = \{ x \in \mathbb{R} | a \le x \le b \}$ .

based on a total order “≤”. But in our circle, we don’t have a useful ordering relation. When you write “4 < 7”, this means that 4 is to the left of 7 on the real number line. If you start at 4 and keep walking right, you’re gonna come by 7 eventually. If you instead were to walk left, the numbers would just keep getting smaller indefinitely.

On our circle mod N, this is not true. If you start at 4 and keep walking in the positive direction (clockwise in our case), you’ll reach 7 after three steps. If you instead walk counterclockwise from 4, you still reach 7 – this time after taking the long way round, in thirteen steps. This is true for any pair of numbers on the circle, which after all represent congruence classes of integers mod N (if you don’t know the terminology, just ignore the rest of this paragraph). It makes sense to say that the integer 4 is less than the integer 7, but the set $\bar{4} = 4 + N\mathbb{Z} = \{ 4 + kN | k \in \mathbb{Z} \}$ is not in any useful sense “less” or “greater” than the set $\bar{7} = 7 + N\mathbb{Z}$ . Viewed in that light, the definition for [a,b] mod N given above is outright weird; the case distinction between “a ≤ b” and “a > b” is testing a condition that doesn’t really make sense in terms of the concepts being expressed.

So ordering relationships are on shaky footing. We can, however, usefully talk about distances. For example, “4” and “7” are definitely three steps apart when we take the shortest path, also three steps apart when we’re only allowed to move clockwise, and thirteen steps apart when we’re only allowed to move counterclockwise. In general, we can define distance functions for the “increasing” (clockwise), “decreasing” (counter-clockwise) and shortest-path distances between two points (we won’t actually be using that latter one, I just mention it for completeness):

$d^{\,+}(a,b) := (b-a) \bmod N$
$d^{\,-}(a,b) := (a-b) \bmod N$
$d(a, b) := \min(d^{\,+}(a,b), d^{\,-}(a,b))$

These distance functions are, technically speaking, mappings $d : \mathbb{Z}_N \times \mathbb{Z}_N \rightarrow \mathbb{N}_0$ from the integers mod N to the natural numbers (hence the explicit use of mod as a binary operation). The distances are regular non-negative integers; while points in the integers mod N can’t be meaningfully compared, distances can. That’s why they’re useful to us. Also note that the “mod N” here is the modulus under Euclidean/floor division – that is, a non-negative value smaller than N.

Most programming languages use truncating division instead, which means the modulus has absolute value less than N but might be negative; you need to consider this when turning any of the equations in here into code!

Given all this, let’s go back to our problem of defining intervals nicely. Well, how do we draw something like the interval [14,1] on our circle? We just move the pen to “14” and start drawing a clockwise arc until we hit the “1”. We can use that exact idea to define intervals: namely, a point x is inside the interval [a,b] if, starting from a and walking in increasing order (clockwise), we hit x before we leave the interval. That leads to our improved definition of a closed interval mod N:

$[a,b] \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,+}(a,x) \le d^{\,+}(a,b) \}$
or the equivalent
$[a,b] \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,-}(b,x) \le d^{\,-}(b,a) \}$

and the generalizations to half-open intervals are straightforward:

$[a,b) \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,+}(a,x) < d^{\,+}(a,b) \}$
$(a,b] \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,-}(b,x) < d^{\,-}(b,a) \}$

Things are simpler if we always start measuring from the closed (inclusive) end, so that’s what we do. I’ll drop the (mod N) for the rest of the article; we know that’s our setting.

Point-in-interval tests and symmetry

This definition can be turned into code immediately and leads to fairly elegant point-in-interval tests that don’t break down into multiple cases:

// modN(x) is assumed to calculate Euclidean (=non-negative) x % N.

// x in [a,b] (mod N)?
static bool point_in_closed_interval(int x, int a, int b)
{
    return modN(x - a) <= modN(b - a);
}

// x in [a,b) (mod N)?
static bool point_in_half_open_interval(int x, int a, int b)
{
    return modN(x - a) < modN(b - a);
}

At this point, you should also be noticing something funny about that code (it’s a bit harder to see in the definitions above since the detour through the distance functions obscures the actual computation going on); namely, the fact that we’re subtracting a and then reducing mod N on both sides.

It’s confession time. This whole notion with measuring distances along the circle is not how I derived this; it’s the best way I know to think about this problem, but that knowledge came in retrospect. I got here the long way round, with several false starts. The key idea turned out to be thinking about symmetries, which is always worth doing if the problem you’re working on has any; see for example this post I wrote 6 years ago!

In this case, the integers mod N are a cyclic group—they wrap around. That’s why it makes sense to draw them as a circle. And that’s why being attached to any particular point being 0 is a bit silly: a circle has continuous rotational symmetry, and our discrete cyclic group has N-fold rotational symmetry. We get the same image if we rotate by an N’th of a full turn. And going back to our setting, since we really only care about distances, we can cyclically rotate our points around any way we want without changing the results.

What these tests really do is exploit this symmetry, “translating” (or more appropriately, rotating) everything by -a. This turns testing x against the interval [a,b] (or the half-open variants) into testing $(x-a) \bmod N$ against $[0,(b-a) \bmod N)$ . Once we move the start point of the interval to 0, we don’t have to worry about wrap-around happening in inconvenient places anymore; comparing the integers directly works fine. Score one for symmetry.

Interval overlap

This takes care of points. Now for something trickier: how do we test for interval overlap?

The standard tests for interval overlap are slick, but not really applicable in our situation: the center-extent trick actually generalizes just fine to integers mod N and much more general settings (it works in arbitrary metric spaces provided the sets in question can be expressed as balls in the target metric) but is not ideal in a discrete setting, and the direct tests rely heavily on an order structure we don’t have in our cyclic world.

But what we do have are reasonably simple point-in-interval tests. Can we build an interval overlap test out of them? Well, we can try. Suppose we have two intervals [a,b] and [c,d]. For example, surely, if the “left” endpoint c of [c,d] falls inside [a,b], the two intervals overlap – after all, we know a point that is in both of them! And by symmetry, swapping the roles of the two intervals, if a falls inside [c,d], we again must have overlap.

If you start drawing a few pictures of intervals in various configurations, you’ll soon notice that testing both of these conditions seems to detect all configurations where the intervals actually overlap (and this works with intervals in the real numbers as well as in our discrete setting). The question is, can we prove that testing these two points is sufficient, or are we merely lucky? It turns out we didn’t just luck out; here’s a quick proof:

Lemma: $([a,b] \cap [c,d]) \ne \emptyset$ $\Leftrightarrow$ $(c \in [a,b])$ or $(a \in [c,d])$ . In words, the two (non-empty) intervals overlap if and only if at least one of a or c is inside the respective other interval.
Proof: “ $\Leftarrow$ ” $c \in [c,d]$ , so if we also have $c \in [a,b]$ , then that gives us a point in the intersection of [a,b] and [c,d], which therefore can’t be empty. Likewise with a. Therefore, if either of the two conditions on the right-hand side holds, we indeed have a non-empty intersection.
“ $\Rightarrow$ ” The intersection isn’t empty, so take $x \in ([a,b] \cap [c,d])$ . x is in both intervals. Informally, we now “slide” x to the “left” (in negative direction) through the intersection until we hit either of the interval end points a or c. Formally, consider the distances from the interval start points to x $d_a := d^{\,+}(a,x)$ and $d_c := d^{\,+}(c,x)$ . Suppose that d_a ≤ d_c. Then we have

$0 \le d^{\,+}(c,a) = a - c = a - x + x - c$
$= (x - c) - (x - a) = d_c - d_a \le d_c \le d^{\,+}(c,d)$

In words, a’s distance from c, in positive direction, is no more than x’s; since x already was inside of [c,d], surely a must be too. If instead d_a > d_c, we learn that c is inside [a,b]. In either case, this proves the claim.

This takes care of closed intervals. Note that this proof leans heavily on the intervals in question being non-empty. We can readily adapt it to half-open intervals of the form [a,b), but we do need to make sure to catch either interval being empty first, in which case the intersection of intervals is necessarily empty too. Likewise, you can also easily adapt it to half-open intervals of type (a,b], but in this case you want to be using the “right” end points of intervals for testing, not the left end points.

This may all sound complicated, but the implementation is actually quite short and simple:

// do [a,b] and [c,d] overlap?
static bool closed_intervals_overlap(int a, int b, int c, int d)
{
    return modN(c - a) <= modN(b - a) || modN(a - c) <= modN(d - c);
}

// do [a,b) and [c,d) overlap?
static bool half_open_intervals_overlap(int a, int b, int c, int d)
{
    int w0 = modN(b - a);
    int w1 = modN(d - c);

    return (w1 != 0 && modN(c - a) < w0) ||
           (w0 != 0 && modN(a - c) < w1);
}

And there we go. Interval overlap tests mod N.

Implementation notes and variations

The most common case in systems programming involves power-of-2 N. In that scenario, modN is readily implemented via bit masking as a single binary AND operation. If N is 2³² or 2⁶⁴, 32- or 64-bit unsigned integers can (and should) be used directly, in which case there is no need for explicit masking in the code (although on some 64-bit architectures, unsigned 32-bit arithmetic compiles into 64-bit arithmetic with masking anyway). This is one of the rare cases where unsigned integer overflow works exactly the way we need. In this case, you want to be using unsigned integers throughout.

As presented, we’re working with intervals given in terms of two end points, because that’s the most common presentation. But computationally, all of the functions in the code shown actually use a single endpoint (on the “closed” end) along with the width of the interval. That’s the modN(b - a) and modN(d - c) terms we keep computing. So if you’re working a lot with intervals mod N, or storing them, you probably want to consider working in that representation directly.

The intervals in this article are intentionally defined with their endpoints coming from $\mathbb{Z}_N$ , to force us to think about the effects of the cyclical wrap-around in the integers mod N cleanly. Now that we’ve spent some time thinking it through, we can relax that requirement. In particular, when using ring buffers with what I call the “virtual stream” model (read/write cursors not reduced mod N), it can make sense to just not reduce the interval lengths mod N at all—that is, turn occurrences of modN(b - a) and modN(d - c) in the code into plain b - a and d - c, respectively, or reduce with respect to a modulus that’s a larger multiple of N. Among other things, this allows us to have half-open intervals covering the entirety of $\mathbb{Z}_N$ , something the fully reduced variant cannot easily do.

And as a final closing remark, this article comes in at 2500 words to explain less than 20 lines of code doing nothing but straightforward integer arithmetic. That has to be some sort of personal record. I’m not sure if that’s good or bad.

I am talking about the I/O operations as used in computing here. A typical example of how this kind of thing is exposed are the POSIX syscalls read(2) and write(2), which have the following C function prototypes:

ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);

Now these are raw system calls; user programs can use them directly, but they usually don’t. They normally go through some buffered IO layer; in the C standard library, this means FILE* and functions fread and fwrite, which split count into a product of two values in a vestigial nod to record-based IO but are otherwise equivalent. For concreteness, suppose we’re interfacing with actual storage (i.e. not a pipe, socket, virtual filesystem etc.). Then conceptually, a “read”-class operation (like read or fread) grabs bytes from a file say on a disk somewhere and puts them into the specific memory buffer, and a “write”-class operation takes bytes in a memory buffer and writes them to the disk. Which definitely sounds nice and symmetric—but there’s some important behavioral asymmetries between them, especially when errors are in the mix. The reasons have to do with buffering.

Buffered I/O

In general, file I/O operations in your program will not go directly to a storage device; data instead makes its way through several buffering layers (most of which can be disabled using various flags, but in normal usage these layers are on). These layers are there for good reason: on the kernel side, there’s what’s traditionally called the “buffer cache”. Storage devices are “block devices”, which means they store data in blocks. The block size depends on the device; on old hard disks it used to be 512 bytes, CDs, DVDs etc. tend to use 2k blocks, newer storage devices are now on 4k blocks. Block devices only read entire blocks at a time; that means random byte-aligned IO requests such as “read 100 bytes from disk at byte offset 1234567” or “write 2000 bytes to location 987654” can’t be directly passed to the device at all. The buffer cache is used to translate these requests into block-aligned read and write operations that the device understands; non-block-aligned writes also require reading the previous contents of the block that are not overwritten, and those go in the buffer cache as well. And of course, as the name suggests, it acts as a cache.

On the user-space side, we also have buffers, albeit for a different reason: read and write are system calls, and as such incur a transition to kernel space and back. They also need to check for and report errors every time they are invoked. And of course they actually need to do the work we want them to do – copy the data from (read) or to (write) the buffer cache. System call overhead varies between OSes, but it’s safe to assume that the whole process takes at least a couple hundred clock cycles in the best case. So for the overhead not to completely dominate the actual work being done, you generally want to be reading or writing at least a few kilobytes at a time. For scale reference, typical IO buffer sizes as of this writing are 4096 bytes (e.g. Visual C++ 2013 FILE*, Go bufio.Reader/bufio.Writer) or 8192 bytes (e.g. GNU libc FILE*, Java BufferedReader/BufferedWriter).

Often there are more buffers too. For example, most hard drives and RAID controllers have their own caches, and it is not uncommon for user-space code to have several layers of buffering for various reasons. But this is enough to illustrate the basic structure.

All of these buffers are used in much the same way for reading and writing. So where’s the behavioral asymmetry between reading and writing that I’m talking about? You need to think about the state of the world (so to speak) after you call a read-type call and how it differs from the state of the world after a write-type call.

What happens when you issue an IO operation

Let’s look at what goes into servicing a read-type call first: say you open a C FILE* and want to read the first 100 bytes via fread. The C standard I/O library notices that its buffers are currently empty, and tries to fill them up, issuing a read system call to read say 4k worth of data. The kernel in turn asks the file system where the data for the first 4k of the file is located, checks the buffer cache to see if it already has a copy in memory, and if not, it issues a block read command to the storage device. Either way, the kernel makes sure to get those 4k of data into the buffer cache and from there copies them into the standard IO buffers in user-space memory, then returns. The standard IO library looks at the result of the system call, updates the state of its IO buffers, and then copies the 100 requested bytes into the memory buffer the app supplied.

And what if anything goes wrong? Say the file is smaller than 100 bytes, or there was an error reading from disk, or the file is on a network file system that’s currently inaccessible. Well, if that happens, we catch it too: if something goes wrong filling up the buffer cache, the kernel notices and returns an appropriate error to the I/O library, which can in turn pass errors on to the app. Either way, anything that can go wrong will go wrong before the fread call returns. All the intervening layers need to do is make sure to keep the error information around so it can be passed to the app at the appropriate time.

Now let’s go the other way round: let’s open a fresh new file with a 4k write buffer^[1] and issue a 100-byte fwrite. This time, the IO library copies the 100 bytes from the app buffer to the write buffer… and immediately returns, reporting success. The underlying write system call will not be executed until either the buffer fills up or is flushed as a result of calling fflush, fseek, fclose or similar.

Quick imaginary show of hands: who reading this habitually checks return codes of fread or fwrite at all? Of those saying “yes”, who also remembers to check return codes of fflush, fseek or fclose? Probably not a lot. Well, if you don’t, you’re not actually checking whether your writes succeeded at all. And while these remarks are C-specific, this general pattern holds for all buffered writer implementations. Buffered writing delays making the actual write system call; that’s kind of the point. But it implies that error reporting is delayed too!

More buffers

This type of problem is not restricted to user-space buffering either. The implementation of write itself has similar issues: generally, after a successful write call, your data made it to the buffer cache, but it hasn’t hit actual storage yet. The kernel will make its best effort to write that data to storage eventually (hopefully within the next few seconds), but if there’s a device error or a system crash, that data could still be lost. Both of these are relatively rare these days, so we don’t worry about them too much, right? Except for those of us who do.

Oh, and while write will go to some lengths to make sure there are no nasty surprises when writing to local filesystems (for example, even with delayed write-back, you want to make sure to reserve free space on the disk early^[2], lest you run out during write-back), at least on POSIX systems there can still be write errors that you only get notified about on close, especially when network filesystems such as NFS or SMB/CIFS are in play (I’m not aware of any such late-reported error conditions on Windows, but that doesn’t mean there aren’t any). Something to be aware of: if you’re using these system calls and are not checking the return code of close, you might be missing errors.

Which brings up another point: even on local file systems, you only have the guarantee that the data made it to the buffer cache. It hasn’t necessarily made it to the storage device yet! If you want that (for example, you’ve just finished writing some important data and want to make sure it actually made it all the way), you need to call fsync^[3] on the file descriptor before you close it. The Windows equivalent is FlushFileBuffers.

So, if you make sure to check error codes on every write, and you fsync before you close (again checking errors), that means that once you’ve done all that, you’re safe and the data has successfully made it to permanent storage, right?

Well, two final wrinkles. First, RAID controllers and storage devices themselves have caches too. They’re supposed to have enough capacitors so that if the system suddenly loses power, they still have sufficient power to actually get your data written safely. Hopefully that’s actually true. Good luck. Second, the data may have made it to storage, but that doesn’t necessarily mean it’s actually visible, because the metadata necessary to reach it might not have been written yet. Quoting the Linux man page on fsync(2):

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

For better or for worse, I can’t recall ever seeing code doing this in the wild, though. I’m honestly not sure what the actual guarantees are that popular Linux file systems provide about these things. If you’re handling really really important data, you should probably find out.

Conclusion and summary

Buffering on the read side is great and pretty much transparent because if anything goes wrong, it will go wrong before you ever get to see the data, and you’ll get a proper error code.

Buffering on the write side is much trickier because it delays actual writing and error reporting in ways that most programmers are supposed to be aware of, but usually aren’t. Few are aware of the actual steps necessary to ensure that data made it to storage safely, and some of the OS abstractions involved don’t exactly make things easier (see the fsync quote above). Here be dragons.

Footnotes

^[1] Full buffering not line buffering mode, in case anyone’s feeling nit-picky.
^[2] Actual block allocation—as in, selecting which physical location on the device file writes will end up—is often delayed in modern file systems, to make it easier to assign mostly-contiguous space to large files where possible. However, even with delayed allocation, you want to keep track of how much space is going to be available on the device once all currently pending writes complete, so that you can return “out of disk space” errors appropriately instead of discovering that you’re out of space 10 seconds after the user exited the app he was using to edit and save his Important Document. Because that would be bad. This sounds as though it’s just a matter of accounting, but it gets tricky with file systems that use extents and not bitmap-based block allocation: getting the last few discontinuous blocks on the device means that you might need extra space to store the file extents! All of which is to say: this stuff is tricky to get right.
^[3] Yes, the name looks like it’s part of the C library buffered IO package, but it’s a proper syscall.

Suppose we want to calculate a product between a 4×4 matrix M and a 4-element vector v:

$Mv = \begin{pmatrix}a_x & b_x & c_x & d_x \\ a_y & b_y & c_y & d_y \\ a_z & b_z & c_z & d_z \\ a_w & b_w & c_w & d_w\end{pmatrix} \begin{pmatrix}v_x \\ v_y \\ v_z \\ v_w\end{pmatrix}$

  // Given M as its four constituent column vectors a, b, c, d,
  // compute r=M*v.
  r = v.xxxx*a + v.yyyy*b + v.zzzz*c + v.wwww*d;

So here’s the punch line: we form four new vectors by walking along diagonals (with wrap-around) as follows:

Phrasing the matrix multiply in terms of these four vectors, we get:

  r = v*e + v.yzwx*f + v.zwxy*g + v.wxyz*h;

Alas, you’re not out of the woods. Success has its own failure modes, and I want to talk about one in particular that affects modular designs.

“Dry stone wall, Island of Mull”. Photo by Jan Smith, CC-BY 2.0

Intervals mod N, first attempt

$[a,b] \mod N := \begin{cases} \{ a, a + 1, \hdots, b \} & \mbox{if } a \le b \\ \{ a, a + 1, \hdots, N - 1 \} \cup \{ 0, 1, \hdots, b \} & \mbox{if } a > b \end{cases}$

A different approach

The key problem here is that intervals are normally defined in terms of ordering relationships. For example, real-number intervals are usually defined as sets

$[a,b] = \{ x \in \mathbb{R} | a \le x \le b \}$ .

$d^{\,+}(a,b) := (b-a) \bmod N$
$d^{\,-}(a,b) := (a-b) \bmod N$
$d(a, b) := \min(d^{\,+}(a,b), d^{\,-}(a,b))$

$[a,b] \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,+}(a,x) \le d^{\,+}(a,b) \}$
or the equivalent
$[a,b] \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,-}(b,x) \le d^{\,-}(b,a) \}$

and the generalizations to half-open intervals are straightforward:

$[a,b) \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,+}(a,x) < d^{\,+}(a,b) \}$
$(a,b] \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,-}(b,x) < d^{\,-}(b,a) \}$

Things are simpler if we always start measuring from the closed (inclusive) end, so that’s what we do. I’ll drop the (mod N) for the rest of the article; we know that’s our setting.

Point-in-interval tests and symmetry

This definition can be turned into code immediately and leads to fairly elegant point-in-interval tests that don’t break down into multiple cases:

// modN(x) is assumed to calculate Euclidean (=non-negative) x % N.

// x in [a,b] (mod N)?
static bool point_in_closed_interval(int x, int a, int b)
{
    return modN(x - a) <= modN(b - a);
}

// x in [a,b) (mod N)?
static bool point_in_half_open_interval(int x, int a, int b)
{
    return modN(x - a) < modN(b - a);
}

Interval overlap

This takes care of points. Now for something trickier: how do we test for interval overlap?

$0 \le d^{\,+}(c,a) = a - c = a - x + x - c$
$= (x - c) - (x - a) = d_c - d_a \le d_c \le d^{\,+}(c,d)$

This may all sound complicated, but the implementation is actually quite short and simple:

// do [a,b] and [c,d] overlap?
static bool closed_intervals_overlap(int a, int b, int c, int d)
{
    return modN(c - a) <= modN(b - a) || modN(a - c) <= modN(d - c);
}

// do [a,b) and [c,d) overlap?
static bool half_open_intervals_overlap(int a, int b, int c, int d)
{
    int w0 = modN(b - a);
    int w1 = modN(d - c);

    return (w1 != 0 && modN(c - a) < w0) ||
           (w0 != 0 && modN(a - c) < w1);
}

And there we go. Interval overlap tests mod N.

Implementation notes and variations

ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);

Buffered I/O

What happens when you issue an IO operation

More buffers

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

Conclusion and summary

Buffering on the read side is great and pretty much transparent because if anything goes wrong, it will go wrong before you ever get to see the data, and you’ll get a proper error code.

Footnotes

This year, we (RAD) shipped two new lossless codecs, both using rANS. One of the two is Oodle LZNA (released in May), which Charles has already written about. The other is called BitKnit, which first shipped in July as part of Granny, and is slated for inclusion into more RAD products.

So, with two production-quality versions written and successfully shipped, this seems like a good time to write up some of the things we’ve learned, especially in terms of implementation concerns. Let’s get cracking! (I’m assuming you’re familiar with ANS. If not, I wrote a paper that has a brief explanation, and various older blog posts that give more details on the proofs. I’m also assuming you’re familiar with “conventional” arithmetic coding; if not, you’re not gonna get much out of this.)

One small note before we start…

I’ll be referring to the ANS family as a class of arithmetic coders, because that’s what they are (and so are “range coders“, by the way). So here’s a small historical note before we get cracking: the “bottom-up” construction of ANS and the LIFO encoding seem quite foreign once you’re used to most “modern” arithmetic coders, but what’s interesting is that some of the earliest arithmetic coders actually looked very similar.

In particular, I’m talking about Rissanen’s 1976 paper “Generalized Kraft Inequality and Arithmetic Coding” (which coined the term!). Note the encoding and decoding functions C and D on the second page, very reminiscent to the “bottom-up” construction of ANS (with the code being represented by a number that keeps growing), and the decoder returning symbols in the opposite order they were encoded!

Rissanen’s coder (with its rather cumbersome manual truncated floating point arithmetic subject to careful rounding considerations) never saw any widespread application, as far as I can tell. The coders that actually got traction use the now familiar top-down interval subdivision approach, and a different strategy to adapt to fixed-precision operation. But reading Rissanen’s paper from today’s perspective is really interesting; it feels like a very natural precursor to ANS, and much closer in spirit to ANS than to most of the other algorithms that descended from it.

Why rANS (and not FSE/tANS)?

On my blog especially, I’ve been talking almost exclusively about rANS, and not so much FSE/tANS, the members of the ANS family that have probably been getting the most attention. Why is that?

Briefly, because they’re good at different things. FSE/tANS are (nearly) drop-in replacements for Huffman coding, and have similar strengths and weaknesses. They have very low (and similar) per-symbol decode overhead, but the fast decoders are table-driven, where the table depends on the symbol probabilities. Building Huffman decoding tables is somewhat faster; FSE/tANS offer better compression. Both Huffman and FSE/tANS can in theory support adaptive probabilities, but there’s little point in doing anything but periodic rebuilds; true incremental updates are too slow to be worthwhile. At that point you might as well use a coder which is more suited to incremental adaptation.

Which brings us to rANS. rANS is (again, nearly) a drop-in replacement for multi-symbol Arithmetic coders (such as range coders). It uses fewer registers than most arithmetic coders, has good precision, and the decoder is division-free without introducing any approximations that hurt coding efficiency. Especially with the various tweaks I’ll describe throughout this post, rANS has what are easily the fastest practical multi-symbol alphabet arithmetic decoders I know. rANS coders are also quite simple in implementation, with none of the tricky overflow and underflow concerns that plague most arithmetic coders.

So rANS is a pretty sweet deal if you want a fast arithmetic coder that deals well with relatively fast-changing probabilities. Great! How do we make it work?

Reverse encoding

As mentioned above, ANS coders are LIFO: whatever order you encode symbols in, the decoder will produce them in the opposite order. All my ANS coders (including the public ryg_rans) use the convention that the encoder processes the data in reverse (working from the end towards the beginning), whereas the decoder works forwards (beginning towards end).

With a static model, this is odd, but not especially problematic. With an adaptive model, decoder and model have to process data in the same direction, since the decoder updates the model as it goes along and needs current model probabilities to even know what to read next. So the decoder and the model want to process data in the same direction (forward being the natural choice), and the rANS encoder needs to be processing symbols in the opposite order.

This is where it comes in handy that rANS has an interface very much like a regular arithmetic coder. For example, in my sample implementation, the symbol is described by two values, start and freq, which are equivalent to the symbol interval lower bound and size in a conventional arithmetic coder, respectively.

Most arithmetic coders perform the encoding operation right there and then. In rANS, we need to do the actual encoding backwards, which means we need to buffer the symbols first: (the first description I’ve seen of this idea was in Matt Mahoney’s fpaqa)

// Say our probabilities use 16-bit fixed point.
struct RansSymbol {
  uint16_t start; // start of symbol interval
  uint16_t range; // size of symbol interval
};

class BufferedRansEncoder {
  std::vector<RansSymbol> syms; // or similar

public:
  void encode(uint16_t start, uint16_t range)
  {
    assert(range >= 1);
    assert(start + range <= 0x10000); // no wrap-around

    RansSymbol sym = { start, range };
    syms.push_back(sym);
  }

  void flush_to(RansEncoder &coder);
};

With this, we can use rANS exactly like we would any other arithmetic coder. However, it will not be generating the bitstream incrementally during calls to encode; instead, it simply buffers up operations to be performed later. Once we’re done we can then pop off the symbols one by one, in reverse order, and generate the output bitstream. Easy:

void BufferedRansEncoder::flush_to(RansEncoder &coder)
{
  // Replays the buffered symbols in reverse order to
  // the actual encoder.
  while (!syms.empty()) {
    RansSymbol sym = syms.back();
    coder.encode(sym.start, sym.range);
    syms.pop_back();
  }
}

Once you have this small piece of scaffolding, you really can use rANS as a drop-in replacement for a conventional arithmetic coder. There’s two problems with this, though: if you use this to encode an entire large file, your symbol buffer can get pretty huge, and you won’t get a single bit of output until you’ve processed the entire input stream.

The solution is simple (and can also be found in the aforementioned fpaqa): instead of accumulating all symbols emitted over the entire input stream and doing one big flush at the end, you just process the input data in chunks and flush the coder periodically, resetting the rANS state every time. That reduces compression slightly but means the encoder memory usage remains bounded, which is an important practical consideration. It also guarantees that output is not delayed until the end of stream; finally, lots of compressors are already chunk-based anyway. (For the opportunity to send incompressible data uncompressed rather than wasting time on the decoder end, among other things.)

Basic interleaving

One thing that rANS makes easy is interleaving the output from several encoders into a single bitstream without needing any extra signaling. I’m not going into detail why that works here; I wrote a paper on the subject if you’re interested in details. But the upshot is that you can use multiple rANS encoders and decoders simultaneously, writing to the same output bitstream, rather than having a single one.

Why do we care? Because this is what decoding a single symbol via rANS looks like (adapted from my public ryg_rans code):

static const uint32_t kProbBits = 16;
static const uint32_t kProbMask = (1 << kScaleBits) - 1;

class RansDecoder {
  uint32_t state; // current rANS state
  // (IO stuff omitted)

  uint32_t renormalize_state(uint32_t x)
  {
    // Byte-wise for simplicity; can use other ways.
    while (x < RANS_L)
      x = (x << 8) | read_byte();

    return x;
  }

public:
  uint32_t decode_symbol()
  {
    uint32_t x = state; // Current state value

    uint32_t xm = x & kProbMask; // low bits determine symbol
    Symbol sym = lookup_symbol(xm); // (various ways to do this)

    // rANS state advance
    x = sym.range * (x >> kProbBits) + xm - sym.start;
    x = renormalize_state(x);

    // Save updated state and return symbol
    state = x;
    return sym.id;
  }
};

Note how literally every single line depends on the results of the previous one. This translates to machine code that has a single, very long, dependency chain with relatively low potential for instruction-level parallelism (ILP). This is fairly typical for all entropy coder inner loops, by the way, not just rANS. And because superscalar processors depend on ILP to deliver high performance, this is bad news; we’re not making good use of the machine.

Hence interleaving. The idea is that we have two RansDecoder instances, each with their own state, but implicitly sharing the same bitstream read pointer (referenced by read_byte). Now, when we have code like this:

  RansDecoder dec0, dec1;
  // ...
  uint32_t sym0 = dec0.decode_symbol():
  uint32_t sym1 = dec1.decode_symbol();

the processor’s out-of-order execution logic can overlap execution of both decodes, effectively running them at the same time. The renormalize step for dec0 needs to happen before the renormalize of dec1, but other than that, there’s no dependencies between the two. For what it’s worth, this does not actually require out-of-order execution; a compiler for an in-order architecture can also work with this, provided it has enough dataflow information to know that dec0 calling read_byte() does not influence anything that dec1 does before its renormalize step. So what interleaving does is convert a very serial task into one that’s much more amenable to superscalar execution.

What it boils down to is this: a regular rANS decoder is a fast, precise, divide-less arithmetic decoder (which is nice to begin with). Interleave two streams using what is otherwise the exact same code, and you get a pretty good boost in performance; (very roughly) around 1.4× faster, on both the decoder and encoder. But this is by now an old hat; this was all in the initial release of ryg_rans.

Some of the early experiments leading up to what later became BitKnit uses this directly, pretty much the same as in the ryg_rans example code, but it turns out it was a bit of a pain in the neck to work with: because the two streams need to interleave properly, the BufferedRansEncoder needs to keep track of which symbol goes to which stream, and both the encoder and decoder code needs to (somewhat arbitrarily) assign symbols to either stream 0 or stream 1. You’d prefer the streams to keep alternating along any given control-flow path, but that’s not always possible, since sometimes you have conditionals where there’s an even number of symbols send on one path, and an odd number sent on the other! So having two explicit streams: not so great. But we found a better way.

Implicit interleaving to the rescue

What we actually ended up doing was interleaving with a twist – literally. We give the underlying rANS encoders (and decoders) two separate state values, and simply swap the two stream states after every encoding and decoding operation (that’s where the “BitKnit” name comes from – it keeps two active states on the same “needle” and alternates between them). The modifications from the decoder shown above are pretty small:

class RansDecoder {
  uint32_t state1; // state for "thread 1"
  uint32_t state2; // state for "thread 2"

  // ...

public:
  uint32_t decode_symbol()
  {
    uint32_t x = state1; // Pick up thread 1

    // ---- BEGIN of code that's identical to the above

    uint32_t xm = x & kProbMask; // low bits determine symbol
    Symbol sym = lookup_symbol(xm); // (various ways to do this)

    // rANS state advance
    x = sym.range * (x >> kProbBits) + xm - sym.start;
    x = renormalize_state(x);

    // ---- END of code that's identical to the above

    // Save updated state, switch the threads, and return symbol
    state1 = state2; // state2 becomes new state1
    state2 = x;      // updated state goes to state2

    return sym.id;
  }
};

The changes to the encoder are analogous and just as simple. It turns out that this really is enough to get all the performance benefits of 2× interleaving, with none of the extra interface complexity. It just looks like a regular arithmetic decoder (or encoder). And assuming you write your implementation carefully, compilers are able to eliminate the one extra register-register move instruction we get from swapping the threads on most paths. It’s all win, basically.

Bypass coding

Borrowing a term from CABAC here; the “bypass coding mode” refers to a mode in the arithmetic coder that just sends raw bits, which you use for data that’s known a priori to be essentially random/incompressible, or at least not worth modeling further. With conventional arithmetic coders, you really need special support for this, since interleaving an arithmetic code stream with a raw bitstream is not trivial.

With rANS, that’s much less of a problem: you can just use a separate bitbuffer and mix it into the target bitstream with no trouble. However, you may not want to: rANS has essentially all of the machinery you need to act as a bit buffer. Can you do it?

Well, first of, you can just use the arithmetic coder with a uniform distribution to send a set number of bits (up to the probability resolution). This works with any arithmetic coder, rANS included, and is fairly trivial:

  // write value "bits" using "numbits"
  coder.encode(bits << (kProbBits - numbits),
                  1 << (kProbBits - numbits));

and the equivalent on the decoder side. However, this is not particularly fast. Fortunately, it’s actually really easy to throw raw bit IO into a rANS coder: we just add the bits at the bottom of our state variable (or remove them from there in the decoder). That’s it! The only thing we need to do is work out the renormalization condition in the encoder. Using the conventions from the bytewise ryg_rans, an example implementation of the encoder is:

static inline void RansEncPutBits(RansState* r, uint8_t** pptr,
  uint32_t val, uint32_t nbits)
{
  assert(nbits <= 16);
  assert(val < (1u << nbits));

  // nbits <= 16!
  RansState x = RansEncRenorm(*r, pptr, 1 << (16 - nbits), 16);

  // x = C(s,x)
  *r = (x << nbits) | val;
}

and the corresponding getbits in our ongoing example decoder looks like this:

class RansDecoder {
  // ...

  uint32_t get_bits(uint32_t nbits)
  {
    uint32_t x = state1; // Pick up thread 1

    // Get value from low bits then shift them out and
    // renormalize
    uint32_t val = x & ((1u << nbits) - 1);
    x = renormalize_state(x >> nbits);

    // Save updated state, switch the threads, and return value
    state1 = state2; // state2 becomes new state1
    state2 = x;      // updated state goes to state2

    return val;
  }
};

note that except for the funky state swap (which we carry through for consistency), this is essentially just a regular bit IO implementation. So our dual-state rANS admits a “bypass mode” that is quite cheap; usually cheaper than having a separate bit buffer would be (which would occupy yet another CPU register in the decoder), at least in my tests.

Note that if you combine this with the buffering encoder described above, you need a way to flag whether you want to emit a regular symbol or a burst of raw bits, so our RansSymbol structure (and the code doing the actual encoding) gets slightly more complicated since we now have two separate types of “opcodes”.

The implementation above has a limit of 16 bits you can write in a single call to RansEncPutBits. How many bits you can send at once depends on the details of your renormalization logic, and how many bits of rANS state you keep. If you need to send more than 16, you need to split it into multiple operations.

Tying the knot

I got one more: a rANS encoder needs to write its final state to the bitstream, so that the decoder knows where to start. You can just send this state raw; it works just fine. That’s what the ryg_rans example code does.

However, rANS states are not equally likely. In fact, state x occurs with a probability proportional to 1/x. That means that an ideal code should spend approximately $\log_2(x)$ bits to encode a final state of x. Charles has already written about this. Fortunately, the ideal coder for this distribution is easy: we simply send the index of the highest set bit in the state (using a uniform code), followed by the remaining bits.

One options is to do this using regular bit I/O. But now you still need a separate bit IO implementation!

Fortunately, we just covered how do send raw bits through a rANS encoder. So one thing we can do is encode the final state value of stream 2 using the “stream 1” rANS as the output bit buffer, using the putbits functionality just described (albeit without the thread-switching this time). Then we send the final state of the “stream 1” rANS raw (or using a byte-aligned encoding).

This approach is interesting because it takes a pair of two rANS encoder threads and “ties them together” – making a knot, so to speak. In the decoder, undoing the knot is serial (and uses a single rANS decoder), but immediately after initialization, you have a working dual-stream coder. This saves a few bytes compared to the sloppier flushing and is just plain cute.

This technique really comes into its own for the wide-interleave SIMD rANS coders described in my paper, because it can be done in parallel on lots of simultaneous rANS coders in a reduction tree: group lanes into pairs, have each odd-indexed lane flush into its even-indexed neighbor. Now look at groups of 4 lanes; two have already been flushed, and we can flush the rightmost “live” lane into the leftmost lane coder. And so forth. This allows flushing a N× interleaved SIMD rANS coder in $O(\log(N))$ coding operations, and still has some parallelism while doing so. This is not very exciting for a 2× or 4× interleaved coder, but for GPU applications N is typically on the order of 32 or 64, and at that level it’s definitely interesting.

Conclusion and final notes

Using the techniques described in this post, you can write rANS encoders and decoders that have about the same amount of code as a conventional arithmetic coder with comparable feature set, have a similar interface (aside from the requirement to flush the encoder regularly), are significantly faster to decode (due to the combination of the already-fast rANS decoder with implicit interleaving), and have very cheap “bypass coding” modes.

This is a really sweet package, and we’re quite happy with it. Anyone interested in (de)compression using adaptive models should at least take a look. (For static models, FSE/tANS are stronger contenders.)

What example code there is in this article uses byte-wise renormalization. That’s probably the simplest way, but not the fastest. Oodle LZNA uses a 63-bit rANS state with 32-bits-at-a-time renormalization, just like rans64.h in ryg_rans. That’s a good choice if you’re primarily targeting 64-bit platforms and can afford a 64-bit divide in the encoder (which is quite a bit more expensive than a 32-bit divide on common CPUs). BitKnit uses a 32-bit rANS state with 16-bits-at-a-time renormalization, similar to the coder in ryg_rans rans_word_sse41.h. This is friendlier to 32-bit targets and admits a branch-free renormalize option, but also means the coder has somewhat lower precision. Using a probability range of 16 bits would not be wise in this case; BitKnit uses 14 bits.

This post is about general techniques for handling end-of-buffer checks in code that processes an input stream a byte at a time, or a few bytes at a time at the most. Concretely, I’ll be talking about decompression code, but many of these ideas are also applicable to related sequential input processing tasks like lexical analysis.

A basic decoder

To show how the problem crops up, let’s look at a simple decompressor and at what happens when we try to make an efficient implementation. Here’s our simple decoder for a toy LZ77 variant:

while (!done) { // main loop
    if (get_bits(1) != 0) { // match
        int offset = 1 + get_bits(13);
        int len = 3 + get_bits(5);

        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
       *dest++ = get_bits(8);
    }
}

This particular coding scheme is just arbitrarily chosen to have a simple example, by the way. It’s not one I would actually use.

How does get_bits look like? The design space of bit IO is a big topic on its own, and I won’t be spending any time on the trade-offs here; let’s just use a basic variant with MSB-first (big endian-like) bit packing, reading the input stream from a memory buffer, one byte at a time:

const uint8_t *input_cursor;    // current input cursor
const uint8_t *input_end;       // end of input buffer

uint8_t read_byte()
{
    // If we reached the end of the input buffer, return 0!
    if (input_cursor >= input_end)
        return 0;

    return *input_cursor++;
}

uint32_t bitcount; // number of bits in bitbuf
uint32_t bitbuf;   // values of bits (bitcount bits from MSB down)

uint32_t get_bits(uint32_t nbits)
{
    assert(0 < nbits && nbits <= 24);

    // Refill: read extra bytes until we have enough bits
    // in buffer. Insert new bits below the ones we already
    // have.
    while (bitcount < nbits) {
        bitbuf |= read_byte() << (24 - bitcount);
        bitcount += 8;
    }

    // The requested bits are the top nbits bits of bitbuf.
    uint32_t ret = bitbuf >> (32 - nbits);

    // Shift them out.
    bitbuf <<= nbits;
    bitcount -= nbits;
    return ret;
}

Note we do an explicit end-of-buffer check in read_byte and return a defined value (0 in this case) past the end of the input stream. This kind of check is generally required to avoid crashes (or buffer overrun bugs!) if there is any chance the input stream might be invalid or corrupted – be it as the result of a deliberate attack, or just a transmission error. Returning 0 past the end of buffer is an arbitrary choice, but a convention I tend to stick with in my code.

Reducing overhead

As for get_bits, the implementation is a fairly typical one. However, as should be obvious, reading a few bits like this is still a relatively involved process, because every call to get_bits involves the refill check and an update of the bit buffer state. A key trick in many decompressors is to reduce this overhead by separating looking at bits from consuming them, which allows us to grab lots of bits at once (speculatively), and then later decide how far to move the input cursor. This basically boils down to splitting get_bits into two parts:

uint32_t peek_bits(uint32_t nbits)
{
    assert(0 < nbits && nbits <= 24);

    // Refill: read extra bytes until we have enough bits
    // in buffer. Insert new bits below the ones we already
    // have.
    while (bitcount < nbits) {
        bitbuf |= readbyte() << (24 - bitcount);
        bitcount += 8;
    }

    // Return requested bits, starting from the MSB in bitbuf.
    return bitbuf >> (32 - nbits);
}

void consume_bits(uint32_t nbits)
{
    assert(bitcount <= nbits);
    bitbuf <<= nbits; // shift them out
    bitcount -= nbits;
}

Using this new interface, we can modify our decoder to reduce bit IO overhead, by doing a single peek_bits call early and then manually extracting the different sub-fields from it:

while (!done) { // main loop
    // We read up to 19 bits; grab them all at once!
    uint32_t bits = peek_bits(19);
    if (bits & (1u << 18)) { // match bit set?
        int offset = 1 + ((bits >> 5) & 0x1fff);
        int len = 3 + (bits & 0x1f);

        consume_bits(19); // 1b flag + 13b offs + 5b len
        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
        *dest++ = (uint8_t) (bits >> 10);
        consume_bits(9); // 1b flag + 8b value
    }
}

This trick of peeking ahead and deciding later how many bits were actually consumed is very important in practice. The example given here is a simple one; a very important use case is decoding Huffman codes (or other variable-length codes) aided by a look-up table.

Note, however, that we changed the input behavior: before, we really only called read_byte when we knew it was necessary to complete reading the current code. Now, we peek ahead more aggressively, and will actually peek past the end of the input bitstream whenever the last token is a literal. It’s possible to avoid this type of problem by being more restrained in the usage of peek_bits: only ever peek ahead by the minimum amount of bits that we know is going to get consumed no matter what. However, doing so forces us to do a bit more work at runtime than the code fragment shown above entails.

However, the variant shown above is still completely correct: our implementation of read_byte checks for the end of the input stream, and returns zeroes once we’ve passed it. However, this is no longer an exceptional condition: rather than being a “contingency plan” in case of corrupted input data, we can now expect to hit this path when decoding many valid bit streams.

In short, we’re taking a check we need for correctness (the end-of-buffer check) and making it serve double duty to simplify the rest of our decoder. So far, all the code we’ve seen is very standard and not remarkable at all. The resulting bit-IO implementation is fairly typical, more so once we stop trying to only call read_byte when strictly necessary and simplify the buffer refill logic slightly by always refilling to have >24 bits in the buffer no matter what the peek amount is.

Even beyond such details, though, this underlying idea is actually quite interesting: the end-of-buffer check is not one we can easily get rid of without losing correctness (or at least robustness in the face of invalid data). But we can leverage it to simplify other parts of the decoder, reducing the “sting”.

How far can we push this? If we take as granted that reading past the end of the buffer is never acceptable, what is the least amount of work we can do to enforce that invariant?

Relaxed requirements

In fact, let’s first go one further and just allow reading past the end-of-buffer too. You only live once, right? Let’s pull out all the stops and worry about correctness later!

It turns out that if we’re allowed to read a few bytes past the end of the buffer, we can use a nifty branch-free refill technique. At this point, I’m going to manually inline the bit IO so we can see more clearly what’s going on:

while (!done) { // main loop
    // how many bytes to read into bit buffer?
    uint32_t refill_bytes = (32 - bitcount) / 8;

    // refill!
    bitbuf |= read_be32_unaligned(input_cursor) >> bitcount;
    bitcount += refill_bytes * 8;
    input_cursor += refill_bytes;

    assert(bitcount > 24);

    // peek at next 19 bits
    uint32_t bits = bitbuf >> (32 - 19);

    if (bits & (1u << 18)) { // match bit set?
        int offset = 1 + ((bits >> 5) & 0x1fff);
        int len = 3 + (bits & 0x1f);

        // consume_bits(19);
        bitbuf <<= 19;
        bitcount -= 19;
        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
        *dest++ = (uint8_t) (bits >> 10);
        // consume_bits(9);
        bitbuf <<= 9;
        bitcount -= 9;
    }
}

This style of branchless bit IO is used in e.g. Yann Collet’s FSE and works great when the target machine supports reading unaligned 32-bit big endian values quickly — the read_be32_unaligned function referenced above. This is the case on x86 (MOV and BSWAP or just MOVBE where supported), ARMv6 and later (LDR provided unaligned accesses are allowed, plus REV when in little-endian mode) and POWER/PPC; not sure about other architectures. And for what it’s worth, I’m only showing 32-bit IO here, but this technique really comes into its own on 64-bit architectures, since having at least 56 bits in the buffer means we can usually go for a long while without further refill checks.

That’s a pretty nice decoder! The only problem being that we have no insurance against corrupted bit streams at all, and even valid streams will read past the end of the buffer as part of regular operation. This is, ahem, hardly ideal.

But all is not lost. We know exactly how this code behaves: every iteration, it will try reading 4 bytes starting at input_cursor. We just need to make sure that we don’t execute this load if we know it’s going to be trouble.

Let’s say we work out the location of the spot where we need to start being careful:

// Before the decoder runs:
const uint8_t *input_mark;

if (input_end - input_cursor >= 4)
    input_mark = input_end - 4;
else
    input_mark = input_cursor;

The simplest thing we can do with that information is to just switch over to a slower (but safe) decoder once we’re past that spot:

while (!done && input_cursor <= input_mark) {
    // fast decoder here: we know that reading 4 bytes
    // starting at input_cursor is safe, so we can use
    // branchless bit IO
}

while (!done) {
    // finish using safe decoder that refills one byte at
    // a time with careful checks!
}

This works just fine, and is the technique chosen in e.g. the zlib inflate implementation: one fast decoder that runs when the buffer pointers are well away from the boundaries, and a slower decoder that does precise checking.

Note that the input_cursor < input_mark check is the only addition to our fast decoder that was necessary to make the overall process safe. We have some more prep work, and it turns out we ended up with an entire extra copy of the decoder for the cold “near the end of the buffer” path, but the path we expect to be much more common — decoding while still being safely away from the end of the input stream — really only does that one extra compare (and branch) more than the “fast but unshippable” decoder does!

And now that I’ve done my due diligence and told you about the boring way that involves code duplication, let’s do something much more fun instead!

One decoder should be enough for anyone!

The problem we’re running into is that our buffer is running out of bytes to read. The “safe decoder” solution just tries to be really careful in that scenario. But if we’re not feeling very careful today, well, there’s always the ham-fisted alternative: just switch to a different input buffer that’s not as close to being exhausted yet!

Our input buffers are just arrays of bytes. If we start getting too close to the end of our “real” input buffer, we can just copy the remaining bytes over to a small temp buffer that ends with a few padding bytes:

uint8_t temp_buf[16]; // any size >=4 bytes will do.

while (!done) {
    if (input_cursor >= input_mark) {
        assert(input_cursor < input_end);

        // copy remaining bytes to temp_buf
        size_t bytes_left = (size_t) (input_end - input_cursor);
        assert(bytes_left < sizeof(temp_buf));
        memmove(temp_buf, input_cursor, bytes_left);

        // fill rest of temp_buf with zeros
        memset(temp_buf + bytes_left, 0, sizeof(temp_buf) - bytes_left);

        // and update our buffer pointers!
        input_cursor = temp_buf;
        input_end = temp_buf + sizeof(temp_buf);
        input_mark = input_end - 4;
    }

    assert(input_cursor <= input_mark);
    // rest of fast decoder using branchless bit IO
}

And with that little bit of extra logic, we can use our fast decoder for everything: note that we never read past the bounds of the original buffer. Also note that the logic given above can generate an arbitrary amount of trailing zero bytes: if after swapping buffers around, our input cursor hits the mark again, we just hit the refill path again to generate more zeroes. (This is why the copying code uses memmove).

This is nifty already, but we can push this idea much further still.

Switching input buffers

So far, we’re effectively switching from our regular input buffer to the conceptual equivalent of /dev/zero. But there’s no need for that restriction: we can use the same technique to switch over to a different input buffer altogether.

We again use a temporary transition buffer that we switch to when we reach the end of the current input buffer, but this time, we copy over the first few bytes from the next input buffer after the end of the current buffer, instead of filling the rest with zeroes. We still do this using our small temp buffer.

We place our input mark at the position in the temp buffer where data from the new input buffer starts. Once our input cursor is past that mark, we can change pointers again to resume reading from the new input buffer directly, instead of copying data to the temp buffer.

Note that handling cases like really short input buffers (shorter than our 4-byte “looakhead window”) requires some care here, whereas it’s not a big deal when we do the bounds checking on every consumed input byte. We’re not getting something for nothing here: our “sloppy” end-of-input window simplifies the core loop at the expense of adding some complexity in the boundary case handling.

Once we reach the actual end of the input stream, we start zero-filling, just as before. This all dovetails nicely into my old post “Buffer-centric IO” which combines very well with this technique. Together, we get almost-zero-copy IO, except for the copies into the transition buffer near buffer boundaries, which only touch a small fraction of all bytes and are there to make our lives easier.

A final few generalizations

The example I’ve been using was based on a single get_bits (or later peek_bits) call. But this is really not substantial at all. The crucial property we’re exploiting in the decoder above is that we have a known bound for the number of bytes that can be consumed by a single iteration of the loop. As long as we can establish such a bound, we can do a single check per iteration, and in general, we need to check our input cursor at least once inside every loop that consume a potentially unbounded (or at least large) number of input bytes — which in this example is only the main loop.

For the final generalization, note that a lot of compressors use a stream interface similar to zlib. In essence, this is a buffer interface similar to the one described in “Buffer-centric IO” for both the input and output buffers; the decompressor then gets called and processes data until the input or output buffers are exhausted, the end of stream is reached, or an error occurs, whichever happens first. This type of interface makes the (de)compressor somewhat harder to write but is much more convenient for the client code, which can just hand in whatever.

A typical way to implement this type of interface is described in Simon Tatham’s old article “Coroutines in C” — the key property being that the called function needs to be able to save its state at any point where I/O happens, in case it runs out of buffer space; and furthermore it needs to be able to later resume at exactly that point.

The solution is to effectively turn the (de)compressor into a state machine, and Tatham’s article describes a way to do so using a variant of Duff’s Device, quite probably the most infamous coding trick in the C language. Most (de)compressors with a zlib-like interface end up using this technique (or an equivalent) so they can jump into the middle of the decoder and resume where they left off.

So why do I mention all this? Well, the technique I’ve outlined in this article is applicable here as well: Tatham’s description assumes byte-level granularity IO, which means there’s generally lots of points inside the decoder main loop where we might need to save our state and resume later. If the decoder instead ensures there’s enough bytes left in the buffer to make it through one full iteration of the main loop no matter what, that means we have many fewer points where we need to save our state and later resume, often only in a single location.

What’s particularly interesting about combining the relaxed-refill technique with a coroutine-style decoder is that all of the refill and transition buffer logic can be pulled outside of the decoder proper. In library code, that means it can be shared between multiple decoders; so the logic that deals with the transition buffers and short input buffers only needs to be implemented and debugged once.

Discussion

The key simplification in this scheme is relaxing the strict “check for end of buffer on every byte consumed” check. Instead, we establish an upper bound N on the number of input bytes that can be consumed in a single iteration through our decoder main loop, and make sure that our current input buffer always has at least N bytes left — by switching to a different temporary input buffer if necessary.

This allows us to reduce the number of end-of-buffer checks we need to execute substantially. More importantly, it greatly increases the applicability of branch-less refill techniques in bit IO and arithmetic coding, without having to keep a separate “safe” decoder around.

The net effect is one of concentrating a little complexity from several places in hot code paths (end-of-buffer checks on every byte consumed) into somewhat increased complexity in a single cold code path (buffer switching). This is often desirable.

The biggest single caveat with this technique is that as a result of the decoder requiring N bytes in the input buffer at all times, the decoder effectively “lags behind” by that many bytes – or, depending on your point of view, it “looks ahead” by N bytes, reading from the input stream sooner than strictly necessary.

This can be a problem when, for example, several compressed streams are concatenated into a single file: the decoder may only get to decoding the “end of stream” symbol for stream A after N bytes from stream B have already been submitted to the decoder. The decoder would then need to “un-read” (in the sense of ungetc) the last few bytes or seek backwards. No matter how you dice it, this is annoying and awkward.

As a result, this technique is not all that useful when this is a required feature (e.g. as part of a DEFLATE decoder obeying the zlib interface).
However, there are ways to sidestep this problem: if the bitstream specifies the compressed size for either the entire stream or individual blocks, or if the framing format ends in N or more trailing “footer” bytes (a checksum or something similar), we can use this approach just fine.

UPDATE: As commenter derf_ notes on Hacker News, there’s a nice trick to produce implicit trailing zero bits in a bit reader like the one described above by just setting bitcount to a high value once the last byte’s been read into bitbuf. However, this only works with a decoder exactly like the one shown above. The nice part about switching to an explicit zero-padding buffer is that it works not just with all bit IO implementations I’m aware of, but also with byte-normalized (or larger) arithmetic coders like typical range coders or rANS.

I spent the majority of last year working on LZ77-style codecs. I’ve written about some results before. But there were also several smaller (in scope) but still quite neat discoveries along the way.

One of them has to do with repeated match offsets. BitKnit was originally designed for Granny files, which usually contain 3D meshes, animations, sometimes textures, and can also store other user-defined data. As far as a compressor is concerned, Granny files are highly structured, mostly consisting of a few large, homogeneous arrays of fixed-size records.

Repeated match offsets

Often, there is significant correlation between adjacent records, for various reasons. What this means in a LZ77-style dictionary compressor is that there will usually be a lot of matches with a match distance (or match offset) that is a small integer multiple of the record size, and matches with the same offset tend to clump together.

The way LZ77 compressors typically exploit this fact is by reserving special codes for “reuse a recent match distance”. To my knowledge, this technique first appeared in LZX, which keeps a 3-element cache of recent match offsets with a LRU eviction policy. The basic idea seems to have spread from there. Many compressors (too many to list) reserve at least a single special, cheaper code to send another match with the same offset as the previous one (corresponding to a 1-element “cache”). This, among other things, gives a cheaper way to code “gap matches” (a match that resumes after being interrupted by a few mismatching bytes) and appears to be beneficial on most types of data.

On text and data that skews towards variable-size records, having extra codes for more repeated match offsets doesn’t help much, if at all (at least, they don’t seem to hurt, either). However, on data heavy on fixed-size records, it is often a big win. LZX, as mentioned before, has 3 “repeat offset slots”. LZMA uses 4. Several experiments early in the design of BitKnit indicated that at least for the highly structured Granny files it was designed for, there was a good case to be made for having even more repeat offset slots. We re-evaluated this several times, but a repeat offset count of 8 made it into the final codec; essentially, having a larger number of offset slots allows us to “get away” with an overall less sophisticated offset coder (reducing compression, but improving decoder speed), and is a very solid win on the highly structured data that was the target use case.

Experiments with lots of repeat match offsets

Two interesting problems arise from making the “repeat offset cache” this large. First, 8 entries is large enough that it’s worth thinking about different algorithmic variants. Second, at that size, it makes sense to investigate different eviction policies as well as other strategies such as maybe “pinning” a few match distances that we expect to be useful (for example, multiples of the record size in front of homogeneous sections).

Second part first: the effect of either “preloading” or pinning useful match distances was either “in the noise” (almost any change to a LZ encoder using adaptive models will make some files larger and others smaller simply due to getting a slightly different parse) or strictly worse in all our tests. Considering how much interface complications this implies (the pre-loaded offsets for different sections need to get to the compressor somehow, and they either need to be known in the decoder from other sources or stored in the compression stream, reducing the gains even further) that makes the idea fairly uninteresting. Empirically, at least in our tests, LZ compressors find useful match distances quickly, and once they’re in the cache, they tend to stick around. Since such a cache is naturally adaptive to the data (whereas static pinned offsets are not), keeping them fully dynamic seems like a good idea.

The next test was to decouple the eviction policy from offset modeling. LZX, LZMA etc. always keep their list of recent match offsets in MRU order: slot 0 is the most recently used offset, slot 1 is the second-most recent, and so forth. One experiment we tried with a very early version of BitKnit was a variant I dubbed “stable index MRU”: offsets are still evicted on a LRU basis, but instead of shuffling the indices around on every match so that the new most recent match gets offset 0, new offsets would get inserted into the least recently referenced slot without moving the slot IDs around.

This affects the modeling; before, you have a very skewed distribution: slot 0 (most recent) is much more important than slot 1, which is more important than slot 2, and so forth. After, they are more spread out; but the idea was that in highly structured files where the same few offsets stick around for a fairly long time, you might capture more useful correlations by keeping these offsets in a single spot (which the entropy coder then tried to capitalize on).

Here were the results on a few granny files, listing the compressed sizes in bytes, with “anchor” being a x86 executable file that doesn’t have a significant amount of record-structured data in it, as a baseline. “MTF” refers to move-to-front index update policy, “stable” is the stable-index variant just described.

Configuration	granny1	granny2	granny3	anchor
4 offsets, MTF	18561994	22127519	15156085	1300427
4 offsets, stable	18533728	22261691	15177584	1300980
8 offsets, MTF	17825495	21746140	14800616	1300534
8 offsets, stable	17580201	21777192	14819439	1304406
12 offsets, MTF	17619775	21640749	14677706	1301007
12 offsets, stable	17415560	21448142	14681142	1306434
16 offsets, MTF	17554769	21523058	14600474	1300984
16 offsets, stable	17341154	21462241	14669793	1308508

First off, as you can see from these experiments, going from 4 to 8 repeat match offsets really does help significantly on these files; an extra 1.5%-4% reduction in file size may not sound like much, but it’s a fairly big deal in compression terms. The experiments with even more repeat offsets were mainly to get a feel for when we start to hit diminishing returns; also, as you can see, the compressed size for the “anchor” file (which is not record-structured) doesn’t seem to care much about the difference between 4 and 8 repeat offset slots, and gets worse after.

As for stable-index coding, well, it’s a mixed bag. It does help on some files, and on the files that do seem to improve from it, it’s a bigger win when using more offset slots, but on e.g. “granny3” and “anchor” it was a net negative. Interesting experiment, but it didn’t go in.

Insertion policy

Another experiment we ran was on insertion policy. Specifically, our hypothesis was that at least on highly structured data (where the repeat offsets really help), we really want to make sure the important offsets stay “in front”. But occasionally, you will still get other matches that “don’t fit the pattern”. The problem is that this puts some other random offset in front that will then slowly “slide down” and meanwhile cause our actually important offsets to be more expensive to code.

This is more of a problem with greedy LZ parsers (which make their decisions locally); optimizing parsers (which usually try to look ahead by a few kilobytes or so) are better at correctly estimating the cost of “disrupting” the set of offsets. Either way, it’s annoying.

We tried a couple different approaches with this; the best overall approach we found in our tests was to stick with a basic MTF coding scheme and LRU eviction policy (bog-standard in other words), but distinguish between updates caused by repeat matches and those caused by inserting a new offset not currently in the repeat offset set. The former (repeat matches) just do a full move-to-front step, as usual. The latter don’t; instead of inserting a new offset all the way in front, we insert it further back. If it then gets reused a second time, it really will be moved all the way to the front of the list; but if it doesn’t get referenced again, it will drop out more quickly and with less disruption of the remaining repeat offsets.

Here’s the batch of test results, from the same compressor version. “kSlotNew” is the slot where new offsets are inserted; 0 corresponds to inserting in front (regular MTF), 1 is the second position, and so forth.

Configuration	granny1	granny2	granny3	anchor
4 offsets, kSlotNew=0	18561994	22127519	15156085	1300427
4 offsets, kSlotNew=1	18450961	22153774	15154609	1304707
4 offsets, kSlotNew=2	18118014	22000266	15181128	1305755
4 offsets, kSlotNew=3	17736663	22002942	15209073	1307550
8 offsets, kSlotNew=0	17825495	21746140	14800616	1300534
8 offsets, kSlotNew=4	17327247	21546289	14771634	1305128
8 offsets, kSlotNew=6	17197347	21425116	14713121	1305588
16 offsets, kSlotNew=0	17554769	21523058	14600474	1300984
16 offsets, kSlotNew=14	17122510	21177337	14578492	1305432

We can see that the anchor prefers pure MTF, but the Granny files definitely see a win from not moving new offsets all the way to the front the first time they’re seen. There were a few more tests than the one shown, but in general, inserting new offsets in the second-to-last slot seemed like a good rule of thumb for the Granny files.

This one is definitely more contextual. As you can see, different types of files really prefer different settings here. BitKnit went with 8 offsets and insertion at the second-to-last slot (corresponding to the “8 offsets, kSlotNew=6” row above), because it produced the overall best results on the data it was designed for. (As evaluated on a larger test set not shown here.)

So, this is fairly neat, and a comparatively major win over the baseline 4 offsets and insert-in-front variant (a la LZMA) for the data in question. Now how to implement this efficiently?

Implementation notes

The basic implementation of the offset maintenance logic in the decoder is dead simple. You just keep an array of recent offsets and shuffle it around with something like this:

if (is_repeat_match) {
    // move slot "rep_idx" to front.
    // this involves grabbing the offset at the corresponding
    // location and then sliding everything before that position
    // down by one slot.
    tmp = offsets[rep_idx];
    for (uint i = rep_idx; i > 0; --i)
        offsets[i] = offsets[i - 1];
    offsets[0] = tmp;
} else {
    // implement the "insert in second-to-last position"
    // rule, which touches exactly two elements.
    offsets[kNumReps - 1] = offsets[kNumReps - 2];
    offsets[kNumReps - 2] = newOffset;
}

This works just fine, but it has a lot of data-dependent branches in the repeat match case, which is a performance trap in decompressors; generally speaking, you want to avoid branching on data you just read out of a bitstream, because it tends to be relatively high entropy and thus cause a lot of branch mispredictions, which are expensive.

One way to fix this is to add several entries worth of padding in front of the actual used part of offsets, and always copy the same number of entries in the “sliding down” phase. This gets rid of the data-dependent branches and makes it easy to unroll the loop fully (since the trip count is now constant) or express it using a few unaligned SIMD loads/stores (where supported).

However, BitKnit uses a different approach derived from our earlier experiments with “stable index MRU” that doesn’t need anything beyond regular integer arithmetic. The basic idea is to leave the offsets array alone; instead, we keep a secondary “data structure” that tells us which logical “repeat offset” list position corresponds to which index in the offsets array.

I write “data structure” in quotes because that information is actually stored in (drum roll)… a single 32-bit unsigned integer! Here’s the idea: we have a uint32_t mtf_state that represents the current offset permutation. It does this by storing the offset array index for the i’th logical repeat offset slot in the i’th nibble (numbered starting from the LSB upwards). At initialization time, we set mtf_state = 0x76543210, the identity mapping: the logical and actual offset indices coincide.

Why does this help? Because the fundamental operation for move-to-front processing is moving a bunch of offsets “one slot down” in their array position. If they’re separate integers, that means either a lot of copying, or less copying but using much wider (e.g. SIMD) instructions. Our array of 4-bit indices is compact enough that 8 indices fit inside a single 32-bit uint; we can slide them all “down” or “up” using nothing but a single bit shift. Now, our code above doesn’t actually move all elements, just the ones at position ≥rep_idx; but that turns out to be easily remedied with some bit masking operations.

So the alternative variant is this:

if (is_repeat_match) {
    // move slot "rep_idx" to front by permuting mtf_state. first,
    // determine the offset slot ID at that position in the list
    uint32_t rep_idx4 = rep_idx*4;
    uint32_t slot_id = (mtf_state >> rep_idx4) & 0xf;
    match_offs = offsets[slot_id]; // decoder needs this later!

    // moved_mtf: slide down everything by one slot, then put
    // "slot_id" in front.
    uint32_t moved_mtf = (mtf_state << 4) + slot_id;
    uint32_t keep_mask = ~0xf << rep_idx4; // bits that don't move
    mtf_state = (mtf_state & keep_mask) | (moved_mtf & ~keep_mask);
} else {
    // implement the "insert in second-to-last position"
    // rule, which touches exactly two elements. this is easier
    // to do by just modifying the offsets directly.
    uint32_t last = (mtf_state >> ((kNumReps - 1)*4)) & 0xf;
    uint32_t before_last = (mtf_state >> ((kNumReps - 2)*4)) & 0xf;

    offsets[last] = offsets[before_last];
    offsets[before_last] = newOffset;
}

It’s a bit of integer arithmetic, but not a lot, and there’s no dependence on vector instructions, fast unaligned memory access, or in fact anything outside of standard C/C++. BitKnit uses a 32-bit mtf_state to implement an 8-entry LRU cache. Using 64-bit values (and still using nibbles to store array indices), the exact same approach (with essentially no modifications to the source save for type names) can manage a 16-entry LRU.

An 8-entry LRU actually only needs 24 bits (when storing array indices in groups of three bits instead of nibbles), but that’s not a very useful size. A 4-entry LRU state fits in 4*log2(4) = 8 bits, which is nice and compact, although for 4 entries, this way is generally not a win (at least in our tests).

And now this is time to come clean: I kinda like this approach, and it’s the real reason I wrote this whole thing up. I probably would’ve still written it up even if it hadn’t turned out to be useful in practice, but it did, which is always a nice bonus.

Finally, over the years, I’ve found a few instances like this where packing a small “data structure” (using the term loosely) inside a single register-width integer produces interesting results. There’s a good chance I’ll write about more in the future! Until then.

SSE and SSE2 are available in every single x86-family CPU with 64-bit support. You too can play around with SIMD, which is great fun! Unfortunately, SSE2 level in particular also happens to be what is probably the most maddeningly non-orthogonal SIMD instruction set in the world, where operations are either available or not available for particular data types with little rhyme or reason, especially where integers are involved. Later revisions (especially starting around SSE4.1) fill in some of the more annoying gaps, but plenty of us are stuck with supporting the older CPUs for at least a few more years, and besides – not to mess with the authentic SSE experience – even on AVX2-supporting CPUs, there’s still a few of the classic gaps remaining.

So, here’s a list of tricks to get you around some of the more common, eh, “idiosyncrasies” of SSE and its descendants. This happens to be mostly focused on the integer side; the floating-point side is generally less, well, weird. I’ll keep the individual descriptions relatively brief since the whole point of this post is to collect lots of tricks. The assumption here is that you’re already somewhat familiar with the instructions, so I’ll not explain the basics (maybe another time). I’ll use the official Intel intrinsics (as exposed in C/C++) since that’s probably the most common way people interact with these instructions intentionally (awkward glance in the direction of auto-vectorization here. No making eye contact. Moving on.)

Branchless “select” (cond ? a : b)

The natural mode of operation in SIMD computations is to do things branchlessly. If some part of a computation is conditional, rather than doing the equivalent of an if, it’s more typical to do both the computation for the “if” and “else” forks, and then merge the results based on the condition. The “select” I mean is the operation which takes the condition and both results and performs the rough equivalent of C’s ternary operator cond ? a : b. You first evaluate both sides, giving a and b. You then evaluate the condition using a SIMD compare, which returns a vector containing a bit mask that is has all bits set for lanes that meet cond, and all bits clear for lanes that don’t.

This select operation can always be done using a few bitwise operations (which is well known), but starting in SSE 4.1 we get slightly more efficient variants too (less well known, and the reason I mention this):

Integer (all vers): _mm_or_si128(_mm_and_si128(a, cond), _mm_andnot_si128(cond, b)).
32-bit float (all vers): _mm_or_ps(_mm_and_ps(a, cond), _mm_andnot_ps(cond, b)).
64-bit float (all vers): _mm_or_pd(_mm_and_pd(a, cond), _mm_andnot_pd(cond, b)).
Integer (SSE4.1+): _mm_blendv_epi8(a, b, cond).
32-bit float (SSE4.1+): _mm_blendv_ps(a, b, cond).
64-bit float (SSE4.1+): _mm_blendv_pd(a, b, cond).

The andnot operations don’t come in handy very often, but they’re the best choice here (pre-SSE4.1).

If you don’t want to use cond but its logical negation, just switch the positions of a and b, since (!cond) ? a : b is the same as cond ? b : a.

Unsigned integer compares

SSE, in all incarnations, offers precisely two types of integer comparisons: test for equality (PCMPEQt, _mm_cmpeq_T, where t and T stand for various type suffixes) and test for signed greater-than (PCMPGTt, _mm_cmpgt_T). Most other comparison types can be produced using nothing but logical negation and standard identities:

a == b is supported directly.
a != b is !(a == b).
a > b (signed) is supported directly.
a < b (signed) is the same as b > a (swap a and b).
a >= b (signed) is !(a < b) (which in turn is !(b > a)).
a <= b (signed) is !(a > b).

See previous note on selection operations on how to get rid of the NOT in the most common use case. Conspicuously absent from that list is any type of unsigned ordered comparison. However, a trick that works is to bias both integers so that signed comparison does the right thing:

a > b (unsigned, 8-bit) is the same as (a - 0x80) > (b - 0x80) (signed, 8-bit).
a > b (unsigned, 16-bit) is the same as (a - 0x8000) > (b - 0x8000) (signed, 16-bit).
a > b (unsigned, 32-bit) is the same as (a - 0x80000000) > (b - 0x80000000) (signed, 32-bit).

The same argument-swapping and NOT-ing tricks as above still apply to give you the other compare types. In general, the trick is to add (or subtract, or XOR – they all do the same thing in this particular case) the INT_MIN for the respective type from both operands before doing the compare. This turns the smallest possible unsigned integer, 0, into the smallest possible signed integer for the given type; after that, the ordering works out. In particular, when comparing against a constant, this addition (or subtraction, or XOR) can be baked into the constant operand, so the unsigned compare “only” ends up doing one more operation than a signed compare (instead of two).

A completely different approach is to use the unsigned integer min/max instructions (more about those in a second) to build less-or-equal or greater-or-equal comparisons:

a <= b if and only if max(a, b) == b.
a >= b if and only if min(a, b) == b.

The good news is that this reduces unsigned comparisons to either an unsigned min or a max, followed by an equality comparison, which is only 2 instead of 3 operations. The bad news is that the requisite unsigned min/max operations only exist for uint8s in SSE2. The uint16/uint32 variants were finally added with SSE4.1; if your minimum target is earlier, you’re stuck with the bias-then-compare variants above.

Integer min and max

SSE4.1 has the full set of integer min/max for 8-, 16- and 32-bit types, both signed and unsigned. So if you’re targeting SSE4.1 or later, good for you!

If you’re stuck with anything earlier, you’re decidedly more limited. In SSE2, you get integer min/max for uint8 and int16. If you need min/max for int8, uint16, or anything 32-bit, you’re on your own.

Luckily, we can just combine some of the techniques above to derive a solution. The general patterns here are:

  max(a, b) == (a > b) ? a : b;
  min(a, b) == (a > b) ? b : a;

So this is just a combination of a compare and a “select” operation. When the compare is signed (the int8 and int32 cases), the comparison maps to a single SSE intrinsic. The unsigned compares (uint16 and uint32) can be solved using the bias-then-signed-compare trick which in turn gives us an unsigned min/max.

32-bit and 64-bit loads/stores

This one has nothing to do with the actual instruction set and everything to do with the intrinsics: yes, SSE2 has 32-bit (MOVD) and 64-bit (MOVQ) loads and stores, the standard intrinsics just do their best to confuse you about it:

64-bit loads are _mm_loadl_epi64. This intrinsic takes a __m128i * as an argument. Don’t take that seriously. The actual load is 64-bit sized, not 128-bit sized, and there is no alignment requirement.
64-bit stores are _mm_storel_epi64. Again, the __m128i * is confusing and does not mean that the actual store is 128-bit or that there are alignment requirements. It isn’t and there are not.
32-bit loads are even more hidden! Namely, you write _mm_cvtsi32_si128(*x) where x is a pointer to a 32-bit integer. No direct load intrinsic, but compilers will turn this into a MOVD with memory operand where applicable.
32-bit stores, likewise: *x = _mm_cvtsi128_si32(value). Now you know.

Multiplies

There’s lots of different ways to provide multiplies in a SIMD instruction set, and by now SSE has tried most of them in one form or another.

Let’s start with the (historically) first variant: multiplying 16-bit numbers. The relevant instructions originated in the Pentium MMX and compute the low and high halves (bottom and top 16 bits) of a signed 16-bit×16-bit product. MMX only has signed multiplies, but SSE also added a “high half of unsigned 16-bit times 16-bit product” instruction (the low halves of signed and unsigned products are identical), so we’re not gonna have to worry about that particular problem, not yet anyway.

These instructions are fine if you want the low or high halves of the product. What if you want the full 32-bit product of vectors of 16-bit values? You compute the low and high halves and then merge them using the “unpack” instructions. This is the standard approach, but not very obvious if you haven’t deal with this kind of thing before. So for a full 16×16→32-bit product (note this produces two vectors worth of results), we get:

  // EITHER: a*b (16-bit lanes), signed
  __m128i lo16 = _mm_mullo_epi16(a, b);
  __m128i hi16 = _mm_mulhi_epi16(a, b);

  // OR: a*b (16-bit lanes), unsigned
  __m128i lo16 = _mm_mullo_epi16(a, b);
  __m128i hi16 = _mm_mulhi_epu16(a, b);

  // THEN: merge results
  __m128i res0 = _mm_unpacklo_epi16(lo16, hi16); // result lanes 0..3
  __m128i res1 = _mm_unpackhi_epi16(lo16, hi16); // result lanes 4..7

But what if you’re working with 32-bit values? There is a 32×32→32-bit product (PMULLD / _mm_mullo_epi32), but it was only added with SSE4.1, and it’s significantly slower than the other SSE2 multiplies in many implementations. So you might either not want to set your minimum target that high, or you might be looking for something quicker.

There’s full 32×32→64-bit products, which are available from SSE2 on as
PMULUDQ/_mm_mul_epu32 (unsigned). SSE4.1 adds the signed equivalent PMULDQ/_mm_mul_epi32 (UPDATE: An older version of this post incorrectly stated that PMULDQ was SSE2. Thanks Exophase for pointing it out!). These ones only compute two products (between the even lanes of the two sources) and place them in a 128-bit result. The odd 32-bit lanes are ignored completely, so if you want four 32×32→32-bit products, you need at least two of these multiplies and a lot of wrangling:

  // res = _mm_mullo_epi32(a, b) equivalent using SSE2, via PMULUDQ.

  // even and odd lane products
  __m128i evnp = _mm_mul_epu32(a, b);
  __m128i odda = _mm_srli_epi64(a, 32);
  __m128i oddb = _mm_srli_epi64(b, 32);
  __m128i oddp = _mm_mul_epu32(odda, oddb);

  // merge results
  __m128i evn_mask = _mm_setr_epi32(-1, 0, -1, 0);
  __m128i evn_result = _mm_and_si128(evnp, evn_mask);
  __m128i odd_result = _mm_slli_epi64(oddp, 32);

  __m128i res = _mm_or_si128(evn_result, odd_result);

It works, but it’s a mouthful.

But what if you’re using 32-bit vector lanes, but happen to know that the numbers we’re trying to multiply are in fact in the range [-32768,32767] (i.e. representable as signed 16-bit integers)? We could try narrowing the 32-bit lanes into 16 bits then using the 16×16→32 sequences above, but is that really the best we can do?

It is not: PMADDWD (_mm_madd_epi16), MMX/SSE2’s amazing and strange (but mostly amazing) dot product operation, has our back, for we can do this:

   // a and b have 32-bit lanes with values that fit in int16s.
   // produces the 32-bit result
   //   res[i] = a[i] * b[i]

   // clears high 16 bits of every 32-bit lane
   __m128i bm = _mm_and_si128(b, _mm_set1_epi32(0xffff));

   // after this, madd_epi16 does what we want!
   __m128i res = _mm_madd_epi16(a, bm);

   // can swap role of a and b above too, when convenient.

That’s a lot shorter than narrowing to 16-bit first would be! Alas, it only works for int16 (signed). What if we’re working in 32-bit lanes with values that fit inside a uint16 (unsigned)? It’s not quite as slick, but still, better than narrowing to 16-bit first or dealing with the logistics when synthesizing 32×32→32-bit muls from PMULDQ/PMULUDQ:

   // a and b have 32-bit lanes with values that fit in uint16s,
   // i.e. a[i] == (uint16)a[i] and same for b[i].
   //
   // produces the 32-bit result
   //   res[i] = a[i] * b[i]

   // compute low and high 16-bit products
   __m128i lop = _mm_mullo_epi16(a, b);
   __m128i hip = _mm_mulhi_epu16(a, b);

   // merge results
   __m128i res = _mm_or_si128(lop, _mm_slli_epi32(hip, 16));

Horizontal adds, dot products etc. (float)

SSE3 adds horizontal adds HADDPS (_mm_hadd_ps) and HADDPD (_mm_hadd_pd) and SSE4.1 throws in the dot-product instructions DPPS (_mm_dp_ps) and DPPD (_mm_dp_pd).

Generally, don’t expect these operations to be magic. They exist in the instruction set but are fast precisely nowhere; in all x86 implementations I’m familiar with, they just turn into a canned sequence of more basic (SSE2-level) operations. So more often that not, you will end up requiring a higher minimum CPU target for little to no speed gain. Caveat: these instructions are a smaller than their replacement instruction sequence, so using them can reduce code size slightly. But still, don’t expect this to be fast.

If you want good SIMD performance, don’t lean on horizontal and dot-product style operations; process data in batches (not just one vec4 at a time) and transpose on input, or use a SoA layout to begin with.

The other kind of horizontal adds, dot products etc. (integer)

SSE does have a bunch of horizontal add and dot product-style operations that don’t suck, but they’re on the integer pipe, and not what you’d expect.

Nope, not PHADDW/PHADDD (_mm_hadd_epi16/_mm_hadd_epi32). These are SSSE3 and later only and OK but not particularly great (similar disclaimer as for the floating-point horizontal adds applies).

No, I’m talking about PMADDWD (_mm_madd_epi16, SSE2 with its ancestor around since the original MMX), PSADBW (_mm_sad_epu8, SSE2) and PMADDUBSW (_mm_maddubs_epi16, SSSE3). The official manuals describe what these instructions do, so I won’t bother going into too much detail, but here’s the basic idea: PMADDWD and PMADDUBSW are 2-element dot-product instructions between pairs of adjacent SIMD lanes. PMADDWD computes two int16 by int16 products for each pair of 16-bit lanes and sums the 32-bit integer products to yield the 32-bit result lanes. PMADDUBSW computes two uint8 by int8 products for each pair of 8-bit lanes and sums the 16-bit integer products to yield the 16-bit result lanes. These can be used to compute dot products of this problem; but they also have “degenerate” configurations that are very useful:

_mm_madd_epi16(x, _mm_set1_epi16(1)) sums the 16-bit even and odd lanes of x in pairs to yield 32-bit results.
_mm_maddubs_epi16(_mm_unpacklo_epi8(a, b), _mm_setr_epi8(1, -1, 1, -1, ..., 1, -1)) happens to be the fastest way to compute the 16-bit signed differences between 8-bit unsigned vectors a and b on processors that support SSSE3.
The 16-bit multiply example above shows another special configuration.

Long story short, these dot product instructions are surprisingly versatile in decidedly non-obvious ways.

Finally, PSADBW (_mm_sad_epu8, SSE2). This one is intended for motion estimation in video codecs, but it also happens to be the one actually really fast horizontal add you get on x86. In particular, _mm_sad_epu8(x, _mm_setzero_si128()) computes two 16-bit horizontal sums of groups of 8 uint8 lanes in a single, and quite fast, operation. We can do the same trick we did for compares in reverse to compute the sum of 8 int8s instead: add (or subtract, or XOR) _mm_set1_epi8(-128) to x (before the PSADBW), the subtract 128×8 from the resulting 16-bit sums.

To be continued!

There’s a lot more of these, but this feels like enough to chew on for a single blog post. So there will be a sequel covering, at least, integer widening/narrowing and variable shifts. Until then!

It’s fairly well-known (among programmers anyway) that say rounding up x to the nearest multiple of 8 can be accomplished using the formula (x + 7) & ~7, and that in general rounding up to the nearest multiple of N (where N is a power of 2) can be accomplished using (x + N - 1) & ~(N - 1). But sometimes you need a slightly generalized version: round up to the nearest value that is congruent to some $k \pmod N$ ; for example, this crops up in boundary tag-using memory allocators when the user requests aligned memory. Such allocators put a header before allocated blocks (just before the address returned to the caller). For the user-visible pointer to be aligned by say 32, that header needs to fall at an address that’s off alignment by a specified distance (which brings us to our problem).

It’s not immediately obvious how to adapt the original formula to this case (there is a way; I’ll get to it in a second). Now this is not exactly a frequent problem, nor is there any real need for a clever solution, but it turns out there is a very nice, satisfying solution anyway, and I wanted to write a few words about it. The solution is simply x + ((k - x) & (N - 1)) for power-of-2 N. The basic approach works in principle for arbitrary N, but x + ((k - x) % N) will not work properly in environments using truncated division where taking the modulus of a negative argument can return negative results, which sadly is most of them. That said, in the remainder of this short post I’ll write % N instead of & (N - 1) with a “N needs to be a power of 2” disclaimer anyway, since there’s really nothing about the method that really requires it. Finally, this expression works fine even in overflowing unsigned integer arithmetic when N is a power of 2, but not for non-power-of-2 N.

What I like about this solution is that, once you see it written down, it’s fairly clear that and why it works (unlike many bit manipulation tricks), provided you know the rules of modular arithmetic: $x + ((k - x) \bmod N) \equiv x + (k - x) = k \pmod N$ . We’re adding a non-negative value to x, so it’s clear that the result is ≥ x (provided there is no overflow). And we’re adding the smallest possible value we can to get to a value that’s congruent to k (mod N); I wrote about similar things before in my post “Intervals in modular arithmetic”.

There’s an equivalent expression for rounding down to the nearest value congruent to k (mod N): x - ((x - k) % N) that works (and is easy to prove) the same way.

It’s interesting to consider the case k=0. The round-down variant, x - (x % N), feels fairly natural and is something I’ve seen in “real-world” code more than once. The round-up variant, x + (-x % N) is something I’ve never seen anywhere. Once you throw the k in there, it all makes sense, but without it the expression looks quite odd.

Finally, here’s the aforementioned way to adapt the “regular” round-up formula to produce a value that’s congruent to k (instead of 0) mod N (and we’re back to requiring power-of-2 N here): ((x - k + N - 1) & ~(N - 1)) + k. This uses a different trick from the intervals in modular arithmetic paper: shift the origin around. In this case, we don’t have a formula for arbitrary k, but we do have a formula to round up to the nearest multiple of N. So we first subtract k; in this new shifted coordinate system, we want to round up to the next-larger multiple of N, which we know how to do. And finally, we add back k. It gets the job done, but it’s not as pretty as the other variant (in my opinion anyway), and it takes some thinking to convince yourself that it works at all.

Resurrecting an old tradition of this blog, let’s take a simple problem, go over a way too long list of alternative solutions, and evaluate their merits.

Our simple problem is this: we want to read data encoded using some variable-bit-length code from a byte stream. Reading individual bytes, machine words etc. is directly supported by most CPUs and many programming languages, but for bit-granularity IO, you generally need to implement it yourself.

This sounds simple enough, and in some sense it is. The first source of problems is that this operation tends to be a hot spot in codecs—-and yes, compute bound, not memory or I/O bound. So we’d like not just an implementation that works; we’d like it to be efficient as well. And along the way we’ll run into many other complications: interactions with IO buffering, end-of-buffer handling, corner cases in the way bit shifts are specified both in C/C++ and in various processor architectures, as well as other bit shift peculiarities.

I’ll mainly focus on many different ways to handle the reader side in this post; essentially all techniques covered in here apply equally to the writer side, but I don’t want to double the number of algorithm variations I’m presenting. There will be plenty as it is.

Degrees of freedom

“Read a variable number of bits” is not a sufficient problem specification. There are lots of plausible ways to pack bits into bytes, and all have their strengths and weaknesses that I’ll go into later. For now, let’s just cover the differences.

The first major decision to make is whether fields are packed MSB-first or LSB-first (“most significant bit” and “least significant bit”, respectively). That is, if we call our to-be-implemented function getbits and run the code sequence

a = getbits(4);
b = getbits(3);

on some bitstream that we just opened, we expect both values to come from the same byte, but how are they arranged in that byte? If bits are packed MSB-first, then “a” occupies 4 bits starting at the MSB, and “b” is below “a”, leading to this arrangement:

MSB-first bit packing

I’m numbering bits with the LSB being bit 0 and increasing towards the MSB, which is the conventional ordering in most contexts. LSB-first packing is the opposite, where the first field occupies bit 0 and later fields proceed upwards:

LSB-first bit packing

Both conventions are used in everyday file formats. For example, JPEG uses MSB-first bit packing for its bitstream, and DEFLATE (zip) uses LSB-first.

The next question we have to settle is what’s supposed to happen when a value ends up spanning multiple bytes. Say we have another value, “c”, that we want to encode in 5 bits. What do we want to end up with? We can postpone the issue slightly be declaring that we’re packing values into 32-bit words or 64-bit words and not bytes, but ultimately we’ll have to settle on something. This is where we suddenly get a whole lot of different variants, and I’m only going to cover the main contenders.

We can think of MSB-first bit packing as iterating over our bitfield “c” from its MSB towards the LSB, inserting one bit at a time. And once we fill up one byte, we start with the next byte. Following these rules for our new bitfield c, this is where the bits end up in our stream:

MSB-first bit packing across multiple bytes

Note that by following these rules, we end up with the same two bytes we would have gotten from MSB-first bit packing into a big integer and then stored it in big-endian byte order. If we had instead decided to split c such that its LSB goes into the first byte and the four higher-order bits into the second byte, this wouldn’t have worked. I’m going to call bit-packing rules that are self-consistent like that “natural” rules.

LSB-first bit-packing of course also has its corresponding natural rule, which is to insert the new value bit by bit starting from the LSB upwards, and if we do that here, we end up with this bit stream:

LSB-first bit packing across multiple bytes

LSB-first natural packing gives us the same bytes as LSB-first packing into a big integer then storing it in little-endian byte order. Also, we’re clearly running into some awkwardness with the drawing here: the logically contiguous packing of c into multiple bytes looks discontinuous when drawing it like this, whereas the drawing for the MSB-first packing looked the way you’d expect. But there’s trouble brewing there too: in the MSB-first drawing, we’re numbering the bits in increasing order from right to left (as is common) and the bytes in increasing order from left to right (as is also common).

Here’s what happens with the LSB-first bit diagram if we draw bit 0 (the LSB) in each byte at the left and bit 7 (the MSB) at the right:

LSB-first bitpacking across multiple bytes, LSB->MSB from left to right

If you draw it this way, it looks the way you’d expect. Putting the MSB on the right feels weird when thinking of a byte as a number and much less weird when you just think about it as an array of 8 bits (which is, effectively, what we’re treating it as when we’re doing bit-wise IO).

Incidentally, some big-endian architectures (e.g. IBM POWER) do number bits this way – bit 0 is the MSB, bit 31 (or 63) is the LSB. Diagramming MSB-first bit packing on such a machine with bit 0=MSB, and also numbering our own bit fields so that their bit 0 corresponds to the MSB, we’d get the exact same diagram (but it would mean something slightly different). This convention makes bit and byte ordering consistent (nice) but breaks the handy convention of bit k corresponding to a value of 2^k (less nice).

And if the idea of renumbering bits so bit 0 is the MSB breaks your head, you can also leave the bit numbering as it is but be a rebel and draw byte addresses increasing to the left. Or alternatively keep drawing increasing addresses to the right but be a slightly different kind of rebel and write your byte stream backwards. If you do either of those, you end up with this layout:

LSB-first bitpacking across multiple bytes, reverse byte order

I realize this is getting confusing, and I’ll stop now, but I’m trying to make an actual point here: you should not get too attached to the way this stuff is drawn. It’s easy to fool yourself into thinking that one variant is better than another because it looks prettier, but the conventions of drawing bytes left to right and bits within them right to left are completely arbitrary, and no matter which one you pick, they always end up having that One Weird Thing about them.

It turns out that MSB-first and LSB-first packing conventions both have advantages and disadvantages, and it’s much more useful to think of them as tools with different areas of application than it is to designate one as the “right way” and the other as the “wrong way”. As for byte order, and whether to pack values into bytes, words, or something larger, I highly recommend that you use whatever the natural order for your bit-packing convention is: MSB-first corresponds naturally to big-endian style byte ordering, and LSB-first corresponds naturally to little-endian byte ordering. Unless you’re writing byte streams in reverse—believe it or not, there’s good reasons to do that too—in which case we have reverse-order MSB-first corresponding to little-endian and reverse-order LSB-first corresponding to big-endian.

The reason to prefer “natural” orders is that they tend to give you more freedom for different implementations. A natural-order stream admits a variety of different decoders (and encoders), all with different trade-offs (and good on different targets). “Unnatural” orders are usually designed with exactly one implementation in mind and tend to very awkward to decode any other way.

Our first `getbits` (bit extract)

Now that we’ve specified the problem sufficiently, we can implement a solution. A particularly simple version is possible if we assume the entire bit stream is sequential in memory (as a byte array) and also completely ignore, for the time being, such pesky issues as running into the end of the array. Let’s just pretend it’s infinite! (Or large enough and zero-padded, anyway.)

In that case, we can base everything on a purely functional “bit extract” function, which I am then going to illustrate a number of issues that come up in all kind of bit-reading functions. Let’s start with LSB-first bit packing:

// Treating buf[] as a giant little-endian integer, grab "width"
// bits starting at bit number "pos" (LSB=bit 0).
uint64_t bit_extract_lsb(const uint8_t *buf, size_t pos, int width) {
    assert(width >= 0 && width <= 64 - 7);

    // Read a 64-bit little-endian number starting from the byte
    // containing bit number "pos" (relative to "buf").
    uint64_t bits = read64LE(&buf[pos / 8]);

    // Shift out the bits inside the first byte that we've
    // already consumed.
    // After this, the LSB of our bit field is in the LSB of bits.
    bits >>= pos % 8;

    // Return the low "width" bits, zeroing the rest via bit mask.
    return bits & ((1ull << width) - 1);
}

// State variable, assumed to be local variables, or factored
// into some object; hopefully not actual globals.
const uint8_t *bitstream; // The input bistream
size_t bit_pos; // Current position in the stream, in bits.

uint64_t getbits_extract_lsb(int width) {
    // Read the bits
    uint64_t result = bit_extract_lsb(bitstream, bit_pos, width);
    // Advance the cursor
    bit_pos += width;
    return result;
}

We’re just using the fact I noted earlier that a LSB-first bit stream is just a big little-endian number. We first grab 64 contiguous byte-aligned bits starting from the first byte containing any bits we care about, do a right shift to get rid of the remaining 0-7 extra bits below the first bit we care about, and then return the result masked to the desired width.

Depending on the value of pos, that right shift can cost us an extra 7 bits. So even though we read a full 64 bits, the maximum number of bits we can read in one go with the code above is 64-7=57 bits.

With bitextract in hand, getbits is straightforward; we just keep track of the current position in the bitstream (in bits), and increment it after reading. Easy.

The corresponding MSB-first variant works out quite similarly, except for one annoying issue I’ll explain after showing the code:

// Treating buf[] as a giant big-endian integer, grab "width"
// bits starting at bit number "pos" (MSB=bit 0).
uint64_t bit_extract_msb(const uint8_t *buf, size_t pos, int width) {
    assert(width >= 1 && width <= 64 - 7);

    // Read a 64-bit big-endian number starting from the byte
    // containing bit number "pos" (relative to "buf").
    uint64_t bits = read64BE(&buf[pos / 8]);

    // Shift out the bits we've already consumed.
    // After this, the MSB of our bit field is in the MSB of bits.
    bits <<= pos % 8;

    // Return the top "width" bits.
    return bits >> (64 - width);
}

uint64_t getbits_extract_msb(int width) {
    // Read the bits
    uint64_t result = bit_extract_msb(bitstream, bit_pos, width);
    // Advance the cursor
    bit_pos += width;
    return result;
}

This works very similar to the one before: read 64 contiguous byte-aligned bits (big-endian this time), do a left shift to align the top of the bit field we want with the MSB of bits (where before, we did a right shift to align the bottom of our bit field with the LSB of bits), and then do a right shift to place the top width bits at the bottom to return them, because when somebody calls getbits(3), they generally expect to see a value between 0 and 7 inclusive.

Shift boundary cases

So what’s the problem? Well, this version doesn’t allow width to be zero. The issue is that if we allow width == 0, then the final shift will try to right-shift a 64-bit value by 64 bits, and that’s undefined behavior in C/C++! In this case, we may only shift by 0 through 63.

Now, in some cases, C/C++ leave details unspecified for backwards-compatibility with machines essentially nobody cares about at this point. Not requiring that the representation of signed numbers use two’s complement being a famous example; while non-two’s complement architectures exist, they’re all museum pieces at this point.

Sadly, this is not one of these cases. Here’s what happens on different widespread CPU architectures when the shift amount is out of range:

For 32-bit x86 and x86-64, shift amounts are interpreted mod 32 for operand widths of 32 bits and lower, and mod 64 for 64-bit operands. So right-shifting a 64-bit value by 64 will yield the same as shifting by 0, i.e. a no-op.
In 32-bit ARM (A32/T32 instruction sets), shift amounts are taken mod 256. Right-shifting a 32-bit value by 32 (or 64) will hence yield 0, as will right-shifting it by 255, but right-shifting it by 256 will leave the value untouched.
In 64-bit ARM (A64 ISA), shift amounts are taken mod 32 for 32-bit shifts and mod 64 for 64-bit shifts (essentially the same as x86-64).
RISC-V also follows the same rule: 32-bit shifts distances are mod 32, 64-bit shift distances are mod 64.
For POWER/PowerPC, 32-bit shifts take the shift amount mod 64 and 64-bit shifts take the shift amount mod 128.

To make matters even more confusing, in most of these instruction sets with SIMD extensions, the SIMD integer instructions have different out-of-range shift behavior
than the non-SIMD instructions. In short, this is one of those cases where there’s actual architectural differences between mainstream platforms; POWER and RISC-V might be a bit obscure for most of you, but say 32-bit vs. 64-bit ARM both have hundreds of millions of devices sold at this point.

Therefore, even if all the compiler does with that right-shift is to emit the corresponding right-shift instruction for the target architecture (which is generally what happens), you will still see different behavior on different architectures, and on both ARM A64 and x86-64, the result of a shift by 64 will be effectively a no-op, and getbits(0) will therefore (generally) return a non-0 value, where one would expect that it’s always zero.

Why does it matter in the first place? Having a hard-coded getbits(0) is indeed not an interesting use case; but sometimes you might want to perform a getbits(x) for some variable x, where x can be zero in some cases, and it’s nice for that case to just work and not require some special-case testing.

If you really need that case to work, one option is to explicitly test for width == 0 and handle that specially; another is to use an branch-less expression that works for zero widths, for example this variant used in Yann Collet‘s FSE:

    // Return the top "width" bits, avoiding width==0 edge cases.
    return (bits >> 1) >> (63 - width);

This particular case is easier to handle with LSB-first bit streams. And while I’m mentioning them, let’s talk about the masking operation I used to isolate the lowest width bits:

    // Return the low "width" bits, zeroing the rest via bit mask.
    return bits & ((1ull << width) - 1);

There’s an equivalent form that is slightly cheaper on architectures with a three-operand and-with-complement (AND-NOT) instruction. This includes many RISC CPUs as well as x86s with BMI1. Namely, we can take a mask of all-1 bits, left-shift it to introduce width zeroes at the bottom, then complement the whole thing:

    return bits & ~(~0ull << width);

If you’re on an x86 with not just BMI1, but also BMI2 support, you can also use the BZHI instruction that is tailor-made for this use case, if you can figure out how to make your compiler emit it (or use assembly). Yet another option that is advantageous in some cases is to simply prepare a small look-up table of bit masks, which then simplifies the code into

    return bits & width_to_mask_table[width];

It may feel ridiculous to prepare a lookup table storing the result of two integer operations, especially since computing the address of the table element to load usually involves both a shift and an addition—the exact two operations we’d be performing if we didn’t have the table!—but there’s a method to this madness: the required address calculation can be done as part of the memory access in a single load instruction on e.g. x86 and ARM machines, and so this shift and add get executed on an Address Generation Unit (AGU) as part of the load pipeline of the CPU, not with integer arithmetic and shift instructions. So counter-intuitive as it may sound, replacing two integer ALU instruction with one integer load instruction like this can result in a noticeable speed-up in bit-IO-intensive code, because it balances the workload better between available execution units.

Another interesting property is that the LSB-first version using the mask table lookup only performs one shift by a variable amount (to shift out the already-consumed bits). This matters because, for a variety of (often banal) reasons, integer shifts by a variable amount are more expensive than shifts by a compile-time constant amount on many micro-architectures. For example, the Cell PPU/Xbox 360 Xenon CPU were infamous for having variable-distance shifts that stalled the CPU core for a whopping 12 cycles, whereas regular shifts were pipelined and would complete in two cycles. Less drastically, on many Intel x86 microarchitectures, “traditional” variable x86 shifts (SHR reg, cl and friends) are about three times as expensive as shifts by a compile-time constant.

This seems like another way in which the MSB-first variant is behind: the variants we’ve seen so far perform either two or three shifts per extract operation, two of which are variable-distance. But there’s a trick to get down to a single shift-type operation, namely by using a bit rotate (cyclical shift) instead of a regular shift:

// Treating buf[] as a giant big-endian integer, grab "width"
// bits starting at bit number "pos" (MSB=bit 0).
uint64_t bit_extract_rot(const uint8_t *buf, size_t pos, int width) {
    assert(width >= 0 && width <= 64 - 7);

    // Read a 64-bit big-endian number starting from the byte
    // containing bit number "pos" (relative to "buf").
    uint64_t bits = read64BE(&buf[pos >> 3]);

    // Rotate left to align the bottom of our bit field with the LSB.
    bits = rotate_left64(bits, (pos & 7) + width);

    // Return the bottom "width" bits.
    // (Using a table here, but the other ways of masking discussed
    // for LSB-first bit IO also work.)
    return bits & width_to_mask_table[width];
}

Here, the rotate-left (which you need to figure out how to best access in your C compiler yourself; I want to avoid further digressions) first takes up the work of the original left-shift, and then rotates by an extra “width” bits to make our bit field wrap around from the most-significant bits of the value down to the least-significant, where we can then mask it the same way as in the LSB-first variant.

Oh, and I’ve also started writing the division by 8 as shift-right by 3, and the mod-8 of our unsigned position as a binary AND with 7. These are equivalent substitutions, and you’ll see both forms in real-world code, so I figured I’d mix it up a little.

This rotate-style masking operation was how we (being RAD Game Tools) used to perform bit reading on the Cell PPU and Xbox 360, purely because the variable-distance shifts were so dismal on that particular core. And I’ll note that this version also has no problems whatsoever with width == 0; the only problem is its dependence on rotate instructions, which are available (and fast) in most architectures, but tend to be somewhat inconvenient to access from C code.

At this point, I’ve talked a lot about many, many different ways to do a bit shifting and masking, and I’ll be referring back to these later. What I haven’t shown you yet is any actual practical bit IO logic—the bit extraction form is a useful starting point, and convenient to explain these concepts in without having to worry about state, but you’re unlikely to ever use it as presented in any production code, not least because it assumes that the entire bit stream is in memory and will gleefully read past the end of the buffer during normal operation!

However, this post is already long enough that WordPress editing is starting to get sluggish on me. So what I’ll do is split this up into a multi-part series; you get to take a breather, and we’ll meet back in the next part. Until then!

Caches

Coherency protocols

MESI and friends

Memory models

Atomicity and atomic operations

How atomics are implemented

The cost of memory operations

Cache line contention

Table-based Huffman decoders

State machines

A bit buffer state machine

State machine Huffman decoder with built-in bit IO

Let’s get weird, stage 1

Let’s get weird, stage 2: binary alias coding

Intervals mod N, first attempt

A different approach

Point-in-interval tests and symmetry

Interval overlap

Implementation notes and variations

Buffered I/O

What happens when you issue an IO operation

More buffers

Conclusion and summary

Footnotes

Intervals mod N, first attempt

A different approach

Point-in-interval tests and symmetry

Interval overlap

Implementation notes and variations

Buffered I/O

What happens when you issue an IO operation

More buffers

Conclusion and summary

Footnotes

One small note before we start…

Why rANS (and not FSE/tANS)?

Reverse encoding

Basic interleaving

Implicit interleaving to the rescue

Bypass coding

Tying the knot

Conclusion and final notes

A basic decoder

Reducing overhead

Relaxed requirements

One decoder should be enough for anyone!

Switching input buffers

A final few generalizations

Discussion

Repeated match offsets

Experiments with lots of repeat match offsets

Insertion policy

Implementation notes

Branchless “select” (cond ? a : b)

Unsigned integer compares

Integer min and max

32-bit and 64-bit loads/stores

Multiplies

Horizontal adds, dot products etc. (float)

The other kind of horizontal adds, dot products etc. (integer)

To be continued!

Degrees of freedom

Our first getbits (bit extract)

Shift boundary cases

Our first `getbits` (bit extract)