Page-Fault-Safe SIMD Parsing of Variable-Length Unsigned Integers
Parsing variable-length unsigned integers with SIMD while avoiding page faults is a subtle problem that combines Win32 paging behavior, x86 SIMD semantics, and C’s undefined-behavior landmines.
Problem: SIMD over untrusted string tails
A common micro-optimization is to parse decimal strings into integers using SIMD, loading 16 bytes at a time into an __m128i and processing multiple digits in parallel. For short or untrusted inputs, it is tempting to assume there are at least 16 accessible bytes from data, then do _mm_loadu_si128((__m128i*)data) for the “tail” of the string and mask off non-digit bytes.
This falls apart when data is near the end of a page and the string is short; a 16-byte load can cross into an unmapped page and crash with an access violation.
Page boundaries vs. tail loads
On modern x86, the performance difference between aligned and unaligned 16-byte loads is small; the real danger is crossing a page boundary into an unmapped page. A 16-byte unaligned load is fine as long as all touched pages are mapped and readable, but if data points to the last few bytes of a page, a naive load at data may access bytes on the next page, which might not exist.
The “problem size” is small: you only need up to 16 bytes for the tail, so the logic can reason in 16-byte units without knowing the OS page size. This leads to a pattern where you detect “safe” versus “dangerous” positions relative to the page boundary, and adjust the load address accordingly.
A robust tail-load pattern
One robust pattern to safely load up to 16 bytes from a potentially short string without page faults is:
- Process full 16-byte SIMD blocks in a loop while
size >= 16with straightforward loads atdata. - For the tail (
size < 16), compute a load address that is guaranteed not to cross a page boundary while still covering all bytes you need, then shuffle and mask digits into place.
The core trick is to compute a 16-byte-aligned base within the page that still covers the [data, data + size) range, load from that base, then use pshufb or similar to move interesting bytes down and mask garbage bytes.
// data: pointer to first digit
// size: number of bytes (0 < size <= 16)
uintptr_t addr = (uintptr_t)data;
uintptr_t address16 = addr & ~(uintptr_t)0xF; // 16-byte aligned base
size_t extra = (addr + size - 1) & 0xF; // offset of last useful byte
// Adjust load_base so that we never cross a page boundary
// while still covering all bytes you need.
uintptr_t load_base = address16;
/* Refined logic omitted: real code checks whether data+size
* would cross a 16-byte block & adjusts to the previous one.
* See the original discussion for the minimal-instruction form.
*/
__m128i block = _mm_load_si128((__m128i*)load_base);
// Then use pshufb + mask / blend to put bytes you care about
// into the low lanes.
The actual formulas discussed minimize instructions and ensure correctness of the boundary check, specifically handling “data sits at the last 16-byte block of a page” versus “data + size spills into the next page”.
Using memcpy and SWAR for the last 4 digits
For very small tails (for example, 1–4 remaining digits) it is often simpler and competitive to use SWAR (SIMD Within A Register) on a scalar uint32_t rather than another round of SIMD. The key is to avoid undefined behavior: do not cast a potentially misaligned pointer to uint32_t*; use memcpy into a scalar instead.
union {
uint32_t value;
uint8_t bytes[4];
} u;
// Assume buffer has at least 4 accessible bytes
memcpy(&u.value, data + size - 4, 4);
// Construct mask from size without arrays, for example:
uint32_t mask = (size == 4) ? 0xFFFFFFFFu : ((1u << (size * 8)) - 1);
u.value &= mask;
// Subtract '0' from each byte and accumulate with SWAR techniques.
The 0x80.pl “parsing decimal numbers” articles show how to do the SWAR math without branching per digit and how to extend to SSE4.1 for 16-digit chunks.
When you can skip the dance
If a higher-level invariant guarantees “there are always at least 16 bytes accessible after data” (for example, input buffers padded with zeros to 15 extra bytes), then you can load once at data with _mm_loadu_si128 and mask as needed without worrying about page faults. In that case, the page-safe alignment logic is unnecessary overhead.
In summary, safe SIMD tail parsing hinges on picking a load address that never crosses a page boundary while still covering all needed bytes, then using pshufb or SWAR tricks to reposition digits, with tiny tails often best handled in a single scalar register loaded via memcpy.