Graham King

Solvitas perambulum

Underrust: What does vec![0u8; 1024] really do?

In Rust vec![0u8; 1024] allocates a 1K block of memory on the heap and sets it to zero. It does that by calling libc’s calloc. What does that do? What if we fill it with 1’s instead of 0’s? What if we allocate a much bigger chunk? Let’s dig in.

I’m using Rust 1.79 nightly and glibc 2.38. Most of this will be specific to 64-bit Linux, and when we really get into it, it will be specific to the CPU; here an Intel i7 Tigerlake, with four Willow Cove cores.

Here’s my test program arr.rs:

const SIZE: usize = 1024;
fn main() {
    let mut a = vec![0u8; SIZE];

    // make the page real and defeat the optimizer
    let pos = (&a as *const _ as u32 % SIZE as u32) as usize;
    a[pos] = 42;
    std::process::exit(a[pos + 1] as i32);
}

Only the first line is what we’re testing. The last three lines use the memory at a random address (see A random number you already have) to prevent the optimizer from removing our allocation, and to make sure the page actually exists.

Expand the macro

vec! is a macro so the very first task is expanding it:

$ cargo rustc --bin arr --profile=check -- -Zunpretty=expanded

It becomes alloc::vec::from_elem(0u8, 1024) (from_elem source) which is not visible in the docs because it’s marked as #[doc(hidden)], although it’s otherwise a normal public function.

After a few more steps, which are collapsed in release mode, we need to find the memory allocator.

Find the allocator

from_elem eventually calls alloc_zeroed - note how we’re already special-cased for 0, we’ll come back to this - which needs to find the allocator.

It does this by calling special symbol __rust_alloc_zeroed. If we had defined our own global allocator with #[global_allocator] this is where it gets plugged in, and none of what follows would apply to us.

We’re using the default allocator so this calls __rdl_alloc_zeroed. That hooks into the Platform Abstraction Layer, the allocation piece of which is here sys/pal/unix/alloc.rs.

I’m hand-waving a bit because there’s some internal compiler magic here. I don’t know how it gets wired up.

The platform abstraction layer is what hooks into the actual allocator.

Call libc

This is Rust’s default allocator on Linux:

    #[inline]
    unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut u8 {
        // See the comment above in `alloc` for why this check looks the way it does.
        if layout.align() <= MIN_ALIGN && layout.align() <= layout.size() {
            libc::calloc(layout.size(), 1) as *mut u8
        } else {
            let ptr = self.alloc(layout);
            if !ptr.is_null() {
                ptr::write_bytes(ptr, 0, layout.size());
            }
            ptr
        }
    }

Alignment for a byte array is 1, size is 1024, and MIN_ALIGN on x86_64 is 16. This minimum alignment (or is it maximum) is defined by the ABI and guaranteed by malloc.

I’m using u8 (byte) type in all these examples because the type doesn’t matter. The allocator allocates byte arrays, which we can use to store N bytes, or Q objects of M bytes each as long as Q*M is N.

This code above means that for everything except very small allocations, we’re calling calloc. And we’re leaving Kansas.

Asking the kernel for some memory

We’re in C land now, no more Rust. How is memory allocated?

I wrote about this in How memory is allocated. It’s actually really simple: We ask the kernel to move the top of the heap, also called the “program break”, up. That new space is our memory. To free memory we move the break back down.

If we strace our program we’ll see these two lines:

    brk(NULL)                               = 0x55fc418b4000
    brk(0x55fc418d5000)                     = 0x55fc418d5000

The first call queries the position of the program break (libc sbrk(0)). We’ll only need to do this once.

The second call sets the break 132K higher. Why 132K when we only asked for 1K?

This is malloc’s setup call. It will ask for 132k whether you request 1 byte or 1 Gig. malloc needs 640 bytes for it’s thread local cache. This is a set of chunks malloc can allocate to us without locking the arena, and it is the first place malloc looks when trying to satisfy an allocation request.

Next malloc adds 16 bytes for the chunk header. This stores the size of the previous chunk and the size of this chunk. malloc adds this to any allocation request, this initial one just happens to be from malloc itself. Our 640 become 656 bytes. Additionally malloc ensures that every address it returns is at least 16 byte aligned, which 656 already is.

System calls are expensive so the allocator requests memory in large chunks. It does this by adding padding to this request, defined by M_TOP_PAD in mallopt. That defaults to 128K, so we’re at 656 + 128k. Malloc only seems to add the M_TOP_PAD padding if it is equal to or greater than the allocation amount.

Finally, the Linux kernel deals in “pages” of memory that are 4K each. We need to round up our brk request to a 4K multiple. 128K is a page multiple. Our 656 bytes pushes us over so we need to go to the next page, which is 128K + 4K = 132K. That’s the 132K we see in the brk syscall.

Now that malloc has 132K, it will manage this memory, lending chunks to us on request, re-using them on free, and requesting more from the operating system if this initial 132k is all in use. Using the program at the top of this post, if we print the address of the Vec’s backing buffer (a.as_ptr()) it will be in this managed memory, somehere between the previous position of the break (0x55fc418b4000) and the new position (0x55fc418d5000).

memset to zero

Finally, we need to zero that memory.

Or do we? Memory the OS gives us, whether via brk or mmap (see later) is always zeroed to prevent us reading data previously owned by another process. The kernel does this zeroing by mapping the new page to a special zero-page, which will copy-on-write to to a physial memory page as needed - more on this later.

calloc does have an optimization to not clear freshly allocated memory, but that optimization doesn’t trigger here. An if statement prevents it if our allocation size (csz in the linked code, 1k) is less than oldtopsize (132k), which it is. I don’t understand this.

The only time we need to zero memory is if malloc is re-using a chunk we previously released with free. calloc gets memory from malloc. I guess it doesn’t know if the memory is being re-used, so it always clears it.

calloc calls memset to do that, which it does very fast because:

  1. Writing zero’s to memory is something computers do a lot. Accessing uninitialized memory is a reliable source of bugs and modern languages try to prevent us doing this. Go, for example, always zero’s memory before letting us have it.

  2. Writing zero’s to memory is a fun thing to optimize.

As a result memset is an incredibly well optimized operation. There are specialized variants for every instruction set (SSE, AVX2, AVX512, etc), for whether the memory is aligned or not, and so on.

On my CPU, for 1024 values, this is the heart of the memset loop:

    vmovdqa64 YMMWORD PTR [rdi],ymm16
    vmovdqa64 YMMWORD PTR [rdi+0x20],ymm16
    vmovdqa64 YMMWORD PTR [rdi+0x40],ymm16
    vmovdqa64 YMMWORD PTR [rdi+0x60],ymm16
    sub    rdi,0xffffffffffffff80 # add 128 to rdi
    cmp    rdi,rcx
    jb     0x7ffff7f10300 <__memset_evex_unaligned_erms+192> # Jump to top if we haven't reached rcx yet

ymm16 is a 32 byte (not bit!) register containing all zeros. vmovdqa64 writes it to memory.

rdi points to memory we have allocated and are zero-ing, working from bottom to top. rcx is the aligned top of our new allocation. Note how it adds 128 by subtracting an enormous number, yeah I have no idea either.

It’s writing 128 bytes to memory at a time. If I’m reading the diagrams correctly, a Willow Cove core has four memory access ports so maybe those four memory writes can be in flight at the same time.

There is a version of that instruction that takes a zmm register, which is 64 bytes. We could be writing 256 bytes at once. My tigerlake CPU has AVX-512 and my RUSTFLAGS has -Ctarget-cpu=native, and yet it’s not using them. My guess is because AVX-512 has very aggressive thermal throttling. That presumably slows the CPU down enough that the larger writes are no longer worth it.

The two important points here are:

  1. yes it’s incredibly well optimized
  2. but it really is writing a thousand zeros

It’s loading this memory into our caches, evicting other data. There’s no special hardware that magics the memory to zero.

That’s it! Our vec![0u8; 1024] is ready for use.

What about bigger allocations?

A 1K allocation does write a thousand zeros, but a 1 MiB allocation doesn’t write a million zeros.

calloc is really malloc and then (maybe) memset. Let’s look at the two parts separately in reverse order.

memset

What about vec![0u8; 2048] or vec![0u8; 64 * 1024]?

For allocations of 2048 bytes or more (tunable in glibc via glibc.cpu.x86_rep_movsb_threshold), memset will switch to using this single repeating instruction:

    rep stos BYTE PTR es:[rdi],al

al contains 0. rdi is our memory, and rcx is the size of the allocation. This instruction stos stores the byte at the given address. The rep prefix repeats the operation rcx times. Modern CPUs have special support for this instruction pair and as a result, as that page says:

Note that a REP STOS instruction is the fastest way to initialize a large block of memory.

This applies all the way to the mmap threshold (see below), so a 64K buffer will rep stosb 64K times, even though I’m only accessing a single byte. And even though if this is our first use of the allocation it’s already zeroed.

For allocations over 128K memset doesn’t run at all in user space. Read on.

malloc

What about vec![0u8; 1024 * 1024]?

For allocations over 128K (usually, see MMAP_THRESHOLD in mallopt) malloc will allocate our chunk with an anonymous mmap, and this is a big change. malloc is no longer using the memory it manages and re-cycles. Instead it creates a private anonymous mapping of exactly our size and gives it to us. The kernel pretends it allocated, but really doesn’t, does some bookkeeping, and returns us an address.

mmap is typically used to map a disk file into memory so that we can access the file as if it was a large array in memory.

Initially none of our pages are backed by real memory. When we try to access one a page-fault occurs. In a regular mmap the kernel would load data from the disk, put it in memory, and continue as if it was always there.

For our anonymous mapping there is no file. The kernel has an optimization to read from it’s zero-page instead of allocating a real page. This is the magic that makes memset no longer necessary for larger allocations. We know the memory isn’t recycled and we know the kernel gives it to us zeroed. Here is where the glibc calloc source skips zeroing mmaped pages (perturb_byte is a testing and security feature that is usually disabled):

  /* Two optional cases in which clearing not necessary */
  if (chunk_is_mmapped (p))
    {
      if (__builtin_expect (perturb_byte, 0))
        return memset (mem, 0, sz);

      return mem;
    }

Only if the page-fault is because of a write does the kernel need to provide us with physical memory. It is at that point that it zero’s it by copying the backing zero-page to a physical memory page. Although user space doesn’t need to run memset, the kernel is still writing zeros to memory. We’re not avoiding the cost of the operation, just moving it from user space to kernel space.

All of this means our programs don’t need to do clever tricks with uninitialized memory. The kernel already does the optimal thing, and we can’t really avoid that cost. The kernel is not going to let us have uninitialized memory.

Non-zero initial values

Earlier our function call specialized to allocate_zeroed. There are other special cases for zero in the call chain: the IsZero trait and the AllocInit::Zeroed enum (I skipped all that earlier).

Small ones are the same

So how does this change if we do this?

vec![1u8, 1024]

At that size, not much. We can’t use calloc so SpecFromElem will do malloc + memset itself.

First it does Vec::with_capacity_in which, using the default allocator, calls __rdl__alloc which calls libc::malloc(layout.size()). That still moves the program break, no change there.

Next it intializes memory with ptr::write_bytes.

This is an interesting function because it’s an “intrinsic”. There’s no code. The Rust compiler somehow wires it up to call a version of memset that LLVM provides: memset. In other words we leave the function blank and ask LLVM to fill it in at build time. LLVM’s implementation calls libc’s memset (the ‘normal’ one), so with a few extra hoops, we’re back in memset where the zero-value initialization took us.

memset is the same whatever value we are setting, so 0 or 1 it’s the same code.

Big ones are expensive

What about this?

vec![1u8, 1024 * 1024]

Once we reach the mmap threshold things change a lot. We can no longer rely on the zero-page optimization. Now we really do have to write a million ones to memory.

We can confirm this with perf stat. In the program at the start of this post change size to be 1024 * 1024 but leave the initial value at 0u8. Build it and run perf stat -e cycles ./target/release/arr. I get about 400k cycles.

Change the 0u8 to be 1u8 and re-run perf stat. Now I get 750k cycles. It grows with SIZE. The 0 version did not.

This cycle difference between initializing with 0’s and 1’s only appears once we hit the mmap threshold.

Conclusions

  • Rust vector initialization uses either libc’s calloc (0 values) or malloc + memset (non-0 values).
  • Initialize with 0’s, no other value. Our types should have an all-zero default value.
  • If we’re only using part of a large allocation we don’t need tricks with unintialized memory and unsafe Rust functions. The kernel already handles this.

Happy allocating!