One of many ways in which the Arm64 structure is completely different from x86 is the power to configure the scale of reminiscence pages within the Reminiscence Administration Unit (MMU) of the CPU to 4K, 16K, or 64K. This text summarizes what reminiscence web page dimension is, the best way to configure web page dimension on Linux programs, and when it would make sense to make use of a special web page dimension in your functions.
Introduction to reminiscence web page dimension
As we beforehand mentioned in Diagnosing and Fixing a Web page Fault Efficiency Challenge with Arm64 Atomics, working programs current a digital reminiscence deal with house to functions, and map bodily reminiscence pages to digital reminiscence addresses utilizing a web page desk. The CPU then offers a mechanism referred to as the Translation Lookaside Buffer (TLB) to make sure that just lately accessed pages of reminiscence may be recognized and browse quicker utilizing L1 or L2 CPU cache.
The dimensions of bodily reminiscence pages (referred to as granules) on the x86 structure is a set 4KB. On ARM64 programs like Ampere Altra(R) or AmpereOne(R), nevertheless, the developer can configure the scale of bodily reminiscence pages to be 4KB, 16KB, or 64KB.
When to Use Bigger Web page Sizes?
As altering the web page dimension can affect the reminiscence effectivity and efficiency of your system, you will need to perceive when it is sensible to make use of a bigger web page dimension, and the trade-offs concerned. Bigger web page sizes can result in much less environment friendly use of reminiscence by having pages that aren’t full.
For instance, if we retailer 7 KB of information in reminiscence, it will use two 4KB pages for a complete of 8KB of reminiscence on a system with 4KB kernel pages, an effectivity of 87.5%. On a system with 64KB pages, nevertheless, we at the moment are consuming a single 64KB web page with 7KB of information for an effectivity of 11% with the only allocation above.
Nevertheless, the MMU and the OS kernel are good sufficient to make use of contiguous blocks of reminiscence which have beforehand been allotted however usually are not full for future reminiscence allocations. If the identical course of allocates 32KB of reminiscence later, we’re nonetheless solely utilizing one 64KB web page with 39KB occupied. With 4K web page dimension, we’ll now be managing ten 4KB pages.
The second trade-off is in efficiency resulting from cache misses for web page desk look-ups. There are a comparatively small variety of web page entries saved within the TLB for every stage of cache (L1, L2, System Degree Cache).
With bigger web page sizes, these TLB entries cowl a bigger quantity of the bodily reminiscence. On Ampere Altra and Altra Max processors, for instance, the L1 information TLB has 48 entries, and the L2 TLB has 1280 entries.
Which means that with a 4KB granule, the L1 TLB can cache addresses for 192KB of bodily reminiscence, and the L2 TLB can retailer web page addresses overlaying 5MB of bodily reminiscence.
With 64KB web page sizes, this will increase to 3MB for L1 information TLB and 80MB for the L2 TLB. Every cache miss within the TLB provides time for a web page stroll to seek out the bodily web page matching a digital reminiscence lookup, caching the web page as soon as positioned, and updating the TLB appropriately. With bigger pages, you’ve got fewer cache misses, and higher efficiency for reminiscence intensive workloads.
You additionally enhance I/O efficiency by having bigger zones of contiguous reminiscence accessible. Consequently, information intensive functions which have loads of information in reminiscence or in transit can profit from bigger web page sizes. A few of these functions are:
- Databases: Database programs are likely to retailer loads of info in reminiscence for caching functions and have plenty of disk I/O for giant datasets. Each traits make database servers nice candidates for giant reminiscence web page sizes.
- Virtualization infrastructure: Digital Machines (VMs) embody a disk picture, comprising of an working system kernel and all of the functions required by that VM, and vary in dimension from lots of of megabytes to lots of of gigabytes. Consequently, they’ll use massive quantities of reminiscence and may profit from bigger web page sizes.
- Construct servers for Steady Integration: Duties like constructing the Linux kernel course of 1000’s of supply information and use loads of RAM whereas compiling them. As a excessive throughput workload, hosts configured with bigger web page sizes are likely to carry out higher as construct servers.
- Community or I/O heavy functions: For functions with loads of community I/O and in-memory information processing like object caches, load balancers, firewalls, or video streaming, massive reminiscence pages may end up in fewer web page faults, enhancing efficiency.
- Reminiscence intensive functions like AI Inference: AI Inference, executing a skilled mannequin like a advice engine of an LLM chatbot, is a reminiscence and CPU intensive workload, the place massive reminiscence web page sizes might help present excessive efficiency.
Usually, the efficiency of most of these functions with bigger web page sizes will depend upon a number of components, together with the info units concerned and the sample of reminiscence accesses of the applying.
If you happen to imagine that your software may benefit from bigger reminiscence pages, you must benchmark your goal workload with each 4K and 64K pages and make your deployment choice primarily based on the outcomes of your checks.
Along with benchmarking your goal software with each 4K and 64K pages utilizing production-style information, you possibly can consider the potential good thing about bigger web page sizes utilizing the “perf” instrument, by measuring TLB stalls (that’s, how typically TLB misses end result within the CPU pipeline to stall whereas ready for info to be loaded from reminiscence).
First, test that the kernel helps the TLB stall counters on AmpereONE and newer CPUs.
# perf record | grep end_tlb
stall_backend_tlb
stall_frontend_tlb
With kernel help confirmed the pipeline stalls resulting from TLB misses may be measured:
# perf stat -e directions,cycles,stall_frontend_tlb,stall_backend_tlb ./a.out
time for 12344321 * 100M nops: 3.7 s
Efficiency counter stats for './a.out':
12,648,071,049 directions # 1.14 insn per cycle
11,109,161,102 cycles
1,482,795,078 stall_frontend_tlb
1,334,751 stall_backend_tlb
3. 706937365 seconds time elapsed
3. 629966000 seconds consumer
0. 000995000 seconds sys
The ratio (stall_frontend_tlb + stall_backend_tlb)/cycles is an higher sure for the time that may very well be saved by utilizing bigger reminiscence pages.
Beware, nevertheless, that as 4K has been the default web page dimension for therefore lengthy, some software program packages could make that assumption about your system, leading to low effectivity in reminiscence utilization. This isn’t a quite common state of affairs in trendy software program stacks, however it’s suggested to run some testing and benchmarking earlier than committing to bigger web page sizes.
Configuring Bigger Web page Sizes on Ampere CPUs
Altering the scale of reminiscence web page dimension requires operating an working system kernel that has been compiled to help your required dimension. For well-liked cloud working programs like Crimson Hat Enterprise Linux, Oracle Enterprise Linux, Suse Enterprise Linux, or Ubuntu from Canonical, the working programs ship with pre-built kernels supporting 4KB web page dimension and 64KB web page dimension on Arm64.
To make use of a kernel with 64KB pages on Crimson Hat Enterprise Linux 9:
1. Set up the kernel-64k bundle:
dnf –y set up kernel-64k
2. To allow the 64K kernel to be booted by default at boot time:
okay=$(echo /boot/vmlinuz*64k)
grubby --set-default=$okay
--update-kernel=$okay
--args="crashkernel=2G-:640M"
In addition a 64KB kernel on Ubuntu 22.04:
1. Set up the arm64+largemem ISO which accommodates the 64K kernel by default, or:
2. Set up the linux-generic-64k bundle, which can add a 64K kernel choice to the boot menu with the command sudo apt set up linux-generic-64K
3. You’ll be able to set the 64K kernel because the default boot choice by updating the grub2 boot menu with the command:
echo "GRUB_FLAVOUR_ORDER=generic-64k" | sudo tee
/and so forth/default/grub.d/local-order.cfg
For 64KB pages on Oracle Linux:
1. Set up the kernel-uek64k bundle:
sudo dnf set up -y kernel-uek64k
2. Set the 64K kernel because the default at boot time:
sudo grubby --set-default=$(echo /boot/vmlinuz*64k)
3. After rebooting the system, you possibly can confirm that you’re operating the 64K kernel utilizing getconf as described beneath.
Comparable directions could also be accessible on the web sites of different working system distributions.
If you’re constructing your individual Linux kernel, you need to use make menuconfig to vary the kernel configuration. Within the “Processor sort and options” submenu, you will see the ARM64 CPU function registers primarily based on kernel options configuration choice, which you’ll be able to change to 16K or 64K.
Alternatively, you possibly can change the kernel configuration file .config on to set the worth of CONFIG_ARM_PAGE_SHIFT from its default worth of 12 (4K = 212 bytes) to 14 (16K =214 bytes) or16 (64K =216 bytes). You’ll be able to then select which kernel as well at boot time by creating a number of entries in your bootloader for the kernels with completely different web page sizes and selecting the suitable kernel at boot time.
To confirm what the kernel web page dimension setting is on your present Linux kernel, you need to use the system getconf utility. With a 64K web page dimension, these will present the next:
$ getconf PAGESIZE
65536
Conclusion
To summarize: Altering the kernel reminiscence web page dimension in your cloud programs can have a constructive affect on software efficiency for a lot of widespread cloud workloads. In case your software consists of loads of disk, reminiscence, or community I/O, you might be able to enhance your efficiency considerably by utilizing a kernel with 16K or 64K pages enabled on ARM hosts.
Nevertheless, this isn’t a panacea, and your mileage could differ. We advocate that you simply check with each artificial and real-world benchmark checks to see if altering web page dimension will end in a constructive affect to your backside line.
Many widespread Linux distributions with Arm64 builds already embody a number of kernels of their distribution repositories. By putting in these kernel packages and booting them at start-up, the fee to strive bigger kernels to check whether or not they present a efficiency enchancment is comparatively low.