Lisbon, October 20, 2005
This is a short article about the performance impact of the PaX memory protection patch for the linux kernel. PaX greatly enhances linux's security by disallowing common types of attacks exploiting software bugs like buffer overflows. The associated cost of the implemented memory protection mechanisms must be evaluated and quantified to help people take decisions and make investments.
Copyright (C) 2005 by Pedro Venda
I am an engineering student with professional interest in computer networks, distributed systems and computer security. Also I'm a linux user since 1998 and I've done professional system administration work.
This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.
The most recent version of this document can be found on my website http://www.pjvenda.org under the "linux articles" section. It can be browsed online or downloaded in several formats: [dvi] and [pdf]. [not yet available. hopefully online in a couple of days]
Corrections, suggestions and questions are welcome. Feel free to ask questions and point out errors or misleading information.
PaX (PAge eXecute) is a security patchset for the linux kernel which aims at preventing the exploitation of memory corruption bugs which in turn could give an attacker arbitrary privileges otherwise denied. This class of bugs include various forms of buffer overflows (stack or heap based), user supplied format string bugs, etc.
PaX is a security enhancement per-se, but not nearly enough to achieve an enhanced security environment. Grsecurity patchset includes PaX and is a much more complete approach towards a well based security enhancement for linux.
The PaX patchset includes mechanisms for:
The memory protection for x86 or IA-32 architecture is one of the most important features brought by PaX. By improving memory protection, PaX is able to dismantle some common attack exploits without being overly intrusive on the operating system. All applications should work without changes and on supporting architectures (not IA-32) without even affecting performance. There are two methods for memory protection implemented by PaX: PAGEEXEC and SEGMEXEC.
The PAGEEXEC memory protection algorithm was an idea originally for the virtualization freemware/plex86 project to protect memory accesses between different running virtual operating systems, while keeping performance mostly the same. This idea was later used by PaX due to its evident security potential.
PAGEEXEC is a process to implement page protection features using the paging logic of IA-32 (x86) based CPUs. It is the least intrusive, most elegant but most penalizing approach considering the worst case. This architecture lacks the hardware support for non executable pages on its MMU, through a No-eXecute bit (a small number of recent Intel 32bit CPUs already support an NX bit, but at the moment PaX doesn't use it). These simpler MMUs overload one bit for READ/EXEC permission settings, so every READable page can also be EXECutable.
Fortunately, IA-32 CPUs also have split TLB (Translation Lookaside Buffer) for code and data memory pages. These tables cache physical page locations and access rights (user/supervisor and read/exec) obtained from (expensive) page table walks. All page table entries (PTE) have well defined (good or bad) states (READ+EXEC is ok for code pages but bad for data pages, as the latter aren't by definition executable) as well as good/bad transitions between states. Code pages are generally READ+EXEC while data pages are READ(and WRITE)+NOEXEC. A userspace attempt to execute code in a data page (NOEXEC) will generate a page fault exception handled by the kernel and the PaX PAGEEXEC code and the access will be denied.
PAGEEXEC sits in the memory paging subsystem, intercepting page fault handles replacing access rights whenever a page would end up in a "bad" state.
SEGMEXEC is a reimplementation of the memory protection mechanism exclusively for the IA-32 architecture, taking advantage of the hardware segmentation features.
PaX's VMMIRROR code is the heart of SEGMEXEC: It allows coherent mapping of the same set of physical pages in two different linear addresses on any given task. Consistency is assured in case of swap cycles and copy-on-write operations. In the typical case of 3 GB userland linear address range, two 1.5 GB regions are created from it and the code/data segment descriptors are setup to cover only one or the other. Specifically, data memory is mapped in the bottom half (0-1.5GB in linear address range) and executable memory is mapped in the top half (1.5-3GB in linear address range). Executable code mappings must be visible in the code segment region (1.5-3GB in linear address space) and since such mappings may contain data as well, they need to be mirrored at the same linear addresses in the data segment as well (0-1.5GB in linear address space).
Using PaX's VMMIRROR code, user data memory is mapped into the first half, while executable memory goes into the second. Instruction fetches are translated from the top half while data fetches come from the bottom half maps. A non-executable memory space would be mapped in the bottom half only while executable data would be mapped into both halves since executable data can also be read. The VMMIRROR code allows identification of access type through the linear address requested.
Non executable memory is only mapped in the 0-1.5GB half of the linear memory, so instruction fetches of this data (which are illegal) are trivially detected; This case's requested addresses will be translated from the top half of the memory (1.5-3GB) raising a page fault, detecting an illegal execution attempt.
SEGMEXEC uses VMMIRROR to do most of the work and, like PAGEEXEC, sits in the memory subsystem intercepting page faults to determine permission violation attempts.
My benchmarks were done with a practical and simple situation, indicative of relative performance impact caused by different PaX configurations. What I'm really looking for is the impact caused by the different memory protection implementations on IA-32 CPUs. Enhanced memory protection is one of the most important security enhancements brought by PaX, but without hardware support on most CPUs it comes with a performance cost.
Benchmarks were based on time measurement of a kernel tree compilation, on idle computers. This is a reliable benchmark because it can take about one hour to complete, and above all, it is very CPU and memory intensive. On SMP testbench, I took advantage of both CPUs via make process parallelization (make -j3).
Configuration was explicitly generated to make compilation process take as long as possible. This means that every kernel feature should be enabled, hence make allyesconfig was used for automatic generation of full configuration file.
Sources were cleaned with make mrproper before starting and between successive compilations. Every time measurement was taken from 3 different compilations under the same conditions.
|Normalized performance results of average compilation times
with different PaX configurations
|time||No protection (reference)||PAGEEXEC implementation||SEGMEXEC implementation|
Note 1: Reference times yield 100%. Results below 100% represent longer compilation times. An example value of 50% means compilation took twice the time than reference.
Note 2: Illustrative graphics represent average amont of time in seconds spent in the compilation processes, split by system and user time.
Note 3: Absolute compilation times and some simple statistical analysis can be found on the appendixes of this document.
Implementing non-hardware supported features has to have a negative impact on performance. AMD64 based CPUs let us use PAGEEXEC "for free" because the architecture supports it explicitly, but that is not the case for generic x86 CPUs (i386-Pentium, Pentium Pro/II/III, Celeron, Centrino, Pentium IV, AMD K7/Athlon, etc). The relevance of these CPUs is the overwhelming and presently unthreatened placement on the desktop and low end server market.
Surprisingly, PAGEEXEC has little performance impact, even implying a lot more page table walks and page fault exceptions to allow software permission management. The benchmarking results for testbench 2 (Pentium III) show only a 2.7% performance drop on total (real) time and specifically a more significant 18.8% on system time, showing the increased load on memory paging subsystem.
The most notable result is the particularly negative impact found on testbench 1 (Pentium IV). The netburst architecture somehow shows a non-modest performance drop of 22.1% on total time and a whopping 73.2% on system time.
On IA-32 architecture, SEGMEXEC based memory protection should have a small if at all noticeable performance impact. And that's exactly what the benchmarks show.
All benchmarks show very small performance hits of about 3.5% of system time and less than 1% of total time. The performance increases of 0.32% of real time found in testbench 2 can be covered by statistical errors and finally there was no SEGMEXEC benchmark for AMD64 because the patch doesn't support it and it would be a waste of time. AMD64 can use PAGEEXEC "for free" as seen in the PAGEEXEC benchmark result.
Normalized benchmark results can be viewed on the char below. Results are represented in percentage values, normalized with the standards (no PaX) for each testbench. Percentage values below 100% represent decreased performance and vice-versa. The visible performance hits can be found in testbench 2 but mostly in testbench 1, as expected in PAGEEXEC setups.
The PaX patchset is a big step towards increased operating system security, by adding protection features against introduction and/or execution of arbitrary code and user changes to code execution order. Memory protection features are only a part of the PaX project and are most welcome on the IA-32 architecture, since it doesn't support them (decently, I mean). AMD64, PPC64, Alpha and other architectures benefit with PaX's PAGEEXEC memory protection at no cost whatsoever!
IA-32 CPUs suffer a small performance drop when using PAGEEXEC -- 2.7% in real time and 18.8% in system time -- with the exception of the netburst architecture (Pentium IV and Xeon) that suffer a much bigger hit -- 22.1% on real time and 73.2% on system time.
As expected, AMD64 architecture uses PAGEEXEC for free. There is no performance drop whatsoever.
Comment from the PaX team on these results: the PAGEEXEC/i386 performance impact is lower than what it used to be because a year ago I implemented a speed-up trick on 2.6 (it is not documented yet). you can try to test 2.4 and you'll see what I mean, especially on the P4 where the old PAGEEXEC logic would cause a huge slowdown, maybe a 100x or on that order (so what you observed shows just how effective the speed-up is in real life). I don't know why the P4 is so impacted, it is probably in part because user/kernel transitions are very slow in general, but there must be something else as well (the PaX specific page table manipulation maybe).
For IA-32 CPUs with significant performance drops caused by PAGEEXEC (either on netburst architecture or on others with memory sensitive applications), the PaX project developed an alternative algorithm -- SEGMEXEC -- which achieves the same goal as PAGEEXEC but by taking advantage of the hardware segmentation logic of x86 CPUs, making the associated performance drop nearly null. Both testbenches 1 and 2 showed the expected similar results. These are very small performance drops which measure only 3.5% of system time and less than 1% of total time. Curiously, testbench 2 had a performance increase of 0.32% on real time, but I believe that can be covered by the statistical errors. On a final note, SEGMEXEC was not tested on AMD64 because the patch doesn't support it (SEGMEXEC on AMD64) and it would be a waste of time. AMD64 can use PAGEEXEC "for free".
This small study has revealed that PaX's PAGEEXEC memory protection can be expensive in some cases but for IA-32 architecture SEGMEXEC is a perfectly functional alternative with a completely negligible performance hit. Also SEGMEXEC's associated drawbacks are not at all easily found.
Comment from the PaX team on these results: it is interesting that you also found the same small performance increase for SEGMEXEC/P3 that i'd observed in the past, it is really small but consistent: userland time goes down (unexpected), kernel goes up a bit (expected), on the order of a few seconds on a kernel compilation. I have no explanation for this, but it is quite reproducible (i used to think that randomization causes it but then it affects only virtual addresses, not the physical addresses which play a role in caching, and you didn't have randomization enabled at all, I think).
SEGMEXEC/amd64 doesn't exist because the CPU itself doesn't make it possible (nor necessary, of course), at least in 64 bit mode. in 32 bit mode you can of course boot the normal i386 kernels and test both non-exec methods.
It is my opinion that PaX is a very good patchset, being an important step towards improved operating system and therefore services' security. The memory protection plays an important role but the effectiveness of the patchset is maximized in conjunction with the other mechanisms supplied. grsecurity includes PaX and presents a very complete approach for improved linux security.
Some applications that were badly written, aggressively optimized or derived from very old and thus crippled code may not work with this kind of security patches. There is no hope for those applications other than two solutions:
Some more information on the netburst architecture could enlighten its particular negative results on PAGEEXEC, so a comment on that would be useful to justify bad performance. Next reviews of this article may include an explanation. Feel free to contribute.
One other useful PaX feature is also bound to cause a negative performance impact and could also be studied. Address Space Layout Randomization randomizes several process memory mappings to prevent exploit techniques that rely on known memory vectors.
On system calls like fork() or exec() some address vectors must be generated from a known base address and an amount of randomization. The randomization "quality" and "quantity" for the different base addresses can cause a performance penalty, which could be measured.
This work is licensed under a Creative Commons Attribution 2.5 License.