Add to Google Reader or Homepage |
~ pjvenda
$home . blog . photography
Table of Contents
Random Photo

PaX performance hit

PaX performance impact

Pedro Venda

Lisbon, October 20, 2005

This is a short article about the performance impact of the PaX memory protection patch for the linux kernel. PaX greatly enhances linux's security by disallowing common types of attacks exploiting software bugs like buffer overflows. The associated cost of the implemented memory protection mechanisms must be evaluated and quantified to help people take decisions and make investments.

Index

1. Copyright, acknowledgements, updates and feedback

2. PaX features

PaX (PAge eXecute) is a security patchset for the linux kernel which aims at preventing the exploitation of memory corruption bugs which in turn could give an attacker arbitrary privileges otherwise denied. This class of bugs include various forms of buffer overflows (stack or heap based), user supplied format string bugs, etc.

PaX is a security enhancement per-se, but not nearly enough to achieve an enhanced security environment. Grsecurity patchset includes PaX and is a much more complete approach towards a well based security enhancement for linux.

The PaX patchset includes mechanisms for:

3. Memory protection

The memory protection for x86 or IA-32 architecture is one of the most important features brought by PaX. By improving memory protection, PaX is able to dismantle some common attack exploits without being overly intrusive on the operating system. All applications should work without changes and on supporting architectures (not IA-32) without even affecting performance. There are two methods for memory protection implemented by PaX: PAGEEXEC and SEGMEXEC.

PAGEEXEC

The PAGEEXEC memory protection algorithm was an idea originally for the virtualization freemware/plex86 project to protect memory accesses between different running virtual operating systems, while keeping performance mostly the same. This idea was later used by PaX due to its evident security potential.

PAGEEXEC is a process to implement page protection features using the paging logic of IA-32 (x86) based CPUs. It is the least intrusive, most elegant but most penalizing approach considering the worst case. This architecture lacks the hardware support for non executable pages on its MMU, through a No-eXecute bit (a small number of recent Intel 32bit CPUs already support an NX bit, but at the moment PaX doesn't use it). These simpler MMUs overload one bit for READ/EXEC permission settings, so every READable page can also be EXECutable.

Fortunately, IA-32 CPUs also have split TLB (Translation Lookaside Buffer) for code and data memory pages. These tables cache physical page locations and access rights (user/supervisor and read/exec) obtained from (expensive) page table walks. All page table entries (PTE) have well defined (good or bad) states (READ+EXEC is ok for code pages but bad for data pages, as the latter aren't by definition executable) as well as good/bad transitions between states. Code pages are generally READ+EXEC while data pages are READ(and WRITE)+NOEXEC. A userspace attempt to execute code in a data page (NOEXEC) will generate a page fault exception handled by the kernel and the PaX PAGEEXEC code and the access will be denied.

PAGEEXEC sits in the memory paging subsystem, intercepting page fault handles replacing access rights whenever a page would end up in a "bad" state.

SEGMEXEC

SEGMEXEC is a reimplementation of the memory protection mechanism exclusively for the IA-32 architecture, taking advantage of the hardware segmentation features.

PaX's VMMIRROR code is the heart of SEGMEXEC: It allows coherent mapping of the same set of physical pages in two different linear addresses on any given task. Consistency is assured in case of swap cycles and copy-on-write operations. In the typical case of 3 GB userland linear address range, two 1.5 GB regions are created from it and the code/data segment descriptors are setup to cover only one or the other. Specifically, data memory is mapped in the bottom half (0-1.5GB in linear address range) and executable memory is mapped in the top half (1.5-3GB in linear address range). Executable code mappings must be visible in the code segment region (1.5-3GB in linear address space) and since such mappings may contain data as well, they need to be mirrored at the same linear addresses in the data segment as well (0-1.5GB in linear address space).

linear mapping space SEGMEXEC chantes to the linear mapping space

Using PaX's VMMIRROR code, user data memory is mapped into the first half, while executable memory goes into the second. Instruction fetches are translated from the top half while data fetches come from the bottom half maps. A non-executable memory space would be mapped in the bottom half only while executable data would be mapped into both halves since executable data can also be read. The VMMIRROR code allows identification of access type through the linear address requested.

memory mappings for executable data memory mappings for non-executable data

Non executable memory is only mapped in the 0-1.5GB half of the linear memory, so instruction fetches of this data (which are illegal) are trivially detected; This case's requested addresses will be translated from the top half of the memory (1.5-3GB) raising a page fault, detecting an illegal execution attempt.

SEGMEXEC uses VMMIRROR to do most of the work and, like PAGEEXEC, sits in the memory subsystem intercepting page faults to determine permission violation attempts.

4. Benchmarks

My benchmarks were done with a practical and simple situation, indicative of relative performance impact caused by different PaX configurations. What I'm really looking for is the impact caused by the different memory protection implementations on IA-32 CPUs. Enhanced memory protection is one of the most important security enhancements brought by PaX, but without hardware support on most CPUs it comes with a performance cost.

Benchmarks were based on time measurement of a kernel tree compilation, on idle computers. This is a reliable benchmark because it can take about one hour to complete, and above all, it is very CPU and memory intensive. On SMP testbench, I took advantage of both CPUs via make process parallelization (make -j3).

Configuration was explicitly generated to make compilation process take as long as possible. This means that every kernel feature should be enabled, hence make allyesconfig was used for automatic generation of full configuration file.

Sources were cleaned with make mrproper before starting and between successive compilations. Every time measurement was taken from 3 different compilations under the same conditions.

Testbenches were:

Testbench 1

  • Intel Pentium IV 2.2 GHz
  • 2x256 MB DDR RAM (dual channel)
  • /usr/src on RAID1 device based on 2x40GB IDE hard drives
  • Gentoo Linux Hardened (with USE flags hardened and pic and -fstack-protector in CFLAGS
  • Gentoo Hardened-v1 kernel (1)
    • with PAGEEXEC
    • with SEGMEXEC
    • without PaX protections
  • benchmark command sequence: make mrproper && make allyesconfig && time make

Testbench 2

  • Dual Intel Pentium III 866 MHz
  • 1x512MB 133MHz ECC registered SDRAM
  • /usr/src on RAID5 device based on 4x80GB IDE hard drives
  • Gentoo Linux Hardened (with USE flags hardened and pic and -fstack-protector in CFLAGS
  • kernel 2.6.11.7-grsec
    • with PAGEEXEC
    • with SEGMEXEC
    • without PaX protections
  • benchmark command sequence: make mrproper && make allyesconfig && time make -j3

Testbench 3

  • HP Proliant DL154
  • Dual Opteron 248
  • 4x1GB Registered ECC RAM
  • /usr/src on RAID1 device based on 2x160GB IDE hard drives
  • Gentoo kernel 2.6.11-r13
    • with PAGEEXEC
    • SEGMEXEC is not applicable for this CPU architecture
    • without PaX protections
  • benchmark command sequence: make mrproper && make allyesconfig && time make -j3

Note 1: This kernel is version 2.6.11 patched for security hardening by the Gentoo people. Patches include grsecurity which also incorporates PaX into the kernel.


Normalized performance results of average compilation times
with different PaX configurations
time No protection (reference) PAGEEXEC implementation SEGMEXEC implementation
Testbench 1
real 100% 87,9% 99,4%
user 100% 95,9% 99,5%
system 100% 36,2% 96,5%
testbench 1 benchmark results
Testbench 2
real 100% 97,26% 100,03%
user 100% 99,33% 100,32%
system 100% 81,19% 96,11%
testbench 2 benchmark results
Testbench 3
real 100% 100,8% n/a
user 100% 100,44% n/a
system 100% 100,21% n/a
testbench 3 benchmark results
Note 1: Reference times yield 100%. Results below 100% represent longer compilation times. An example value of 50% means compilation took twice the time than reference.
Note 2: Illustrative graphics represent average amont of time in seconds spent in the compilation processes, split by system and user time.
Note 3: Absolute compilation times and some simple statistical analysis can be found on the appendixes of this document.

5. PAGEEXEC performance impact

Implementing non-hardware supported features has to have a negative impact on performance. AMD64 based CPUs let us use PAGEEXEC "for free" because the architecture supports it explicitly, but that is not the case for generic x86 CPUs (i386-Pentium, Pentium Pro/II/III, Celeron, Centrino, Pentium IV, AMD K7/Athlon, etc). The relevance of these CPUs is the overwhelming and presently unthreatened placement on the desktop and low end server market.

Surprisingly, PAGEEXEC has little performance impact, even implying a lot more page table walks and page fault exceptions to allow software permission management. The benchmarking results for testbench 2 (Pentium III) show only a 2.7% performance drop on total (real) time and specifically a more significant 18.8% on system time, showing the increased load on memory paging subsystem.

The most notable result is the particularly negative impact found on testbench 1 (Pentium IV). The netburst architecture somehow shows a non-modest performance drop of 22.1% on total time and a whopping 73.2% on system time.

6. IA-32 alternative: SEGMEXEC performance impact

On IA-32 architecture, SEGMEXEC based memory protection should have a small if at all noticeable performance impact. And that's exactly what the benchmarks show.

All benchmarks show very small performance hits of about 3.5% of system time and less than 1% of total time. The performance increases of 0.32% of real time found in testbench 2 can be covered by statistical errors and finally there was no SEGMEXEC benchmark for AMD64 because the patch doesn't support it and it would be a waste of time. AMD64 can use PAGEEXEC "for free" as seen in the PAGEEXEC benchmark result.

7. Conclusion

Normalized benchmark results can be viewed on the char below. Results are represented in percentage values, normalized with the standards (no PaX) for each testbench. Percentage values below 100% represent decreased performance and vice-versa. The visible performance hits can be found in testbench 2 but mostly in testbench 1, as expected in PAGEEXEC setups.

relative benchmark comparison chart

PAGEEXEC

The PaX patchset is a big step towards increased operating system security, by adding protection features against introduction and/or execution of arbitrary code and user changes to code execution order. Memory protection features are only a part of the PaX project and are most welcome on the IA-32 architecture, since it doesn't support them (decently, I mean). AMD64, PPC64, Alpha and other architectures benefit with PaX's PAGEEXEC memory protection at no cost whatsoever!

IA-32 CPUs suffer a small performance drop when using PAGEEXEC -- 2.7% in real time and 18.8% in system time -- with the exception of the netburst architecture (Pentium IV and Xeon) that suffer a much bigger hit -- 22.1% on real time and 73.2% on system time.

As expected, AMD64 architecture uses PAGEEXEC for free. There is no performance drop whatsoever.

Comment from the PaX team on these results: the PAGEEXEC/i386 performance impact is lower than what it used to be because a year ago I implemented a speed-up trick on 2.6 (it is not documented yet). you can try to test 2.4 and you'll see what I mean, especially on the P4 where the old PAGEEXEC logic would cause a huge slowdown, maybe a 100x or on that order (so what you observed shows just how effective the speed-up is in real life). I don't know why the P4 is so impacted, it is probably in part because user/kernel transitions are very slow in general, but there must be something else as well (the PaX specific page table manipulation maybe).

SEGMEXEC

For IA-32 CPUs with significant performance drops caused by PAGEEXEC (either on netburst architecture or on others with memory sensitive applications), the PaX project developed an alternative algorithm -- SEGMEXEC -- which achieves the same goal as PAGEEXEC but by taking advantage of the hardware segmentation logic of x86 CPUs, making the associated performance drop nearly null. Both testbenches 1 and 2 showed the expected similar results. These are very small performance drops which measure only 3.5% of system time and less than 1% of total time. Curiously, testbench 2 had a performance increase of 0.32% on real time, but I believe that can be covered by the statistical errors. On a final note, SEGMEXEC was not tested on AMD64 because the patch doesn't support it (SEGMEXEC on AMD64) and it would be a waste of time. AMD64 can use PAGEEXEC "for free".

This small study has revealed that PaX's PAGEEXEC memory protection can be expensive in some cases but for IA-32 architecture SEGMEXEC is a perfectly functional alternative with a completely negligible performance hit. Also SEGMEXEC's associated drawbacks are not at all easily found.

Comment from the PaX team on these results: it is interesting that you also found the same small performance increase for SEGMEXEC/P3 that i'd observed in the past, it is really small but consistent: userland time goes down (unexpected), kernel goes up a bit (expected), on the order of a few seconds on a kernel compilation. I have no explanation for this, but it is quite reproducible (i used to think that randomization causes it but then it affects only virtual addresses, not the physical addresses which play a role in caching, and you didn't have randomization enabled at all, I think).

SEGMEXEC/amd64 doesn't exist because the CPU itself doesn't make it possible (nor necessary, of course), at least in 64 bit mode. in 32 bit mode you can of course boot the normal i386 kernels and test both non-exec methods.

Overall Conclusion

It is my opinion that PaX is a very good patchset, being an important step towards improved operating system and therefore services' security. The memory protection plays an important role but the effectiveness of the patchset is maximized in conjunction with the other mechanisms supplied. grsecurity includes PaX and presents a very complete approach for improved linux security.

Some applications that were badly written, aggressively optimized or derived from very old and thus crippled code may not work with this kind of security patches. There is no hope for those applications other than two solutions:

8. Future work

Some more information on the netburst architecture could enlighten its particular negative results on PAGEEXEC, so a comment on that would be useful to justify bad performance. Next reviews of this article may include an explanation. Feel free to contribute.

One other useful PaX feature is also bound to cause a negative performance impact and could also be studied. Address Space Layout Randomization randomizes several process memory mappings to prevent exploit techniques that rely on known memory vectors.

On system calls like fork() or exec() some address vectors must be generated from a known base address and an amount of randomization. The randomization "quality" and "quantity" for the different base addresses can cause a performance penalty, which could be measured.

9. References

  1. PaX Overview
  2. PaX: NOEXEC
  3. PaX: PAGEEXEC
  4. PaX: Older PAGEEXEC document
  5. PaX: VMMIRROR
  6. PaX: SEGMEXEC
  7. GRSecurity homepage

A. Absolute benchmark values

testbench 1 absolute benchmark values table
testbench 2 absolute benchmark values table
testbench 3 absolute benchmark values table
All values are in seconds except those on performance column, which are percentages

Creative Commons License
This work is licensed under a Creative Commons Attribution 2.5 License.