Whetstone Benchmark on PowerPC 440 CPU

Performance Assessment using the Whetstone v1.2 Benchmark on the Xilinx ML507 FPGA Platform
 

Table of Contents [Toc]

Overview
Test Setup
Hardware Architecture
   Clocking Infrastructure
   Glossary
The Whetstone Benchmark
IBM PowerPC Performance Libraries
Test Results
   Measurement Results w/ FPU
   Measurement Results w/o FPU
   Inter-/Extrapolated Results
Summary & Conclusions
References
 

Overview   [Toc] [Top]

This project describes the performance assessment of the IBM PowerPC 440 CPU using the Whetstone version 1.2 benchmark. The tests were executed on the Xilinx Virtex-5 ML507 FPGA prototyping platform with an embedded PowerPC 440 CPU running at 400 MHz and a double-precision floating-point unit (FPU) running as coprocessor at 200 MHz. The main objectives of this project are summarized as follows:

Well, I'm aware of the paraphrase 'fake, lie, benchmark', therefore I seek to follow best practices by clearly specifying the overall setup employed including hardware system, software and compiler versions used, and employed compilation flags. Moreover, I do not intend to tweak any system specifics in order to reach highest scores, but rely on ordinary compiler flags such as '-O2' and '-O3'. The corresponding results are listed below. Any interpretation is left to the prospective reader.

Note that in general, benchmark numbers are meaningless without proper specification of compiler settings and benchmarking conditions.

Last but not least, keep in mind Dilbert:

"A misleading benchmark test can accomplish in minutes what years of good engineering can never do."
 

Test Setup   [Toc] [Top]

Hardware IBM PowerPC 440 CPU running at 400 MHz
double-precision floating-point unit (FPU) running at 200 MHz
CPM & PLB running at 133 MHz
Peripherals: 32 kB BRAM, RS232 UART, interrupt controller, timer
Software Whetstone benchmark version 1.2 (Language: C) [5]
Selected tests employ the IBM PowerPC Performance Libraries [6]
Compiler powerpc-eabi-gcc (GCC) 4.1.1 20060524
Compilation Program compiled with variations of the following compilation flags:
CFLAGS := -g -Wall -Werror -std=c99 -O3 -mcpu=440 -mfpu=dp_full

Hardware Architecture   [Toc] [Top]

The hardware architecture is depicted below. Only essential peripheral blocks have been attached to the CPU.

Hardware Architecture
Hardware architecture for the performance assessment of the embedded IBM PowerPC 440 CPU using the Whetstone benchmark. A double-precision floating point unit (DP-FPU) as soft IP block clocked at 200 MHz has been connected through the fabric coprocessor bus (FCB) to the CPU core running at 400 MHz. As peripheral blocks, 32 kB on-board memory, timer, interrupt controller and RS232 UART for serial communication are connected.

Clocking Infrastructure   [Toc] [Top]

The clocking scheme of the system architecture needs to adhere to a bunch of different rules imposed by various interconnect and device specifics. The applied clocking scheme and the corresponding important clock ratios are listed below.

Clock Frequency Clock Ratio
CPU core clock 400 MHz
FCM clock (APU) 200 MHz CPU:FCM 2:1
CPM clock 133 MHz CPU:CPM 3:1
MPLB (PLB_v46_0) 133 MHz CPU:MPLB 3:1, CPM:MPLB 1:1

Glossary   [Toc] [Top]

APU Auxiliary Processing Unit
BRAM Block Random Access Memory
CPM Communications Processor Module
DAC Digital-to-Analog Converter
DCR Device Control Register
DMA Direct Memory Access
FCB Fabric Coprocessor Bus
FCM Fabric Coprocessor Module
FPU Floating-Point Unit
GPIO General Purpose Input/Output
MCI Memory Controller Interface
MPLB Processor Local Bus Master
SPLB Processor Local Bus Slave
PLB Processor Local Bus
PPC PowerPC

The Whetstone Benchmark   [Toc] [Top]

The Whetstone benchmark is a synthetic benchmark initially composed in 1972 using the Algol 60 programming language [3]. Later on, the source code of the benchmark was ported to FORTRAN and C [4, 5]. The Whetstone benchmark intends to qualify the CPU performance by employing various CPU-centric operations within approx. 150 code statements. The benchmarks itself performs a number of iterations on a set of modules. Each module performs a loop on certain instructions such as procedure calls, branching statements, fixed- and floating-point arithmetics, and trigonometric operations - with a predefined weighted amount of executions for each loop derived from the initially specified number of iterations. The dominant code section usually accounts for 30%-50% of execution time and performs floating-point operations [4]. As stated in the code, a loop number of 10 corresponds to one million Whetstone instructions. By specifying the loop number and by measuring the overall execution time of the benchmark on a given CPU, the individual CPU performance can then be reflected in a benchmark-specific metric called Million Whetstone Instructions per Second (MWIPS).

In general, a synthetic benchmark like Whetstone has to satisfy contradicting design requirements [3]: On one hand, the benchmark must be simple enough to be easily portable from one machine to another and to be translated to different programming languages. On the other hand, the benchmark has to exhibit sufficient complexity as to represent a sound measure for CPU performance without inferring language-specific peculiarities, while at the same time preventing excessive compiler optimizations. According to [3], at design time of the Whetstone benchmark, code movement out of loops and similar extensive optimization techniques were rarely performed by ALGOL 60 compilers, hence procedure calls and loops were viable techniques when confined to the ALGOL 60 programming language.

Citing Roy Longbottom's [4] interesting statement about compiler optimizations attributed to the Whetstone benchmark:

"The code was designed to be non-optimisable and optimising compilers did not have a significant impact until the introduction of in-lining of subroutine instructions. Although this produces code outside the definition of Whetstone instructions, which include a specific proportion of procedure calls, it is a valid technique to obtain the best performance out of modern systems and may well be the compiler default optimisation level. As reflected in the PC results, a good compiler can halve the execution time by in-lining, careful choice of instructions and sequence, and omission of intermediate stores/loads.

With in-lining and global optimisation, a small number of compilers identified that the dominant loop did not have to be executed and immediately lead to an apparent more than doubling of MWIPS speeds. This was identified by the 1980 enhancements and fixed in 1987, essentially by changing the name of one variable. Unlike some other standard benchmarks, Whetstone results were generally verified as part of the Central Computer and Telecommunications Agency (CCTA) system appraisal, in project related benchmarking sessions or during acceptance trials. It was also standard practice to run the tests with different levels of optimisation and obvious over - optimised results were not published."

IBM PowerPC Performance Libraries   [Toc] [Top]

Selected tests with absent floating-point unit (FPU) employ the compilation flag '-mppcperflib' which infers the IBM PowerPC Performance Libraries for optimized low-level integer and floating-point emulation, and optimized string handling routines [6]. According to Xilinx, the IBM PowerPC Performance Libraries may show an average of three times increase in speed on applications that heavily use these routines.

Caution: The IBM PowerPC Performance Libraries are only intended for improving the execution of emulated floating-point arithmetics and hence cannot be used in conjunction with floating-point hardware, i.e., with active '-mfpu' switch.

Test Results   [Toc] [Top]

The source code has been compiled using GCC wit compiler flags indicated below.
Any interpretation is left to the prospective reader.

Measurement Results w/ FPU   [Toc] [Top]

IBM PowerPC 440 @ 400 MHz with dedicated double-precision FPU @ 200 MHz

The gray shaded table entries represent compiler settings violating the intended compilation rules of the synthetic benchmark.

Compiler optimization flags -O3¹ -O2² -O1³
Execution time for 400'000 loops [s] 83.8
296.9
310.4
Million Whetstone instructions per second (MWIPS) 477.2
134.7
128.9
Time for one loop through Whetstone [us]
210
742
776
MWIPS/MHz 1.19
0.34
0.32

¹ CFLAGS := -g -Wall -Werror -std=c99 -O3 -mcpu=440 -mfpu=dp_full
² CFLAGS := -g -Wall -Werror -std=c99 -O2 -mcpu=440 -mfpu=dp_full
³ CFLAGS := -g -Wall -Werror -std=c99 -O1 -mcpu=440 -mfpu=dp_full

Measurement Results w/o FPU   [Toc] [Top]

IBM PowerPC 440 @ 400 MHz without dedicated FPU, using software-emulated floating-point

The gray shaded table entries represent compiler settings violating the intended compilation rules of the synthetic benchmark.

Compiler optimization flags -O3 -mppcperflib¹º -O3¹¹ -O2 -mppcperflib²º -O2²¹
Execution time for 100'000 loops [s] 75.8
211.3
283.2
673.4
Million Whetstone instructions per second (MWIPS) 131.9
47.3
35.3
14.9
Time for one loop through Whetstone [us]
758
2113
2832
6734
MWIPS/MHz 0.33
0.12
0.09
0.04

¹º CFLAGS := -g -Wall -Werror -std=c99 -O3 -mcpu=440 -mppcperflib
¹¹ CFLAGS := -g -Wall -Werror -std=c99 -O3 -mcpu=440
²º CFLAGS := -g -Wall -Werror -std=c99 -O2 -mcpu=440 -mppcperflib
²¹ CFLAGS := -g -Wall -Werror -std=c99 -O2 -mcpu=440

Inter-/Extrapolated Results   [Toc] [Top]

PPC w/ FPU, using 134.7 MWIPS @ 400 MHz as reference data:

CPU clock [MHz] 100 200 300 400 500 550
MWIPS 34
67
101
135
168
185
MWIPS/MHz 0.34

PPC w/o FPU, using 35.3 MWIPS @ 400 MHz as reference data:

CPU clock [MHz] 100 200 300 400 500 550
MWIPS 9
18
26
35
44
49
MWIPS/MHz 0.09

Summary & Conclusions   [Toc] [Top]

By using an embedded PowerPC 440 CPU running at 400 MHz in conjunction with a double-precision floating-point unit running at 200 MHz, a Whetstone performance of 135 MWIPS was measured with cache-adjusted code size and compiler settings not violating the rules of the benchmark. Extrapolated to the fastest available Virtex-5 FPGA speed grade (i.e., a CPU clock frequency of 550 MHz), the PowerPC 440 would achieve 185 MWIPS for the version 1.2 benchmark.

When using a PowerPC 440 CPU running at 400 MHz without dedicated FPU coprocessor, all floating-point arithmetics have to be performed by software. In this context, a Whetstone performance of 35 MWIPS was measured by inferring the IBM PowerPC Performance Libraries during compilation. Without inferring the PowerPC Performance Libraries, the Whetstone performance of the employed system decreases to merely 15 MWIPS.

However, the relevance of these performance numbers are in general questionable: The code must be running entirely from cache without any I/O transfers to show best results. As soon as larger code size, costly I/O transfers, and different compiler options are involved, these numbers are merely theoretical.

Last but not least, it is very impressive to see how different code optimization techniques of the compiler significantly influence the execution time of the identical piece of code. Here at last it becomes obvious, that a CPU performance measuring tool like a benchmark needs to be designed by keeping clearly in mind hardware, software and compiler architectures and capabilities.
 

References   [Toc] [Top]

[1] Xilinx Inc., Virtex-5 Family Brochure, Dec 2008

[2] Xilinc Inc., Virtex-5 Website

[3] H.J. Curnow, B.A. Wichmann, "A synthetic benchmark" (Whetstone), Computer Journal, Vol 19, No 1, pp 43-49, 1976

[4] Roy Longbottom, Whetstone Benchmark History and Results

[5] netlib.org, Benchmark Programs and Reports

[6] sourceforge.net, IBM PowerPC Performance Libraries

 

Last updated: 2012/12/30

[Toc] [Top]
 

If you see only this page in your browser window,
click here
to get the entire site.