POWER6 Decimal Floating Point (DFP)
Introduction
The POWER6 CPU is to heart of the next range on System p machines. Not only a superbly fast processor at 4.7 GHz but this CPU allows for some industry leading new features. This includes Decimal Floating Point and new raw data type within the CPU (like the current characters, integers and floating point numbers) but with high numbers of digits accuracy which means they can be used for money calculations. Most current applications and databases have to use Binary Coded Decimal (BCD) to perform calculations with money to sufficient accuracy and this is very heavy work on the processor. Decimal Floating Point is here to fix this and speed up most application by getting the processor to do the hard work with up to 34 digit accuracy.
Wikipedia is always good for a starting point and states for Decimal Floating Point
- Decimal floating point refers to both a representation and operations
- The key is preserving base 10 exponents wherever possible
- The conversion to binary floating point exponents can lead to repeat rounding errors and is therefore unsuitable for precision mathematics
For decades we have had: character, integer & floating point data types in C, FORTRAN etc but there is a need for one more ...
- Floating point has accuracy issues for high numbers of decimal places and when doing 1000's of calculations the inaccuracy builds up.
- Totally unacceptable for currency calculations - so applications & databases forced to - Binary Coded Decimal (BCD)
- BCD is like doing maths with strings V E R Y ... S L O W L Y
- Like using long division on paper rather than a calculator
The Answer is DFP
DFP is the answer
- Massive performance boost for these calculations
- To be a new Industry Standard and IBM leading the way
Adoption:
- IBM's largest ISV's releasing DFP products in 2007.
- Financial customers for example with own code are adopting it too
Going to skip the technical details on how bits are packed in memory etc. This is all very interesting but who cares! It's like knowing your car engine has 55.5 mm pistons, right! There are references at the bottom that you can go to find out the full gory details.
There are two important questions:
- Important are how to get this working in your code and how much effort is involved?
- How much faster is it? i.e. the Pay-Back for you efforts.
There are two ways to implement this:
- Native DFP language support via a complier using the IBM XLC C Compiler release 9 for AIX
- Decimal Floating Point Abstraction Layer (DFPAL) via a library Freely Downloadable from IBM
- These are covered below.
Native DFP language support via the XLC complier
Native DFP language support via a complier using the IBM XLC C Compiler release 9 for AIX,
If you have a licence for this compiler you can upgrade to version 9 (I think) free of charge.
If not you can get an evaluation copy from the IBM website (I think its for 60 days).
The C draft standard includes the following new data types (these are native data types just like integer, long and float, double etc):
- _Decimal32
- _Decimal64
- _Decimal128
As you can imagine the high number at the end means more digits of accuracy. We are going to concentrate on the best the _Decimal128 with 34 digits. If you want to reduce the size in memory or in for example data records and you are absolutely sure you don't need 34 digits then you should investigate the 32 and 64 bit versions.
Note: the printf() uses new options to print these new data types:
- _Decimal32 use %Hf
- _Decimal64 use %Df
- _Decimal128 use %DDf
A worked example on Native DFP
Here is a code sample using the new data types, that calculates interest:
main(int argc, char **argv)
{
long i, count;
double dfund, dinterest;
_Decimal128 Dfund, Dinterest; /* Declaring the new data type*/
dfund = atof(argv[1]);
dinterest = atof(argv[2]);
Dfund = atodecimal(argv[1]); /* Assigning values just like other data types */
Dinterest = atodecimal(argv[2]);
count = atoi(argv[3]);
printf("double fund=%20.10f interest=%40.30f\n",dfund,dinterest);
printf("Decimal fund=%20.10DDf interest=%40.30DDf\n",Dfund,Dinterest); /* printing them with the new printf specifiers */
for(i=0;i<count;i++) {
dfund=dfund*dinterest;
Dfund=Dfund*Dinterest; /* performing maths */
}
printf("Print final funds\n");
printf("double fund=%30.10f\n",dfund);
printf("Decimal fund=%30.10DDf\n",Dfund);
}
Support function
You may have noticed the use of a atodecimal() function here. In the future there will be a strto128() function in the C library but not at the time of writing this wiki page. This atodecimal() function had to be written for this program and the code is below:
/* Takes a string with a decimal number and returns a _Decimal128 * Format: [+ -]digits.digits */ _Decimal128 atodecimal(char *s)
{
_Decimal128 top=0, bot=0, result;
int negative=0, i;
if( s[0] == '-') {
negative=1;
s++;
}
if( s[0] == '+') s++;
for(; isdigit(*s); s++) {
top = top * 10;
top = top + *s - '0';
}
if(*s == '.') {
s++;
for(i=strlen(s)-1; isdigit(s[i]);i--) {
bot = bot / 10;
bot = bot + (_Decimal128)(s[i] - '0')/(_Decimal128)10;
}
}
result = top + bot;
if(negative)
result = -result;
return result;
}
This function is not complex and a good exercise to the student - you may have a more concise way of doing this but it does the job for now.
Compiling
For hardware supported DFP:
cc dfp.c -o dfp_hw -qdfp -qarch=pwr6
For software emulation of DFP:
cc dfp.c -o dfp_sw -qdfp -qarch=pwr6 -qfloat=dfpemulate
If you want to compile for non-POWER6 machines:
cc dfp.c -o dfp_sw_old -qdfp -qfloat=dfpemulate
If on POWER6 but otherwise use software i.e. runs everywhere:
cc dfp.c -o dfp_any -qdfp -qarch=ppc -qipa=clonearch=pwr6 -qfloat=dfpemulate
Running non POWER6 with no emulation
If we run the version that expects hardware DFP support on old H/W like POWER5 or POWER4 where there is no DFP support:
./dfp_hw 10 1.000001 60000000Illegal instruction (core dump)
As these processor do not support DFP, the instruction is not valid, hence the core dump.
If there is no XLC release 9 compiler installed (actually it is the runtime library you need), you get missing Library functions and failure to launch instead:
Symbol __b64_to_d128 (number 6) is not exported from dependent module /usr/lib/libc.a(shr.o).
You need compiler Runtime support library.
Running on POWER6 with the Compiler Runtime support
So let us take a look at it running properly and with very carefully choose values
to illustrate a few points.
First running on a POWER6 4.7 GHz machine with AIX6 (could have been AIX 5.3 ML6) so we have the hardware to do native DFP.
# time ./dfp_hw 10 1.000001 60000000
double fund= 10.0000000000 interest= 1.000000999999999917733362053700
Decimal fund= 10.0000000000 interest= 1.000001000000000000000000000000
Print final funds
double fund =1141973124493563816969240576.0000000000
Decimal fund=1141973130130727445029596475.9717600000
real 0m0.72s
user 0m0.72s
sys 0m0.00s
Notes:
- The difference between the double binary and DFP values of 1.00001 is very important. This already shows how the rounding errors in float, double and the like, leads to errors after a lot of maths.
- The values are printed out to highlight the level of accuracy in the DFP numbers
- the difference between the final double and DFP fund is extremely large and start at the 9th digit.
- The double maths calculation has "short changed" the bank customer by trillions of Pounds, Dollars, Euros, Yen (pick your own currency here)
This proves the accuracy allowed with DFP but what about the performance? What does the 0.7 seconds mean?
Let us do the same calculation but within software on the same machine as above - this is much like the maths that applications have to perform now with Binary Coded Decimal (BCD) numbers before DFP came along.
# time ./dfp_sw 10 1.000001 60000000
double fund= 10.0000000000 interest= 1.000000999999999917733362053700
Decimal fund= 10.0000000000 interest= 1.000001000000000000000000000000
Print final funds
double fund =1141973124493563816969240576.0000000000
Decimal fund=1141973130130727445029596475.9717600000
real 0m54. 81s
user 0m54.70s
sys 0m0.00s
Exactly the same results show the software emulation to be accurate - which is good but the time is much loner at 54.8 seconds.
So DFP is 76 times faster THAT IS WORTH HAVING
I am told that results normally vary from 30 times to 60 times depending on the calculations. In this small example, we were doing just the multiply operation i.e. no divide, add, subtract.
DFP Native - Conclusions
1) QED - we were sizing a machine for 50 CPUs now only need 1 CPU?
- No - maths calculations are only a proportion of the application (unlike our example)
- But still a big impact of most applications that manipulate large sums of money
2) IBM has been busy
- Analysing existing code in real applications for potential speed up
- Forecast the silicon cost in CPUs (development / fabrication)
- Decided it is justified - i.e the difference is large enough clearly it is huge
- Implemented it in POWER6 (Note: I am told it is already implemented on the recent System z machines in zOS V1R9)
- Measuring the performance jump once implemented
3) It is important DFP is compiled in AIX 6 applications by ISV's
- IBM is working on its own Software Groups set of middleware and applications
- IBM is busy signing up large key ISV's too.
- At the time of writing SAP Netweaver announced availability of the new version with DFP support and we heard major RDBMS vendors will have DFP support in their next release (expected in late 2007).
But wait ... there is more DFP can be accessed a different way by using a library ... see below.
Decimal Floating Point Abstraction Layer (DFPAL) via a library
Many applications that are using Binary Coded Decimals (BCD) today use a library to perform the maths. Changing to a native data type could be hard work and then you may have an issue with one code set for AIX on POWER6 and one for other platforms not blessed with DFP. The solution to this is an alternative to the native support called the Decimal Floating Point Abstraction Layer (DFPAL). This contains:
- A header file to include in your code and
- the DFPAL library
How do you get access to this? It is downloadable from http://www2.hursley.ibm.com/decimal/
and then look for "DFPAL". You can download the complete source code and will need to compile it on your system which is very easy and took me less than a minute.
If you have hardware support for DFP then you use the library to access the functions.
If your don't have hardware support (or want to compare the hardware and software emulation) you can force the use of software emulation by setting a shell variable before running your application as so:
export DFPAL_EXE_MODE=DNSW
Once you have recompiled the DFPAL library, include the DFPAL header file and compile with (in this example, the program is called dfploop):
xlc_r -O2 -o dfploop -q64 dfploop.c -L . -lDFPA
Notes:
- The -L means search the current directory for the library
- The -lDFPAL (minus and lowercase L) means include the library in the file called libDFPAL.a
Here is a sample program that does much the same as the native DFP example above:
#include "dfpal.h" /* the DFPAL header file */
int main(int argc, char **argv)
{
int i, count;
char mystring[100];
int32_t init_err, init_os_err;
char *err_str = NULL;
dfpalflag_t st;
decimal128 n1, n2; /* Declare the decimal flointing point numbers */
if (dfpalInit((void *)malloc(dfpalMemSize())) /* start the library */
!= DFPAL_ERR_NO_ERROR) {
fprintf(stderr,"DFPAL init error\n);
dfpalEnd(free);
return (1);
}
n1 = dec128FromString(argv[1]);
n2 = dec128FromString(argv[2]); /* assign a value by converting the input string to a demical */
for (i = 0; i < count; i++)
n1 = dec128Multiply(n1, n2); /* do the maths */
printf("Final amount=%s\n",dec128ToString(n1,mystring)); /* print the results */
dfpalEnd(free); /* end the library cleanly */
}
The results for hardware supported DFP via the DFPAL library.
# time ./dfploop 10 1.000001 60000000
Version of DFPAL: 2.10
DFPAL is operating in hardware
amount = 10
interest = 1.000001
count = 60000000
double fund= 1141973124493563816969240576.0000000000
Decimal fund=1141973130130727445029596475.971760
real 0m2.37s
user 0m2.30s
sys 0m0.00s
The results for DFP emulated in software via the DFPAL library.
# export DFPAL_EXE_MODE=DNSW
# time ./dfploop 10 1.000001 60000000
Version of DFPAL: 2.10
DFPAL is operating in software
amount = 10
interest = 1.000001
count = 60000000
double fund= 1141973124493563816969240576.0000000000
Decimal fund=1141973130130727445029596475.971760
real 0m44.64s
user 0m44.37s
sys 0m0.01s
Comments on the results
Results - with the latest version DFPAL at the time of writing (version 2.10)
- In Hardware real ~2.37 seconds
- In Software real ~44.64 seconds
This is 19 times faster -only less as the S/W emulation is faster
Equally very impressive with Native DFP via the compiler
- Hardware DFP tiny bit slower than native - I assume this is due to the library function call overhead
- Software DFP actually %25 faster - this is a highly optimised library
Note: DFP library can be compiled on non-POWER6 platforms
How do I find out If my Applications are using DFP?
The Power processors have special internal hardware counters that are mostly used in the Austin Development Labs for very find tuning and monitoring. This is true for the POWER6 processor and using thing you find out if DFP are being using inside the processor. There are two commands hpmstat for monitoing the whole system and hpmcount for monitoring a sinlge program. See the below for examples, of using them. Run your applications and then run the hpmstat command every five seconds and ten times to find out the operations the whole LPAR/machine is doing.
# hpmstat -g 90 5 10
Execution time (wall clock time): 5.000044511 seconds
Group: 90
Counting mode: user+kernel+hypervisor
Counting duration: 10.000090000 seconds
PM_DFU_ENC_BCD_DPD (DFU Encode BCD to DPD) : 0
PM_DFU_EXP_EQ (DFU operand exponents are equal for add type): 12
PM_DFU_FIN (DFU instruction finish) : 12000104
PM_DFU_SUBNORM (DFU result is a subnormal) : 0
PM_RUN_INST_CMPL (Run instructions completed) : 53134198
PM_RUN_CYC (Run cycles) : 751761527
Normalization base: time
Counting mode: user+kernel+hypervisor
Derived metric group: hpc_metrics_grp_counts
Total run cycles for thread. : 751761527.000
Derived metric group: metrics_AEM
Cycles Per Instruction (Run) : 14.148
Derived metric group: basic_performance
Run cycles per run instruction : 14.148
......
For Decimal Floating Point the important line is
PM_DFU_FIN (DFU instruction finish) : 12000104
The command hpmstat samples the POWER6 hardware counters but here we see 12 million DFP operations have finished in this 5 second capture. To monitor a particular program use hpmcount as follows:
# hpmcount -g 90 ./dfp_hw 1 1.000001 60000000
double fund= 1.0000000000 interest= 1.000000999999999917733362053700
Decimal fund= 1.0000000000 interest= 1.000001000000000000000000000000
Print final funds
double fund=114197312449509773876920320.0000000000
Decimal fund=114197313013072744502959647.5971760000
Workload context: ./dfp_hw 1 1.000001 60000000 (pid: 303226)
Execution time (wall clock time): 0.724730958 seconds
######## Resource Usage Statistics ########
Total amount of time in user mode : 0.721439 seconds
Total amount of time in system mode : 0.000901 seconds
Maximum resident set size : 160 Kbytes
Average shared memory use in text segment : 9 Kbytes*sec
Average unshared memory use in data segment : 105 Kbytes*sec
Number of page faults without I/O activity : 47
Number of page faults with I/O activity : 0
Number of times process was swapped out : 0
Number of times file system performed INPUT : 0
Number of times file system performed OUTPUT : 0
Number of IPC messages sent : 0
Number of IPC messages received : 0
Number of signals delivered : 0
Number of voluntary context switches : 0
Number of involuntary context switches : 20
####### End of Resource Statistics ########
Group: 90
Counting mode: user
Counting duration: 0.724156279 seconds
PM_DFU_ENC_BCD_DPD (DFU Encode BCD to DPD) : 0
PM_DFU_EXP_EQ (DFU operand exponents are equal for add type): 0
PM_DFU_FIN (DFU instruction finish) : 7609311
PM_DFU_SUBNORM (DFU result is a subnormal) : 0
PM_RUN_INST_CMPL (Run instructions completed) : -
PM_RUN_CYC (Run cycles) : -
Normalization base: time
Counting mode: user
#
Here the the important line is "PM_DFU_FIN (DFU instruction finish)", where the count is 7609311.
You can also use the tprof command to monitor the use of DFP within a program - look for the command: tprof -E PM_MRK_DFU_FIN option trpof uses the AIX trace subsystem. This will tell you which functions are using DFP and how often.
What next?
Some places for more information:
- First note Decimal Floating Point for System p requires POWER6 and AIX 5.3 ML6+ or AIX6
- http://www2.hursley.ibm.com/decimal/
- for DFPAL and lots of important information - this is the IBM centre for all things DFP and related
- http://en.wikipedia.org/wiki/Floating_point
has lots of good pointers for more information
- IBM XLC C Compiler Version 9 - Native DFP find the compiler download at the www.ibm.com website (sorry the URL keeps changing)
DFP and GCC
I am told (but have not tried it yet) that the GCC C compiler version 4.2.2 is available for AIX 6.1 and this supports native DFP. You may have to use the following compiler options to switch it on: -enable-decimal-float=dpd
This GCC compiler can be downloaded pre-compiled in RPM format from http://www.perzl.org/aix/index.php?n=Main.Gcc
There are many more Open source packages available from this site. Report back here if you give it a try, please.
The End
Well that finishes my brief tour of the two ways of getting Decimal Floating Point working, the huge performance impact it gives you in addition to the accuracy we need in handling money values (there are other uses too). I hope you either use it as a programmer or look forward to running DFP supporting applications so we can make POWER6 computers even faster than they are already.