BigDecimal and your Money

I often see Java developers attempting to use the BigDecimal class to store monetary values for financial applications. This often seems like a great idea when you start out, but almost always comes back as a flesh-eating zombie intent on devouring your entire time some time later on. The BigDecimal class is designed around accuracy and has an almost infinite size. However: this choice will almost always come back to bite you, or someone else attempting to find a minor bug later on. It’s an especially bad idea when it comes to banking applications. So why is BigDecimal not well suited to storing monetary values then? For one thing: it doesn’t behave in a manner that is practical for financial systems. While this statement flies in the face of everything you’ve probably been taught about this class, it’s true. Read-on and I’ll tell you why.

Theres more numbers in there than you think!

A BigDecimal is not a standard floating point number. Instead: it’s a binary representation of a number. This means: 0.0 != 0.00. While this doesn’t seem like a problem at first: I’ve seen it cause no-end of strange little bugs. The only way to accurately determine whether two BigDecimal objects have an equal value is by using the compareTo method. Try these two little unit tests:

@Test
public void testScaleFactor() {
    final BigDecimal zero = new BigDecimal("0.0");
    final BigDecimal zerozero = new BigDecimal("0.00");

    assertEquals(zero, zerozero);
}

@Test
public void testScaleFactorCompare() {
    final BigDecimal zero = new BigDecimal("0.0");
    final BigDecimal zerozero = new BigDecimal("0.00");

    assertTrue(zero.compareTo(zerozero) == 0);
}

This technique works when you’re in control of the data and the comparison, but it breaks when you want to put a BigDecimal object into most other Java data-structures. I’ve actually seen someone use a BigDecimal as a key to a HashMap, which of course didn’t work. The solution in this case was simple: change the HashMap for a TreeMap and things were happy. However it won’t always be this simple.

They’re true high precision structures.

This doesn’t just mean that they are precise, it also means that they won’t run any calculation that wouldn’t result in a representable answer. Take the following code snippet as an example:

@Test
public void testArithmatic() {
    BigDecimal value = new BigDecimal(1);
    value = value.divide(new BigDecimal(3));
}

Primitive numeric types would just swallow this and represent the 0.3* as best they could, while a BigDecimal throws an ArithmeticException instead of attempting to represent a recurring number. In some cases getting an error will be desirable, but I’ve actually seen someone resolve the ArithmaticException like this:

try {
    return decimal1.divide(decimal2);
} catch(ArithmaticException ae) {
    return new BigDecimal(decimal1.doubleValue() / decimal2.doubleValue());
}

Yes folks, unfortunately I’m quite serious here. This is the sort of bug introduced by an error occurring, computations stop running, and someone adds a “hack” to just “make it work quickly and we’ll fix it later“. It’s a total disaster, but I see it far to often.

They don’t play nice with Databases.

According to the JDBC spec database drivers implement a getBigDecimal, setBigDecimal and updateBigDecimal functions. They seem like a great idea, until you ponder that your database may not have a suitable storage type for these values. When storing a BigDecimal in a database, it’s common to type the column as a DECIMAL or REAL SQL type. These are both standard floating-point types, with all the rounding errors that implies. They are also limited in capacity and will often overflow or cause a SQLException when attempting to store very large BigDecimal values.

The only practical solution which will keep all the BigDecimal functionality and accuracy in a database is to type the amounts a BLOB columns. Try to imagine the following table structure if you will:

CREATE TABLE transactions (
    initial_date DATETIME NOT NULL,
    effective_date DATETIME NOT NULL,
    description VARCHAR(30) NOT NULL,
    source_id BIGINT NOT NULL,
    destination_id BIGINT NOT NULL,
    in_amount BLOB NOT NULL,
    in_amount_currency CHAR(3) NOT NULL,
    effective_amount BLOB NOT NULL,
    effective_amount_currency CHAR(3) NOT NULL,
    charge_amount BLOB NOT NULL,
    tax_amount BLOB NOT NULL
);

That required four different BLOB columns, each one of which will be stored outside of table space. BLOB objects are very expensive both to store, and to work with. Each one often uses it’s own database resources (much like an internal cursor) to read or write the value. This translates to much more time and network usage between your application and it’s database. To add to the misery a BLOB is generally not readable by a SQL tool, one of the major reasons for sticking with a SQL database is that it can be managed from outside of your application.

Performance.

This is often raised as an issue, but ignored in favor of “accuracy”. The performance of BigDecimal is often considered “good enough” for general computing, and it’s fine if you want to add tax to an item every once in a while, but consider the number of interest calculations per month a moderate sized bank do. This may seem like an extreme case, but if your application ran a simple shipping and tax calculation for items on an online store in a JSP you’ve got effectively the same problem. In a very simple multiplication test BigDecimal performed over 2300 times slower than a simple long value. While this may only be milliseconds per mutation, a performance-factor of this size very quickly adds up to more computational time than is actually available to the system.

Also remember that BigDecimal (like most Number subclasses) are immutable. That means every calculation requires a copy of the existing BigDecimal. These copies are generally cleaned away by the eden-space collector (and G1 is very good at handling them), but when you put such a system into production it leads to a massive change in your heap requirements. Your BigDecimal objects must be allocated in such a way that a minimum number of them survive a garbage collection, the memory requirement of such a space quickly spirals out of control.

To add to the performance argument: the compareTo method is quite a bit slower than the equals method, and gets significantly slower as the size of the BigDecimal increases.

A Cure to BigDecimal Woes:

A standard long value can store the current value of the Unites States national debt (as cents, not dollars) 6477 times without any overflow. Whats more: it’s an integer type, not a floating point. This makes it easier and accurate to work with, and a guaranteed behavior. You’ll notice that several different behaviors in BigDecimal are either not well defined, or have multiple implementations. That said: depending on your application you may need to store the values as hundredths or even thousandths of cents. However this is highly dependent on your application, and theres almost always someone who can tell you exactly what unit the business works in. Bare in mind also that there are often de-facto (or even mandated) standards which exist between businesses about what unit of money they deal in, using more or less precision can lead to some serious problems when interfacing with suppliers or clients.

The mechanism I generally try to use is a custom-built MoneyAmount class (each application has different requirements) to store both the actual value, and it’s Currency. Building your own implementation opens the opportunity to use factory methods instead of a constructor. This will allow you to decide on the actual data-type at runtime, even during arithmetic operations. 99% of the time, an int or long value will suffice – when they don’t the implementation can change to using a BigInteger. The MoneyAmount class also enables you to define your own rounding schemes, and how you wish to handle recursive decimal places. I’ve seen systems that required several different rounding mechanisms depending on the context of the operation (currency pairs, country of operation and even time of day). For an example of this kind of factory discussion: take a look at the source-code for the java.util.EnumSet class. Two different implementations exist: the RegularEnumSet class uses a long to store a bit-set of all the selected constants. Given that very few enum values have more than 64 constants this implementation will cover most cases, just like a long will cover most requirements in a financial system.

Summary

This post is to warn people who are busy (or about to start) writing a system that will run financial calculations and are tempted to use BigDecimal. While it’s probably the most common type used for this purpose in the “enterprise” world, I’ve seen it backfire more times than I care to recount. My advise here is really to consider your options carefully. Taking shortcuts in implementation almost always leads to pain in the long-run (just look at the java.util.Properties class as an example of this).

Advertisements

6 Reasons to Develop Software on a MacBook Pro

Most of my life I’ve worked on either Windows or Linux (in various flavors). Most of my software development experience has been on a Linux machine, and they’re amazingly productive when compared to Windows. The main reason for the increased productivity is that Linux machines require significantly less configuration for developers. Most of the toolchain you use in development is already there, or can be installed and configured into place in under 10 clicks of your mouse.

Recently however I’ve been using a MacBook Pro, and I’ve gone from a fanatical devotion to my Linux machine to a fondness for the Apple. I decided to write out a list of the things I’ve noticed about these machines that makes them so ideal for software development.

Note: this is really just my opinion on the subject, don’t take it to seriously 😉

Spotlight is Productivity Defined

There are several desktop based search tools for both windows and linux, but they don’t even come close to how powerful Apple Spotlight is. If you’re used to a keyboard driven environment you’ll absolutely love it. Not only for just starting your software, but for amazingly quickly finding that document you were reading the other day… the one that said something about “too much fresh fruit”.

It’s BSD Based Underneath

Thus a flavor of Unix. Whatever Unix derivative I’ve developed on over the years, they have always been (for me) more productive than Windows. I’ve found that the single most important reason for this actually appears to be the implementation of the file system layer. Unix implementations tend to have a more robust and flexible file system implementation, while Windows seems to be of the opinion that it’s always going to be a desktop machine, and if something goes wrong you can shut it down and do a scandisk (and yes I have used Windows 7, it’s still broken).

The better file system layer plays directly into software development. Source control tools such as Subversion, databases and web servers all tend to have less problems on a Unix file system than on Windows. Yes the tools could probably be modified to better handle the Windows filesystem, but should that really be necessary?

Gesture Recognizing Touch Pad

Spend in hour or two using the touch pad on a Mac and you may find it difficult using any other touch pad. The ability to switch from scrolling to clicking to navigating without actually moving to different areas of the pad makes a huge difference.

The other thing everyone comments on, and I’ll definitely add my voice to is the size of the touchpad. It’s so huge I don’t miss a mouse in the slightest. In fact quite the opposite, I would far rather develop software with a MacBook Pro touchpad than with a mouse. Main reason: it’s much closer to the keyboard, I don’t need to move my hand over to the mouse to scroll or navigate code or documentation.

Finally (and a bit unusual maybe), the ability to smoothly scroll horizontally as well as scale (with the pinch gesture) makes viewing diagrams much easier than previously. Most touchpads emulate a mouse-wheel when they scroll, and most diagram software tends to interpret this as very slow scrolling, or scrolls in “lines”. Both of these behaviors are very annoying and effectively wrong.

The Keyboard is Incredible

It’s not just the fact that it’s a clicklit keyboard, it’s the size and spacing of the keys. The arrangement of the keyboard is similar to a desktop keyboard, but since it’s a laptop the keys don’t sink as far, they tap away with a very satisfying “click” sound. The backlight is also brilliantly useful for late-night coding, but you can find that on many different laptops today.

The default keyboard shortcuts also make abundant sense when you’re using these keyboards. Some of the characters are not quite where you expect to find them at first, after a short time of using them I’m wondering why all QWERTY keyboards aren’t arranged like this. Finally: many of the symbols that you’d normally open a character map application for are accessible directly from the keyboard. It’ll take you a bit of time to get used to the fact that you can type a ± or Ω with just 2 keys, but it’s a brilliantly useful ability.

It’s Just Built Better

Having looked at a large number of laptops, both off-the-shelf and customized builds I’ve found that the MacBook Pro is actually surprisingly good value for money: if you know whats inside. It’s small details such as one laptop using a Core2-Duo while the same priced MacBook is an i5 CPU. Theres also the size of the battery, which for some is less important than others. If you take a look at the image on the right you’ll see just how much physical space the battery takes up in the 15″ variation of the MacBook Pro: it’s huge! It’s actual capacity is 77.5 watt-hours, compared to a similarly priced Lenovo laptop which has just 48 watt-hours.

Another “feature” is the fact that the MacBooks all have DDR3 memory in them where most of the competition is still using DDR2 (and often really cheap DDR2 at that). The MacBooks also generally have more memory by default than similarly priced machines.

Other tiny features that make a huge difference are the amazing lack of cooling vents on the case, the mag-safe power connector that simply falls away when someone kicks your cable, the battery meter on the outside of the laptop.

Add to all of this that the MacBook is generally thinner than it’s competitors and has an aluminium case instead of the normal plastic nonsense and you’ll find that the MacBook will not just out-perform it’s competition but also just feels more solid when you carry it around.

It Just Works

This sounds really corny, and it’s generally a bit of a Mac idiom, but it’s true. When I plugged my Android phone in to do some on-the-device debugging work, all I had installed with the stock Android SDK. With no special device drivers or funny setup work, the device just worked. I set breakpoints, hit them, edited variables, simply put: I controlled the phone from my MacBook with no special setup or configuration. All things considered, this is quite amazing.

Where C++ Templates Fall Down

There are many articals out there detailing comparisons between C++ Templates and Java Generics (Google around and you’ll quickly find loads of them). Almost universally they view the erasure of Java generics as a fail-point. Even the ones written by “Java Gurus” seem to favour C++ templates as being more flexable. I recently stumbled upon a simple case that I’ve been doing in Java for ages, and while trying to implement the same concept in C++ discovered that it’s impossible.

The case is a generic Visitor model:

public interface Visitor<R, P> {
    R visitFirstNodeType(FirstNodeType node, P parameter);
    R visitSecondNodeType(SecondNodeType node, P parameter);
}

public abstract class AbstractNodeType {
    public abstract <R, P> R accept(Visitor<R, P> visitor, P parameter);
}

Looks simple enough, but try and implement it in C++ and you discover that you cannot use templates on a virtual method. At first this doesn’t make sense (even to a friend of mine who lives in C++). The simple reason is the way templates are handled: they are code generation shortcuts. Therefore the compiler needs to know the exact method to run, but by declaring the method virtual you are telling the compiler to deferre that descision until runtime. In Java the generics are “erased” (although they are kept in the signature), resolving to a single method implementation. This means that there is no problem trying to decide which method to run.

C++ can create the same structure without templates, but then you loose the type-safety contract that you defined in Java. It’s not really a fail-point on C++ (rather just a product of the implementation), but it is something that seems to be overlooked on the Java side of the fence. Generic erasure buys you fairly cool compile-time contracts.

Declaring things final in Java

This is probably a really weird thing to be blogging about. I mean: whats to know, really? The “final” keyword is a pretty underused feature of Java. It’s most common (for some people it’s only use) is for static fields:

public static final int ONE = 1;

However, there are several tricks to final that some people know nothing about. For example: when used on a field, you don’t have to assign a value immediately. Amazingly, the following code is in fact valid:

public class NumberStringThingy {
private final String value;

public NumberStringThingy(int number) {
if(number < 0) { value = "negative"; } else { value = Integer.toString(number); } } } [/sourcecode] This is a really awful example in some ways, but it serves the point. You can make a decision in the constructor about what to the field, so long as every possible branch leads to the field being assigned. You can also declare local variables and parameters as final. The final keyword can serve as a hint to the Hotspot compiler at runtime. In some ways the opposite to volatile. There is another, much more important reason to declare fields, variables and parameters (almost all of them in fact) as final:

It makes your code more readable.

I started experimenting a few months ago, declaring anything that didn’t change value as final. The result: I can read my code much faster now, and my older code? When I see things that could be final, I’ve started changing them. It’s a massive psychological effect it has: if a variable isn’t final, it must change somewhere. It means that with a glance at the code, you can be very certain what the value of a variable is.

I strongly recommend giving this a try for a few weeks, you may find it very hard to go back to normal.

Note: I don’t recommend declaring your classes and methods final unless it really makes sense. Just fields, variables and parameters.

Misleading Microbeanchmarks

You’ll often see people show a micro-benchmark to prove a point. I can think of many examples offhand:

  1. Using a synchronized singleton SimpleDateFormat is slower than tying them to a ThreadLocal or putting them in a pool
  2. Unfair ReadWriteLocks cause write-lock starvation (writeLock.lock() will never return)
  3. Using a HashSet of String’s is much faster than doing an indexOf()

These are all true in the confines of a micro-benchmark. The problem is: micro-benchmarks are misleading. For example, number 2. I recently read a micro-benchmark proving that under high contention situations the writeLock may never be acquired. Firstly the documentation does tell you this:

The constructor for this class accepts an optional fairness parameter. When set true, under contention, locks favor granting access to the longest-waiting thread. Otherwise this lock does not guarantee any particular access order.

Most people assume that the write-lock will have preference over read-locks. However, this is a general-purpose implementation of a ReadWriteLock (ReentrantReadWriteLock). That means it’s designed to work well (and fast) under most situations. The fast is: in most situations, lock contention is relatively low and/or fluctuates over time.

What does this have to do with micro-benchmarking? A micro-benchmark is generally built not to simulate the fluctuation of the systems load. A micro-benchmark is generally designed to simulate a very high load, often where the only thing done is what is being tested (formatting dates, acquiring locks, etc). This results in a very unrealistic picture of whats actually going on.

If we take 3 for example. Looking for a specific substring in a String list (for example “AN” in a comma separated String containing “GH,JK,IH,TA,AN,FR,MN,SA”), there are several different approaches to this:

  • Simply use indexOf to see if the substring exists
  • Use String.split(“,”) and then iterate through the resulting array to find the substring
  • Add the substring tokens to a HashSet and use the contains() method
  • Write a method to iterate through the String and test the substring tokens without using substring (ie: a char[] or some such).

Obviously the fastest (assuming the data can be re-used) will almost always be to add the strings to a HashSet. However if the HashSet is not kept around for re-use later, this is the slowest method. The fastest in this case would be indexOf() or a specialized method. In a micro-benchmark however, it would be easy to prove that a HashSet is by far the fastest, when the String is actually only tested once every day. In which case, whats the point in optimizing?

Don’t guess, measure

When dealing with performance problems, I’ve noticed an alarming trend: Profiling seems to be something special! Profiling should be your first stop when trying to improve performance, and not pretend profiling either. When you profile a large system, you should profile parts of the production system.

Production Profiling 1.0.1

This doesn’t mean you have to put your profiling into production. I’ve often used a record / playback system for profiling. A recorder of sorts writes a binary log file detailing the actual actions of real users on the production system. A profiling system is then setup (after a few weeks of recording), and you play back the recording, capturing the profiling information. This method means that your users don’t experience bad performance because of the profiling instrumentation of the code.

Caching Data

A common answer to a slow system is to start caching some of the information. But what information should you cache? How long should it be lived? How big should to cache be? Most decent caching systems offer metrics that will tell you what the cache is doing, use this information! There’s not point in gathering information you don’t use.

On-going Profiling

Finally, there are parts of your system you will always want to know about. These parts should continuously feed you information about their performance, information like:

  • How often that special piece of optimized code is actually being used
  • How much memory is being consumed in the process
  • How long it takes to execute
    • Both the entire process, and key sub-parts of the process
  • Which users make the most use of the feature
  • How often that area of the system is used by day / week / month

If you’re using Java, I would strongly recommend looking into writing some MBeans. JConsole will give you graphing of your stats for free, and also makes it very easy to create custom views of you MBeans.

Measured insight into your system is possibly the only way to truly improve the performance of a system.

EoD SQL 2.0-beta is up and about

Thanks to a lot of bug reports and feature requests, I’ve released EoD SQL 2.0-beta. What’s changed? Not a massive amount is visible, besides a long list of bug fixes:

  • QueryTool.select is now working again, and includes a new variant that uses a Connection instead of a DataSource.
  • Added “char” and “Character” as default primitive types
  • The Java primitive TypeMappers (short, int, long, boolean, etc.) now use valueOf instead of new

Some of the bug-fixes are very important, so if you’re using EoD SQL 2.0-alpha you’ll want to download this new version asap.