Why memcached is probably a bad idea for your Java app

memcached has gotten a lot of press lately, and in the web-application world people are swallowing the coffee as a cure-all for performance related problems. However, in a Java application I would now consider memcached not just a bad idea: it’s in fact harmful.

Don’t get me wrong at this point, memcached is a great little piece of software. It’s simple (both in itself, and in it’s usage), and that’s one of it’s greatest virtues.

So now, why is it a bad idea for a Java app?

Remote caching

memcached is a 100% remote cache. Every time you get an object from memcached is does a network fetch. In a PHP / Python / Perl style application where you likely have memory separation this is a logical course of action. However in Java, there’s often a good chance you already have that object lying around locally already, especially since many smaller clusters are configured with sticky sessions. However the Java clients for memcached that I have seen make no attempt to bind the object through even a WeakReference map of some sort. Therefore, every-time you get something from memcached, you allocate more memory, for an object you may have dozens of in memory already. Considering the amount of memory thrown at servers today, it can take the Garbage Collector ages to get to those objects.

Another side effect of this structure, is that when storing Java object, you incur a Serialization overhead in either direction. Which, while not a huge expense, takes a bit of the edge off.

Storing Database Objects

Most of the use of memcached seems to be storing database objects. This is something you’re database layer in Java may well be doing for you already! Tools like Hibernate already have caches built in to reduce the number of database hits. Yes, memcached is a replicated cache, so is SwarmCache. SwarmCache and similar products have the added advantage of existing within you Java VM and thus not reproducing copies of objects fetched.

Another point I’d like to make at this point is: mostly objects that are fetched from the DB in a web-application are used to build one or more pages. Why not rather store the fully rendered page, or at least parts of it in memcached?

2GB Memory Limit

Okay, this in some ways is one of memcached’s advantages. However when you confront it with the fact that a Java VM quite happily cope with 10GB of memory, why start multiple processes, instead of just one? Why load your scheduler (which is probably a server non-pre-emptive scheduler) with the additional load? Once again this also relates to the “not inside the VM” problem.

Very Large Hash-Tables

Hash-tables inherently get slower as you put more objects in them. This is not such a a big deal generally, but breaking your cache into multiple smaller caches can have a huge impact on performance. Rather caching each object type in it’s own cache (which is generally what you want to do anyway). Each memcached instance running in your network is a nice big Hash-table. Given that most objects take up only a few k of memory, and that memcached instances normally allocate the full 2GB limit, there are often hundreds of thousands or objects stored in each memcached instance.

So what do I suggest?

So I’ve outlined a few problems here, but what sort of solutions would I suggest? First I would say take a look at pure Java caching solutions such as SwarmCache. SwarmCache itself is fairly old now, and hasn’t seen much activity in a while, but it’s stable and well known.

If you really want to ride the memcached wave with Java, I would suggest putting a local cache in front of it with WeakReferences. This means you will never fetch an object from memcached that is already within your VM.

A final, but important note about caching: I would strongly suggest making your cached objects immutable! This will save you from the possibility of concurrent modification, if required: use a Builder that can take an existing object as a template.

13 Responses to “Why memcached is probably a bad idea for your Java app”

  1. Dhan Says:

    You seem to have taken a very simplistic approach to caching. While I agree to the fact that memcache is remote caching and there are remoting, marshalling, and memory allocation overheads, there simply is no way other (easy) way around for a distributed system. When requests that need the same data can hit any one of a farm of servers, then where else do I cache the data other than a distributed cache?

    I would like to add to the list of “suggestions” –
    When the content being cached belongs to restricted entity (like a single user / session), then try to hold the data in a JVM cache (Session / singleton objects). Take advantage of session affinity based redirection in load balancers.

  2. Jason Says:

    Hi Dhan, thanks for the comment.

    I well know that marshalling, and memory allocation overheads are a basic part of networking. Mostly what I was pointing out is that because memcached doesn’t run in a Java VM, it can’t take advantage that Java applications often do their own load balancing and may already have the cached data in memory.

    My main point: take a look at Java caches before choosing memcached, unless you’re not writing in Java.

  3. Dhan Says:

    “take a look at Java caches before choosing memcached” — doing that in a distributed server environment is tough. Let me illustrate :
    I have 2 servers (A&B). Now, a user hits Server A, updates some data causing the cache to be dropped / updated. With memcache like non-redundant replication, it is easier since the key for the cache will resolve to one particular cache instance, thus invalidating the cache in one shot. However, with distribution aware caches like swarmcache, the entire cluster of cache servers have to be broadcasted and potentially updated.
    With very high memory these days, if some data is static (or can be assumed so in some scope), then the data can be cached locally within a JVM as singleton objects. However, I am sure that no one will use a product like memcache where basic non-changing data that can be cached within a JVM. For all the rest, memcache like service is better.
    Ain’t it?

  4. Jason Says:

    I would advise you to look closer at pure Java caches. Most of them don’t update the entire distributed cache, since as you point out: it’s a cache. Some have highly configurable option stacks which allow 100% replication, power-of-three replication, last-accessed replication and many other options.

  5. Nemanja Says:

    I read the article and all the comments and i must say that, Jason, didn’t answer the question. What to use instead if you have farm of servers and you must track a session for each user?

    • Jason Says:

      Session management for users is a standard feature of the Servlet containers. Why would you introduce memcached to manage sessions in a Java environment? The standard session management is generally going to be much more efficient (and stable) than using memcached (unless you’re using Jetty).

      If you’re concerned with session migration, have your load balancer run sticky sessions, which will provide a much faster solution than memcached, since the users sessions will never need to move across the network (we run this way on 80 machines with no problems).

      If you are going to use memcached with Java, use it for a cache. Tracking sessions with memcached in Java is a really bad idea (not to mention a terrible waste of resources).

      • Nemanja Says:

        As far as I know, if someone turns off cookie in browser (or for mobile, where many browsers don’t support cookie), than servlet container will track session using jsessionid, for example. And that session will not be remembered when you come again on site. And what if I need to track one customer whole year, for example.

        As for the load balancer concerns, that is a good idea. And we are using it, already.

        P.S. You could be more polite to people who posts here. Just a suggestion. :)

        • Jason Says:

          I apologize if I came across as impolite, that was not my intention.

          As for the cookie problem, there is a solution in the servlet api: HttpServletResponse.encodeURL. This method will automatically detect if the user doesn’t have cookies, and encode URL’s with the session ID if they don’t. It’s considered bad practice when it comes to SEO, however the crawlers all seem to support cookies as well (so they will only see it on one page).

          To fix the URL’s in JSP pages, you can use the JSTL tag <c:url> to both produce an absolute path to a resource, and encode the session ID if required.

          I’m curious to know how you would go about tracking a customer for a whole year with memcached, considering it’s likely to throw the session away long before that time is reached, and without cookies or some form of URL encoding you can’t tell one user from the next.

          You could crank the session timeout as high as you like. Inactive sessions are serialized to disk (some servlet containers also do this between restarts), and so their timeout is guaranteed rather than based on load.

  6. Nemanja Says:

    Whole year is just for example.
    We are using URL rewriter to extract sessionid.
    And I don’t want to go oftopic here.I was just curious how to store sessions on mulpitple servers and that’s it. And your answer is load balancer over memcache.
    Thank you. :)

  7. Enzo Says:

    So why do java based websites suck so much?

    PHP / Memcached sites are light years ahead of Java sites in terms of speed. Look at any typical Java based banking site…my god are they slow.

    • Jason Says:

      Quite frankly, most banks hire monkeys to do their coding (and yes, I’ve worked in that sector). That said, my banks IB site is much faster than the local competition, and it’s the only one built in Java.

      I’ve seen a PHP dev coding Java (apparently for 2 years already) make their doGet and doPost methods synchronized in a Servlet. It’s not that PHP is simpler, just Java has more intricacies (like threads) that monkey devs don’t “understand”.

      One other point on Java and corporates is: most go with J2EE sites in the most stupid ways imaginable. For example:

      Data Object -> Data Store EJB -> JDBC Driver -> Database

      See the additional network round-trip in there? That structure is the norm with Java monkeys, and it’s just plain idiotic. I have a very low opinion of most J2EE developers (their are a few exceptions), since most seem to be covering their own incompetence with a framework.

  8. Brett Says:

    Anyone who is a student of truly large scale server design knows by now that Java servlet sessions and session replication don’t scale. I suggest two items of study. First, watch the Google IO sessions from 2008/09. Strongly dis-recommended is almost any kind of what has traditionally been known as “session state” on the server. Everything is moving to the client. The server should be, _as nearly as practicable_ (let me emphasize that), stateless.

    Second item of study reinforces what Google says. Read ebay’s write-up on scalability (google: ebay scalability). Ebay architecture has evolved through the school of hard-knocks. One of those hard-knocks was servlet sessions. Ebay was simply being crushed in 2002. Now they serve over two BILLION pages views PER DAY. Something according to them (don’t take my word for it) they could never have done if they had persisted with their server-side session-based architecture.

    This is “Web 2.0″ and beyond kids. Java can scale massively; better despite any language debate than any modern language thanks to the JIT. But a well architected Ruby site (for christ sake!) can beat a poorly architected Java one. Let go of your “servlet sessions” and 1990’s server architecture.

    Memcache, combined properly with client-homed sessions and “stateless” servers is the way of the future. Computing clouds, instant and seamless fail-over are the goal here. “Session” or “server” affinity are be a thing of the past. Login to ebay and look at your cookies — nearing 2 dozen — that’s ALL your state. Do you think any two of your requests go to the same server at ebay? You would be wrong.

    • Alex Says:

      A great post and some insightful comments.

      What you are talking about is a pretty massive infrastructure needed by only a handful of large companies (i.e. ebay, google, etc); for medium-large companies with traffic up to say 1-10 mil pageviews/day , some degree of server-stickiness with 80 odd servers is probably not the worst thing in the world.


Leave a Reply