Leanpub: Publish Early, Publish Often

4. Caching considerations

I didn’t want to discuss caching too much, but later I did spread couple of sections about it throughout the text. Later I decided to concentrate them here and be (mostly) done with it for the sake of this book.

While persistence context (EntityManager or session) is sometimes considered a cache too, it is merely a part of the unit-of-work pattern (identity map). The real cache sits underneath and is shared on the level of the EntityManagerFactory – or even between more of them across various JVMs in case of distributed caches. This is called the second-level cache.¹¹ It is used to enhance performance, typically by avoiding round-trips to the database. But caching has consequences.

Caching control

[JPspec] doesn’t say much about caching. It says how to configure it – starting with shared-cache-mode in the persistence.xml. But I’d probably study caching documentation of a particular provider, because if we don’t care at all, we don’t even know whether and how we use the cache.

Without choosing shared-cache-mode it is up to the JPA provider and its defaults. This may render any use of @Cacheable annotations useless. Currently, Hibernate typically doesn’t cache by default, while EclipseLink caches everything by default. Being oblivious to the cache (not related to cache-oblivious algorithms at all) is rather dangerous, especially if our application is not the only one running against the same database. In that case setting shared-cache-mode explicitly to NONE is by far the best start. We may revisit our decisions later, but at least we know what is happening.

Database caches too

Caching happens also directly on the database level. It still incurs a network round-trip but it does not necessarily load data from disk (that would be really slow) and we don’t have this entity-query duality – although we don’t know how complex the database cache is. In any case, it is used whether we use second-level cache or not and, luckily, it is mostly totally transparent for a programmer or user.

If we don’t want to dig deep we may disable ORM caches knowing that something still probably caches for us. Often we can see that the first request for a page of something is slow but the second requests is much faster – even if both go to the database. Sure we don’t go as fast as we can, but we avoided a lot of problems, like stale data on the screen when we change them in the database directly or the need for a distributed cache when we scale the application while still using a single database server (until sufficient).

Probably the most important question to consider is: Is our application the sole user of a particular database? Is it running in a single process? If yes, we may safely use the cache. But I’d not “just use the defaults” of the JPA provider – which may be no cache as well. It’s not a good idea to use caching without any thought at all. I prefer not to use it until I feel the need. When we start using the cache and tuning it we must be prepared for a walk that may be not as easy as it seems.

Entity cache returning the entities by ID really quickly sounds like a good idea because it makes the problem of eager-loads of to-one relationships less serious. E.g. when we load a dog we may get its breed quickly from the cache. It doesn’t fix the problem though as all the additional entities are part of the current persistence context whether we want them or not. E.g. we want to update a single attribute of a dog and we are not interested in its breed. We already mentioned that units of work bigger than necessary are not for free.

When JPA is not the only one caching

Imagine you’re caching returned JPA entities with JSR 107, e.g. using Spring implementation of the JSR. It can be some data access method that returns single entity for a filter – filter is the key for the cache. There is one serious problem with this approach. Let’s ignore that we may have many keys that return the same entity, but in the cache these may be stored as many instances (recreated from its dehydrated/serialized form in the 2nd-level cache). The real problem is that when the cache misses it reaches for JPA to get the entity, stores it and gives it to you – attached. But when the cache hits it returns detached entity. This is highly non-deterministic situation that may lead to exceptions in better case or to unexpected changes of entities in a worse one, depending on the code that uses such a method.

All in all – don’t cache managed stuff with technology that doesn’t understand it.

Second-level cache vs queries

Caching should be transparent, but just turning it on is a kind of premature optimization which – in virtually all cases – ends up being wrong. Any auto-magic can only go so far, and any caching leads to potential inconsistencies. I believe most JPA users don’t understand how the cache is structured (I’m talking from my own experience too, after all). This depends on a concrete ORM, but typically there is an entity cache and a query cache.

Entity cache helps with performance of EntityManager.find, or generally with loading by entity’s @Id attribute. But this will not help us if we accidentally obfuscate what we want with a query, that would otherwise return the same. The provider has no way to know what entity (with what ID) will be loaded just looking at arbitrary where conditions. This is what query cache is for. Bulk update and deletes using JPQL go around either of these caches and the safest way how to avoid inconsistent data is to evict all entities of the modified type from the caches. This is often performed by the ORM provider automatically (again, check documentation and settings).

If we only work with whole entities all the time and nothing else accesses the database we can be pretty sure we always get the right result from the entity cache. You may wonder how this cache behaves in concurrent environment (like any EE/Spring application inherently is). If you imagine it as a Map, even with synchronized access, you may feel the horror of getting the same entity instance (Dog with the same ID) for two concurrent persistence contexts (like concurrent HTTP requests) that subsequently modify various fields on the shared instance. Luckily, ORMs provide each thread with its own copy of the entity. Internally they typically keep entities in the cache in some “dehydrated” form.¹²

Explicit application-level cache

Previously we ignored the case when we cache some entities under multiple different keys, not necessarily retrieved by the same method. Imagine a component that returns some classifiers by various attributes (ID, name, code, whatnot) – this is pretty realistic scenario. We code it as a managed component with declarative cache. There are separate methods to obtain the classifier by specific attributes. If we retrieve the same classifier by three different attributes we’ll populate three different caches with – essentially the same – entity stored under different key (attribute value) in each of these caches.

Even if we ignore that the entities sometimes participate in the persistence context and sometimes don’t this consumes more memory space than necessary. It may still be acceptable though. Personally I believe that the non-determinism regarding the attached/detached state is more serious but let’s say these are only for reading and we may not care. Imagine further that we may filter on these entities – like “give me a list of classifiers with name starting with BA”. Now we have even more variability in cache keys – any distinct filter is the key – and probably many more repeated entities in the results. But these are likely distinct objects even for the same logical entities. This may either explode our cache, or cause frequent evictions rendering our cache useless, probably utilizing more CPU in the process.

If the amount of underlying data is big we may have no other chance, but in case of various administered static data, code books or classifiers the size of the table is typically small. Once our DB admins reported that we queried a table with 20k rows of static data 20 million times a day – not to mention the traffic on our site was in order of thousands of request a day. It was an unnecessarily rich relational model and this part of system would be better represented in some kind of document/NoSQL store. We didn’t use the relations in that data that much – and they actually prevented some needed data fixes because of overly restricted cobweb of foreign keys. But this was the design we had and we needed a fix. “Can’t we just cache it somehow?” The data were localization keys – not for the application itself but for some form templates – so it was part of the data model. We joined these keys based on the user’s language for each row of possibly multi-row template (often with hundreds of rows).

First we needed to drop the joins. Plan was to ask some cache component for any of these localization keys based on its ID and the language. It took us some time to rewrite all the code that could utilize it and originally just joined the data in queries. But the result was worth it. The component simply read the whole table and created map of maps based on the languages and IDs. DB guys were happy. We stopped executing some queries altogether and removed unnecessary joins from many others.

There are other situations when we may seriously consider just coding our own cache explicitly and not rely on declarative one like JSR 107. Declarative cache doesn’t mean unattended anyway. We should limit it, set policies, etc. It can be extremely handy when we can get results cheap and this typically happens when limited subset of possibly big chunk of data is used repeatedly using the same keys.

Programmatic (explicit) cache can shine in other areas:

When we work with a limited set of data and we need it often – and we want to pre-cache it. This may be reasonable scenario also for declarative cache if we can pre-fill it somehow and there is a single way how we obtain the data.
If we require the same data based on various attributes (different views). We can use multiple maps that point to the same actual instances. This can work both for cases when we cache as we go (when misses are possible) and when we preload the whole set (ideal case when not too big).
The cache can cooperate with any administrative features that modify the underlying data. Because we code it with full knowledge of the logic behind the data we can selectively refresh or update the cache in a smart way. In declarative cache we often have to evict it completely – although this can still be a good simple strategy even for programmatic caches, especially when the refresh requires a single select.
Full preload requires more memory up-front and slows down the startup of the application (can be done lazily on-demand) but deals with the DB once and for all. Declarative cache executes a query on demand for every miss but loads only some of the data that require caching – potentially less efficiently than full preload.

Of course, there are cases when we can’t use fully pre-loaded cache. In general, ask yourself a question whether “cache this by this key” is best (or good enough) solution or whether you can implement a component utilizing logic behind the data better – and whether it’s all worth it.

Conclusion

Caching can help tremendously with the performance of our applications – but it can also hurt if done badly. We should be very aware of our cache setup. We should be very clear how we want to do it. In the code it may look automagical, but it must be explicit somewhere – our strategy must be well known to all the developers who may encounter it in any way (even without knowing).

We always trade something for something – with caching it’s typically memory for speed.¹³ Memory can slow us down too, but in general we have plenty of it nowadays. While we are in a single process a JPA provider can typically manage the data consistency. If we have a distributed architecture we enter a different world altogether and I’d think twice before going there. We must feel the need for it and we must measure what we get – because we’ll definitely get the complexity and we have to think about consistency much more.

Don’t mix freely caching on multiple levels. Database cache is mostly transparent to us, but when we mix two declarative caches we often make matters worse. Especially when we cache entities with technology that is not aware of their lifecycle within the persistence context.

Finally, depending on what our keys for caching are we may waste a lot of memory. Entity ID (like in second level cache) is natural and good key. But if we key on many various selectors that may return the same entities (single or even whole collections) we may store many instances for the same entity in the cache. That wastes memory. Knowing more about the logic between the keys and values we may get better results with our own implementation of some explicit cache on an application level. It may require more effort but the pay-off may be significant.

Shortly:

Don’t cache mindlessly. Design what to cache (or not) and size the cache regions properly.
Realize that with caching we trade CPU for memory – or rather for memory and hopefully less CPU required by GC. Check the heap usage after changes.
Beware of caching on multiple levels, especially combining JPA and non-JPA caches inside JVM.
Consider implementing your own caching for selected sub-domains. The same data accessed by various criteria may be fit for this.
Measure the impact of any change related to caching.

Up next

5. Love and hate for ORM