Leanpub: Publish Early, Publish Often

3. Questionable parts

The JPA, as any ORM, is not without its drawbacks. Firstly, it is complex – much deeper than developers realize when they approach it. Secondly, it is not a perfect abstraction. The more you want to play it as a perfect abstraction, the worse it probably gets in marginal cases. And the margin is not that thin. You may solve 80% cases easily, but there are still 20% of hard cases where you go around your ORM, write native SQL, etc. If you try to avoid it you’ll probably suffer more than if you accepted it.

We can’t just stay on the JPA level, even for cases where ORM works well for us. There are some details we should know about the provider we use. For instance, let’s say we have an entity with auto-generated identifier based on IDENTITY (or AUTO_INCREMENT) column. We call persist on it and later we want to use its ID somewhere – perhaps just to log the entity creation. And it doesn’t work, because we’re using EclipseLink and we didn’t call em.flush() to actually execute that INSERT. Without it the provider cannot know what value for ID it should use. Maybe our usage of the ID value was not the right ORM way, maybe we should have used the whole entity instead, but the point is that if we do the same with Hibernate it just works. We simply cannot assume that the ID is set.⁴

Lazy on basic and to-one fields

While we can map these fields as lazy the behaviour is actually not guaranteed according to the JPA specification [JPspec]. Its section 2.2 states:

And the attached footnote adds:

While ORMs generally have no problem to make collections lazy (e.g. both to-many annotations), for to-one mappings this gets more complicated. [PoEAA] offers couple of solutions for lazy load pattern: lazy initialization, virtual proxy, value holder, and ghost. Not all are usable for to-one mappings.

The essential trouble is that such a field contains an entity directly. There is no indirection like a collection, that can provide lazy implementation, in case of to-many mappings. JPA does not offer any generic solution for indirect value holder. Virtual proxy would require some interface to implement, or bytecode manipulation on the target class, ghost would definitely require bytecode manipulation on the target class, and lazy initialization would require bytecode manipulation, or at least some special implementation, on the source class. JPA design neither offers any reasonable way how to introduce this indirection without advanced auto-magic solutions nor ways how to do it explicitly in a way a programmer can control.

Removing to-one mappings and replace them with raw foreign key values is currently not possible with pure JPA even though JPA 2.1 brought ON clause for JPQL but it does not allow root entities in JOINs. We will expand on this in the chapter Troubles with to-one relationships.

Generated SQL updates

Programmer using JPA should see the object side of the ORM mapping. This means that an object is also the level of granularity on which ORM works. If we change a single attribute on an object ORM can simply generate full update for all columns (except for those marked updatable = false, of course). This by itself is probably not such a big deal performance-wise, but if nothing else it makes SQL debug output less useful for checking what really changed.

I’d not even expect ORM to eliminate the column from update when it’s equal, I’d rather expect them to include it only when it was set. But we are already in the domain of ORM auto-magic (again) as they somehow have to know what has changed. Our entities are typically enhanced somehow, either during the build of a project, or during class-loading. It would probably be more complex to store touched columns instead of marking the whole entity as “dirty”.

To be concrete, EclipseLink out of the box updates only modified attributes/columns, while Hibernate updates all of them except for ID which is part of the WHERE clause.⁵ There is a Hibernate specific annotation @DynamicUpdate that changes this behaviour. We may even argue what is better and when. If we load an object in some state, change a single column and then commit the transaction, do we really want to follow the changes per attribute or do we expect the whole object to be “transferred” into the database as-is at the moment of the commit? If we don’t squeeze performance and our transactions are short-lived (and they better are for most common applications) there is virtually no difference from the consistency point either.

All in all, this is just a minor annoyance when we’re used to log generated queries, typically during development, and we simply cannot see what changed among dozens of columns. For this – and for the cases when we want to change many entities at once in the same way – we can use UPDATE clause, also known as bulk update. But these tend to interfere with caches and with persistence context. We will talk about that in the next chapter.

Unit of work vs queries

JPA without direct SQL-like capabilities (that is without JPQL) would be very limited. Sure, there are projects we can happily sail through with queries based on criteria API only, but those are the easy ones. I remember a project, it was an important budgeting system with hierarchical organizational structure with thousands of organizations. There were budget items and limits for them with multiple categories, each of them being hierarchical too. When we needed to recalculate some items for some category (possibly using wildcards) we loaded these items and then performed the tasks in memory.

Sometimes the update must be done entity by entity – rules may be complicated, various fields are updated in various way, etc. But sometimes it doesn’t have to be this way. When the user approved the budget for an organization (and all its sub-units) we merely needed to set a flag on all these items. That’s what we can do with a bulk update. UPDATE and DELETE clauses were in the JPA specification since day zero, with the latest JPA 2.1 we can do this not only in JPQL, but also in Criteria API.⁶

When we can use a single bulk update (we know what to SET and WHERE to set it) we can gain massive performance boost. Instead of iterating and generating N updates we just send a single SQL to the database and that way we can go down from a cycle taking minutes to an operation taking seconds. But there is one big “but” related to the persistence context (entity manager). If we have the entities to be affected by the bulk update in our persistence context, they will not be touched at all. Bulk updates go around entity manager’s “cache” for unit of work, which means we should not mix bulk updates with modification of entities attached to the persistence context, unless these are completely separate entities. In general, I try to avoid any complex logic with attached entities after I execute bulk update/delete – and typically the scenario does not require it anyway.

To demonstrate the problem with a snippet of code:

BulkUpdateVsPersistenceContext.java

 1 System.out.println("dog.name = " + dog.getName()); // Rex
 2 
 3 new JPAUpdateClause(em, QDog.dog)
 4   .set(QDog.dog.name, "Dex")
 5   .execute();
 6 
 7 dog = em.find(Dog.class, 1); // find does not do much here
 8 System.out.println("dog.name = " + dog.getName()); // still Rex
 9 
10 em.refresh(dog); // this reads the real data now
11 System.out.println("after refresh: dog.name = " +
12   dog.getName()); // Dex

The same problem applies to JPQL queries which happen after the changes to the entities within a transaction on the current persistence context. Here the behaviour is controlled by entity manager’s flush mode and it defaults to FlushModeType.AUTO.⁷ Flush mode AUTO enforces the persistence context to flush all updates into the database before executing the query. But with flush mode COMMIT we’d get inconsistencies just like in the scenario with bulk update. Obviously, flushing the changes is a reasonable option – we’d flush it sooner or later anyway. Bulk update scenario, on the other hand, requires us to refresh attached entities which is much more disruptive and also costly.

We can’t escape SQL and relational model

Somewhere under cover there is SQL lurking and we better know how it works. We will tackle this topic with more passion in the second part of the book – for now we will just demonstrate that not knowing does not work. Imagine we want a list of all dog owners, and because there is many of them we want to paginate it. This is 101 of any enterprise application. In JPA we can use methods setFirstResult and setMaxResults on Query object which corresponds to SQL OFFSET and LIMIT clauses.⁸

Let’s have a model situation with the following owners (and their dogs): Adam (with dogs Alan, Beastie and Cessna), Charlie (no dog), Joe (with Rex and Lassie) and Mike (with Dunco). If we query for the first two owners ordered by name – pagination without order doesn’t make sense – we’ll get Adam and Charlie. However, imagine we want to display names of their dogs in each row. If we join dogs too, we’ll just get Adam twice, courtesy of his many dogs. We may select without the join and then select dogs for each row, which is our infamous N+1 select problem. This may not be a big deal for a page of 2, but for 30 or 100 we can see the difference. We will talk about this particular problem later in the chapter about N+1.

These are the effects of the underlying relational model and we cannot escape them. It’s not difficult to deal with them if we accept the relational world underneath. If we fight it, it fights back.

Additional layer

Reasoning about common structured, procedural code is quite simple for simple scenarios. We add higher-level concepts and abstractions to deal with ever more complex problems. When we use JDBC we know exactly where in the code our SQL is sent to the database, it’s easy to control it, debug it, monitor it, etc. With JPA we are one level higher. We still can try to measure performance of our queries – after all typical query is executed where we call it – but there are some twists.

First, it can be cached in a query cache – which may be good if it provides correct results – but it also significantly distorts any performance measurement. JPA layer itself takes some time. Query has to be parsed (add Querydsl serialization to it when used) and entities created and registered with the persistence context, so they are managed as expected. This distorts the result for the worse, not to mention that for big results it may trigger some additional GC that plain JDBC would not have.

The best bet is to monitor performance of the SQL on the server itself. Most decent RDBMS provide some capabilities in this aspect. We can also use JDBC proxy driver that wraps the real one and performs some logging on the way. Maybe our ORM provides this to a degree, at least in the logs if nowhere else. This may not be easy to process, but it’s still better than no visibility at all. More sophisticated system may provide nice measurement, but can also add performance penalty – which perhaps doesn’t affect the measured results too much, but it can still affect the overall application performance. Of course, monitoring overhead is not related only to JPA, we would get it using just plain JDBC as well.

I will not cover monitoring topics in the book – they are natural problems with any framework, although we can argue that access to RDBMS is kinda critical. Unit of work pattern causes real DB work to happen somewhere else than the domain code would indicate. For simple CRUD-like scenarios it’s not a problem, not even from performance perspective (mostly). For complex scenarios, for which the pattern was designed in the first place, we may need to revisit what we send to the database if we encounter performance issues. This may also affect our domain. Maybe there are clean answers for this, but I don’t know them. I typically rather tune down how I use the JPA.

All in all, the fact that JPA hides some complexity and adds another set of it is a natural aspect of any layer we add to our application – especially the technological one. This is probably one of the least questionable parts of JPA. We either want it and accept its “additional layer-ness” or choose not to use it. Know that when we discover any problem we will likely have to deal with the complexity of the whole stack.

Big unit of work

I believe that unit of work pattern is really neat, especially when we have support from ORM tools. But there are legitimate cases when we can run into troubles because the context is simply too big. This may easily happen with complicated business scenarios and it may cause no problem. Often, though, users may see the problem. Request takes too long or nightly scheduled task is still running late in the morning. Code looks good, it’s ORM-ish as it can be – it’s just slow.

We can monitor or debug how many objects are managed, often we can see effects on the heap. When this happens something has to change, obviously. Sometimes we can deal with the problem within our domain model and let ORM shield us from the database. Sometimes it’s not possible and we have to let relational world leak into our code.

Let’s say we have some test cases creating some dog and breed objects. In ideal case we would delete all of them between tests, but as it happens, we are working on the database that contains some fixed set of dogs and breeds as well (don’t ask). So we mark our test breeds with ‘TEST’ description. Dog creation is part of tested code, but we know they will be of our testing breeds. To delete all the dogs created during the test we may then write:

1 delete from Dog d where d.breed.description = 'TEST'

That’s pretty clear JPQL. Besides the fact that it fails on Hibernate it does not touch our persistence context at all and does its job. We can do the same with subquery (works for Hibernate as well) – or we can fetch testing breeds into a list (they are now managed by the persistence context) and then execute JPQL like this:

1 delete from Dog d where d.breed in :breeds

Here breeds would be parameter with a list containing the testing breeds as its value. We may fetch a plain list of breed.id instead, this does not get managed, takes less memory and pulls less data from the database with the same effect, we just say where d.breed.id in :breedIds instead – if supported like this by our ORM, but it’s definitely OK with JPA. I’ve heard arguments that this is less object-oriented. I was like “what?”

Finally, what we can do is start with fetching the testing breeds and then fetch all the dogs with these breeds and call em.remove(dog) in a cycle. I hope this, object-oriented as it is, is considered a bit of a stretch even by OO programmers. But I saw it in teams where JPQL and bulk updates were not very popular (read also as “not part of the knowledge base”).

In most cases the persistence context is short-lived, so even when a lot of data flows through, it will probably not be a big deal for the garbage collector – unless you seriously overflow young generation. But it will affect CPU load, network bandwidth – as the “object-oriented” approach likely goes to the database multiple times – and/or caches.

If you cache a lot, it will affect application’s memory footprint and garbage collector, and if the cache is distributed it may still trigger network roundtrips. These may or may not be quicker than the database access, depending on how much you can get by id; to re-hydrate a cached entity should be faster than to fetch it from the database. I hope you can see a lot of trade-offs you should think about when you choose any more sophisticated path – this includes paths that seem easy at their beginning only because they are covered with a veil of auto-magic solutions. We will get to this when we talk about Vietnam of Computer Science.

Persistence context (unit of work) has a lot to do when the transaction is committed. It needs to be flushed which means it needs to check all the managed (or attached) entities, and figure out whether they are changed or not (dirty checking). How the dirty checking is performed is beyond the scope of this book, typically some bytecode manipulation is used. The problem is that this takes longer when the persistence context is big. When it’s big for a reason, then we have to do what we have to do, but when it’s big because we didn’t harness bulk updates/deletes, than it’s simply wasteful. Often it also takes a lot of code when a simple JPQL says the same. I don’t accept an answer that reading JPQL or SQL is hard. If we use RDBMS we must know it reasonably well.

Do I need transactions for reading?

This is a tricky one. Maybe not, but I encountered strange things happening when use cases loaded stale data. This happened when we used Spring and its @Transactional annotation. Without it I didn’t see the changes from previously finished transactions. There was no race condition involved, it was all nice slow clicking in a CRUD-like web application. Adding @Transactional to the service that reads the data “magically” fixed it. I’m sorry I didn’t have time to dig deeper than, but I remember it vividly. So just know, it may happen.

But if you make read-only scenario transactional you add all the burden we’ve just described before. Spring offers attribute on the annotation to mark the transaction as read-only. This disables dirty checking on the persistence context which makes sense. Unfortunately, if you accidentally use some modification SQL via bulk update/delete or even going down to JDBC, it may get committed, as not all JDBC drivers respect connection.setReadOnly(true). While it’s an obvious bug (or missing feature, whatever), marking transaction as read-only does not guarantee that nothing gets committed. It still says clearly your intention and it should bring some time-saving on the unit of work level. Unfortunately, again, this does not work universally, not even on the ORM level. Hibernate’s session does not get flushed, but I’m pretty sure EclipseLink flushes updates to the database on the JPA level and does not set JDBC connection to read-only either.

I don’t know how to mark transaction as read-only with JTA @Transactional annotation.

I highly recommend reading older but still very relevant article Transaction strategies: Understanding transaction pitfalls

Sometimes bulk update is not a feasible option, but we still want to loop through some result, modify each entity without interacting with others and flush it (or batch the update if possible). This sounds similar to the missing “cursor-like result set streaming” mentioned in the section covering missing features some ORM providers do have – although that case covers read-only scenarios without any interaction with persistence context. If we want to do something on each single entity as we loop through the results and hit a performance problems (mostly related to memory) we may try to load smaller chunks of the data and after processing each chunk we flush the persistence context and also clear it with EntityManager.clear().

Talking about read-only scenarios, it’s also shame that JPA as of 2.2 still does not have any notion of read-only entities. EclipseLink provides @ReadOnly annotation for entity classes since TopLink times, Hibernate has its @Immutable annotation that works not only for classes but also for fields/methods. This does not make the persistence context smaller by itself, but these classes can be skipped during dirty checking, not to mention the benefit of explicit information they suppose to be for reading only.

Other entity manager gotchas

Consider the following snippet:

1 em.getTransaction().begin();
2 Dog dog = new Dog();
3 dog.setName("Toothless");
4 em.persist(dog);
5 dog.setAge(4);
6 em.getTransaction().commit();

We create the dog named Toothless and persist it, setting its age afterwards. Finally, we commit the transaction. What statements do we expect in the database? Hibernate does the following:

On persist it obtains an ID. If it means to query the sequence, it will do it. If it needs to call INSERT because the primary key is of type AUTO_INCREMENT or IDENTITY (depending on the database), it will do so. If insert is used, obviously, age is not set yet.
When flushing, which is during the commit(), it will call INSERT if it wasn’t called in the previous step (that is when the sequence or other mechanism was used to get the ID). Interestingly enough, it will call it with age column set to NULL.
Next the additional UPDATE is executed, setting the age to 4. Transaction is committed.

Good thing is that whatever mechanism to get the ID value is used we get the consistent sequence of insert and update. But why? Is it necessary?⁹

There is a reason I’m talking about Hibernate, as you might have guessed. EclipseLink simply reconciles it all to the single INSERT during the commit(). If it is absolutely necessary to insert the entity into the database in a particular state (maybe for some strange constraints reasons) and update afterwards we have to “push” it with an explicit flush() placed somewhere in between. That is probably the only reliable way how to do it if we don’t want to rely on the behaviour of a particular JPA provider.¹⁰

JPA alternatives?

When JPA suddenly stands in our way instead of helping us, we can still fallback gracefully to the JDBC level. Personally I don’t mix these within a single scenario, but we can. For that we have to know what provider we use, unwrap its concrete implementation of EntityManager and ask it for a java.sql.Connection in non-portable manner. When I don’t mix scenarios, I simply ask Spring (or possibly other container) to inject underlying javax.sql.DataSource and then I can access Connection without using JPA at all. Talking about Spring, I definitely go for their JdbcTemplate to avoid all the JDBC boilerplate. Otherwise I prefer JPA for all the reasons mentioned in the Good Parts.

We’ve lightly compared JPA with concrete ORMs already, but they are still the same concept – it is much more fun to compare it with something else altogether, let’s say a different language – like Groovy. We’re still firmly on JVM although it’s not very likely to do our persistence in Groovy and the rest of the application in Java. Firstly, Groovy language also has its ORM. It’s called GORM and while not built into the core project it is part of Grails framework. I don’t have any experience with it, but I don’t expect radical paradigm shift as it uses Hibernate ORM to access RDBMS (although it supports also No-SQL solutions). Knowing Groovy I’m sure it brings some fun into the mix, but it still is ORM.

I often use core Groovy support for relational databases and I really like it. It is no ORM, but it makes working with database really easy compared to JDBC. I readily use it to automate data population as it is much more expressive than SQL statements – you mostly just use syntax based on Groovy Maps. With little support code you create helper insert/update methods that can provide reasonable defaults for columns you don’t want to specify every time or insert whole aggregates (master-slave table structures). It’s convenient to assign returned auto-increment primary keys into Groovy variables and use them as foreign keys where needed. It’s also very easy to create repetitive data in a loop.

I use this basic database support in Groovy even on projects where I already have entities in JPA, but for whatever reason I don’t want to use them. We actually don’t need to map anything into objects, mostly couple of methods will do. Sure, we have to rename columns in the code when we refactor, but column names are hidden in just a couple places, often just a single one. Bottom line? Very convenient, natural type conversion (although not perfect, mind you), little to no boilerplate code and it all plays nicely with Groovy syntax. It definitely doesn’t bring so much negative passion with it as JPA does – perhaps because the level of sophistication is not so high. But in many cases simple solutions are more than enough.

Up next

4. Caching considerations