I JPA Good Times, Bad Times

The biggest trouble with JPA (and ORM in general) is that it is hard. Harder than people realize. The topic is complex, yet the solutions somehow make it look easier than it really is. JPA is part of Java EE and Spring framework supports it very well too, so we simply can’t go wrong with it. Or can we? Various problems – especially complexity – are inherent to ORM solutions in general. But there is also no widespread alternative. JDBC? Too low level, although we may use some frameworks above it (like Spring’s JdbcTemplate). Anything else? Not widespread.

JPA/ORM also somehow make people believe they don’t need to understand SQL. Or JPQL (or any QL) for that matter – and it shows (pain)fully both in the code and at runtime too, typically in the performance aspects.

Technology like this deserves to be studied more deeply than it usually is. It cannot be just slapped on the application. Yet, that is what I mostly see and that is what all those false implied promises deliver. JPA is now one of many embodiments of “EE” – or enterprise software. Applications are bigger and more complex than they need to be. And slower, of course. Developers map the entities but they don’t follow through and don’t investigate and let entropy do the rest of the work instead.

JPA is not perfect and I could name couple of things I’d love to see there. Many of them may not be right for the JPA philosophy, others would probably help tremendously, but in any case we are responsible to do our best to learn about such a key technology – that is if we use it. It reads and writes data from/to our databases which is a crucial thing for most enterprise applications. So let’s talk about the limits, let’s talk about implications, about lazy loads and their consequences, etc. Don’t “just fix it with OSIV” (Open Session in View).

In the following chapters we will talk about what is really good in JPA, what is missing and what is limiting or questionable. I tried to avoid “plain bad” for these parts because it is rather unfairly harsh statement and some of these features are… well, features. Just questionable on many occasions, that’s all. Because the questionable portion is naturally more fertile topic I let it overflow into separate chapters about caching and how people often feel about JPA.

1. Good Parts

First we will look at all the good things JPA offers to us. Why did I start to prefer JPA over a particular provider? Mind that this part is just an extended enumeration of various features, not a tutorial material.

The good parts are not absolute though. If we compare the JPA as a standard with a concrete ORM we’ll get the standardization, but there are related drawbacks. Comparing JPA with SQL is not quite appropriate as JPA covers much bigger stack. We may compare JPA with JDBC (read on the good parts), or we may compare Java Persistence Query Language (JPQL) with SQL to a degree.

JPA is standard

When we started with ORM, we picked Hibernate. There were not many other ORM options for Java, or at least they were not both high-profile and free. Yet even Hibernate creators are now glad when their tool is used via JPA interfaces:

Standard is not always necessarily good, but in case of JPA it is well established and respected. Now we can choose between multiple JPA 2.1 implementations or just deploy to any Java EE 7 application server. I’d not bet on the promise that we can easily swap one provider (or application server) for the other – my experiences are terrible in this aspect – but there is a common set of annotations for mapping our entities and the same core classes like EntityManager.

But standards also have their drawbacks, most typically that they are some common subset of the features offered in that area. JPA’s success is probably based on the fact that it emerged as a rock-solid subset of the features offered by major players in the Java ORM world. But starting this way it must have trailed behind them. JPA 1 did not offer any criteria API, can you imagine that? Now JPA 2.1 is somewhere else. The features cover nearly anything we need for typical applications talking to relational databases.

Even if something is missing we may combine it with proprietary annotations of a particular ORM provider – but only to a necessary degree. And perhaps we can switch those to JPA in its next version. In any case, JPA providers allow us to spice up JPA soup with their specifics easily. That way JPA does not restrain us.

Database independence

Let’s face it – it’s not possible to easily switch from one database to another when we have invested in one – especially if we use stored procedures or anything else beyond simple plain SQL. But JPA is not the problem here, on the contrary. It will require much less (if any) changes and the application will work on.

One of the best examples of this is pagination. There is a standard way how to do it since SQL:2008 using FETCH FIRST clause (along with OFFSET and ORDER BY) but only newer database versions support this. Before that the pagination required anything from LIMIT/OFFSET combo (e.g. PostgreSQL) to outer select (for Oracle before 12c). JPA makes this problem go away as it uses specific dialect to generate the SQL. This portion of the database idiosyncrasies is really something we can live without and JPA delivers.

Hence, if we start another project and go for a different database our application programmers using JPA probably don’t have to learn specifics of its SQL. This cuts both ways though. I’ve heard that “you don’t have to learn SQL” (meaning “at all”) is considered a benefit of the JPA, but every time programmers didn’t understand SQL they didn’t produce good results.

Natural SQL-Java type mapping

JPA can persist anything that is so called persistable type. These are:

entity classes (annotated with @Entity) – these loosely matches to database tables;
mapped superclasses (@MappedSuperclass) – used when we map class hierarchies;
embeddable classes (@Embeddable) – used to group multiple columns into a value object;

These were all types on the class level, more or less matching whole database tables or at least groups of columns. Now we will move to the field level:

simple Java types (optionally annotated with @Basic) like primitive types, their wrappers, String, BigInteger and BigDecimal;
temporal types like java.util.Date/Calendar in addition to types from java.sql package;
collections – these might be relations to other entities or embeddable types;
enum types – here we can choose whether it’s persisted by its name() or by ordinal().

Concepts in the first list are not covered by the JDBC at all – these are the most prominent features of ORM. But we can compare JDBC and JPA in how easy it is to get the value for any column as our preferred type. Here JDBC locks us to low-level types without any good conversion capabilities, not even for omnipresent types like java.util.Date. In JPA we can declare field as java.util.Date and just say that the column represents time, date or timestamp using @Temporal annotation. I feel no need to use java.sql package anymore.

Also enums are much easier, although this applies only for simple cases, not to mention that for evolving enums @Enumerated(EnumType.ORDINAL) is not an option (it should not be an option at all, actually). More in the chapter Mapping Java enum.

The bottom line is that mapping from SQL types to Java field types is much more natural with JPA than with JDBC. And we haven’t even mentioned two big guns JPA offers – custom type converters and large object support.

Convenient type conversion

Up to JPA 2.0 we could not define custom conversions, but JPA 2.1 changed this. Now we can simply annotate our field with @Convert and implement AttributeConverter<X, Y> with two very obvious methods:

Y convertToDatabaseColumn (X attribute)
X convertToEntityAttribute (Y dbData)

And we’re done! Of course, don’t forget to add annotation @Converter on the converter itself, just like we annotate other classes with @Entity or @Embeddable. Even better, we can declare converter as @Converter(autoApply = true) and we can do without @Convert on the field. This is extremely handy for java.time types from Java 8, because JPA 2.1 does not support those (it was released before Java SE 8, remember).

Large objects

Compared with JDBC, the JPA makes working with large objects (like SQL CLOB and BLOB types) a breeze. Just annotate the field with @Lob and use proper type, that is byte[] for BLOBs and String for CLOBs. We may also use Byte[] (why would we do that?) or Serializable for BLOBs and char[] or Character[] for CLOBs.

In theory we may annotate the @Lob field as @Basic(fetch=FetchType.LAZY) however this is a mere hint to the JPA provider and we can bet it will not be heard out. More about lazy on basic and to-one fields in a dedicated section.

Getting LOBs via JDBC from some databases may not be such a big deal, but if you’ve ever looped over LOB content using 4 KB buffer you will appreciate straightforward JPA mapping.

Flexible mapping

It better be flexible because it is the most important reason for the whole ORM concept. Field-to-column mapping can be specified as annotations on the fields or on the getters – or in XML files. Annotation placement implies access strategy, but it can be overridden using @Access on the class. In the book I’ll use field access mostly, but property access can be handy when we want to do something in get/setters. For instance, before JPA 2.1 we did simple conversions there, field was already converted and we used additional get/setters annotated as @Transient leaving those JPA related accessors only for JPA purposes.

Mentioning @Transient – sometimes we want additional fields in our JPA entities and this clearly tells JPA not to touch them. There are arguments how much or little should be in JPA entity, I personally prefer rather less than more – this on the other hand is criticized as anemic domain model – and we will return to this topic shortly later.

Unit of work

Unit of work is a pattern described in [PoEAA]. In Hibernate it was represented by its Session, in JPA more descriptive (and arguably more EE-ish) name EntityManager was chosen. To make things just a little tougher on newcomers, entity manager manages something called persistence context – but we make no big mistake if we treat entity manager and persistence context as synonyms. Nowadays EntityManager is typically injected where needed, and combined with declarative transactions it is very natural to use it. Funny enough, it’s not injected with standard @Inject but with @PersistenceContext.

To explain the pattern simply: Whatever we read or delete through it or whatever new we persist with it will be “flushed” at the end of a transaction and relevant SQL queries will get executed. We get all of this from EntityManager and while there is some ideal way how to use it, it is flexible enough and can be “bent”. We can flush things earlier if we need to force some order of SQL statements, etc.

Because unit of work remembers all the work related to it sometimes it is called “session cache” or “1^st level cache”. This “cache” however does not survive the unit of work itself and talking about cache gives this pattern additional – and confusing – meaning.

After all the years of experience with Session and EntityManager it was actually really refreshing to read about the unit of work pattern, to see what operations are typical for it, to have it explicitly stated and also to read Fowler’s speculation how it will be used and implemented – all written back in 2002, and most of it still valid!

Declarative transactions

JPA has basic transaction support, but when it comes to declarative transactions they are part of Java Transaction API (JTA). JPA and JTA are seamlessly integrated from programmer’s point of view that we simply take @Transactional support for granted.

Programmatically we can get to the transaction using EntityManager’s getTransaction() method. This returns EntityTransaction implementation that allows us to check whether a transaction is active or not, begin one, commit or rollback. This is all good for simple cases while using transaction-type set to RESOURCE_LOCAL in our persistence.xml – typical for SE deployment. But in Java EE world (or when using Spring, even outside of application server) we’ll probably use @Transactional annotation for declarative transactional management, possibly to join a distributed transaction if necessary.

We will not cover transaction management, there are better resources and plenty of blogs. I can recommend [ProJPA2] which covers both @Transactional and @TransactionalAttribute annotations, and much more. A bit of a warning for Spring users. We can use their @Transactional annotation, this way we can even mark transaction as read-only when needed, but we cannot mix both @Transactional annotations at will as they work differently. It is advisable to use only Spring’s annotation in Spring environment.

Other JPA 2.1 goodies

We talked about custom converters already, but JPA 2.1 brought much more. Let’s cover the most interesting points quickly:

As 2.0 brought Criteria, 2.1 extended their usage for updates/deletes as well (although I’d use Querydsl for all these cases).
We can now call stored procedures in our queries.
We can map results into constructors and create custom DTO objects for specific queries (again, Querydsl has it too and it works for older JPAs).
Feature called entity graph allows us to define what relations to load in specific situations and then to fetch relations marked as lazy when you know up-front you will need them. I’m not covering this in the book.
There’s also an option to call any custom FUNCTION from JPQL. This means also native functions of a particular database if needed. While we limit ourselves to that vendor (or we need to rewrite it) it allows us to perform duties beyond and above JPA’s built-in functions.
JPA 2.1 also specifies properties that allow us to generate database schema or run SQL scripts, which unifies this configuration for various JPA providers.

For the whole list see either this part of Java Persistence wikibook or this post covering the features even better.

2. The Missing Parts

In this part we’ll briefly compare the JPA standard with capabilities offered by various ORM providers and go through the things missing in the Java Persistence Query Language (JPQL) when compared with SQL.

Compared to ORM providers

Couple of years back this section would be a long one – but the JPA narrowed the gap with 2.0 and virtually closed it with 2.1. There are still things that may be missing when we compare it with ORM providers. The list of various ORM features not included in the JPA is probably long. However, the right question is how much a feature is really missed, how often we use it and how serious impact this omission has.

In most cases we can get out of the comfortable “standard” zone and use capabilities of a particular ORM without affecting parts that can be pure JPA. Maybe we add some provider-specific annotations in our mapping – so be it, let’s be pragmatic.

One particular feature I once missed a lot is something that allows me to go through a result set just like in JDBC, but using entities. These don’t have to be put into the persistence context as I can process them as they come. Actually – they must not be put there, because they just bloat the heap without any real use. It’s like streaming the result set. This may be extremely handy when we produce some very long exports that are streamed straight to the client browser or a file. Hibernate offers ScrollableResults, I personally used JDBC with Spring JdbcTemplate instead and solved the problem without JPA altogether – obviously, I had to do the mapping myself, while Hibernate can do it for us. Even so, as mentioned in this StackOverflow answer this may still cause OutOfMemoryError or similar memory related issue, this time not on JPA/ORM level, but because of silly JDBC driver (or even because of database limitations, but that’s rare).

Another area that is not covered by JPA that much is caching. JPA is rather vague about the structure of caches, although it does specify some configuration options and annotations. But the ORM implementations can still differ significantly. We tackle this topic in a separate chapter.

Finally, with the introduction of ON clause we could get much more low-level with our queries when it suits us. ON is intended for additional JOIN conditions but we would have some new ways how to approach the relations which can be bothersome from time to time. We could use ON to explicitly add our primary JOIN condition – but all this is for naught because JPA 2.1 does not allow to use root entity (representing the table itself) in the JOIN clause. More about this topic and how to do it with EclipseLink and Hibernate (5.1.0 and higher) in the chapter Removing to-one altogether.

Comparing JPQL to SQL

JPQL gets more and more powerful with every new specification version, but obviously it cannot match SQL with proprietary features. Being “object query language” it does have its expressiveness on its own though – for instance we may simply say “delete all the dogs where breed name is <this and this>” without explicitly using joins (although this relies on to-one mapping we will try to get rid of). When object relations are mapped properly, joins are implied using the mapping and JPA provider will take care of generating the proper SQL for us. We may also ask for dog owners with their dogs and ORM can load it in a single query and provide it as a list of owners, each with their list of dogs – this is the difference between relational and object view.

But there are some very useful constructs that are clearly missing. I personally never needed right outer join, so far I was always able to choose the right entity to start with and then using left outer joins, but this one may hit us sometimes. There is also no full outer join, but this relates to the fact we’re working with objects, not with rows – although technically we may work with tuples (and with relations in RDBMS meaning). This dumbs down the JPA capabilities a bit, but in many cases it may be a good way and actually simplify things, provided we understand SQL – which we should.

When compared to SQL, probably the most striking JPQL limitations are related to subqueries. Some scenarios can be replaced by JOINs, but some can’t. For instance, we cannot put subqueries into a SELECT clause – this would allow for aggregated results per row. We cannot put them into a FROM clause and use the select result as any other table – or, in relational database sense, as a relation). This would allow us, among other things, to count rows for results that current JPA does not support.¹

JPA offers a palette of favourite functions, but of course it does not provide all possible functions.² Before JPA 2.1 we’d have to use ORM provider custom functionality to overcome this, or fallback to native SQL. Just because we are consciously sacrificing database portability it does not mean we don’t want to use JPQL. It provides us with FUNCTION construct where the name of the called function is the first argument with other arguments following behind. Easy to use and very flexible – this effectively closes the gap for functions we can use.

Other missing parts

The best way how to find corner cases that are not supported in the specification [JPspec] is simply to search for “not supported” string. Some of these are related to embeddables, some are pretty natural (e.g. “applying setMaxResults or setFirstResult to a query involving fetch joins over collections is undefined”).³

What I dearly miss in JPA is better control over fetching of to-one relations. Current solution is trying to be transparent both in terms of object-relational mapping and Java language features, but it may kill our performance or require caching, potentially a lot of it. While to-many relations can be loaded lazily with special collection implementation enabling it, to-one cannot work like this without bytecode modification. I believe though, that developers should be allowed to decide that this and this to-one relationship will not be loaded and only detached entities with IDs will be provided. But let’s wait with this discussion for the opinionated part.

3. Questionable parts

The JPA, as any ORM, is not without its drawbacks. Firstly, it is complex – much deeper than developers realize when they approach it. Secondly, it is not a perfect abstraction. The more you want to play it as a perfect abstraction, the worse it probably gets in marginal cases. And the margin is not that thin. You may solve 80% cases easily, but there are still 20% of hard cases where you go around your ORM, write native SQL, etc. If you try to avoid it you’ll probably suffer more than if you accepted it.

We can’t just stay on the JPA level, even for cases where ORM works well for us. There are some details we should know about the provider we use. For instance, let’s say we have an entity with auto-generated identifier based on IDENTITY (or AUTO_INCREMENT) column. We call persist on it and later we want to use its ID somewhere – perhaps just to log the entity creation. And it doesn’t work, because we’re using EclipseLink and we didn’t call em.flush() to actually execute that INSERT. Without it the provider cannot know what value for ID it should use. Maybe our usage of the ID value was not the right ORM way, maybe we should have used the whole entity instead, but the point is that if we do the same with Hibernate it just works. We simply cannot assume that the ID is set.⁴

Lazy on basic and to-one fields

While we can map these fields as lazy the behaviour is actually not guaranteed according to the JPA specification [JPspec]. Its section 2.2 states:

And the attached footnote adds:

While ORMs generally have no problem to make collections lazy (e.g. both to-many annotations), for to-one mappings this gets more complicated. [PoEAA] offers couple of solutions for lazy load pattern: lazy initialization, virtual proxy, value holder, and ghost. Not all are usable for to-one mappings.

The essential trouble is that such a field contains an entity directly. There is no indirection like a collection, that can provide lazy implementation, in case of to-many mappings. JPA does not offer any generic solution for indirect value holder. Virtual proxy would require some interface to implement, or bytecode manipulation on the target class, ghost would definitely require bytecode manipulation on the target class, and lazy initialization would require bytecode manipulation, or at least some special implementation, on the source class. JPA design neither offers any reasonable way how to introduce this indirection without advanced auto-magic solutions nor ways how to do it explicitly in a way a programmer can control.

Removing to-one mappings and replace them with raw foreign key values is currently not possible with pure JPA even though JPA 2.1 brought ON clause for JPQL but it does not allow root entities in JOINs. We will expand on this in the chapter Troubles with to-one relationships.

Generated SQL updates

Programmer using JPA should see the object side of the ORM mapping. This means that an object is also the level of granularity on which ORM works. If we change a single attribute on an object ORM can simply generate full update for all columns (except for those marked updatable = false, of course). This by itself is probably not such a big deal performance-wise, but if nothing else it makes SQL debug output less useful for checking what really changed.

I’d not even expect ORM to eliminate the column from update when it’s equal, I’d rather expect them to include it only when it was set. But we are already in the domain of ORM auto-magic (again) as they somehow have to know what has changed. Our entities are typically enhanced somehow, either during the build of a project, or during class-loading. It would probably be more complex to store touched columns instead of marking the whole entity as “dirty”.

To be concrete, EclipseLink out of the box updates only modified attributes/columns, while Hibernate updates all of them except for ID which is part of the WHERE clause.⁵ There is a Hibernate specific annotation @DynamicUpdate that changes this behaviour. We may even argue what is better and when. If we load an object in some state, change a single column and then commit the transaction, do we really want to follow the changes per attribute or do we expect the whole object to be “transferred” into the database as-is at the moment of the commit? If we don’t squeeze performance and our transactions are short-lived (and they better are for most common applications) there is virtually no difference from the consistency point either.

All in all, this is just a minor annoyance when we’re used to log generated queries, typically during development, and we simply cannot see what changed among dozens of columns. For this – and for the cases when we want to change many entities at once in the same way – we can use UPDATE clause, also known as bulk update. But these tend to interfere with caches and with persistence context. We will talk about that in the next chapter.

Unit of work vs queries

JPA without direct SQL-like capabilities (that is without JPQL) would be very limited. Sure, there are projects we can happily sail through with queries based on criteria API only, but those are the easy ones. I remember a project, it was an important budgeting system with hierarchical organizational structure with thousands of organizations. There were budget items and limits for them with multiple categories, each of them being hierarchical too. When we needed to recalculate some items for some category (possibly using wildcards) we loaded these items and then performed the tasks in memory.

Sometimes the update must be done entity by entity – rules may be complicated, various fields are updated in various way, etc. But sometimes it doesn’t have to be this way. When the user approved the budget for an organization (and all its sub-units) we merely needed to set a flag on all these items. That’s what we can do with a bulk update. UPDATE and DELETE clauses were in the JPA specification since day zero, with the latest JPA 2.1 we can do this not only in JPQL, but also in Criteria API.⁶

When we can use a single bulk update (we know what to SET and WHERE to set it) we can gain massive performance boost. Instead of iterating and generating N updates we just send a single SQL to the database and that way we can go down from a cycle taking minutes to an operation taking seconds. But there is one big “but” related to the persistence context (entity manager). If we have the entities to be affected by the bulk update in our persistence context, they will not be touched at all. Bulk updates go around entity manager’s “cache” for unit of work, which means we should not mix bulk updates with modification of entities attached to the persistence context, unless these are completely separate entities. In general, I try to avoid any complex logic with attached entities after I execute bulk update/delete – and typically the scenario does not require it anyway.

To demonstrate the problem with a snippet of code:

BulkUpdateVsPersistenceContext.java

 1 System.out.println("dog.name = " + dog.getName()); // Rex
 2 
 3 new JPAUpdateClause(em, QDog.dog)
 4   .set(QDog.dog.name, "Dex")
 5   .execute();
 6 
 7 dog = em.find(Dog.class, 1); // find does not do much here
 8 System.out.println("dog.name = " + dog.getName()); // still Rex
 9 
10 em.refresh(dog); // this reads the real data now
11 System.out.println("after refresh: dog.name = " +
12   dog.getName()); // Dex

The same problem applies to JPQL queries which happen after the changes to the entities within a transaction on the current persistence context. Here the behaviour is controlled by entity manager’s flush mode and it defaults to FlushModeType.AUTO.⁷ Flush mode AUTO enforces the persistence context to flush all updates into the database before executing the query. But with flush mode COMMIT we’d get inconsistencies just like in the scenario with bulk update. Obviously, flushing the changes is a reasonable option – we’d flush it sooner or later anyway. Bulk update scenario, on the other hand, requires us to refresh attached entities which is much more disruptive and also costly.

We can’t escape SQL and relational model

Somewhere under cover there is SQL lurking and we better know how it works. We will tackle this topic with more passion in the second part of the book – for now we will just demonstrate that not knowing does not work. Imagine we want a list of all dog owners, and because there is many of them we want to paginate it. This is 101 of any enterprise application. In JPA we can use methods setFirstResult and setMaxResults on Query object which corresponds to SQL OFFSET and LIMIT clauses.⁸

Let’s have a model situation with the following owners (and their dogs): Adam (with dogs Alan, Beastie and Cessna), Charlie (no dog), Joe (with Rex and Lassie) and Mike (with Dunco). If we query for the first two owners ordered by name – pagination without order doesn’t make sense – we’ll get Adam and Charlie. However, imagine we want to display names of their dogs in each row. If we join dogs too, we’ll just get Adam twice, courtesy of his many dogs. We may select without the join and then select dogs for each row, which is our infamous N+1 select problem. This may not be a big deal for a page of 2, but for 30 or 100 we can see the difference. We will talk about this particular problem later in the chapter about N+1.

These are the effects of the underlying relational model and we cannot escape them. It’s not difficult to deal with them if we accept the relational world underneath. If we fight it, it fights back.

Additional layer

Reasoning about common structured, procedural code is quite simple for simple scenarios. We add higher-level concepts and abstractions to deal with ever more complex problems. When we use JDBC we know exactly where in the code our SQL is sent to the database, it’s easy to control it, debug it, monitor it, etc. With JPA we are one level higher. We still can try to measure performance of our queries – after all typical query is executed where we call it – but there are some twists.

First, it can be cached in a query cache – which may be good if it provides correct results – but it also significantly distorts any performance measurement. JPA layer itself takes some time. Query has to be parsed (add Querydsl serialization to it when used) and entities created and registered with the persistence context, so they are managed as expected. This distorts the result for the worse, not to mention that for big results it may trigger some additional GC that plain JDBC would not have.

The best bet is to monitor performance of the SQL on the server itself. Most decent RDBMS provide some capabilities in this aspect. We can also use JDBC proxy driver that wraps the real one and performs some logging on the way. Maybe our ORM provides this to a degree, at least in the logs if nowhere else. This may not be easy to process, but it’s still better than no visibility at all. More sophisticated system may provide nice measurement, but can also add performance penalty – which perhaps doesn’t affect the measured results too much, but it can still affect the overall application performance. Of course, monitoring overhead is not related only to JPA, we would get it using just plain JDBC as well.

I will not cover monitoring topics in the book – they are natural problems with any framework, although we can argue that access to RDBMS is kinda critical. Unit of work pattern causes real DB work to happen somewhere else than the domain code would indicate. For simple CRUD-like scenarios it’s not a problem, not even from performance perspective (mostly). For complex scenarios, for which the pattern was designed in the first place, we may need to revisit what we send to the database if we encounter performance issues. This may also affect our domain. Maybe there are clean answers for this, but I don’t know them. I typically rather tune down how I use the JPA.

All in all, the fact that JPA hides some complexity and adds another set of it is a natural aspect of any layer we add to our application – especially the technological one. This is probably one of the least questionable parts of JPA. We either want it and accept its “additional layer-ness” or choose not to use it. Know that when we discover any problem we will likely have to deal with the complexity of the whole stack.

Big unit of work

I believe that unit of work pattern is really neat, especially when we have support from ORM tools. But there are legitimate cases when we can run into troubles because the context is simply too big. This may easily happen with complicated business scenarios and it may cause no problem. Often, though, users may see the problem. Request takes too long or nightly scheduled task is still running late in the morning. Code looks good, it’s ORM-ish as it can be – it’s just slow.

We can monitor or debug how many objects are managed, often we can see effects on the heap. When this happens something has to change, obviously. Sometimes we can deal with the problem within our domain model and let ORM shield us from the database. Sometimes it’s not possible and we have to let relational world leak into our code.

Let’s say we have some test cases creating some dog and breed objects. In ideal case we would delete all of them between tests, but as it happens, we are working on the database that contains some fixed set of dogs and breeds as well (don’t ask). So we mark our test breeds with ‘TEST’ description. Dog creation is part of tested code, but we know they will be of our testing breeds. To delete all the dogs created during the test we may then write:

1 delete from Dog d where d.breed.description = 'TEST'

That’s pretty clear JPQL. Besides the fact that it fails on Hibernate it does not touch our persistence context at all and does its job. We can do the same with subquery (works for Hibernate as well) – or we can fetch testing breeds into a list (they are now managed by the persistence context) and then execute JPQL like this:

1 delete from Dog d where d.breed in :breeds

Here breeds would be parameter with a list containing the testing breeds as its value. We may fetch a plain list of breed.id instead, this does not get managed, takes less memory and pulls less data from the database with the same effect, we just say where d.breed.id in :breedIds instead – if supported like this by our ORM, but it’s definitely OK with JPA. I’ve heard arguments that this is less object-oriented. I was like “what?”

Finally, what we can do is start with fetching the testing breeds and then fetch all the dogs with these breeds and call em.remove(dog) in a cycle. I hope this, object-oriented as it is, is considered a bit of a stretch even by OO programmers. But I saw it in teams where JPQL and bulk updates were not very popular (read also as “not part of the knowledge base”).

In most cases the persistence context is short-lived, so even when a lot of data flows through, it will probably not be a big deal for the garbage collector – unless you seriously overflow young generation. But it will affect CPU load, network bandwidth – as the “object-oriented” approach likely goes to the database multiple times – and/or caches.

If you cache a lot, it will affect application’s memory footprint and garbage collector, and if the cache is distributed it may still trigger network roundtrips. These may or may not be quicker than the database access, depending on how much you can get by id; to re-hydrate a cached entity should be faster than to fetch it from the database. I hope you can see a lot of trade-offs you should think about when you choose any more sophisticated path – this includes paths that seem easy at their beginning only because they are covered with a veil of auto-magic solutions. We will get to this when we talk about Vietnam of Computer Science.

Persistence context (unit of work) has a lot to do when the transaction is committed. It needs to be flushed which means it needs to check all the managed (or attached) entities, and figure out whether they are changed or not (dirty checking). How the dirty checking is performed is beyond the scope of this book, typically some bytecode manipulation is used. The problem is that this takes longer when the persistence context is big. When it’s big for a reason, then we have to do what we have to do, but when it’s big because we didn’t harness bulk updates/deletes, than it’s simply wasteful. Often it also takes a lot of code when a simple JPQL says the same. I don’t accept an answer that reading JPQL or SQL is hard. If we use RDBMS we must know it reasonably well.

Do I need transactions for reading?

This is a tricky one. Maybe not, but I encountered strange things happening when use cases loaded stale data. This happened when we used Spring and its @Transactional annotation. Without it I didn’t see the changes from previously finished transactions. There was no race condition involved, it was all nice slow clicking in a CRUD-like web application. Adding @Transactional to the service that reads the data “magically” fixed it. I’m sorry I didn’t have time to dig deeper than, but I remember it vividly. So just know, it may happen.

But if you make read-only scenario transactional you add all the burden we’ve just described before. Spring offers attribute on the annotation to mark the transaction as read-only. This disables dirty checking on the persistence context which makes sense. Unfortunately, if you accidentally use some modification SQL via bulk update/delete or even going down to JDBC, it may get committed, as not all JDBC drivers respect connection.setReadOnly(true). While it’s an obvious bug (or missing feature, whatever), marking transaction as read-only does not guarantee that nothing gets committed. It still says clearly your intention and it should bring some time-saving on the unit of work level. Unfortunately, again, this does not work universally, not even on the ORM level. Hibernate’s session does not get flushed, but I’m pretty sure EclipseLink flushes updates to the database on the JPA level and does not set JDBC connection to read-only either.

I don’t know how to mark transaction as read-only with JTA @Transactional annotation.

I highly recommend reading older but still very relevant article Transaction strategies: Understanding transaction pitfalls

Sometimes bulk update is not a feasible option, but we still want to loop through some result, modify each entity without interacting with others and flush it (or batch the update if possible). This sounds similar to the missing “cursor-like result set streaming” mentioned in the section covering missing features some ORM providers do have – although that case covers read-only scenarios without any interaction with persistence context. If we want to do something on each single entity as we loop through the results and hit a performance problems (mostly related to memory) we may try to load smaller chunks of the data and after processing each chunk we flush the persistence context and also clear it with EntityManager.clear().

Talking about read-only scenarios, it’s also shame that JPA as of 2.2 still does not have any notion of read-only entities. EclipseLink provides @ReadOnly annotation for entity classes since TopLink times, Hibernate has its @Immutable annotation that works not only for classes but also for fields/methods. This does not make the persistence context smaller by itself, but these classes can be skipped during dirty checking, not to mention the benefit of explicit information they suppose to be for reading only.

Other entity manager gotchas

Consider the following snippet:

1 em.getTransaction().begin();
2 Dog dog = new Dog();
3 dog.setName("Toothless");
4 em.persist(dog);
5 dog.setAge(4);
6 em.getTransaction().commit();

We create the dog named Toothless and persist it, setting its age afterwards. Finally, we commit the transaction. What statements do we expect in the database? Hibernate does the following:

On persist it obtains an ID. If it means to query the sequence, it will do it. If it needs to call INSERT because the primary key is of type AUTO_INCREMENT or IDENTITY (depending on the database), it will do so. If insert is used, obviously, age is not set yet.
When flushing, which is during the commit(), it will call INSERT if it wasn’t called in the previous step (that is when the sequence or other mechanism was used to get the ID). Interestingly enough, it will call it with age column set to NULL.
Next the additional UPDATE is executed, setting the age to 4. Transaction is committed.

Good thing is that whatever mechanism to get the ID value is used we get the consistent sequence of insert and update. But why? Is it necessary?⁹

There is a reason I’m talking about Hibernate, as you might have guessed. EclipseLink simply reconciles it all to the single INSERT during the commit(). If it is absolutely necessary to insert the entity into the database in a particular state (maybe for some strange constraints reasons) and update afterwards we have to “push” it with an explicit flush() placed somewhere in between. That is probably the only reliable way how to do it if we don’t want to rely on the behaviour of a particular JPA provider.¹⁰

JPA alternatives?

When JPA suddenly stands in our way instead of helping us, we can still fallback gracefully to the JDBC level. Personally I don’t mix these within a single scenario, but we can. For that we have to know what provider we use, unwrap its concrete implementation of EntityManager and ask it for a java.sql.Connection in non-portable manner. When I don’t mix scenarios, I simply ask Spring (or possibly other container) to inject underlying javax.sql.DataSource and then I can access Connection without using JPA at all. Talking about Spring, I definitely go for their JdbcTemplate to avoid all the JDBC boilerplate. Otherwise I prefer JPA for all the reasons mentioned in the Good Parts.

We’ve lightly compared JPA with concrete ORMs already, but they are still the same concept – it is much more fun to compare it with something else altogether, let’s say a different language – like Groovy. We’re still firmly on JVM although it’s not very likely to do our persistence in Groovy and the rest of the application in Java. Firstly, Groovy language also has its ORM. It’s called GORM and while not built into the core project it is part of Grails framework. I don’t have any experience with it, but I don’t expect radical paradigm shift as it uses Hibernate ORM to access RDBMS (although it supports also No-SQL solutions). Knowing Groovy I’m sure it brings some fun into the mix, but it still is ORM.

I often use core Groovy support for relational databases and I really like it. It is no ORM, but it makes working with database really easy compared to JDBC. I readily use it to automate data population as it is much more expressive than SQL statements – you mostly just use syntax based on Groovy Maps. With little support code you create helper insert/update methods that can provide reasonable defaults for columns you don’t want to specify every time or insert whole aggregates (master-slave table structures). It’s convenient to assign returned auto-increment primary keys into Groovy variables and use them as foreign keys where needed. It’s also very easy to create repetitive data in a loop.

I use this basic database support in Groovy even on projects where I already have entities in JPA, but for whatever reason I don’t want to use them. We actually don’t need to map anything into objects, mostly couple of methods will do. Sure, we have to rename columns in the code when we refactor, but column names are hidden in just a couple places, often just a single one. Bottom line? Very convenient, natural type conversion (although not perfect, mind you), little to no boilerplate code and it all plays nicely with Groovy syntax. It definitely doesn’t bring so much negative passion with it as JPA does – perhaps because the level of sophistication is not so high. But in many cases simple solutions are more than enough.

4. Caching considerations

I didn’t want to discuss caching too much, but later I did spread couple of sections about it throughout the text. Later I decided to concentrate them here and be (mostly) done with it for the sake of this book.

While persistence context (EntityManager or session) is sometimes considered a cache too, it is merely a part of the unit-of-work pattern (identity map). The real cache sits underneath and is shared on the level of the EntityManagerFactory – or even between more of them across various JVMs in case of distributed caches. This is called the second-level cache.¹¹ It is used to enhance performance, typically by avoiding round-trips to the database. But caching has consequences.

Caching control

[JPspec] doesn’t say much about caching. It says how to configure it – starting with shared-cache-mode in the persistence.xml. But I’d probably study caching documentation of a particular provider, because if we don’t care at all, we don’t even know whether and how we use the cache.

Without choosing shared-cache-mode it is up to the JPA provider and its defaults. This may render any use of @Cacheable annotations useless. Currently, Hibernate typically doesn’t cache by default, while EclipseLink caches everything by default. Being oblivious to the cache (not related to cache-oblivious algorithms at all) is rather dangerous, especially if our application is not the only one running against the same database. In that case setting shared-cache-mode explicitly to NONE is by far the best start. We may revisit our decisions later, but at least we know what is happening.

Database caches too

Caching happens also directly on the database level. It still incurs a network round-trip but it does not necessarily load data from disk (that would be really slow) and we don’t have this entity-query duality – although we don’t know how complex the database cache is. In any case, it is used whether we use second-level cache or not and, luckily, it is mostly totally transparent for a programmer or user.

If we don’t want to dig deep we may disable ORM caches knowing that something still probably caches for us. Often we can see that the first request for a page of something is slow but the second requests is much faster – even if both go to the database. Sure we don’t go as fast as we can, but we avoided a lot of problems, like stale data on the screen when we change them in the database directly or the need for a distributed cache when we scale the application while still using a single database server (until sufficient).

Probably the most important question to consider is: Is our application the sole user of a particular database? Is it running in a single process? If yes, we may safely use the cache. But I’d not “just use the defaults” of the JPA provider – which may be no cache as well. It’s not a good idea to use caching without any thought at all. I prefer not to use it until I feel the need. When we start using the cache and tuning it we must be prepared for a walk that may be not as easy as it seems.

Entity cache returning the entities by ID really quickly sounds like a good idea because it makes the problem of eager-loads of to-one relationships less serious. E.g. when we load a dog we may get its breed quickly from the cache. It doesn’t fix the problem though as all the additional entities are part of the current persistence context whether we want them or not. E.g. we want to update a single attribute of a dog and we are not interested in its breed. We already mentioned that units of work bigger than necessary are not for free.

When JPA is not the only one caching

Imagine you’re caching returned JPA entities with JSR 107, e.g. using Spring implementation of the JSR. It can be some data access method that returns single entity for a filter – filter is the key for the cache. There is one serious problem with this approach. Let’s ignore that we may have many keys that return the same entity, but in the cache these may be stored as many instances (recreated from its dehydrated/serialized form in the 2nd-level cache). The real problem is that when the cache misses it reaches for JPA to get the entity, stores it and gives it to you – attached. But when the cache hits it returns detached entity. This is highly non-deterministic situation that may lead to exceptions in better case or to unexpected changes of entities in a worse one, depending on the code that uses such a method.

All in all – don’t cache managed stuff with technology that doesn’t understand it.

Second-level cache vs queries

Caching should be transparent, but just turning it on is a kind of premature optimization which – in virtually all cases – ends up being wrong. Any auto-magic can only go so far, and any caching leads to potential inconsistencies. I believe most JPA users don’t understand how the cache is structured (I’m talking from my own experience too, after all). This depends on a concrete ORM, but typically there is an entity cache and a query cache.

Entity cache helps with performance of EntityManager.find, or generally with loading by entity’s @Id attribute. But this will not help us if we accidentally obfuscate what we want with a query, that would otherwise return the same. The provider has no way to know what entity (with what ID) will be loaded just looking at arbitrary where conditions. This is what query cache is for. Bulk update and deletes using JPQL go around either of these caches and the safest way how to avoid inconsistent data is to evict all entities of the modified type from the caches. This is often performed by the ORM provider automatically (again, check documentation and settings).

If we only work with whole entities all the time and nothing else accesses the database we can be pretty sure we always get the right result from the entity cache. You may wonder how this cache behaves in concurrent environment (like any EE/Spring application inherently is). If you imagine it as a Map, even with synchronized access, you may feel the horror of getting the same entity instance (Dog with the same ID) for two concurrent persistence contexts (like concurrent HTTP requests) that subsequently modify various fields on the shared instance. Luckily, ORMs provide each thread with its own copy of the entity. Internally they typically keep entities in the cache in some “dehydrated” form.¹²

Explicit application-level cache

Previously we ignored the case when we cache some entities under multiple different keys, not necessarily retrieved by the same method. Imagine a component that returns some classifiers by various attributes (ID, name, code, whatnot) – this is pretty realistic scenario. We code it as a managed component with declarative cache. There are separate methods to obtain the classifier by specific attributes. If we retrieve the same classifier by three different attributes we’ll populate three different caches with – essentially the same – entity stored under different key (attribute value) in each of these caches.

Even if we ignore that the entities sometimes participate in the persistence context and sometimes don’t this consumes more memory space than necessary. It may still be acceptable though. Personally I believe that the non-determinism regarding the attached/detached state is more serious but let’s say these are only for reading and we may not care. Imagine further that we may filter on these entities – like “give me a list of classifiers with name starting with BA”. Now we have even more variability in cache keys – any distinct filter is the key – and probably many more repeated entities in the results. But these are likely distinct objects even for the same logical entities. This may either explode our cache, or cause frequent evictions rendering our cache useless, probably utilizing more CPU in the process.

If the amount of underlying data is big we may have no other chance, but in case of various administered static data, code books or classifiers the size of the table is typically small. Once our DB admins reported that we queried a table with 20k rows of static data 20 million times a day – not to mention the traffic on our site was in order of thousands of request a day. It was an unnecessarily rich relational model and this part of system would be better represented in some kind of document/NoSQL store. We didn’t use the relations in that data that much – and they actually prevented some needed data fixes because of overly restricted cobweb of foreign keys. But this was the design we had and we needed a fix. “Can’t we just cache it somehow?” The data were localization keys – not for the application itself but for some form templates – so it was part of the data model. We joined these keys based on the user’s language for each row of possibly multi-row template (often with hundreds of rows).

First we needed to drop the joins. Plan was to ask some cache component for any of these localization keys based on its ID and the language. It took us some time to rewrite all the code that could utilize it and originally just joined the data in queries. But the result was worth it. The component simply read the whole table and created map of maps based on the languages and IDs. DB guys were happy. We stopped executing some queries altogether and removed unnecessary joins from many others.

There are other situations when we may seriously consider just coding our own cache explicitly and not rely on declarative one like JSR 107. Declarative cache doesn’t mean unattended anyway. We should limit it, set policies, etc. It can be extremely handy when we can get results cheap and this typically happens when limited subset of possibly big chunk of data is used repeatedly using the same keys.

Programmatic (explicit) cache can shine in other areas:

When we work with a limited set of data and we need it often – and we want to pre-cache it. This may be reasonable scenario also for declarative cache if we can pre-fill it somehow and there is a single way how we obtain the data.
If we require the same data based on various attributes (different views). We can use multiple maps that point to the same actual instances. This can work both for cases when we cache as we go (when misses are possible) and when we preload the whole set (ideal case when not too big).
The cache can cooperate with any administrative features that modify the underlying data. Because we code it with full knowledge of the logic behind the data we can selectively refresh or update the cache in a smart way. In declarative cache we often have to evict it completely – although this can still be a good simple strategy even for programmatic caches, especially when the refresh requires a single select.
Full preload requires more memory up-front and slows down the startup of the application (can be done lazily on-demand) but deals with the DB once and for all. Declarative cache executes a query on demand for every miss but loads only some of the data that require caching – potentially less efficiently than full preload.

Of course, there are cases when we can’t use fully pre-loaded cache. In general, ask yourself a question whether “cache this by this key” is best (or good enough) solution or whether you can implement a component utilizing logic behind the data better – and whether it’s all worth it.

Conclusion

Caching can help tremendously with the performance of our applications – but it can also hurt if done badly. We should be very aware of our cache setup. We should be very clear how we want to do it. In the code it may look automagical, but it must be explicit somewhere – our strategy must be well known to all the developers who may encounter it in any way (even without knowing).

We always trade something for something – with caching it’s typically memory for speed.¹³ Memory can slow us down too, but in general we have plenty of it nowadays. While we are in a single process a JPA provider can typically manage the data consistency. If we have a distributed architecture we enter a different world altogether and I’d think twice before going there. We must feel the need for it and we must measure what we get – because we’ll definitely get the complexity and we have to think about consistency much more.

Don’t mix freely caching on multiple levels. Database cache is mostly transparent to us, but when we mix two declarative caches we often make matters worse. Especially when we cache entities with technology that is not aware of their lifecycle within the persistence context.

Finally, depending on what our keys for caching are we may waste a lot of memory. Entity ID (like in second level cache) is natural and good key. But if we key on many various selectors that may return the same entities (single or even whole collections) we may store many instances for the same entity in the cache. That wastes memory. Knowing more about the logic between the keys and values we may get better results with our own implementation of some explicit cache on an application level. It may require more effort but the pay-off may be significant.

Shortly:

Don’t cache mindlessly. Design what to cache (or not) and size the cache regions properly.
Realize that with caching we trade CPU for memory – or rather for memory and hopefully less CPU required by GC. Check the heap usage after changes.
Beware of caching on multiple levels, especially combining JPA and non-JPA caches inside JVM.
Consider implementing your own caching for selected sub-domains. The same data accessed by various criteria may be fit for this.
Measure the impact of any change related to caching.

5. Love and hate for ORM

ORM is typically seen as a good fit with domain-driven design (DDD). But ORM happened to become extremely popular and people thought it will solve all their database access problems without the need to learn SQL. This way obviously failed and hurt ORM back a lot, too. I say it again, ORM is very difficult, even using well documented ORM (like the JPA standard) is hard – there’s simply too much in it. We’re dealing with complex problem, with a mismatch – and some kind of mismatch it is. And it’s not the basic principle that hurts, we always get burnt on many, too many, details.

Vietnam of Computer Science

One of the best balanced texts that critique ORM, and probably one of the most famous, is quite old actually. In 2006 Ted Neward wrote an extensive post with a fitting, if provoking, name The Vietnam of Computer Science. If you seriously want to use ORM on any of your projects, you should read this – unless data access is not an important part of that project (who are we kidding, right?). You may skip the history of Vietnam war, but definitely give the technical part a dive it deserves.

ORM wasn’t that young anymore in 2006 and various experiences had shown that it easily brought more problems than benefits – especially if approached with partial knowledge, typically based on an assumption that “our developers don’t need to know SQL”. When some kind of non-programming architect recommends this for a project and they are long gone when the first troubles appear it’s really easy to recommend it again. It was so easy to generate entities from our DDL, wasn’t it? Hopefully managers are too high to be an audience for JPA/ORM recommendations, but technical guys can hurt themselves well enough. The JPA, for instance, is still shiny, it’s a standard after all – and yeah, it is kinda Java EE-ish, but this is the good new EE, right?

Wrong. Firstly, JPA/ORM is so complex when it comes to details, that using it as a tool for “cheap developers” who don’t need to learn SQL is as silly as it gets. I don’t know when learning became the bad thing in the first place, but some managers think they can save man-months/time/money when they skip training¹⁴. When the things get messy – and they will – there is nobody around who really understands ORM and there is virtually no chance to rewrite data access layer to get rid of it. Easiest thing to do is to blame ORM.

You may ask: What has changed since 2006? My personal take on the answer would be:

Nothing essential could have changed, we merely smoothed some rough edges, got a bit more familiar with already familiar topic and in Java space we standardized the beast (JPA).
We added more options to the query languages to make the gap between them and the SQL smaller. Funny enough, the JPA initially didn’t help with this as it lagged couple of years behind capabilities of the leading ORM solutions.
Query-by-API (mentioned in the post) is much better nowadays, state of the art technologies like Querydsl have very rich fluent API that is also very compact (definitely not “much more verbose than the traditional SQL approach”). Also both type safety and testing practices are much more developed.

Other than that, virtually all the concerns Ted mentioned are still valid.

Not much love for ORM

Whatever was written back in 2006, ORMs were on the rise since then. Maybe the absolute numbers of ORM experts are now higher than then, but I’d bet the ratio of experts among its users plummeted (no research, just my experience). ORM/JPA is easily available, Java EE supports it, Spring supports it, you can use it in Java SE easily, you can even generate CRUD scaffolding for your application with JPA using some rapid application development tools. That means a lot of developers are exposed to it. In many cases it seems deceivingly easy when you start using it.

ORM has a bad reputation with our DBAs – and for good reasons too. It takes some effort to make it generate reasonable queries, not to mention that you often have to break the ORM abstraction to do so. It’s good to start with clean, untangled code as it helps tremendously when you need to optimize some queries later. Optimization, however, can complicate the matter. If you explicitly name columns and don’t load whole entities you may get better performance, but it will be unfriendly to the entity cache. The same goes for bulk updates (e.g. “change this column for all entities where…”). There are the right ways to do it in the domain model, but learning the right path of OO and domain-driven design is probably even harder than starting with JPA – otherwise we’d see many more DDD-based projects around.

We talked about caching already and I’d say that misunderstandings around ORM caching are the reason for a lot of performance problems and potentially for data corruption too. This is when it starts to hurt – and when you can’t get easily out, hate often comes with it. When you make a mistake with a UI library, you may convince someone to let you rewrite it – and they can see the difference. Rewriting data access layer gives seemingly nothing to the client, unless the data access is really slow, but then the damage done is already quite big anyway.

But a lot of hate

When you need to bash some technology, ORM is a safe bet. I don’t know whether I should even mention ORM Is an Offensive Anti-Pattern but as it is now the top result on Google search for ORM, I do it anyway. It wouldn’t be fair to say the author doesn’t provide an alternative, but I had read a lot of his other posts (before I stopped) to see where “SQL-speaking objects” are going to. I cannot see single responsibility principle in it at all, and while SRP doesn’t have to be the single holy grail of OOP, putting everything into the domain object itself is not a good idea.¹⁵

There are other flaws with this particular post. Mapping is explained on a very primitive case, while ORM utilizes unit-of-work for cases where you want to execute multiple updates in a single transaction, possibly updates on the same object. If every elementary change on the object emits an SQL and we want to set many properties for the same underlying table in a transaction we get performance even worse than non-tuned ORM! You can answer with object exposing various methods to update this and that, which possibly leads to a combinatorial explosion. Further, in times of injection we are shown the most verbose way how to do ORM, the way I haven’t seen for years.

“SQL-speaking objects” bring us to a more generic topic of modelling objects in our programs. We hardly model real-life objects as they act in real life, because in many cases we should not. Information systems allow changing data about things that normally cannot change, because someone might have entered the information incorrectly in the first place.

How to model the behaviour of a tin can? Should it have open method? Even in real life someone opens the tin with a tin opener – an interaction of three objects. Why do we insist on objects storing themselves then? It still may be perfectly valid pattern – as I said real life is not always a good answer to our modelling needs – but it is often overused. While in real-life human does it, we often have various helper objects for behaviour, often ending with -er. While I see why this is considered antipattern, I personally dislike the generalization of rules like objects (classes) ending with -er are evil. Sure, they let me think – but TinOpener ends with “-er” too and it’s perfectly valid (even real-life!) object.

In any case I agree with the point that we should not avoid SQL. If we use ORM we should also know its QL (JPQL for JPA) and how it maps to SQL. We generally should not avoid of what happens down the stack, especially when the abstraction is not perfect. ORM, no question about it, is not a perfect abstraction.

To see a much better case against ORM let’s read ORM is an anti-pattern. Here we can find summary of all the bad things related to ORM, there is hardly anything we can argue about and if you read Ted Neward’s post too, you can easily map the problems from one post to another. We will go the full circle back to Martin Fowler and his ORM Hate. We simply have to accept ORM as it is and either avoid it, or use it with its limitations, knowing that the abstraction is not perfect. If we avoid it, we have to choose relational or object world, but we can hardly have both.

Is tuning-down a way out?

Not so long ago Uncle Bob has written an article Make the Magic go away. It nicely sums up many points related to using frameworks and although ORM is not mentioned it definitely fits the story.

Using less of ORM and relying less on complex auto-magic features is a way I propose. It builds on the premise that we should use ORM where it helps us, avoid it where it does not and know the consequences of both styles and their interactions. It may happen that using both ways adds complexity too, but from my experience it is not the case and not an inherent problem. JPA has much better mapping of values from DB to objects compared to the JDBC, so even if you used your entities as dummy DTOs it is still worth it. It abstracts concrete SQL flavour away which has its benefits – and unless this is a real issue for more than couple of queries you can resolve the rest with either native SQL support in the JPA, or use JDBC based solution.

Coming to relations, there may be many of them where to-one does not pose a problem. In that case make your life easier and use mapping to objects. If cascade loading of to-one causes problems you can try how well LAZY is supported by your ORM. Otherwise you have to live with it for non-critical cases and work around it for the critical ones with queries – we will get to this in the chapter Troubles with to-one relationships.

If your ORM allows it, you may go even lower on the abstraction scale and map raw foreign key values instead of related objects. While the mapping part is possible with the JPA standard, it is not enough as JPA does not allow you to join on such relationships. EclipseLink offers the last missing ingredient and this solution is described in the chapter Removing to-one altogether.

This all renders ORM as a bit lower-level tool than intended, but still extremely useful. It still allows you to generate schema from objects if you have control over your RDBMS (sometimes you don’t) or even just document your database with class diagram of entities¹⁶. We still have to keep unit-of-work and caching in check, but both are very useful if used well. I definitely don’t avoid using EntityManager, how could I?

And there is more

This list of potential surprises is far from complete, but for the rest of the book we will narrow our focus to relationships. We will review their mapping and how it affects querying and generated SQL queries.

I’d like to mention one more problem, kind of natural outcome of real-life software development. JPA is implemented by couple of projects and these projects have bugs. If the bug is generally critical, they fix it quite soon. If it’s critical only for your project, they may not. Most ORM projects are now open-sourced and you may try to fix it yourselves, although managing patches for updated OSS project is rather painful.

It’s all about how serious your trouble is – if you can find work-around, do it. For instance, due to a bug (now fixed) EclipseLink returned empty stream() for lazy lists. Officially they didn’t support Java 8, while in fact they extended Vector improperly breaking its invariants. We simply copied the list and called stream() on the copy and we had a utility for it. It wasn’t nice, but it worked and it was very easy to remove later. When they fixed the issue we simplified the code in the utility method and then inlined all the occurrences and it looked like it had never happened.

You may think about switching your provider – and JPA 2.1 brought a lot of goodies to make it even easier as many properties in persistence.xml are now non-proprietary. But you still have to go through the configuration (focus on caching especially) to make it work “as before” and then you may get caught into bug cross-fire, as I call it. I had been using Hibernate for many years when I joined another project using EclipseLink. After a few months we wanted to switch to Hibernate as we discovered that most of the team was more familiar with it. But some of our JPA queries didn’t work on Hibernate because of a bug.

So even something that should work in theory may be quite a horror in practice. I hope you have tests to catch any potential bug affecting your project.

Up next

1. Good Parts