Skip to content

Instantly share code, notes, and snippets.

@odrotbohm
Created August 28, 2012 10:36
Show Gist options
  • Save odrotbohm/3497047 to your computer and use it in GitHub Desktop.
Save odrotbohm/3497047 to your computer and use it in GitHub Desktop.
JPQL / SQL riddle
JPQL: select p from Person p
left outer join p.address address
order by address.city
SQL: select person0_ from Person person0_
left outer join Address address1_ on person0_.id=address1_.person_id
order by address1_.city
JPQL: select p from Person p
left outer join p.address address
order by p.address.city
SQL: select person0_ from Person person0_
left outer join Address address1_ on person0_.id=address1_.person_id
cross join Address address2_ where person0_.id=address2_.person_id
order by address2_.city
The core difference here is that qualifying the sorting criteria to p.address.city
instead of address.city results in an additional JOIN clause added which results in
persons not having an address being ruled out from the result set. Environment is
Hibernate 4.1.5.SP1, H2 1.3.168.
1. Should the two JPQL queries result in the SQL queries shown? In particular
shouldn't they actually generate the very same SQL?
2. Why does the second JPQL query add the additional JOIN which effectively rules
out persons without an address?
3. Is it a good idea that adding a sort criterion potentially adds an additional
join which influences the actual result set return (i.e. "the number of results
changes just because I added a sort criteria")
@rgielen
Copy link

rgielen commented Aug 28, 2012

  1. No
  2. Because you actually join it twice, once you created an alias for your first join (join p.address address) and dont't use it for the path to your order by expression
  3. Basically this is ok, as you could join the same table for various reasons (aka roles). Of course it looks moron to a human...

Moreover, the cross join (added FTOMH by eg. Hibernate 4 "for better JPA 2 compatibility") is a killer for most DMBS when having serious amount of data.

@odrotbohm
Copy link
Author

I tend to disagree :). Digging a bit deeper into the issue I found out that the core problem for it is the way JPA defines a path expression:

Path expression navigability is composed using “inner join” semantics. That is, if the value of a non-terminal field in the path expression is null, the path is considered to have no value, and does not participate in the determination of the result.

As this definition is also considered when applying the order by clauses to the query the additional cross join is added to the query. Whether it has to be a cross join or not doesn't matter from a general perspective (it might from a performance one but who's using relational databases for really performant data access anyway? ;) ). So here's what I disagree on in particular:

  1. If path expression was defined to take mapping information into account it could be discovered that the additional join is not needed as it's already present in the query. Implicit joins added to be able to sort the result set must not influence the actual result set except of ordering.
  2. I don't join it twice, I add a sorting clause to the query. This must not affect the actual query criteria.
  3. The additional join is solely derived from the (IMHO wrong) requirement to implement path expressions as inner joins which actually causes the shown side effect. I've brought this issue to the attention of the JPA EG.

Anyway, thanks a lot for sharing your opinion!

@rgielen
Copy link

rgielen commented Aug 28, 2012

I tend to disagree to some of your disagreement :)

  1. For an optimizing JPQL query parser, it would be great to recognize an existing join with exactly same semantics as the path expression navigation rule implies. Actually this does IMO not apply to your statement, because you have a left outer join instead of an inner join. The seen behavior is predictable once you apply the rule, though your human intuition sees more options. Why did you introduce the alias in the first place? Why didn't you re-use it within the query? Since when did SQL query parsers and planners do a good job without us hinting them to the right direction? :)
  2. You don't join twice, right - but you imply a second join with your path expression not matching an existing join. So the cited rule has to be applied. Once again, I think you get a perfectly predictable result. Btw, I wouldn't be too surprised seeing Hibernate constructing two joins even if your explicit one was an inner join...
  3. On one hand, I can follow your arguments - the left join is a strictly widening operation, all column values addressed with the order by path expression will be present, at least. On the other hand, getting more values - especially null values - for the order column is a direct side effect on your result, including glitches like nulls first / nulls last variations. How many people might already have used such a path expression following the stated rules to get rid of an explicit join definition, "optimizing" their code in a valid, backed-by-spec way? How many people (still) construct dynamic queries from StringBuilder magic, rather than Criteria API? For such queries, the existence of any left join would in addition to a decoupled order by term would produce clearly different results, without being able to omit then implicit widening any more.

Even after rethinking it, I tend to favor the state and rules as is. Especially given that I as the programmer only had to explicitly navigate my shiny little handcrafted alias to get exactly what I'm looking for :)

@odrotbohm
Copy link
Author

Interesting food for thought :). First of all, I don't think this is a parsing issue actually but one touching the translation from JPQL into SQL. It seems your core argument is that the JPQL query explicitly aliases the join criteria. Let's take a step back and try to simplify the scenario. Actually the query could be formulated as:

select p from Person p order by p.address.city

Simply the fact, that this query does not return Person instances without Addresses feels utterly wrong to me. Nowhere is expressed that Persons without Addresses should be dropped. We want all Persons ordered by their Address' city. Here we don't need to discuss potentially equivalent joins and how to find out whether they actually mean the same. The behavior is also not caused by some join semantics but by a 3 line paragraph forcing a path expression into an inner join. To me, this just feels plain wrong. Ironically you've described that a user has to explicitly define what she likes to get back when discussing the duplicate joins. Here we have exactly the opposite scenario: the SQL outcome of define JPQL query is not what the user has expressed through it's mapping metadata and JPQL query.

So the JPQL answer is to force the user to explicitly add the outer join, which is not something the user adds to influence the original outcome of the query. What is restated here has already been defined in the entity mapping.

select p from Person p

and

select p from Person p outer join p.address address

are semantically equivalent. It's not a join to form a criteria, it's a join to work around the path expression definition as defined in the JPA and the fact that adding an order by address.city to the first query would result an additional inner join being added, whereas it wouldn't for the second one. This is somewhat comprehensible as the address in the order by clause refers to different things then. Still it's just awkward to have to add all this boilerplate stuff to get a result which could be derived from what is already defined in the mapping.

Regarding the concatenation story: of course you wouldn't string-concatenate JPQL queries from scratch (honestly I think there's still too many people using JPQL which is mostly because of the verbosity of the Criteria API). The usage scenario is derived from Spring Data JPA where you can simply annotated a basic query to a repository method and apply dynamic sorting to it:

interface PersonRepository extends Repository<Person, Long> {

  @Query("select p from Person p")
  List<Person> findPeopleSortedBy(Sort sort);
}

The interface is then backed by a proxy which translates the given Sort object (consisting of a path expression and direction). To be safe against ambiguities we prefix the given sort by the alias given which leads us to the issue discussed. If you now explicitly join the addresses in the query definition, not only the sort binding would have to be different, also the client would have to know details about the query defined when formulating Sort instances. This has been brought to my attention via DATAJPA-252, hence the discussion here.

@rgielen
Copy link

rgielen commented Aug 29, 2012

Good discussion, indeed :)

Also taking a step back: the path notation is a natural way to express object graph navigation. The ORM layer has to map this into the relational model, following a predictable convention. This is done by using join semantics. The default join in SQL is the inner join. Thus, it maps to inner joins by default.

That said, my intuition for what

select p from Person p order by p.address.city

does seems to be completely different from yours - I find it totally natural that this statement imposes an inner join. As a user, I would get a strange feeling to have this path expression resolving to an outer join, while

select p from Person p where p.address.city='Bielefeld'

resolves to an inner join. I would expect a path expression to have a consistent meaning, regardless at what position (selection, projection, order by) it is used. Now if I want different behavior, I have to code my JPQL correspondingly - by adding the outer join explicitly and navigating it in my order by statement.

So, regarding your issue in Spring Data - first off I have to say I feel your pain :)

But what happens here is a natural problem of how we are able to access JPA. We basically have two options, working declaratively (JPQL) or programmatically (Criteria API). We don't have a mix of those two paradigms - while I would highly appreciate it, I understand that correctly parsing and transforming a JPQL statement into a Criteria graph, for adding Criteria programmatically later, is at least complicated if not even impossible. Your very problem seems to be that what Spring Data JPA tries to accomplish is exactly this paradigm mix that the JPA EG most likely thoughtfully dropped. This is where it get's really complicated, and you now seem to be at exactly that point, for what I can see. I see also that JPA introducing outer joins as default semantics for navigational expressions would solve you a big technical problem for your use case - nevertheless, I'd still argue that what JPQL does currently is totally correct for what it is designed, namely a plain declarative way to access your ORM layer...

@odrotbohm
Copy link
Author

I think there are two points where your argumentation is flawed:

  1. The translation of path expression actually takes mapping metdata into account, the problem is: it's only some of it. Property references are replaced by their column references if configured. Why should they not take relation - and thus join - information into account? That's just inconsistent.
  2. The two examples use path expressions in two different contexts. The latter uses them to define a criteria and if I define a criteria on the city property it can of course only be satisfied by Persons actually referring to Address instances. The former query does nothing of that kind. It does not express any intended criteria at all. So completely disagree on "this has to resolve to the same join" and would definitely argue the order by clauses need to be side-effect-free. Would you honestly accept an ORDER BY clause of a SQL statement implicitly adding a criteria to the WHERE clause?

My general statement is that a JPQL order by clause must not have side effects on the criteria defined in the where clause. If there is no where clause it must not implicitly rule out results by adding restricting joins.

@rgielen
Copy link

rgielen commented Aug 29, 2012

I think what I'm trying to say is that this is an issue related the "M" in ORM. M stands for consistent mappings of idioms only available in / suitable for the O tier to the R tier. The path navigation is clearly an O concept, which we have to map to R. For the sake of consistency, it always imposes a side effect, namely an inner join.

My second SQL example was a bit wrong. Better would have been

select p from Person p where p.address.city is null

Here you can see a real side effect of the path expression, without being anyhow explicit about joins. How does this side effect differ from the one imposed by order by, rethinking both examples with both possible join semantics? The consistency lies therein that a path navigation expression solely and always consistently affects the dynaset definition (FROM part) of your query, and not the projection (SELECT), selection (WHERE) or ORDER BY. So regardless where it is used, it should have the same meaning by having the same side effect on the dynaset. And no, it is not a side effect on the WHERE clause, although it may look that way in the generated SQL. Semantically it only adds an INNER JOINto the FROMclause, which Hibernate for whatever reason de-composes to a selection on a cartesian product. This is valid in relational algebra and the correct decomposition of the non-basic JOIN operation - but it is totally idiotic to use it in a DMBS with extended relational algebra representation, meaning anything newer than Oracle 8i :)

To be clear about, I was not talking about SQL at all, so of course I would not feel good about side effects on my dynaset imposed by ORDER BY in my SQL. Ironically SQL would not even allow me to generate side effects, because each and everything I want to use for ordering has to be explicitly represented by my FROMclause, including the right and thoughtfully chosen join needed to order by a criterion in a related table.

PS: The statement corrected above, although not intended, would have been correct for a "German-speaking" DBMS, since in Germany 'Bielefeld' equals null :)

@rgielen
Copy link

rgielen commented Aug 29, 2012

An addition: Say we have the following statement:

select p, privateAddress.city, companyAddress.city  from Person p 
      left outer join p.addresses privateAddress 
      left outer join p.addresses companyAddress
where
     privateAddress.type='PRIVATE' and companyAddress.type='COMPANY'

Regardless whether this looks like the best possible modeling - what exactly should

ORDER BY addresses.city

now do when added, following your recommendation to implicitly / automatically re-use widening joins when having an order by expression?

@odrotbohm
Copy link
Author

Nothing at all, it's invalid as addesses is not known at all. p.addresses would have been. Does the query actually make sense? Why would you want to outer join the tables (retaining null values) if you actually have a where clause that will definitely return false for these values?

Again, as soon as you manually add joins you do so for a reason and then it's up to you to define all related clauses correctly. But if you don't do (select p from Person p), adding an order byclause simply must not restrict the result set more than actually expressed in the query.

@rgielen
Copy link

rgielen commented Aug 29, 2012

OK, I seem to have way too much time (I haven't), but well - it''s an interesting problem.

So, here's another addition.

  1. my last comment over-emphasizes the concept of re-using an existing outer join. If applying the rule of "non-aliased order by path expressions implicitly add an outer join to the resulting SQL FROMclause, it would be clear what the query does and the actual selection in terms of rows would not be affected.
  2. Navigation expressions are all about navigating an object graph, since JPA tries to hide the relational stuff and give us a "purely" object oriented view on our data, right? So when applying the rule from point 1, you would effectively say: this ORDER BY is not an object graph expression any more, because in an object graph we can only walk along transitions - and not non-transitions, which is what a left outer join would construct. So if ORDER BY a.b would result in an outer join, the path expression applied to the object graph would read [a | ].b, which feels utterly wrong to me in an object model.

@rgielen
Copy link

rgielen commented Aug 29, 2012

Regarding your comment to the addresses-example:

Of course it should have read ORDER BY p.addresses. The example does make sense as a classic "map rows to columns" example, relying on an effective query planner and executer in the DBMS, causing least possible communication overhead between DB and JVM. More realistic would be: Generate result rows of form

firstname, lastname, home phone, business phone

given that phone numbers are attributes of an entity with 1:n relation to Persons and a discriminator column type

But again, the example not really good to make any case for my point in opposite to yours, as mentioned in my last comment - so forget about it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment