-
-
Save odrotbohm/3497047 to your computer and use it in GitHub Desktop.
JPQL: select p from Person p | |
left outer join p.address address | |
order by address.city | |
SQL: select person0_ from Person person0_ | |
left outer join Address address1_ on person0_.id=address1_.person_id | |
order by address1_.city | |
JPQL: select p from Person p | |
left outer join p.address address | |
order by p.address.city | |
SQL: select person0_ from Person person0_ | |
left outer join Address address1_ on person0_.id=address1_.person_id | |
cross join Address address2_ where person0_.id=address2_.person_id | |
order by address2_.city | |
The core difference here is that qualifying the sorting criteria to p.address.city | |
instead of address.city results in an additional JOIN clause added which results in | |
persons not having an address being ruled out from the result set. Environment is | |
Hibernate 4.1.5.SP1, H2 1.3.168. | |
1. Should the two JPQL queries result in the SQL queries shown? In particular | |
shouldn't they actually generate the very same SQL? | |
2. Why does the second JPQL query add the additional JOIN which effectively rules | |
out persons without an address? | |
3. Is it a good idea that adding a sort criterion potentially adds an additional | |
join which influences the actual result set return (i.e. "the number of results | |
changes just because I added a sort criteria") |
I tend to disagree :). Digging a bit deeper into the issue I found out that the core problem for it is the way JPA defines a path expression:
Path expression navigability is composed using “inner join” semantics. That is, if the value of a non-terminal field in the path expression is null, the path is considered to have no value, and does not participate in the determination of the result.
As this definition is also considered when applying the order by
clauses to the query the additional cross join
is added to the query. Whether it has to be a cross join
or not doesn't matter from a general perspective (it might from a performance one but who's using relational databases for really performant data access anyway? ;) ). So here's what I disagree on in particular:
- If path expression was defined to take mapping information into account it could be discovered that the additional join is not needed as it's already present in the query. Implicit joins added to be able to sort the result set must not influence the actual result set except of ordering.
- I don't join it twice, I add a sorting clause to the query. This must not affect the actual query criteria.
- The additional join is solely derived from the (IMHO wrong) requirement to implement path expressions as inner joins which actually causes the shown side effect. I've brought this issue to the attention of the JPA EG.
Anyway, thanks a lot for sharing your opinion!
I tend to disagree to some of your disagreement :)
- For an optimizing JPQL query parser, it would be great to recognize an existing join with exactly same semantics as the path expression navigation rule implies. Actually this does IMO not apply to your statement, because you have a left outer join instead of an inner join. The seen behavior is predictable once you apply the rule, though your human intuition sees more options. Why did you introduce the alias in the first place? Why didn't you re-use it within the query? Since when did SQL query parsers and planners do a good job without us hinting them to the right direction? :)
- You don't join twice, right - but you imply a second join with your path expression not matching an existing join. So the cited rule has to be applied. Once again, I think you get a perfectly predictable result. Btw, I wouldn't be too surprised seeing Hibernate constructing two joins even if your explicit one was an inner join...
- On one hand, I can follow your arguments - the left join is a strictly widening operation, all column values addressed with the order by path expression will be present, at least. On the other hand, getting more values - especially null values - for the order column is a direct side effect on your result, including glitches like nulls first / nulls last variations. How many people might already have used such a path expression following the stated rules to get rid of an explicit join definition, "optimizing" their code in a valid, backed-by-spec way? How many people (still) construct dynamic queries from StringBuilder magic, rather than Criteria API? For such queries, the existence of any left join would in addition to a decoupled order by term would produce clearly different results, without being able to omit then implicit widening any more.
Even after rethinking it, I tend to favor the state and rules as is. Especially given that I as the programmer only had to explicitly navigate my shiny little handcrafted alias to get exactly what I'm looking for :)
Interesting food for thought :). First of all, I don't think this is a parsing issue actually but one touching the translation from JPQL into SQL. It seems your core argument is that the JPQL query explicitly aliases the join criteria. Let's take a step back and try to simplify the scenario. Actually the query could be formulated as:
select p from Person p order by p.address.city
Simply the fact, that this query does not return Person
instances without Address
es feels utterly wrong to me. Nowhere is expressed that Person
s without Address
es should be dropped. We want all Person
s ordered by their Address
' city. Here we don't need to discuss potentially equivalent joins and how to find out whether they actually mean the same. The behavior is also not caused by some join semantics but by a 3 line paragraph forcing a path expression into an inner join. To me, this just feels plain wrong. Ironically you've described that a user has to explicitly define what she likes to get back when discussing the duplicate joins. Here we have exactly the opposite scenario: the SQL outcome of define JPQL query is not what the user has expressed through it's mapping metadata and JPQL query.
So the JPQL answer is to force the user to explicitly add the outer join, which is not something the user adds to influence the original outcome of the query. What is restated here has already been defined in the entity mapping.
select p from Person p
and
select p from Person p outer join p.address address
are semantically equivalent. It's not a join to form a criteria, it's a join to work around the path expression definition as defined in the JPA and the fact that adding an order by address.city
to the first query would result an additional inner join being added, whereas it wouldn't for the second one. This is somewhat comprehensible as the address
in the order by
clause refers to different things then. Still it's just awkward to have to add all this boilerplate stuff to get a result which could be derived from what is already defined in the mapping.
Regarding the concatenation story: of course you wouldn't string-concatenate JPQL queries from scratch (honestly I think there's still too many people using JPQL which is mostly because of the verbosity of the Criteria API). The usage scenario is derived from Spring Data JPA where you can simply annotated a basic query to a repository method and apply dynamic sorting to it:
interface PersonRepository extends Repository<Person, Long> {
@Query("select p from Person p")
List<Person> findPeopleSortedBy(Sort sort);
}
The interface is then backed by a proxy which translates the given Sort
object (consisting of a path expression and direction). To be safe against ambiguities we prefix the given sort by the alias given which leads us to the issue discussed. If you now explicitly join the addresses in the query definition, not only the sort binding would have to be different, also the client would have to know details about the query defined when formulating Sort
instances. This has been brought to my attention via DATAJPA-252, hence the discussion here.
Good discussion, indeed :)
Also taking a step back: the path notation is a natural way to express object graph navigation. The ORM layer has to map this into the relational model, following a predictable convention. This is done by using join semantics. The default join in SQL is the inner join. Thus, it maps to inner joins by default.
That said, my intuition for what
select p from Person p order by p.address.city
does seems to be completely different from yours - I find it totally natural that this statement imposes an inner join. As a user, I would get a strange feeling to have this path expression resolving to an outer join, while
select p from Person p where p.address.city='Bielefeld'
resolves to an inner join. I would expect a path expression to have a consistent meaning, regardless at what position (selection, projection, order by) it is used. Now if I want different behavior, I have to code my JPQL correspondingly - by adding the outer join explicitly and navigating it in my order by statement.
So, regarding your issue in Spring Data - first off I have to say I feel your pain :)
But what happens here is a natural problem of how we are able to access JPA. We basically have two options, working declaratively (JPQL) or programmatically (Criteria API). We don't have a mix of those two paradigms - while I would highly appreciate it, I understand that correctly parsing and transforming a JPQL statement into a Criteria graph, for adding Criteria programmatically later, is at least complicated if not even impossible. Your very problem seems to be that what Spring Data JPA tries to accomplish is exactly this paradigm mix that the JPA EG most likely thoughtfully dropped. This is where it get's really complicated, and you now seem to be at exactly that point, for what I can see. I see also that JPA introducing outer joins as default semantics for navigational expressions would solve you a big technical problem for your use case - nevertheless, I'd still argue that what JPQL does currently is totally correct for what it is designed, namely a plain declarative way to access your ORM layer...
I think there are two points where your argumentation is flawed:
- The translation of path expression actually takes mapping metdata into account, the problem is: it's only some of it. Property references are replaced by their column references if configured. Why should they not take relation - and thus join - information into account? That's just inconsistent.
- The two examples use path expressions in two different contexts. The latter uses them to define a criteria and if I define a criteria on the
city
property it can of course only be satisfied byPerson
s actually referring toAddress
instances. The former query does nothing of that kind. It does not express any intended criteria at all. So completely disagree on "this has to resolve to the same join" and would definitely argue theorder by
clauses need to be side-effect-free. Would you honestly accept anORDER BY
clause of a SQL statement implicitly adding a criteria to theWHERE
clause?
My general statement is that a JPQL order by
clause must not have side effects on the criteria defined in the where
clause. If there is no where
clause it must not implicitly rule out results by adding restricting joins.
I think what I'm trying to say is that this is an issue related the "M" in ORM. M stands for consistent mappings of idioms only available in / suitable for the O tier to the R tier. The path navigation is clearly an O concept, which we have to map to R. For the sake of consistency, it always imposes a side effect, namely an inner join.
My second SQL example was a bit wrong. Better would have been
select p from Person p where p.address.city is null
Here you can see a real side effect of the path expression, without being anyhow explicit about joins. How does this side effect differ from the one imposed by order by, rethinking both examples with both possible join semantics? The consistency lies therein that a path navigation expression solely and always consistently affects the dynaset definition (FROM
part) of your query, and not the projection (SELECT
), selection (WHERE
) or ORDER BY
. So regardless where it is used, it should have the same meaning by having the same side effect on the dynaset. And no, it is not a side effect on the WHERE
clause, although it may look that way in the generated SQL. Semantically it only adds an INNER JOIN
to the FROM
clause, which Hibernate for whatever reason de-composes to a selection on a cartesian product. This is valid in relational algebra and the correct decomposition of the non-basic JOIN
operation - but it is totally idiotic to use it in a DMBS with extended relational algebra representation, meaning anything newer than Oracle 8i :)
To be clear about, I was not talking about SQL at all, so of course I would not feel good about side effects on my dynaset imposed by ORDER BY
in my SQL. Ironically SQL would not even allow me to generate side effects, because each and everything I want to use for ordering has to be explicitly represented by my FROM
clause, including the right and thoughtfully chosen join needed to order by a criterion in a related table.
PS: The statement corrected above, although not intended, would have been correct for a "German-speaking" DBMS, since in Germany 'Bielefeld' equals null :)
An addition: Say we have the following statement:
select p, privateAddress.city, companyAddress.city from Person p
left outer join p.addresses privateAddress
left outer join p.addresses companyAddress
where
privateAddress.type='PRIVATE' and companyAddress.type='COMPANY'
Regardless whether this looks like the best possible modeling - what exactly should
ORDER BY addresses.city
now do when added, following your recommendation to implicitly / automatically re-use widening joins when having an order by expression?
Nothing at all, it's invalid as addesses
is not known at all. p.addresses
would have been. Does the query actually make sense? Why would you want to outer join the tables (retaining null values) if you actually have a where
clause that will definitely return false for these values?
Again, as soon as you manually add joins you do so for a reason and then it's up to you to define all related clauses correctly. But if you don't do (select p from Person p
), adding an order by
clause simply must not restrict the result set more than actually expressed in the query.
OK, I seem to have way too much time (I haven't), but well - it''s an interesting problem.
So, here's another addition.
- my last comment over-emphasizes the concept of re-using an existing outer join. If applying the rule of "non-aliased order by path expressions implicitly add an outer join to the resulting SQL
FROM
clause, it would be clear what the query does and the actual selection in terms of rows would not be affected. - Navigation expressions are all about navigating an object graph, since JPA tries to hide the relational stuff and give us a "purely" object oriented view on our data, right? So when applying the rule from point 1, you would effectively say: this
ORDER BY
is not an object graph expression any more, because in an object graph we can only walk along transitions - and not non-transitions, which is what a left outer join would construct. So ifORDER BY a.b
would result in an outer join, the path expression applied to the object graph would read [a | ].b, which feels utterly wrong to me in an object model.
Regarding your comment to the addresses-example:
Of course it should have read ORDER BY p.addresses
. The example does make sense as a classic "map rows to columns" example, relying on an effective query planner and executer in the DBMS, causing least possible communication overhead between DB and JVM. More realistic would be: Generate result rows of form
firstname, lastname, home phone, business phone
given that phone numbers are attributes of an entity with 1:n relation to Persons and a discriminator column type
But again, the example not really good to make any case for my point in opposite to yours, as mentioned in my last comment - so forget about it :)
Moreover, the cross join (added FTOMH by eg. Hibernate 4 "for better JPA 2 compatibility") is a killer for most DMBS when having serious amount of data.