The next release of Presto (version 312) will include a new optimization to remove unnecessary casts which might have been added implicitly by the query planner or explicitly by users when they wrote the query.
This is a long post explaining how the optimization works. If you’re only interested in the results, skip to the last section. For the full details, read on!
Like many programming languages, SQL allows certain operations between values of different
types if there are implicit conversions (a.k.a., implicit casts or coercions) between those types.
This improves usability, as it allows writing expressions like
1.5 > 2 without worrying too much
whether the types are compatible (
1.5 is of type
2 is an
During query analysis and planning, Presto introduces explicit casts for any implicit conversion in the original query as it translates it into the intermediate query plan representation the engine uses internally for optimization and execution. This eliminates a layer of complexity for the optimizer, which, as a result, doesn’t need to reason about types (type inference) or worry about whether expressions are properly typed.
More importantly, it simplifies the job of defining and implementing operators (e.g.,
Without implicit conversions, there would need to exist a variant of every operator for every combination
of compatible types. For example, it would be necessary to have an implementation of the
= operator for
(smallint, integer), and so on.
Given two columns,
s :: tinyint and
t :: smallint, and an expression such as
s = t, the planner
tinyint can be implicitly coerced to
smallint and derives the following expression:
CAST(s AS smallint) = t
This is not without challenges. The predicate pushdown logic relies on simple equality and range comparisons to move predicates around, and importantly, to infer that certain predicates in one branch of a join can be used to constrain the values on the other side of the join. An expression like the one above is not “simple” from this perspective due to the type conversion involved, and it can defeat the (arguably simplistic) predicate inference algorithm.
t is a constant (or an expression that is effectively constant), the engine has to
convert every value of
s it sees during query execution in order to compare it with
brings up the obvious question: “can’t it somehow convert
tinyint and compare directly”?
It would look like:
s = CAST(t AS tinyint)
t is a constant, the term
CAST(t AS tinyint) can be trivially pre-computed and reused
for the entire query. It’s not that simple in the general case, though. Narrowing cast, such
as a conversion from
tinyint, or from
integer can fail or alter
the value due to rounding or truncation, so we must take special care to avoid errors or
change query semantics. We discuss this at length in the sections below.
Some properties of (well-behaved) implicit casts
Let’s take a short detour and talk briefly about some properties of well-behaved implicit casts we can exploit to do the transformation we described in the previous section.
Since the query engine is free to insert implicit casts wherever it sees fit, these functions need to follow some ground rules. Failure to do so can result in queries producing incorrect results due to changes in query semantics.
Implicit casts need to have the following properties:
- Injective. Given every value in
Smust map to a distinct value in
T(this does not imply that every value in
Thas to map to a value in
- Order-preserving. Given and ,
For exact numeric types (e.g.,
decimal, etc.), this holds as long as
T has enough integer digits to hold the integral part of
S and enough fractional digits to
hold the fractional part of
As an example, the picture below depicts how every value of type
tinyint, which has a range
of , maps to a distinct value of a wider type such as
smallint. Also, every value
of the wider type that is within the range of representable values of
tinyint has a distinct
mapping to a
tinyint. So, for the values within the
tinyint range, the
conversion is bijective. This is not necessary for the
transformation to work, but it simplifies one of the cases we’ll consider. We’ll cover this more later.
On the other hand, some conversions such as those between integer types and decimal types with fractional parts are injective but not bijective, even when excluding the values outside the range of the narrower type.
The properties clearly hold for
biginteger. They also hold for:
decimal(20, 1)→ …
It even works for conversions between exact and approximate numbers, such as:
It does not work for
double when precision is large
because not all
bigints fit in a
double (64 bits vs 53-bit mantissa) and not all
integers fit in a
(32 bits vs 23-bit mantissa). Sadly, for legacy reasons Presto allows those conversions implicitly. We “justify”
it with the argument that “since they are dealing with approximate numerics anyway, and given the conversions only
lose precision in the least significant part, they are sort of ok”. This is something we’ll revisit in the
future once we have a reasonable story around dealing with inherent break in backward-compatibility
of removing such conversions.
Finally, the properties also apply for
varchar(2)→ … →
Getting to the point…
With this in mind, let’s look at the simplest scenario: conversions between integer types.
As in the example we covered in the introduction, the transformation is straightforward
when the constant can be represented in the narrower type. Given
s :: tinyint:
CAST(s AS smallint) = smallint '1' ⟺ s = tinyint '1' CAST(s AS smallint) = smallint '127' ⟺ s = tinyint '127' CAST(s AS smallint) = smallint '-128' ⟺ s = tinyint '-128' CAST(s AS smallint) > smallint '10' ⟺ s > tinyint '10' CAST(s AS smallint) < smallint '10' ⟺ s < tinyint '10'
Of course, when the value is at the edge of the range of the narrower type, we can cleverly turn some inequalities into equalities:
CAST(s AS smallint) >= smallint '127' ⟺ s >= tinyint '127' ⟺ s = tinyint '127' CAST(s AS smallint) <= smallint '-128' ⟺ s <= tinyint '-128' ⟺ s = tinyint '-128'
Additionally, we may be able to tell that an expression is always
care needs to be taken when the value is
null, though, since in SQL any comparison with
CAST(s AS smallint) > smallint '127' ⟺ s > tinyint '127' ⟺ if(s is null, null, false) CAST(s AS smallint) <= smallint '127' ⟺ s <= tinyint '127' ⟺ if(s is null, null, true) CAST(s AS smallint) < smallint '-128' ⟺ s < tinyint '-128' ⟺ if(s is null, null, false) CAST(s AS smallint) >= smallint '-128' ⟺ s >= tinyint '-128' ⟺ if(s is null, null, true)
We can make similar inferences when the value is outside the range of possible values
tinyint. For equality comparisons, it’s trivial.
CAST(s AS smallint) = smallint '1000' ⟺ if(s is null, null, false)
CAST(s AS smallint) <> smallint '1000' ⟺ if(s is null, null, true)
Just like the earlier cases involving comparisons with values at the edge of the range, we can apply the same idea when the value falls outside of the range:
CAST(s AS smallint) < smallint '1000' ⟺ if(s is null, null, true) CAST(s AS smallint) < smallint '-1000' ⟺ if(s is null, null, false) CAST(s AS smallint) > smallint '1000' ⟺ if(s is null, null, false) CAST(s AS smallint) > smallint '-1000' ⟺ if(s is null, null, true)
Values that are outside the range of the narrower type may not be the only ones without a mapping.
For example, for a type such as
decimal(2,1), any value with a fractional part (e.g.,
be represented as a
We can tell whether a value
T is representable in
S by converting it to
S and back to
call this value
t <> t',
t is not representable in
S, and similar rules as for out-of-range values apply when the
expression involves an equality. For instance, given
s :: tinyint:
CAST(s AS double) = double '1.1' ⟺ if(s is null, null, false) CAST(s AS double) <> double '1.1' ⟺ if(s is null, null, true)
When some values in
T are not representable in
S, the cast between
T → S will generally either truncate
or round. The SQL specification doesn’t mandate which of those alternatives an implementation should follow,
and even allows that to vary for conversions between various combinations of types.
This throws a bit of a wrench in our plans, so to speak. If we can’t tell whether a cast will round or truncate,
how would we know whether a
> comparison should turn into a
>= in the resulting expression? To
illustrate, let’s consider this example. Given
s :: tinyint:
CAST(s AS double) > double '1.9'
If the conversion from
tinyint truncates, the expression above is equivalent to:
s > tinyint '1'
On the other hand, if the conversion rounds,
2, and the expression is equivalent to:
s >= tinyint '2'
In order to know which operator to use in the transformed expression (e.g.,
>=), it is therefore
crucial to distinguish between those two behaviors. The good news is that there’s a simple and elegant way
out of this hole.
An important observation is that we don’t need to know how the conversion behaves in general, but only how
it behaves when applied to the constant
t. Regardless of whether the conversion truncates or rounds, for a
given value of
t, the outcome can be seen to round up or round down, as depicted below.
We can easily tell which of those scenarios applies by comparing
t > t', the operation rounded
down. Conversely, if
t < t', it rounded up. If
t = t', the value is representable in
S, and the rules from the
previous section apply.
Oh, the nullability
Let’s take another quick detour and talk about the issue of nullability. After all, no discussion about
SQL is complete without an exploration of the semantics of
SQL uses three-valued logic. In addition
false, logical expressions can evaluate to an unknown value, which is indicated by
OR behave according to the following rules:
The logical comparison operators =, <>, >, ≥, <, ≤ evaluate to
null when one or both operands are
null, our expression
cast(s as smallint) = t can be simply replaced with a constant
As we mentioned in the previous section, there are cases where
cast(s as smallint) = t can be reduced to
false, except for the fact that if
s is null, the expression needs to return
null to preserve
semantics. So, we use the following forms to capture this:
if(s IS null, null, false) if(s IS null, null, true)
The catch with that is that the optimizer does not understand the semantics of these
if expressions and cannot
use them for deriving additional properties. In essence, it becomes an optimization barrier. On the other hand,
the optimizer is pretty good at manipulating logical conjunctions (
AND) and disjunctions (
OR). So, let’s see
how we can use boolean logic to obtain an equivalent formulation.
We can exploit the properties of SQL boolean logic to derive expressions that behave in the same manner as the
if() constructs from above:
Let’s break it down to see why that works.
Putting it all together
Now that we’ve had a taste of how this optimization works, let’s put it all together into one rule to rule them all.
Given an expression of the following form,
we derive a transformation based on the rules below.
- If , we calculate and consider the following cases:
- If is undefined or fails,
Otherwise, the transformation is not applicable.
As if all of this weren’t enough, there’s an additional complication we need to handle for types such
double. Those types are what the SQL specification calls approximate numeric types.
Presto implements them as IEEE-754 single and double
precision floating point numbers, respectively.
In addition to finite numbers, IEEE-754 defines an additional set of values:
NaN (not a number).
It is worth noting that
+∞ do not behave like
∞ in the mathematical sense. They are actual values
in the ordered set of numbers, but they don’t represent any finite number. Therefore, the following relations hold:
-∞ < -1.23E30 < 0 < 3.45E25 < +∞ -∞ = -∞ +∞ = +∞
+∞ can be treated as regular values, we can use them as the minimum and maximum values of the range
for these types. Any other choice would not work, since all values of a type must be contained within the range of the type
for the transformation to be valid. That is,
Let’s look at an example to understand why this is necessary. Instead of using as the range,
let’s say we picked the minimum and maximum representable values for the
real type (-3.4028235E38 and 3.4028235E38), and
consider this expression (
s :: real):
cast(s AS double) >= double '3.4028235E38'
From the rules in the previous section, , and . Since and , from rule 2.1.2, the expression reduces to:
s = 3.4028235E38
This is clearly incorrect. When
s = Infinity,
cast(s AS double) results in
double 'Infinity', which is not equal
On the other hand,
NaN doesn’t obey any of the comparison rules. It’s neither equal nor distinct from itself, and
it’s neither larger, nor smaller than any other value:
NaN = NaN ⟺ false NaN <> NaN ⟺ false NaN > 0 ⟺ false NaN = 0 ⟺ false NaN < 0 ⟺ false
NaN is not part of the ordered set of values for these types, and the requirement that every value be contained
in the range doesn’t hold. From rule 2.1.1, an expression such as:
cast(s AS double) >= double '-Infinity'
if(s is null, null, true), which is incorrect, since the expression returns
Is all hope lost for
double? Fortunately, not. The range is only needed as an optimization. If we
forgo defining a range for types that don’t have the required properties, the special cases 2.1.1 and
2.1.2 don’t apply, and by rule 2.1, the expression is equivalent to:
s >= real '-Infinity'
which correctly returns
So, does all of this even matter? Why, yes! Glad you asked.
As with any performance optimization, you can improve things by working smarter (can you avoid work that can be proven to be unnecessary) or by working harder (can you do the work you have to do more efficiently). This optimization does a little of both. Let’s consider three scenarios when it has a positive effect.
Since in some cases it can prove that the comparisons will always produce
false, regardless of the input,
it can short-circuit entire conditions or subplans before even a single row of data is read. Some query generation
tools are not sophisticated enough and may emit queries that contain that kind of construct. Also, everyone makes
mistakes, and it’s not hard to end up with queries that contain what’s effectively dead code. The last thing you
want is to sit in front of the screen waiting for a query to complete … waiting … waiting … just for Presto
to tell you
For example, given:
CREATE TABLE t(x smallint); -- <insert lots of rows into t> --
SELECT * FROM t WHERE x IS NOT NULL AND x > 1000000
Produces the following query plan (
Values is an empty inline table):
- Output[x] - Values
Improved JOIN performance
What’s nice about this optimization is that it enables other optimizations to work better. We mentioned earlier that comparisons that are not simple expressions between columns, or between columns and constants, make it harder for the predicate pushdown optimization to infer predicates that can be propagated to the other branch of a join.
Given two tables:
CREATE TABLE t1 (v smallint); CREATE TABLE t2 (v bigint);
And the following query:
SELECT * FROM t1 JOIN t2 ON t1.v = t2.v WHERE t1.v = BIGINT '1';
The query plan without this optimization is:
- Output[name] - InnerJoin[expr = v] - ScanFilterProject[t1, filter = CAST(v AS bigint) = BIGINT '1'] expr := CAST(v AS bigint) - TableScan[t2]
The optimization allows the predicate pushdown logic to apply the condition to the other side of the join, producing
a much better plan. If data in
t2 is somehow organized by
v (e.g., a partition key in Hive), or if the
connector understands how to apply the filter at the source, the query won’t need to even read certain parts of the
table. The query plan with the optimization enabled:
- Output[name] - CrossJoin - ScanFilterProject[t1, filter = (v = SMALLINT '1')] - ScanFilterProject[t2, filter = (v = BIGINT '1')]
Best bang for the buck
Finally, if the condition absolutely needs to be evaluated, the transformed expression could be significantly
more efficient, especially when the cast between the two types is expensive. To illustrate, given a table
with 1 billion rows and a column
k :: bigint:
SELECT count_if(k > CAST(0 as decimal(19)) FROM t
Without the optimization:
- [...] - ScanProject ===> CPU: 3.75m (66.34%), Scheduled: 5.56m (145.22%) expr := (CAST("k" AS decimal(19,0)) > CAST(DECIMAL '0' AS decimal(19,0))) Query 20190515_072240_00006_rgzb4, FINISHED, 4 nodes Splits: 110 total, 110 done (100.00%) 0:22 [1000M rows, 8.4GB] [46M rows/s, 395MB/s]
With the optimization:
- [...] - ScanProject ===> CPU: 29.93s (58.17%), Scheduled: 47.44s (145.07%) expr := ("k" > BIGINT '0') Query 20190515_071912_00005_bz6cb, FINISHED, 4 nodes Splits: 110 total, 110 done (100.00%) 0:03 [1000M rows, 8.4GB] [335M rows/s, 2.81GB/s]
Thirsty for more? Here’s the code. Happy querying!
Many thanks to kasiafi for their thoughtful and thorough feedback on early drafts of this post.