Section author: Adam Scott <adam.scott@noirlab.edu>
Version: 20230829
1.14.1. SQL gotchas¶
1.14.1.1. Respect your LIMITS¶
One table may have hundreds of billions of rows. And you are only interested in a few. This is where the LIMIT
clause in SQL is so important to reduce your query time and load on resources.
It is recommended to always include a LIMIT
clause as you explore datasets, and just in case.
For example, to get the g magnitudes of 100 objects in SMASH:
SELECT id, gmag FROM smash_dr2.object LIMIT 100;
This returns in less than a second under quiet conditions.
If we had not added the limit 100
, the query would run for many minutes, maybe hours and return back over 1.1 billion rows.
If you are querying via TOPCAT or some other ADQL interface, the LIMIT
clause is not supported. Use TOP
as shown below.
SELECT TOP 100 id, gmag FROM smash_dr2.object;
1.14.1.1.1. Using Limits with Joins¶
In this example we have a LIMIT 10
(the last line of the query). We used a LIMIT
so we are safe right?
SELECT L.ls_id, L.ra, L.dec, L.mag_g, L.mag_r, L.mag_z, L.mag_w1, L.mag_w2, L.g_r, L.r_z, L.z_w1, L.w1_w2, L.type, S.class
FROM ls_dr10.tractor AS L
INNER JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj AS X
ON L.ls_id = X.id1
INNER JOIN sdss_dr17.specobj AS S
ON X.id2 = CAST(S.specobjid AS BIGINT)
WHERE (L.nobs_r>1 and L.nobs_z>1 and L.nobs_w1>1
and L.mag_g != 'nan' and L.mag_g != 'inf' and L.mag_r != 'nan' and
L.mag_r != 'inf' and L.mag_z != 'nan' and L.mag_z != 'inf' and
L.z_w1 != 'nan' and L.w1_w2 != 'nan')
LIMIT 10;
When the database performs this query, it has to satisfy both of the INNER JOIN
‘s before it can apply
the LIMIT 10
, which means the database has to scan millions of rows before returning 10. Here,
ls_dr10.tractor
has 3 billion rows, ls_dr10.x1p5__tractor__sdss_dr17__specobj
has 4.8 million rows,
and sdss_dr17.specobj
has 5.1 million rows. (If they were around 1,000 rows each, the impact of not using
LIMIT
for most types of queries is negligible - for most types of queries encountered in the wild, anyway.)
So along with using the LIMIT
command, keep in mind the number of rows in each table.
Incidentally (not immediately obvious) there is a missing join in this example. When building your query
sometimes you might forget to add a join. This can happen while experimenting with queries. So this query runs
for a very long time when we only wanted back 10 rows. Even with a missing join we can protect ourselves by
using the LIMIT
clause.
But how do we use the LIMIT
clause to protect ourselves here and elsewhere? Inline views.
1.14.1.1.2. Inline Views¶
If only we had a way to tell the database to limit the rows back from each table before we join them.
Let’s explicitly tell the database we want only 10 rows from ls_dr10.tractor
.
SELECT L.ls_id, L.ra, L.dec, L.mag_g, L.mag_r, L.mag_z, L.mag_w1, L.mag_w2,
L.g_r, L.r_z, L.z_w1, L.w1_w2, L.type, S.class
FROM
(SELECT * FROM ls_dr10.tractor AS L
WHERE (L.nobs_r>1 and L.nobs_z>1 and L.nobs_w1>1
and L.mag_g != 'nan' and L.mag_g != 'inf' and L.mag_r != 'nan'
and L.mag_r != 'inf' and L.mag_z != 'nan' and L.mag_z != 'inf'
and L.z_w1 != 'nan' and L.w1_w2 != 'nan')
LIMIT 10) AS l2
INNER JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj AS X
ON l2.ls_id = X.id1
INNER JOIN sdss_dr17.specobj AS S
ON X.id2 = CAST(S.specobjid AS BIGINT)
LIMIT 10;
Let’s break down what we did here. We created an inline view with this portion of the statement.
(SELECT * FROM ls_dr10.tractor AS L
WHERE (L.nobs_r>1 and L.nobs_z>1 and L.nobs_w1>1
and L.mag_g != 'nan' and L.mag_g != 'inf' and L.mag_r != 'nan'
and L.mag_r != 'inf' and L.mag_z != 'nan' and L.mag_z != 'inf'
and L.z_w1 != 'nan' and L.w1_w2 != 'nan')
LIMIT 10) AS l2
And we gave it an alias l2 (the AS l2). With this technique we treat the whole
SELECT
as another table and then perform a SELECT
against it. Importantly,
we put a LIMIT
on there. Finally, we moved the WHERE
clause up into the
inline view since the conditions only apply to L (the alias we gave to
ls_dr10.tractor
).
HINT: There’s a lot of text to read with queries, but keep the “skeleton” in mind as you read or scan them. The mental “skeleton” or pattern with this query might be like the following.
SELECT columns
FROM
(select columns from table where constraints limit 10) AS x
INNER JOIN something
ON something
INNER JOIN something
ON something
LIMIT 10;
In summary,
Use the
LIMIT
clause to explore your data to prevent long queries and reduce load on computer resources; you won’t have to wait as long to get back results.Take advantage of inline views to apply the
LIMIT
clause to tables you join against.
1.14.1.2. Make your JOINs work for you not against you¶
Here is an example of a JOIN that will work near optimal performance.
SELECT S.ra, S.dec, S.z, X.ra2, X.dec2, S.class,
S.vdisp, S.vdisp_err
FROM sdss_dr17.specobj as S
JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj as X
ON S.specobjid = X.id2
WHERE S.ra BETWEEN 126 and 131
and S.dec BETWEEN 7.0 and 12.0
and (S.ra_ivar > 0)
ORDER BY S.ra_ivar
LIMIT 1000;
In the JOIN, the column S.specobjid
is indexed and so is X.id2
.
Also, this is a 1-to-1 join, meaning there is exactly 1 row in sdss_dr17.specobj
for each specobjid
, and there is exactly 1 row in
ls_dr10.x1p5__tractor__sdss_dr17__specobj
for each id2
.
On the other hand, if specobjid
were not indexed in either table, the database
executor for this query would have to read every row in both tables, about 5 million each.
This is 5000 times more rows than necessary.
(Note: we still have the LIMIT 1000
in there just in case as discussed above.)
If this were a many-to-many join, the number of rows to scan can be the square of the number of rows in worst cases.
In summary for best performance,
Join only on indexed columns.
Make sure your joins are not many-to-many but one-to-one.
1.14.1.3. ORDER BY Trick¶
Again looking at the query we talked about above in the discussion of JOINs
(reshown below), the LIMIT 1000
may look like the database needs only to scan
1000 rows. It is more accurate to say that LIMIT
means return only the
specified number of rows from the results. The results may require scanning
many more rows, we do not always know.
In our example, the ORDER BY
forces the database to generate the entire
results regardless of the LIMIT
.
SELECT S.ra, S.dec, S.z, X.ra2, X.dec2, S.class,
S.vdisp, S.vdisp_err
FROM sdss_dr17.specobj as S
JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj as X
ON S.specobjid = X.id2
WHERE S.ra BETWEEN 126 and 131
and S.dec BETWEEN 7.0 and 12.0
and (S.ra_ivar > 0)
ORDER BY S.ra_ivar
LIMIT 1000;
The Trick We re-write the query as shown below.
SELECT * FROM
(SELECT S.ra, S.dec, S.z, X.ra2, X.dec2, S.class,
S.vdisp, S.vdisp_err, /* added: */ S.ra_ivar
FROM sdss_dr17.specobj as S
JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj as X
ON S.specobjid = X.id2
WHERE S.ra BETWEEN 126 and 131
and S.dec BETWEEN 7.0 and 12.0
and (S.ra_ivar > 0)
LIMIT 1000
) as x
ORDER BY x.ra_ivar;
We force the database to generate the results of the first 1000 rows before it
performs the ORDER BY
utilizing an inline view.
If we had simply put the LIMIT 1000
before the ORDER BY
without the inline
view it would have caused a syntax error. Note we had to add the S.ra_ivar
column
indicated by the SQL comment /* added: */
so the outer SELECT
could see it.
(You don’t have to add the SQL comment, it’s there for our demonstration).
This runs in less than half the time as the original in our example.
1.14.1.4. Learning more about SQL¶
Khan Academy Intro to SQL: Querying and managing.
Online SQL Tutorial; learn the mechanics interactively.
Try some other online exercises geared to PostgreSQL pgexercises.
PostgreSQL SQL Lanuage from the online manual The SQL Language.