Section author: Adam Scott <adam.scott@noirlab.edu>

Version: 20230829

1.14.1. SQL gotchas

1.14.1.1. Respect your LIMITS

One table may have hundreds of billions of rows. And you are only interested in a few. This is where the LIMIT clause in SQL is so important to reduce your query time and load on resources.

It is recommended to always include a LIMIT clause as you explore datasets, and just in case.

For example, to get the g magnitudes of 100 objects in SMASH:

SELECT id, gmag FROM smash_dr2.object LIMIT 100;

This returns in less than a second under quiet conditions.

If we had not added the limit 100, the query would run for many minutes, maybe hours and return back over 1.1 billion rows.

If you are querying via TOPCAT or some other ADQL interface, the LIMIT clause is not supported. Use TOP as shown below.

SELECT TOP 100 id, gmag FROM smash_dr2.object;

1.14.1.1.1. Using Limits with Joins

In this example we have a LIMIT 10 (the last line of the query). We used a LIMIT so we are safe right?

SELECT L.ls_id, L.ra, L.dec, L.mag_g, L.mag_r, L.mag_z, L.mag_w1, L.mag_w2, L.g_r, L.r_z, L.z_w1, L.w1_w2, L.type, S.class
  FROM ls_dr10.tractor AS L
 INNER JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj AS X
            ON L.ls_id = X.id1
 INNER JOIN sdss_dr17.specobj AS S
            ON  X.id2  = CAST(S.specobjid AS BIGINT)
      WHERE (L.nobs_r>1 and L.nobs_z>1 and L.nobs_w1>1
           and L.mag_g != 'nan' and L.mag_g != 'inf' and L.mag_r != 'nan' and
           L.mag_r != 'inf' and L.mag_z != 'nan' and L.mag_z != 'inf' and
           L.z_w1 != 'nan' and L.w1_w2 != 'nan')
LIMIT 10;

When the database performs this query, it has to satisfy both of the INNER JOIN ‘s before it can apply the LIMIT 10, which means the database has to scan millions of rows before returning 10. Here, ls_dr10.tractor has 3 billion rows, ls_dr10.x1p5__tractor__sdss_dr17__specobj has 4.8 million rows, and sdss_dr17.specobj has 5.1 million rows. (If they were around 1,000 rows each, the impact of not using LIMIT for most types of queries is negligible - for most types of queries encountered in the wild, anyway.)

So along with using the LIMIT command, keep in mind the number of rows in each table.

Incidentally (not immediately obvious) there is a missing join in this example. When building your query sometimes you might forget to add a join. This can happen while experimenting with queries. So this query runs for a very long time when we only wanted back 10 rows. Even with a missing join we can protect ourselves by using the LIMIT clause.

But how do we use the LIMIT clause to protect ourselves here and elsewhere? Inline views.

1.14.1.1.2. Inline Views

If only we had a way to tell the database to limit the rows back from each table before we join them.

Let’s explicitly tell the database we want only 10 rows from ls_dr10.tractor.

SELECT L.ls_id, L.ra, L.dec, L.mag_g, L.mag_r, L.mag_z, L.mag_w1, L.mag_w2,
       L.g_r, L.r_z, L.z_w1, L.w1_w2, L.type, S.class
  FROM
       (SELECT * FROM ls_dr10.tractor AS L
                 WHERE (L.nobs_r>1 and L.nobs_z>1 and L.nobs_w1>1
                        and L.mag_g != 'nan' and L.mag_g != 'inf' and L.mag_r != 'nan'
                        and L.mag_r != 'inf' and L.mag_z != 'nan' and L.mag_z != 'inf'
                        and L.z_w1 != 'nan' and L.w1_w2 != 'nan')
        LIMIT 10) AS l2
 INNER JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj AS X
            ON l2.ls_id = X.id1
 INNER JOIN sdss_dr17.specobj AS S
            ON  X.id2  = CAST(S.specobjid AS BIGINT)
LIMIT 10;

Let’s break down what we did here. We created an inline view with this portion of the statement.

(SELECT * FROM ls_dr10.tractor AS L
 WHERE (L.nobs_r>1 and L.nobs_z>1 and L.nobs_w1>1
        and L.mag_g != 'nan' and L.mag_g != 'inf' and L.mag_r != 'nan'
        and L.mag_r != 'inf' and L.mag_z != 'nan' and L.mag_z != 'inf'
        and L.z_w1 != 'nan' and L.w1_w2 != 'nan')
 LIMIT 10) AS l2

And we gave it an alias l2 (the AS l2). With this technique we treat the whole SELECT as another table and then perform a SELECT against it. Importantly, we put a LIMIT on there. Finally, we moved the WHERE clause up into the inline view since the conditions only apply to L (the alias we gave to ls_dr10.tractor).


HINT: There’s a lot of text to read with queries, but keep the “skeleton” in mind as you read or scan them. The mental “skeleton” or pattern with this query might be like the following.

SELECT columns
  FROM
      (select columns from table where constraints limit 10) AS x
  INNER JOIN something
          ON something
  INNER JOIN something
          ON something
  LIMIT 10;

In summary,

  • Use the LIMIT clause to explore your data to prevent long queries and reduce load on computer resources; you won’t have to wait as long to get back results.

  • Take advantage of inline views to apply the LIMIT clause to tables you join against.

1.14.1.2. Make your JOINs work for you not against you

Here is an example of a JOIN that will work near optimal performance.

SELECT   S.ra, S.dec, S.z, X.ra2, X.dec2, S.class,
         S.vdisp, S.vdisp_err
    FROM sdss_dr17.specobj as S
    JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj as X
      ON S.specobjid = X.id2
   WHERE S.ra BETWEEN 126 and 131
         and S.dec BETWEEN 7.0 and 12.0
         and (S.ra_ivar > 0)
ORDER BY S.ra_ivar
   LIMIT 1000;

In the JOIN, the column S.specobjid is indexed and so is X.id2. Also, this is a 1-to-1 join, meaning there is exactly 1 row in sdss_dr17.specobj for each specobjid, and there is exactly 1 row in ls_dr10.x1p5__tractor__sdss_dr17__specobj for each id2.

On the other hand, if specobjid were not indexed in either table, the database executor for this query would have to read every row in both tables, about 5 million each.

This is 5000 times more rows than necessary.

(Note: we still have the LIMIT 1000 in there just in case as discussed above.)

If this were a many-to-many join, the number of rows to scan can be the square of the number of rows in worst cases.

In summary for best performance,

  • Join only on indexed columns.

  • Make sure your joins are not many-to-many but one-to-one.

1.14.1.3. ORDER BY Trick

Again looking at the query we talked about above in the discussion of JOINs (reshown below), the LIMIT 1000 may look like the database needs only to scan 1000 rows. It is more accurate to say that LIMIT means return only the specified number of rows from the results. The results may require scanning many more rows, we do not always know.

In our example, the ORDER BY forces the database to generate the entire results regardless of the LIMIT.

SELECT   S.ra, S.dec, S.z, X.ra2, X.dec2, S.class,
         S.vdisp, S.vdisp_err
    FROM sdss_dr17.specobj as S
    JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj as X
      ON S.specobjid = X.id2
   WHERE S.ra BETWEEN 126 and 131
         and S.dec BETWEEN 7.0 and 12.0
         and (S.ra_ivar > 0)
ORDER BY S.ra_ivar
   LIMIT 1000;

The Trick We re-write the query as shown below.

SELECT * FROM

   (SELECT S.ra, S.dec, S.z, X.ra2, X.dec2, S.class,
           S.vdisp, S.vdisp_err, /* added: */ S.ra_ivar
    FROM sdss_dr17.specobj as S
    JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj as X
    ON S.specobjid = X.id2
    WHERE S.ra BETWEEN 126 and 131
          and S.dec BETWEEN 7.0 and 12.0
          and (S.ra_ivar > 0)
   LIMIT 1000
   ) as x

ORDER BY x.ra_ivar;

We force the database to generate the results of the first 1000 rows before it performs the ORDER BY utilizing an inline view. If we had simply put the LIMIT 1000 before the ORDER BY without the inline view it would have caused a syntax error. Note we had to add the S.ra_ivar column indicated by the SQL comment /* added: */ so the outer SELECT could see it. (You don’t have to add the SQL comment, it’s there for our demonstration).

This runs in less than half the time as the original in our example.

1.14.1.4. Learning more about SQL

Khan Academy Intro to SQL: Querying and managing.

Online SQL Tutorial; learn the mechanics interactively.

Try some other online exercises geared to PostgreSQL pgexercises.

PostgreSQL SQL Lanuage from the online manual The SQL Language.