.. sectionauthor:: Adam Scott *Version: 20230829* .. index:: join, LIMIT, ORDER BY, SQL .. _sec_SQLGotchas: *********** SQL gotchas *********** Respect your LIMITS =================== One table may have hundreds of billions of rows. And you are only interested in a few. This is where the ``LIMIT`` clause in SQL is so important to reduce your query time and load on resources. It is recommended to always include a ``LIMIT`` clause as you explore datasets, and just in case. For example, to get the g magnitudes of 100 objects in SMASH: :: SELECT id, gmag FROM smash_dr2.object LIMIT 100; This returns in less than a second under quiet conditions. If we had not added the ``limit 100``, the query would run for many minutes, maybe hours and return back over 1.1 billion rows. If you are querying via TOPCAT or some other ADQL interface, the ``LIMIT`` clause is not supported. Use ``TOP`` as shown below. :: SELECT TOP 100 id, gmag FROM smash_dr2.object; Using Limits with Joins ----------------------- In this example we have a ``LIMIT 10`` (the last line of the query). We used a ``LIMIT`` so we are safe right? :: SELECT L.ls_id, L.ra, L.dec, L.mag_g, L.mag_r, L.mag_z, L.mag_w1, L.mag_w2, L.g_r, L.r_z, L.z_w1, L.w1_w2, L.type, S.class FROM ls_dr10.tractor AS L INNER JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj AS X ON L.ls_id = X.id1 INNER JOIN sdss_dr17.specobj AS S ON X.id2 = CAST(S.specobjid AS BIGINT) WHERE (L.nobs_r>1 and L.nobs_z>1 and L.nobs_w1>1 and L.mag_g != 'nan' and L.mag_g != 'inf' and L.mag_r != 'nan' and L.mag_r != 'inf' and L.mag_z != 'nan' and L.mag_z != 'inf' and L.z_w1 != 'nan' and L.w1_w2 != 'nan') LIMIT 10; When the database performs this query, it has to satisfy both of the ``INNER JOIN`` 's **before** it can apply the ``LIMIT 10``, which means the database has to scan millions of rows before returning 10. Here, ``ls_dr10.tractor`` has 3 billion rows, ``ls_dr10.x1p5__tractor__sdss_dr17__specobj`` has 4.8 million rows, and ``sdss_dr17.specobj`` has 5.1 million rows. (If they were around 1,000 rows each, the impact of not using ``LIMIT`` for most types of queries is negligible - for most types of queries encountered in the wild, anyway.) So along with using the ``LIMIT`` command, keep in mind the number of rows in each table. Incidentally (not immediately obvious) there is a missing join in this example. When building your query sometimes you might forget to add a join. This can happen while experimenting with queries. So this query runs for a very long time when we only wanted back 10 rows. Even with a missing join we can protect ourselves by using the ``LIMIT`` clause. But how do we use the ``LIMIT`` clause to protect ourselves here and elsewhere? Inline views. Inline Views ------------ If only we had a way to tell the database to limit the rows back from **each** table **before** we join them. Let's explicitly tell the database we want only 10 rows from ``ls_dr10.tractor``. :: SELECT L.ls_id, L.ra, L.dec, L.mag_g, L.mag_r, L.mag_z, L.mag_w1, L.mag_w2, L.g_r, L.r_z, L.z_w1, L.w1_w2, L.type, S.class FROM (SELECT * FROM ls_dr10.tractor AS L WHERE (L.nobs_r>1 and L.nobs_z>1 and L.nobs_w1>1 and L.mag_g != 'nan' and L.mag_g != 'inf' and L.mag_r != 'nan' and L.mag_r != 'inf' and L.mag_z != 'nan' and L.mag_z != 'inf' and L.z_w1 != 'nan' and L.w1_w2 != 'nan') LIMIT 10) AS l2 INNER JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj AS X ON l2.ls_id = X.id1 INNER JOIN sdss_dr17.specobj AS S ON X.id2 = CAST(S.specobjid AS BIGINT) LIMIT 10; Let's break down what we did here. We created an `inline view `_ with this portion of the statement. :: (SELECT * FROM ls_dr10.tractor AS L WHERE (L.nobs_r>1 and L.nobs_z>1 and L.nobs_w1>1 and L.mag_g != 'nan' and L.mag_g != 'inf' and L.mag_r != 'nan' and L.mag_r != 'inf' and L.mag_z != 'nan' and L.mag_z != 'inf' and L.z_w1 != 'nan' and L.w1_w2 != 'nan') LIMIT 10) AS l2 And we gave it an alias `l2` (the `AS l2`). With this technique we treat the whole ``SELECT`` as another table and then perform a ``SELECT`` against *it*. Importantly, we put a ``LIMIT`` on there. Finally, we moved the ``WHERE`` clause up into the inline view since the conditions only apply to `L` (the alias we gave to ``ls_dr10.tractor``). -------------------------- HINT: There's a lot of text to read with queries, but keep the "skeleton" in mind as you read or scan them. The mental "skeleton" or pattern with this query might be like the following. :: SELECT columns FROM (select columns from table where constraints limit 10) AS x INNER JOIN something ON something INNER JOIN something ON something LIMIT 10; ------------------------- In summary, * Use the ``LIMIT`` clause to explore your data to prevent long queries and reduce load on computer resources; you won't have to wait as long to get back results. * Take advantage of inline views to apply the ``LIMIT`` clause to tables you join against. Make your JOINs work for you not against you ============================================ Here is an example of a JOIN that will work near optimal performance. :: SELECT S.ra, S.dec, S.z, X.ra2, X.dec2, S.class, S.vdisp, S.vdisp_err FROM sdss_dr17.specobj as S JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj as X ON S.specobjid = X.id2 WHERE S.ra BETWEEN 126 and 131 and S.dec BETWEEN 7.0 and 12.0 and (S.ra_ivar > 0) ORDER BY S.ra_ivar LIMIT 1000; In the JOIN, the column ``S.specobjid`` is indexed and so is ``X.id2``. Also, this is a 1-to-1 join, meaning there is exactly 1 row in ``sdss_dr17.specobj`` for each ``specobjid``, and there is exactly 1 row in ``ls_dr10.x1p5__tractor__sdss_dr17__specobj`` for each ``id2``. On the other hand, if ``specobjid`` were not indexed in either table, the database executor for this query would have to read every row in both tables, about 5 million each. This is 5000 times more rows than necessary. (Note: we still have the ``LIMIT 1000`` in there just in case as discussed above.) If this were a many-to-many join, the number of rows to scan can be the square of the number of rows in worst cases. In summary for best performance, * Join only on indexed columns. * Make sure your joins are not many-to-many but one-to-one. ORDER BY Trick ============================================ Again looking at the query we talked about above in the discussion of JOINs (reshown below), the ``LIMIT 1000`` may look like the database needs only to scan 1000 rows. It is more accurate to say that ``LIMIT`` means return only the specified number of rows *from the results*. The results may require scanning many more rows, we do not always know. In our example, the ``ORDER BY`` forces the database to generate the entire results regardless of the ``LIMIT``. :: SELECT S.ra, S.dec, S.z, X.ra2, X.dec2, S.class, S.vdisp, S.vdisp_err FROM sdss_dr17.specobj as S JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj as X ON S.specobjid = X.id2 WHERE S.ra BETWEEN 126 and 131 and S.dec BETWEEN 7.0 and 12.0 and (S.ra_ivar > 0) ORDER BY S.ra_ivar LIMIT 1000; **The Trick** We re-write the query as shown below. :: SELECT * FROM (SELECT S.ra, S.dec, S.z, X.ra2, X.dec2, S.class, S.vdisp, S.vdisp_err, /* added: */ S.ra_ivar FROM sdss_dr17.specobj as S JOIN ls_dr10.x1p5__tractor__sdss_dr17__specobj as X ON S.specobjid = X.id2 WHERE S.ra BETWEEN 126 and 131 and S.dec BETWEEN 7.0 and 12.0 and (S.ra_ivar > 0) LIMIT 1000 ) as x ORDER BY x.ra_ivar; We force the database to generate the results of the first 1000 rows *before* it performs the ``ORDER BY`` utilizing an `inline view `_. If we had simply put the ``LIMIT 1000`` before the ``ORDER BY`` without the inline view it would have caused a syntax error. Note we had to add the ``S.ra_ivar`` column indicated by the SQL comment ``/* added: */`` so the outer ``SELECT`` could see it. (You don't have to add the SQL comment, it's there for our demonstration). This runs in less than half the time as the original in our example. Learning more about SQL ======================= Khan Academy Intro to SQL: `Querying and managing `_. Online SQL Tutorial; `learn the mechanics interactively `_. Try some other online exercises geared to PostgreSQL `pgexercises `_. PostgreSQL SQL Lanuage from the online manual `The SQL Language `_.