Section author: Adam Scott <adam.ascott@noirlab.edu>

1.13.4. Optimizing queries

  • Most of the time, if you reduce the number of rows the query “scans”, you will get your results back faster.

    See the discussion on the LIMIT clause in SQL Basics and Gotchas

  • Instead of SELECT *, select only the specific columns of interest. This is true in general. Instead of selecting 20 columns, select only the ones you need. If you are choosing between two columns to select, pick the one that is smaller in size.

  • If the query JOIN’s one or more tables, be sure the columns you are joining on are indexed.

    To find out if a column is indexed, one way is to lookup the table and see if the column is in bold here Data Lab query webpage

  • To take advantage of indexed columns, be sure to not wrap them in functions such as AVG(), SUM(), or any other function call.

SELECT expnum, fieldid  FROM smash_dr1.exposure e
 WHERE CAST(e.expnum AS INTEGER) > 187566;

Since we put CAST() around e.expnum, even if the expnum column is indexed, it will be slow as the index will be ignored. (This example would not be significant since the exposure table has 5,809 rows–related to the first point above to reduce the number of rows the query scans).

  • When querying an indexed column, if you can add a WHERE clause and you can also query another indexed column mentioned in the WHERE clause, most often the results return more quickly than just using a single indexed column.

Consider the abstracted SELECT.

SELECT <<something>> FROM <<something>> WHERE
<<condition 1>> AND
<<condition 2>> AND
<<condition ...>> AND
<<condition N>>;

The more conditions you can provide on any indexed column, generally the better.

  • Sorting is a relatively expensive operation. If you do not need to order the results, removing the ORDER BY will greatly speed up the results. (GROUP BY also forces a sort.)

  • Indexing columns having only one value in their rows is a waste since all rows satisfy (the database still has to scan through all the rows). Thus, using a WHERE clause on an indexed column having unique values will perform better than an indexed column with fewer unique values.

  • If you search for values in a non-indexed column, the database will search through every row. Add a condition in the where clause for an indexed column to narrow your search.

Here is a query on a table on an un-indexed column.

SELECT ra, dec FROM ls_dr3.tractor t
 WHERE g_r between 0.3 and 1.1;

In the example, where g_r is not indexed (as of this writing), the database will scan across over 330 GB of data, requiring over 10 minutes to run. (The entire tractor table is over 524 million rows).

If you add more conditions in the where clause on indexed columns you can drastically reduce the running time as in the following example (2 minutes to run).

SELECT ra, dec from ls_dr3.tractor_cs t
 WHERE g_r between 0.3 and 1.1
       AND mag_g between 20.5 and 21;

Why not index everything? The main reason is some columns do not have varied enough values, so the planner will then not use an index on those columns. Indexes also take up disk space.

  • If you are exploring the data, try reducing the number of calculations. Or try using simpler calculations: Division is more time consuming than multiplying. Some division operations can be changed into multiplication (e.g. x/10 = x * 0.10)

  • Avoid doing string or text manipulations as they are CPU-intensive operations.

  • If you are doing all of the suggestions above and the query is still slow, you may have hit an optimization wall where we are limited by hardware performance. In the future look for more parallel options to run queries across multiple CPUs and nodes. Presently we have parallel query in place which works well for aggregate queries on large tables, though it still has limited use. Future versions will support parallel operations on even more queries.