0 votes
32 views

Hello,

I am using the datalab command line tool to crossmatch a personal database to the ls_dr10.tractor_s catalogue.  My personal database has 50k rows.

I am able to successfully complete a nearest neighbor crossmatch with these tables, and the query runs in ~10 s.  However, when I try to include a matching condition of mag_z < (22.0), the query times out.    

Here is the original code that runs quickly:

WITH mytab AS(

SELECT * FROM mydb://mytab

),

ls10 AS(

SELECT ls_id, ra as ls_ra, dec as ls_dec, mag_g, mag_r, mag_i, mag_z, mag_w1, mag_w2, dered_mag_g, dered_mag_r, dered_mag_i, dered_mag_z, dered_mag_w1, dered_mag_w2, flux_g, flux_r, flux_i, flux_z, flux_w1, dered_flux_g, dered_flux_r, dered_flux_i, dered_flux_z, dered_flux_w1, dered_flux_w2, snr_g, snr_r, snr_i, snr_z, snr_w1, snr_w2, type, parallax, parallax_ivar FROM ls_dr10.tractor_s

)

SELECT mytab.*, (q3c_dist(mytab.ra, mytab.dec, xm_ls10.ls_ra, xm_ls10.ls_dec)*3600.0) as ls_dist_arcsec, xm_ls10.* FROM mytab

LEFT JOIN LATERAL (

SELECT ls10.* FROM ls10

WHERE

q3c_join(mytab.ra,mytab.dec,ls10.ls_ra,ls10.ls_dec,15.0/3600.)

ORDER BY

q3c_dist(mytab.ra,mytab.dec,ls10.ls_ra,ls10.ls_dec)

ASC LIMIT 1

) as xm_ls10 ON true

When i modify the following line like so, the query times out.

ls10 AS(

SELECT ...columns from before... FROM ls_dr10.tractor_s WHERE mag_z < (25.0)

)

resulting error message: HTTPSConnectionPool(host='datalab.noirlab.edu', port=443): Read timed out. (read timeout=600)

I have found if I put mag_z < (35.0) or some other very large value, the query will run and do so relatively quickly.

If I run just the SELECT ...columns from before... FROM ls_dr10.tractor_s WHERE mag_z < (25.0) in the web query interface, it works and runs quickly.

I don't understand why it now takes so long to process, and any advice would be appreciated!

by ktkauma (140 points)
edited by 0 | 32 views

1 Answer

0 votes

Thanks for your question, it was an interesting puzzle to solve.

Basically, the difference between cutting on the mag_z < 25 versus 35 has to do with the distribution of the index values for the column and what the query planner does with it.  If the cutoff is at ~28 then a sequence scan of the values is used since that cutoff includes most of the values in the column anyway, but when you have a cutoff at 25 the execution is changed to use an index scan that in this case is much more inefficient.

The two ways to fix this are to change the statistics computed for the index on the column, or change the query to trick the planner into doing a sequence scan and ignore the index.  The first case could be tuned for this particular query but may not be generally useful for us, however the second fix is rather simple.  By introducing a small calculation in the WHERE constraint you can force the sequence scan, for example

     .....WHERE (mag_z + 1000.0) < (1025.0)

With this change I'm able to run the query in a few seconds using your 'mytab' table.

I hope this helps, let us know if you continue to have problems.

by datalab (20.7k points)

416 questions

434 answers

440 comments

636 users

Welcome to Data Lab Help Desk, where you can ask questions and receive answers from other members of the community.

Categories