0 votes


I have a big catalogue (~3M rows) to corss match with the ls8.tractor catalogue. My catalogue already gives the brickid, release and objid so I only need to match the ids to query the columns I need. I first uploaded entire my catalogue to myDB and did query like this:

datalab query --adql="select m.legacy_id, ra, dec,  mag_w1, mag_w2, mag_w3, psfsize_z, fracdev, shapedev_r, shapedev_e1, shapedev_e2, shapeexp_r,shapeexp_e1,shapeexp_e2 FROM ls_dr8.tractor join mydb://dr2_lsid_all m on release = m.rid and brickid = m.bid and objid = m.oid " --async=true --fmt="csv" --out="vos://dr2_allls" 
And the query couldn't finish in a week. So I splited the big catalogue into small ones with 30k rows each and reran the query in a loop. It still took about 5000s (~1.5h) to finish one small catalogue. However, the loop broke from time to time (seem to be connecting problem). 
Could you tell me how long it normally take to finish such an query? Is there any way to boost it? It is quite annoying if I need to spend a month on the query.
by isaaczheng (220 points) | 105 views

1 Answer

0 votes
Best answer

Hi, thanks for reaching out.

The reason for the slowness is likely the multi-column JOIN (your example is not actually crossmatching, but JOINing tables).

You will have likely seen that the ls_dr8.tractor table has a column 'ls_id', which combines release,brickid, and objid to a single unique ID. The solution to your goal is to compute such a column for your own 3-millions-rows table, too, and then do a simple JOIN on it with ls_dr8.tractor.ls_id.

The formal prescription to generate ls_id from release,brickid,objid is:

(release << 40) | (brickid << 16) | (objid)

If you want to generate it by running a query on our DB, you can use something like this (in this example I'm only creating 1000 rows):

from dl import queryClient as qc
q = """SELECT release, brickid, objid, ra, dec, random_id, ls_id,
            ( (CAST(release AS BIGINT) << 40) |
              (CAST(brickid AS BIGINT) << 16) |
              (CAST(objid AS BIGINT))
            ) AS my_ls_id
          FROM ls_dr8.tractor LIMIT 1000

But you can also generate the my_ls_id column locally, and then add it to your table. Then upload the new table, and crucially, index the my_ls_id column:


Now you can do the JOIN:

q = """SELECT m.legacy_id,
                         l.ra, l.dec, l.mag_w1, l.mag_w2, l.mag_w3,
                         l.psfsize_z, l.fracdev, l.shapedev_r,
                         l.shapedev_e1, l.shapedev_e2, l.shapeexp_r, l.shapeexp_e1, l.shapeexp_e2
          FROM ls_dr8.tractor AS l
          JOIN mydb://mytable AS m
          ON l.ls_id = m.my_ls_id"""
res = qc.query(sql=q,fmt='csv',out='vos://dr2_allls')

I've tried this with example tables in mydb with up to 100K rows in size, and it's much faster: 7 seconds for a 100K rows table.

We hope this helps,

by robertdemo (8.1k points)
selected by 0

427 questions

441 answers


644 users

Welcome to Data Lab Help Desk, where you can ask questions and receive answers from other members of the community.