Taking too long to cross match data

Question

249 views

Hi,

I have a big catalogue (~3M rows) to corss match with the ls8.tractor catalogue. My catalogue already gives the brickid, release and objid so I only need to match the ids to query the columns I need. I first uploaded entire my catalogue to myDB and did query like this:

datalab query --adql="select m.legacy_id, ra, dec, mag_w1, mag_w2, mag_w3, psfsize_z, fracdev, shapedev_r, shapedev_e1, shapedev_e2, shapeexp_r,shapeexp_e1,shapeexp_e2 FROM ls_dr8.tractor join mydb://dr2_lsid_all m on release = m.rid and brickid = m.bid and objid = m.oid " --async=true --fmt="csv" --out="vos://dr2_allls"

And the query couldn't finish in a week. So I splited the big catalogue into small ones with 30k rows each and reran the query in a loop. It still took about 5000s (~1.5h) to finish one small catalogue. However, the loop broke from time to time (seem to be connecting problem).

Could you tell me how long it normally take to finish such an query? Is there any way to boost it? It is quite annoying if I need to spend a month on the query.

asked Jul 16, 2023 by isaaczheng (220 points) | 249 views

1 Answer

Best answer

Hi, thanks for reaching out.

The reason for the slowness is likely the multi-column JOIN (your example is not actually crossmatching, but JOINing tables).

You will have likely seen that the ls_dr8.tractor table has a column 'ls_id', which combines release,brickid, and objid to a single unique ID. The solution to your goal is to compute such a column for your own 3-millions-rows table, too, and then do a simple JOIN on it with ls_dr8.tractor.ls_id.

The formal prescription to generate ls_id from release,brickid,objid is:

(release << 40) | (brickid << 16) | (objid)

If you want to generate it by running a query on our DB, you can use something like this (in this example I'm only creating 1000 rows):

from dl import queryClient as qc
q = """SELECT release, brickid, objid, ra, dec, random_id, ls_id,
            ( (CAST(release AS BIGINT) << 40) |
              (CAST(brickid AS BIGINT) << 16) |
              (CAST(objid AS BIGINT))
            ) AS my_ls_id
          FROM ls_dr8.tractor LIMIT 1000
"""
qc.query(sql=q,out="mydb://mytable")

But you can also generate the my_ls_id column locally, and then add it to your table. Then upload the new table, and crucially, index the my_ls_id column:

qc.mydb_index('mytable','my_ls_id')

Now you can do the JOIN:

q = """SELECT m.legacy_id,
                         l.ra, l.dec, l.mag_w1, l.mag_w2, l.mag_w3,
                         l.psfsize_z, l.fracdev, l.shapedev_r,
                         l.shapedev_e1, l.shapedev_e2, l.shapeexp_r, l.shapeexp_e1, l.shapeexp_e2
          FROM ls_dr8.tractor AS l
          JOIN mydb://mytable AS m
          ON l.ls_id = m.my_ls_id"""
res = qc.query(sql=q,fmt='csv',out='vos://dr2_allls')

I've tried this with example tables in mydb with up to 100K rows in size, and it's much faster: 7 seconds for a 100K rows table.

We hope this helps,
Robert

answered Jul 17, 2023 by robertdemo (9.3k points)
selected Jul 26, 2023 by 0

Taking too long to cross match data

Please log in or register to add a comment.

Your answer

1 Answer

Please log in or register to add a comment.

Categories