How to crossmatch three surveys efficiently to get photometric information?

Question

260 views

Hello! I am currently trying to crossmatch DES DR2 data with VHS and C

The tables I am using are des_dr2.main, vhs_dr5.vhs_cat_v3, and catwise2020.main. I would like to crossmatch the entire survey with the goal of obtaining photometry for all overlapping objects. I do have a for loop that matches based on a central coordinate ra0, dec0, and a radius of search using q3c_radial_query and organizing results by rdist. However, this code takes a very long time to run and will take me over a 100 days of computing time. I was wondering if AstroDataLab has a designated tool for this or if anyone has experience with this that knows how to do crossmatch in a time efficient manner?

For reference, here is my code:

# match des to vista

vista_patch1 = pd.DataFrame()

radius = 0.007

for i in range(len(des_patch1)):

    ra0 = des_patch1.iloc[i, 0] # in degrees, right ascension

    dec0 = des_patch1.iloc[i, 1] # in degrees, declinarion

    query = """

    SELECT ra2000, dec2000, japermag3, ksapermag3, japermag3err, ksapermag3err

    FROM vhs_dr5.vhs_cat_v3,

    q3c_dist(ra2000, dec2000, {:f}, {:f}) as rdist

    WHERE q3c_radial_query(ra2000, dec2000, {:f}, {:f}, {:f})

    order by rdist

    """.format(ra0, dec0, ra0, dec0, radius)

    response = qc.query(sql=query,format='csv')

    sources = convert(response,'pandas')

    sources['ra_des'] = ra0

    sources['dec_des'] = dec0

    sources['ra_comp'] = (sources['ra2000'] - ra0) * np.cos(dec0 * (np.pi/180))   # difference in right ascension

    sources['dec_comp'] = sources['dec2000'] - dec0                               # difference in declination

    closestmatch = sources.head(1)

    vista_patch1 = pd.concat([vista_patch1, closestmatch], ignore_index = True)

# match des to catwise

cat_patch1 = pd.DataFrame()

radius = 0.007

for i in range(len(des_patch1)):

    ra0 = des_patch1.iloc[i, 0] # in degrees, right ascension

    dec0 = des_patch1.iloc[i, 1] # in degrees, declinarion

    query = """

    SELECT ra, dec, w1mpro, w2mpro, w1sigmpro, w2sigmpro, pmra, pmdec

    FROM catwise2020.main,

    q3c_dist(ra, dec, {:f}, {:f}) as rdist

    WHERE cc_flags = '0000'

    AND ab_flags = '00'

    AND q3c_radial_query(ra,dec,{:f},{:f},{:f})

    order by rdist

    """.format(ra0, dec0, ra0, dec0, radius)

    response = qc.query(sql=query,format='csv')

    sources = convert(response,'pandas')

    sources['ra_des'] = ra0

    sources['dec_des'] = dec0

    sources['ra_comp'] = (sources['ra'] - ra0) * np.cos(dec0 * (np.pi/180))   # difference in right ascension

    sources['dec_comp'] = sources['dec'] - dec0                               # difference in declination

    closestmatch = sources.head(1)

    cat_patch1 = pd.concat([cat_patch1, closestmatch], ignore_index = True)

Thank you!

asked Jun 28, 2023 by malinadesai (140 points) | 260 views

1 Answer

Answer 1 · 2023-06-28T14:59:20+0000

Thank you for reaching out. May I offer a few general suggestions / best practices, and also a few specific ones to your proposed workflow. Maybe some, or a combination of them, will help in achieving your goal.

Cross-matches of entire large surveys are very compute-intensive. For most science use cases they are also not the first step (I think). How many des_dr2 objects do you have to start with? How many sources are approx. in each des_patch1 ?

* Is there a way where you could reduce significantly the number of objects you want to crossmatch? Maybe a pre-selection of DES DR2 objects based on colors, SNR, magnitude limits? (your science case drives this of course). The goal is not to have to crossmatch each and every object in two or three giant tables...

* If such a pre-selection can be done, one way to proceed is to save this pre-selected table to your MyDB, e.g. using:

query = """SELECT list,of,the,columns,i,need
                 FROM des_dr.main as d
                 WHERE d.mag_auto_g BETWEEN 23 and 24.5
                 AND d.snr_g > 30
                 AND any-other-conditions"""

result = qc.query(query,out="mydb://des_dr2_mytable")

You can then later access that smaller table, e.g. using

result = qc.query("SELECT * from mydb://des_dr2_mytable")

Ideally, your science use case can also dictate similar quality cuts on the other two tables (VHS, Catwise) to also reduce their initial sizes significantly, before proceeding to crossmatch them.

* As you see in the pseudo-example, it's good to only ask for the columns one really needs (which you are already specifying in your queries). Asking for all columns using '*' just explodes the data volume enormously.

* This is important: don't crossmatch by looping over every single source. Crossmatch using an entire sample at once (maybe the sample in a single patch, or maybe the entire table -- though in this case this would likely not succeed due to the sheer sizes of the initial tables). Our notebook "How to crossmatch tables" shows some examples:

https://github.com/astro-datalab/notebooks-latest/blob/master/04_HowTos/CrossmatchTables/How_to_crossmatch_tables.ipynb

(in section "Run the crossmatch"),

* The notebook also shows an example how you can return just the single nearest neighbor using SQL (since in your workflow you seem to be only keeping the nearest neighbor.) Specifying this in the SQL query prevents all the matching rows from having to be transferred back to your script.

* Note that you can run queries synchronously (that's the default), with up to 600 seconds timeout (default is 300s; set with qc.query(....timeout=600)), or asynchronously, with a runtime up to 24 hours (if memory serves). Handling async queries is a bit different though. Please see for instance the last section in this notebook: https://github.com/astro-datalab/notebooks-latest/blob/master/01_GettingStartedWithDataLab/01_JupyterPythonSQL101.ipynb

Various other notebooks in that repository also use async queries.

* For q3c functions to be more efficient (and this can make a big difference): if one of the two tables being crossmatched is much larger, and has already a q3c index computed on its ra/dec columns (like all object tables hosted at Data Lab do), then that large&indexed table should come second in many q3c functions (q3c_radial_query). See the Q3C documentation for more details: https://github.com/segasai/q3c
And also examples in the DL User Manual: https://datalab.noirlab.edu/docs/manual/Appendices/ExampleQueries/ExampleQueries.html

* If you are looping (maybe over patches or whatever), take the identical stuff outside the loop. For instance in your example (and using Pandas), code like this can be run just once, after the loop has finished:

sources['dec_comp'] = sources['dec2000'] - dec0

If 'dec0' is instead a column in the table, the one-shot line is then:

sources['dec_comp'] = sources['dec2000'] - sources['dec0']

This would do it for an entire column an once. And so on for other similar commands.

* Finally: if you like working with Pandas dataframes, and the returned table from a qc.query() call is not many millions/billions of rows lon, you can ask qc.query() to return a dataframe directly:

query = "SELECT ..."
df = qc.query(query,fmt='pandas')

Please let us know whether these hints were helpful in your quest, and please don't hesitate to contact us if you have further questions.

How to crossmatch three surveys efficiently to get photometric information?

Please log in or register to add a comment.

Your answer

1 Answer

Please log in or register to add a comment.

Categories