0 votes
305 views

I'm looking for a query to filter out unwanted objects from a crossmatch result and I'm not sure how to do it since NOIRLab doesn't support ADQL geometry functions (e.g. CONTAINS).

The resulting crossmatch has columns with ra & dec values (ra, dec, w1mpro, w2mpro, japermag3). I'm looking for a solution to throw these results into the query and remove all objects that are:

  • listed in Gaia DR3 within 0.0006 deg (~ 2 arcsec)
  • or have a magnitude listed in NSC DR2 g- or r-band within 0.0006 deg (~ 2 arcsec)

I couldn't find an example in the 'notebooks-latest' about removing unwanted objects from a crossmatch so I would be thankful for a little hint. :)

by martinkb (720 points) | 305 views

1 Answer

0 votes
Best answer
Hi Martin, thanks for reaching out. How large is the table that you want to rid of some rows? (according to the criteria you mention) If it's not "many millions or billions" of rows, then the simplest solution I can think of would be to load the table into a Pandas dataframe, and then use Panda's built-in filtering capabilities to keep only the rows you want. Some examples of Pandas' row filtering are for instance here: https://blog.hubspot.com/website/filter-rows-pandas

The selection can be probably done with SQL, in your original query, but that query might be too slow. Hard to tell before trying. If you think the Pandas route won't work, and you want to try SQL, can you please post the exact crossmatch query you have so far? (the one that also returns rows that you don't want).

Robert
by robertdemo (8.0k points)
selected by 0
Sorry, I should've given way more details. The crossmatch was between the CatWISE2020 Reject catalog and VHS DR5. The resulting table has ~2,400 rows. I've uploaded it to mydb as 'catw_rjct_vhs_cmatch' and 'mydb_index'ed the ra and dec column.

The crossmatch (3° deg. stripes in VHS & CatWISE2020 Reject catalog) looked like this:
  '''SELECT  c.*, v.japermag3
  FROM mydb://cat_rjct_%c%02d AS c, mydb://vhs_j18t19_%c%02d AS v
  WHERE q3c_join(c.ra, c.dec, v.ra2000, v.dec2000, 0.0006)
  AND v.japermag3-c.w2mpro >= 2 ''' % ('m' if fnr < 0.0 else 'p',abs(fnr),'m' if fnr < 0.0 else 'p',abs(fnr))

I haven't used Pandas dataframe yet, so I don't know if this would work but I'm willing to give it a try if you think this would work.

I was trying out something the last few days and thought something like "NOT q3c_join" would work, but the sync query timed out and async went on for more than 20 hours, so I aborted the job. The query looked like:
  '''SELECT t.*
  FROM %s AS t, gaia_dr3.gaia_source as g
  WHERE NOT q3c_join(t.ra, t.dec, g.ra, g.dec, 0.0006)
  ''' % (tname_source)

('tname_source' was 'mydb://catw_rjct_vhs_cmatch')

I've searched the forum today and found a discussion about cone searches. So the next idea would be to use something like 'NOT q3c_radial_query(t.ra, t.dec, g.ra, g.dec, 0.0006)' but I haven't tried it yet and don't know if this would be the right approach.
Thanks for the additional info. The table of 2400 rows sounds really like a prime candidate for Pandas-style filtering. Would you mind copying the CSV to your vospace, somewhere in the vos://public/ directory? (vos://public/ is readable to all, i.e. I can get it from there). You could use:

qc.query("select * from mydb://catw_rjct_vhs_cmatch",out="vos://public/catw_rjct_vhs_cmatch.csv")

If you prefer, you could also email the CSV file to datalab@noirlab.edu

Robert
I've uploaded it to vospace now (vos://public/catw_rjct_vhs_cmatch.csv). I've checked a bunch of the objects in the meantime, so it's only 2,162 rows.

I've looked into Pandas a little bit, but I'm still not sure how that would work. I can only filter for things that are already in the dataframe but Gaia DR3 membership is not part of it or listed in any column.

By the way, I'm really interested in the solution so that I can do it myself in the future. :)
Hi, I think we have a solution.

I misunderstood first and thought that you have the Gaia membership in the 2100-odd rows table.

Then, I would have tried your approach with NOT before q3c_radial_query() (or before q3c_join() if your table in MyDB has a q3c index computed on ra+dec). To my surprise, this indeed runs very long (and times out). It's not yet clear to me why, but I suspect it's the very way that Q3C works (going down a quad tree).

But: this works, and takes just a split second:

q = """SELECT  t.*, lat.* FROM mydb://catw_rjct_vhs_cmatch AS t
           LEFT JOIN LATERAL (
               SELECT g.ra as gaia_ra, g.dec as gaia_dec
                    FROM
                        gaia_dr3.gaia_source AS g
                    WHERE
                        q3c_join(t.ra, t.dec, g.ra, g.dec, 2./3600)
                    ORDER BY
                        q3c_dist(t.ra,t.dec,g.ra,g.dec)
                    ASC LIMIT 1
               ) as lat ON true;"""

df = qc.query(sql=q,fmt='pandas')  # returns a Pandas datafeame

print(df)

              ra        dec  w1mpro  w2mpro  japermag3     gaia_ra   gaia_dec
0     359.236279 -12.725109  17.272  16.054  18.901869  359.236227 -12.725009
1     342.668594 -37.068164  17.035  15.975  18.822598         NaN        NaN
2     284.883958 -38.395691  16.439  15.193  18.039240  284.884509 -38.396015
3     139.318738 -84.704269  16.401  15.210  18.056204  139.317920 -84.704375
4     285.738926 -42.133567  17.007  15.685  18.530529         NaN        NaN
...          ...        ...     ...     ...        ...         ...        ...
2158   44.970729 -74.375657  17.746  16.554  18.555647         NaN        NaN
2159   25.342299 -78.219318  18.258  16.945  18.946070   25.343585 -78.218890
2160  310.778549 -56.177660  17.158  16.146  18.146202  310.779308 -56.177933
2161  247.471449 -15.539732  17.527  16.455  18.455181  247.471779 -15.539405
2162  258.986891  -6.048619  17.720  16.700  18.700050         NaN        NaN

[2163 rows x 7 columns]

Observe that gaia_ra and gaia_dec columns have 'NaN' for every row where there was no crossmatch in gaia_dr3.gaia_source. We can now use this trivially to remove those rows from the dataframe:

dfclean = df.dropna(subset=['gaia_ra',])

If you now wanted to drop the new 'gaia_ra' and 'gaia_dec' columns from the dfclean dataframe:

dfclean = dfclean.drop(axis=1,labels=['gaia_ra','gaia_dec'])

And finally, reset the index of the dataframe if you wish:

dfclean.reset_index(inplace=True)
print(dfclean)

      index          ra        dec  w1mpro  w2mpro  japermag3
0         0  359.236279 -12.725109  17.272  16.054  18.901869
1         2  284.883958 -38.395691  16.439  15.193  18.039240
2         3  139.318738 -84.704269  16.401  15.210  18.056204
3         7  280.114715 -52.521655  17.024  15.997  18.839478
4         8  248.463188 -12.575940  16.787  15.605  18.447136
...     ...         ...        ...     ...     ...        ...
1071   2150  256.152790 -57.824500  17.885  16.537  18.543484
1072   2152  104.960263 -31.049966  17.067  16.064  18.070358
1073   2159   25.342299 -78.219318  18.258  16.945  18.946070
1074   2160  310.778549 -56.177660  17.158  16.146  18.146202
1075   2161  247.471449 -15.539732  17.527  16.455  18.455181

[1076 rows x 6 columns]

Note the reduced number of rows. You can now for instance write out the dataframe to a new CSV file, etc. And of course, do the same for your "not in nsc_dr within 2 arcseconds" constraints.

Best,
Robert
Thanks Robert. Especially for going into such detail. :)
I'm currently trying to figure out how this works and looking up the Pandas functions.

> Observe that gaia_ra and gaia_dec columns have 'NaN' for every row where there was no crossmatch in gaia_dr3.gaia_source. We can now use this trivially to remove those rows from the dataframe:

I'm basically looking for 'brown dwarf candidates' and a listing in Gaia would rule the object out or it already would have been discovered/mentioned. That's why my intent was to remove all objects that are listed in Gaia DR3. So if I get your approch right then the lines without a crossmatch (NaN) are actually the ones I'm interested in, right?

According to my Google search results for a possible solution it seems the isna() function should be the one to use in this case.
Hi Martin, yes you are right. The rows where 'gaia_ra' and 'gaia_dec' are NaN, are the ones with no match in Gaia. To keep only such rows you can do:

dfclean = df[df['gaia_ra'].isna()]

(and you can drop the gaia_ra and gaia_dec columns if you like, and also reset the index).
I'm still working on my workflow and seem to have struck another problem. How do you get the 'dfclean' Pandas dataframe back into mydb? I was trying to use 'mydb_import' but the table which is created seems to have 0 rows. My current script looks like this:

###########################################################
import time
from getpass import getpass
from datetime import datetime
from dl import authClient as ac, queryClient as qc

def rm_gaia(tname):
    # remove all objects that are listed in Gaia DR3
    query = '''
            SELECT  t.*, lat.*
            FROM %s AS t
            LEFT JOIN LATERAL (
                SELECT g.ra as gaia_ra, g.dec as gaia_dec
                        FROM
                            gaia_dr3.gaia_source AS g
                        WHERE
                            q3c_join(t.ra, t.dec, g.ra, g.dec, 0.0006)
                        ORDER BY
                            q3c_dist(t.ra, t.dec, g.ra, g.dec)
                        ASC LIMIT 1
                ) as lat ON true
            ''' % (tname)

    # table name of objects minus Gaia DR3 -> rm_nsc()
    tname_rm_gaia = '%s_g' % (tname)

    # query Gaia DR3 and return crossmatch as Pandas dataframe
    df = qc.query(sql=query, fmt='pandas', verbose=1)
    
    # remove all Gaia DR3 matches and remove Gaia RA & dec. columns
    dfclean = df[df['gaia_ra'].isna()]
    dfclean = dfclean.drop(axis=1, labels=['gaia_ra','gaia_dec'])

    # reset index after removing rows & dropping columns
    dfclean.reset_index(drop=True, inplace=True)

    # import dataframe into mydb
    qc.mydb_import(tname_rm_gaia, dfclean)


tname = 'mydb://catw_rjct_vhs_cmidx'
rm_gaia(tname)
###########################################################

'mydb://catw_rjct_vhs_cmidx' is the q3c-indexed mydb table (qc.mydb_index(tname,'', q3c='ra,dec', cluster=True) with the 2163 rows. dfclean returns 993 rows, but counting the rows in 'mydb://catw_rjct_vhs_cmidx_g' (notice the '_g' appendix) always returns '0'. What am I doing wrong?

By the way, do I have to q3c-index the new mydb table again or does it still hold the index from the original table?
Oops. I did a little bit debugging and finally found the error. 'tname_rm_gaia' was starting with 'mydb://' and this won't work for a table name in mydb. The solution was to slice the string, then it worked.

tname_rm_gaia = tname[7:] + '_g'

Now the mydb table holds the expected 993 rows.

Still not sure about the q3c-index though. Do I have to redo it prior to crossmatching with NSC DR2?
Glad you figured out the table name issue. Regarding q3c index: they are not propagated if you create a new table using "SELECT * from .." from a table that had the index. For small tables (a few thousand rows) having the q3c index or not should not make much of a difference (as long as you feed the non-indexed ra and dec columns first to e.g. q3c_radial_query(). Otherwise, it's a good habit to quickly compute the q3c index.
Thanks a lot for all the hints/tips Robert. :)

I was finally able to run my little python script and condensed the final table down to 175 rows. Did indeed find a new object with brown dwarf characteristics. Yay! :D (https://www.zooniverse.org/projects/marckuchner/backyard-worlds-planet-9/talk/666/2985419?comment=4903916)

I discovered a problem with the script though when I tried to run it on another dataset. It turns out that NSC DR2 doesn't use 'NaN' to indicate a missing magnitude for an object but instead lists a value of '99.990000'. In order to get only those objects that don't have a magnitude listed in g- or r-mag I used the condition 'dfclean = df[(df['nsc_gmag'] == 99.99) & (df['nsc_rmag'] == 99.99)]'.

The new dataset though returns only 'NaN' values when crossmatching in a 2./3600 radius with NSC DR2 which results in an empty pandas dataframe. That's when I noticed I'm loosing all the Non-matches from the crossmatch with my approach which is not what I intended.

So that's where I'm at right now. I'm looking for a solution to get:
 • all objects that crossmatch with NSC DR2 and don't have a magnitude in g- AND r-band
 • additionally keep the ones that don't crossmatch with NSC DR2 (for further checkup)

Currently reading Pandas documentation, StackOverflow, Google, … to come up with a better solution.
After some testing I think I figured it out. The new condition I'm using is:
dfclean = df[((df['nsc_gmag'] == 99.99) & (df['nsc_rmag'] == 99.99)) | ((df['nsc_gmag'].isna()) & (df['nsc_rmag'].isna()))]

I've just added another 'OR' branch after the prior condition to test for 'isna()' in the nsc_gmag and nsc_rmag columns. In this way I'll also keep all the non-matches. Seems to work. :)

On a sidenote, there seems to be something wrong with the 'Question2Answer' software running here at NOIRLab. I've never gotten an email notification although I add my email address down below most of the time where it says "Email me at this address if a comment is added after mine:"
Hi Martin, sorry for the delay, I also didn't get a notification that there were new comments on this thread. We looked into this issue over the past few days, and tested (successfully): when you check the box "I want to be notified", and you enter your email, then you should be getting notifications. Is it possible that you had one or the other, but not both, on that occasion?
> when you check the box "I want to be notified", and you enter your email, then you should be getting notifications.

The box is usually pre-checked and I only put my email address in. But I wouldn't rule out this has happend in the past. However, I've found the last notification email now buried deep in my spam filter. It seems the feature is working properly then. :)

418 questions

435 answers

440 comments

637 users

Welcome to Data Lab Help Desk, where you can ask questions and receive answers from other members of the community.

Categories