How to filter out unwanted objects

Question

844 views

I'm looking for a query to filter out unwanted objects from a crossmatch result and I'm not sure how to do it since NOIRLab doesn't support ADQL geometry functions (e.g. CONTAINS).

The resulting crossmatch has columns with ra & dec values (ra, dec, w1mpro, w2mpro, japermag3). I'm looking for a solution to throw these results into the query and remove all objects that are:

listed in Gaia DR3 within 0.0006 deg (~ 2 arcsec)
or have a magnitude listed in NSC DR2 g- or r-band within 0.0006 deg (~ 2 arcsec)

I couldn't find an example in the 'notebooks-latest' about removing unwanted objects from a crossmatch so I would be thankful for a little hint. :)

asked Jun 15, 2023 by martinkb (720 points) | 844 views

1 Answer

Best answer

Hi Martin, thanks for reaching out. How large is the table that you want to rid of some rows? (according to the criteria you mention) If it's not "many millions or billions" of rows, then the simplest solution I can think of would be to load the table into a Pandas dataframe, and then use Panda's built-in filtering capabilities to keep only the rows you want. Some examples of Pandas' row filtering are for instance here: https://blog.hubspot.com/website/filter-rows-pandas

The selection can be probably done with SQL, in your original query, but that query might be too slow. Hard to tell before trying. If you think the Pandas route won't work, and you want to try SQL, can you please post the exact crossmatch query you have so far? (the one that also returns rows that you don't want).

Robert

answered Jun 21, 2023 by robertdemo (9.3k points)
selected Sep 12, 2023 by 0

Sorry, I should've given way more details. The crossmatch was between the CatWISE2020 Reject catalog and VHS DR5. The resulting table has ~2,400 rows. I've uploaded it to mydb as 'catw_rjct_vhs_cmatch' and 'mydb_index'ed the ra and dec column.

The crossmatch (3° deg. stripes in VHS & CatWISE2020 Reject catalog) looked like this:
  '''SELECT c.*, v.japermag3
  FROM mydb://cat_rjct_%c%02d AS c, mydb://vhs_j18t19_%c%02d AS v
  WHERE q3c_join(c.ra, c.dec, v.ra2000, v.dec2000, 0.0006)
  AND v.japermag3-c.w2mpro >= 2 ''' % ('m' if fnr < 0.0 else 'p',abs(fnr),'m' if fnr < 0.0 else 'p',abs(fnr))

I haven't used Pandas dataframe yet, so I don't know if this would work but I'm willing to give it a try if you think this would work.

I was trying out something the last few days and thought something like "NOT q3c_join" would work, but the sync query timed out and async went on for more than 20 hours, so I aborted the job. The query looked like:
  '''SELECT t.*
  FROM %s AS t, gaia_dr3.gaia_source as g
  WHERE NOT q3c_join(t.ra, t.dec, g.ra, g.dec, 0.0006)
  ''' % (tname_source)

('tname_source' was 'mydb://catw_rjct_vhs_cmatch')

I've searched the forum today and found a discussion about cone searches. So the next idea would be to use something like 'NOT q3c_radial_query(t.ra, t.dec, g.ra, g.dec, 0.0006)' but I haven't tried it yet and don't know if this would be the right approach.

commented Jun 21, 2023 by martinkb (720 points)
edited Jun 21, 2023 by 0

Hi, I think we have a solution.

I misunderstood first and thought that you have the Gaia membership in the 2100-odd rows table.

Then, I would have tried your approach with NOT before q3c_radial_query() (or before q3c_join() if your table in MyDB has a q3c index computed on ra+dec). To my surprise, this indeed runs very long (and times out). It's not yet clear to me why, but I suspect it's the very way that Q3C works (going down a quad tree).

But: this works, and takes just a split second:

q = """SELECT t.*, lat.* FROM mydb://catw_rjct_vhs_cmatch AS t
           LEFT JOIN LATERAL (
               SELECT g.ra as gaia_ra, g.dec as gaia_dec
                    FROM
                        gaia_dr3.gaia_source AS g
                    WHERE
                        q3c_join(t.ra, t.dec, g.ra, g.dec, 2./3600)
                    ORDER BY
                        q3c_dist(t.ra,t.dec,g.ra,g.dec)
                    ASC LIMIT 1
               ) as lat ON true;"""

df = qc.query(sql=q,fmt='pandas') # returns a Pandas datafeame

print(df)

              ra        dec w1mpro w2mpro japermag3     gaia_ra   gaia_dec
0     359.236279 -12.725109 17.272 16.054 18.901869 359.236227 -12.725009
1     342.668594 -37.068164 17.035 15.975 18.822598         NaN        NaN
2     284.883958 -38.395691 16.439 15.193 18.039240 284.884509 -38.396015
3     139.318738 -84.704269 16.401 15.210 18.056204 139.317920 -84.704375
4     285.738926 -42.133567 17.007 15.685 18.530529         NaN        NaN
...          ...        ...     ...     ...        ...         ...        ...
2158   44.970729 -74.375657 17.746 16.554 18.555647         NaN        NaN
2159   25.342299 -78.219318 18.258 16.945 18.946070   25.343585 -78.218890
2160 310.778549 -56.177660 17.158 16.146 18.146202 310.779308 -56.177933
2161 247.471449 -15.539732 17.527 16.455 18.455181 247.471779 -15.539405
2162 258.986891 -6.048619 17.720 16.700 18.700050         NaN        NaN

[2163 rows x 7 columns]

Observe that gaia_ra and gaia_dec columns have 'NaN' for every row where there was no crossmatch in gaia_dr3.gaia_source. We can now use this trivially to remove those rows from the dataframe:

dfclean = df.dropna(subset=['gaia_ra',])

If you now wanted to drop the new 'gaia_ra' and 'gaia_dec' columns from the dfclean dataframe:

dfclean = dfclean.drop(axis=1,labels=['gaia_ra','gaia_dec'])

And finally, reset the index of the dataframe if you wish:

dfclean.reset_index(inplace=True)
print(dfclean)

      index          ra        dec w1mpro w2mpro japermag3
0         0 359.236279 -12.725109 17.272 16.054 18.901869
1         2 284.883958 -38.395691 16.439 15.193 18.039240
2         3 139.318738 -84.704269 16.401 15.210 18.056204
3         7 280.114715 -52.521655 17.024 15.997 18.839478
4         8 248.463188 -12.575940 16.787 15.605 18.447136
...     ...         ...        ...     ...     ...        ...
1071   2150 256.152790 -57.824500 17.885 16.537 18.543484
1072   2152 104.960263 -31.049966 17.067 16.064 18.070358
1073   2159   25.342299 -78.219318 18.258 16.945 18.946070
1074   2160 310.778549 -56.177660 17.158 16.146 18.146202
1075   2161 247.471449 -15.539732 17.527 16.455 18.455181

[1076 rows x 6 columns]

Note the reduced number of rows. You can now for instance write out the dataframe to a new CSV file, etc. And of course, do the same for your "not in nsc_dr within 2 arcseconds" constraints.

Best,
Robert

commented Jun 28, 2023 by robertdemo (9.3k points)
edited Jun 28, 2023 by 0

Thanks Robert. Especially for going into such detail. :)
I'm currently trying to figure out how this works and looking up the Pandas functions.

> Observe that gaia_ra and gaia_dec columns have 'NaN' for every row where there was no crossmatch in gaia_dr3.gaia_source. We can now use this trivially to remove those rows from the dataframe:

I'm basically looking for 'brown dwarf candidates' and a listing in Gaia would rule the object out or it already would have been discovered/mentioned. That's why my intent was to remove all objects that are listed in Gaia DR3. So if I get your approch right then the lines without a crossmatch (NaN) are actually the ones I'm interested in, right?

According to my Google search results for a possible solution it seems the isna() function should be the one to use in this case.

commented Jun 29, 2023 by martinkb (720 points)
edited Jun 29, 2023 by 0

I'm still working on my workflow and seem to have struck another problem. How do you get the 'dfclean' Pandas dataframe back into mydb? I was trying to use 'mydb_import' but the table which is created seems to have 0 rows. My current script looks like this:

###########################################################
import time
from getpass import getpass
from datetime import datetime
from dl import authClient as ac, queryClient as qc

def rm_gaia(tname):
    # remove all objects that are listed in Gaia DR3
    query = '''
            SELECT t.*, lat.*
            FROM %s AS t
            LEFT JOIN LATERAL (
                SELECT g.ra as gaia_ra, g.dec as gaia_dec
                        FROM
                            gaia_dr3.gaia_source AS g
                        WHERE
                            q3c_join(t.ra, t.dec, g.ra, g.dec, 0.0006)
                        ORDER BY
                            q3c_dist(t.ra, t.dec, g.ra, g.dec)
                        ASC LIMIT 1
                ) as lat ON true
            ''' % (tname)

    # table name of objects minus Gaia DR3 -> rm_nsc()
    tname_rm_gaia = '%s_g' % (tname)

    # query Gaia DR3 and return crossmatch as Pandas dataframe
    df = qc.query(sql=query, fmt='pandas', verbose=1)

    # remove all Gaia DR3 matches and remove Gaia RA & dec. columns
    dfclean = df[df['gaia_ra'].isna()]
    dfclean = dfclean.drop(axis=1, labels=['gaia_ra','gaia_dec'])

    # reset index after removing rows & dropping columns
    dfclean.reset_index(drop=True, inplace=True)

    # import dataframe into mydb
    qc.mydb_import(tname_rm_gaia, dfclean)

tname = 'mydb://catw_rjct_vhs_cmidx'
rm_gaia(tname)
###########################################################

'mydb://catw_rjct_vhs_cmidx' is the q3c-indexed mydb table (qc.mydb_index(tname,'', q3c='ra,dec', cluster=True) with the 2163 rows. dfclean returns 993 rows, but counting the rows in 'mydb://catw_rjct_vhs_cmidx_g' (notice the '_g' appendix) always returns '0'. What am I doing wrong?

By the way, do I have to q3c-index the new mydb table again or does it still hold the index from the original table?

commented Jun 30, 2023 by martinkb (720 points)
edited Jul 1, 2023 by 0

Thanks a lot for all the hints/tips Robert. :)

I was finally able to run my little python script and condensed the final table down to 175 rows. Did indeed find a new object with brown dwarf characteristics. Yay! :D (https://www.zooniverse.org/projects/marckuchner/backyard-worlds-planet-9/talk/666/2985419?comment=4903916)

I discovered a problem with the script though when I tried to run it on another dataset. It turns out that NSC DR2 doesn't use 'NaN' to indicate a missing magnitude for an object but instead lists a value of '99.990000'. In order to get only those objects that don't have a magnitude listed in g- or r-mag I used the condition 'dfclean = df[(df['nsc_gmag'] == 99.99) & (df['nsc_rmag'] == 99.99)]'.

The new dataset though returns only 'NaN' values when crossmatching in a 2./3600 radius with NSC DR2 which results in an empty pandas dataframe. That's when I noticed I'm loosing all the Non-matches from the crossmatch with my approach which is not what I intended.

So that's where I'm at right now. I'm looking for a solution to get:
• all objects that crossmatch with NSC DR2 and don't have a magnitude in g- AND r-band
• additionally keep the ones that don't crossmatch with NSC DR2 (for further checkup)

Currently reading Pandas documentation, StackOverflow, Google, … to come up with a better solution.

commented Jul 6, 2023 by martinkb (720 points)

How to filter out unwanted objects

Please log in or register to add a comment.

Your answer

1 Answer

Please log in or register to add a comment.

Categories