0 votes
62 views

I'm trying a J-W2 crossmatch with VHS DR5 and CatWISE but haven't been successful for weeks. They all seem to either be erroneous or timing out. I think I must be doing something wrong.

My most recent attempt was to prefilter both catalogs (CatWISE & VHS) to get smaller inputs for the crossmatch. It worked for CatWISE but now even a relatively basic async query (without q3c function or LATERAL JOIN) for VHS times out. The Jupyter notebook looked like this:

from dl import authClient as ac, queryClient as qc
from getpass import getpass

token = ac.login(input("Enter user name: (+ENTER) "),getpass("Enter password: (+ENTER) "))
if not ac.isValidToken(token):
    raise Exception('Token is not valid. Please check your usename/password and execute this cell again.')

query =  '''SELECT ra2000, dec2000, japermag3, mergedclass
            FROM vhs_dr5.vhs_cat_v3
            WHERE japermag3 between 18 and 19
            AND (mergedclass=-1 OR mergedclass=-2)'''

result = qc.query(token=token, sql=query, fmt='csv',
                  out='mydb://vhsfiltered.csv',
                  async_=True, wait=True,
                  timeout=86400, poll=1800,
                  verbose=1)

print(result)

I've made a similar query with TAP via TOPCAT (https://datalab.noirlab.edu/tap) in the past and this always worked even if it took a couple of hours and resulted in not reasonably downloadable file sizes. But it at least worked. Trying the same thing with a Jupyter notebook always times out and I don't understand why.

Am I maybe doing something wrong? Does logging out of the JupyterHub mess with a running query? Am I supposed to 'Interrupt the kernel' (Stop button) before logging out and 'Run the selected cells and advance' (Play button) after relogging in? Do I have to stay logged in to Jupyter Hub as long as it takes for the notebook to run successfully? Is there something fundamentally wrong with my query?

I really could use some help.

asked Aug 31 by martinkb (520 points) | 62 views

1 Answer

0 votes

There are a couple of issues here that would explain what you are seeing.

First, your MyDB table name should not contain a ".csv" extension, that makes it invalid. 

Second, the query uses the 'mergedclass' column as a constraint, however this column is not indexed in the table and so a scan of the full 2TB table (1.4B rows) is required and is inherently slow.  For the moment, I've added an index to that column to try to speed things up but it doesn't solve the problem completely.  That still means that your query result would have ~207M rows, even though it's a small number of columns.

When you submit a job through TOPCAT you are talking directly to the TAP service.  A large part of the time required is taken by simply serializing those 207M rows into the XML VOTable that gets returned directly to the client (TOPCAT) ,  that is assuming the TAP service doesn't fail due to an out-of-memory because it is also servicing other jobs.  When submit an ASync job from the queryClient, however, the TAP service still has to serialize the VOTable but that's passed back to our querymanager service which then has to de-serialize it into the CSV returned to the client or then uploaded into the MyDB .  That brings up a number of timeout issues as well as memory constraints.

We are aware that work needs to be done to better handle large results sets such as this, but in the meantime the recommended solution is to break the query up into smaller pieces, e.g. stripes of Declination, to limit the size of each result.  This can also be done such that the stripes can be run as Sync queries -- the advantage here is that is bypasses the VOTable serialization and works directly in the database but the downside is that you then have to deal with multiple output tables.

For example, the following code runs in ~45min and queries the complete VHS catalog:

from dl import queryClient as qc
from datetime import datetime

def doit(d):
   # Format the query into a 3deg Dec strips based on the input value and
   # create a unique table name.
   query = '''SELECT ra2000, dec2000, japermag3, mergedclass
              FROM vhs_dr5.vhs_cat_v3
              WHERE japermag3 between 18 and 19
                AND (mergedclass=-1 OR mergedclass=-2)
                AND (dec2000 < %g and dec2000 >= %g)
   ''' % (float(d),float(max(-90,d-3)))
   tname = 'mydb://vhs_%c%02d' % ('m' if d < 0.0 else 'p',abs(d))

   # Submit the query synchronously.  Dec strip size chosen to complete
   # within the timeout based on earlier tests.  Select row count to validate
   # query completed properly, log the progress.
   print(str(datetime.now()) + ' DEC beteen: %d  and %d' % (d,d-3))
   result = qc.query(sql=query, out=tname, verbose=1, timeout=600)
   nrows = qc.query(sql='select count(*) from %s' % tname).split()[1]
   print(str(datetime.now()) + '  %s  nrows = %s' % (tname, nrows))

# Loop over the VHS catalog in 3deg strips.  Max Dec is about +1.5, so move
# southward in each strip.
for d in range (2,-90,-3):
   doit(d)

It's clear we need to add an 'append' option to the query to allow output tables to be appended in this type of processing, but or the moment you'll at least have result tables you can query individually or concatenate yourself.

As for your other questions:  If you use the 'wait' option then the query() remains active so closing the tab or stopping the kernel is not recommended.  However, without the wait option the query() returns a 'jobID' value that can be used to check status or retrieve results even if you've stopped the notebook.

Hope this helps, post back if you still have questions.

answered Aug 31 by datalab (18,880 points)
First of all thank you for all the tips and the python example. :)

Like you said, I was able to query the whole VHS catalog and ended up with a lot of files in mydb. I decided to apply the same procedure to the CatWISE catalog and split it up into the same 3° strips starting at deg 2 to -90. Just to further shrink the input catalogs for the crossmatch.

My final attempt after all the stripping looked like this:
____________________________________________________________
from dl import queryClient as qc

query =  '''SELECT  c.*, v.japermag3
            FROM mydb://cat_p02 AS c, mydb://vhs_p02 AS v
            WHERE q3c_join(c.ra, c.dec, v.ra2000, v.dec2000, 0.0014)
              AND v.japermag3-c.w2mpro_pm>=2.4'''

#asynchronous query
jobID = qc.query(sql=query, fmt='csv', out='vos://catwxvhs_p02.csv', async_=True, timeout=86400)

print(jobID)
____________________________________________________________

"cat_p02" contains 889,910 rows and "vhs_p02" row count is 1,760,256 and those are one of the smaller ones. A synchronous query times out after 600 seconds and the async one  is still running after four days or refuses to time out.

Do you have any more tips on how to further speed up the process or optimize the query?
The output tables produced by the earlier queries are completely un-indexed, meaning the xmatch will be slow (e.g. the q3c_join() has no indexes to use on either table).  Aside from doing the initial striping to get "small enough" sizes, you should index the tables on the columns needed for later matching.  For example,

qc.mydb_index(token,'mydb://cat_p02','ra')
qc.mydb_index(token,'mydb://cat_p02','dec')
qc.mydb_index(token,'mydb://cat_p02','japermag3')
qc.mydb_index(token,'mydb://cat_p02','w2mpro_pm')
qc.mydb_index(token,'mydb://cat_p02','', q3c='ra,dec', cluster=True)

And do the same for the other tables as well.
Thanks again for this tip about indexing. I honestly had no idea. Indexing wasn't mentioned in the tutorial notebook named "How to do an efficient crossmatch of a user table with a table in Data Lab"

> https://github.com/astro-datalab/notebooks-latest/blob/master/04_HowTos/CrossmatchTables/How_to_crossmatch_tables.ipynb

It might be a good idea to add a helpful tip there that large queries might benefit from indexing with mydb_index.

349 questions

363 answers

372 comments

2,368 users

Welcome to Data Lab Help Desk, where you can ask questions and receive answers from other members of the community.