0 votes
222 views
Suppose I want lightcurves for something like 10 million objects. If I try to execute a query asynchronously, I have the same problem as another user where loading the results takes forever upon job completion. If I try to run the query synchronously and/or store the query results to mydb, I run into the query timeout. Am I forced to work in chunks?
by adriansh (170 points) | 222 views

1 Answer

0 votes
Hi can you post the specific query you are trying to run? Either here, or mail it to datalab@noirlab.edu please.

Robert
by robertdemo (8.9k points)
Sure, here is a quick example:

sub_query = """
SELECT id FROM nsc_dr2.object
    WHERE class_star > 0.9
    AND ndet > 10
    LIMIT 10000
"""

query = f"""
SELECT m.objectid, m.ra, m.dec, m.exposure, m.mjd, m.mag_auto, m.magerr_auto, m.filter, o.class_star
    FROM nsc_dr2.meas AS m, nsc_dr2.object AS o
        WHERE m.objectid IN ({sub_query}) AND m.objectid=o.id
"""

This takes about 12 seconds to run synchronously, but takes a very long time, perhaps forever (I've not sat around waiting long enough to find out), to load the results upon completion of the asynchronous query. Eventually I would like to increase the LIMIT to something like 10^7, or even remove it altogether, but I cannot run such a query synchronously and I can't load the results for 10^4 objects asynchronously.
Moreover, if I rewrite the same query using an INNER JOIN and submit it asynchronously, it just breaks:

sub_query = f"""
SELECT id, class_star FROM nsc_dr2.object
    WHERE class_star > 0.9
    AND ndet > 10
    LIMIT 10000
"""

query = f"""
SELECT m.objectid, m.ra, m.dec, m.exposure, m.mjd, m.mag_auto, m.magerr_auto, m.filter, o.class_star
    FROM nsc_dr2.meas AS m
    INNER JOIN ({sub_query}) AS o
    ON m.objectid=o.id

job_id = qc.query(sql=query, async_=True)
qc.status(job_id)
> 'ERROR'
Hi adriash,

We have a datalab branch that solves this problem, we haven't yet pushed it to the main branch since we want to get more feed back from users.

https://github.com/astro-datalab/datalab/tree/query_async_download

you can do a

git clone -b query_async_download https://github.com/astro-datalab/datalab

cd datalab

and install that version by running

python setup.py install

I tested it by running this query:

SELECT m.objectid, m.ra, m.dec, m.exposure, m.mjd, m.mag_auto, m.magerr_auto, m.filter, o.class_star
    FROM nsc_dr2.meas AS m, nsc_dr2.object AS o
        WHERE m.objectid IN (
SELECT id FROM nsc_dr2.object
    WHERE class_star > 0.9
    AND ndet > 10
    LIMIT 100000
) AND m.objectid=o.id

Notice the subquery's limit is 100k and returned almost 5 million rows. The async query took 16mins and downloading the results about 60s.

Let me us know if this help you.

Thank you,

Igor

452 questions

465 answers

474 comments

659 users

Welcome to Data Lab Help Desk, where you can ask questions and receive answers from other members of the community.

Categories