0 votes
178 views
I've saved a table of lightcurves to mydb and I'm trying to INNER JOIN some small, temporary tables to get subsets of that table. In doing so, I found my queries were running seemingly forever. I've been able to reproduce the issue with a minimal working example, but I don't understand it. mydb://temp is one of the small temporary tables, and mydb://stable_lcs is the big table holding all the lightcurves I'm interested in.

This query: SELECT objectid FROM mydb://temp LIMIT 2 returns results 182752_32789, 185281_28807 in less than a second.

This query: SELECT * FROM mydb://stable_lcs WHERE objectid IN ('182752_32789', '185281_28807')" returns their lightcurves, as expected, also in less than a second.

If I nest these: SELECT * FROM mydb://stable_lcs WHERE objectid IN (SELECT objectid FROM mydb://temp LIMIT 2) this query runs seemingly forever. Moreover, if I stop the execution and rerun the first query, now that query takes forever too. Eventually, like 5-10 minutes or so, I can query mydb://temp again. Why?
by adriansh (170 points)
edited by 0 | 178 views

2 Answers

0 votes
Hi, thanks for reaching out. Would you mind telling us your account name? (you were not logged in when you asked your question, hence we see it as submitted by "anonymous"). If we know your account we might be able to access your temp and stable_lcs tables to see what's going on.

Thanks!
by robertdemo (9.0k points)
Ah, sorry. It's adriansh
0 votes

In order to process the results for long running queries or queries with relatively large result sets use the asynchronous execution mode. Additionally, to speed up the query you can try using an `INNER JOIN` statement instead. See details below:

1. Adjusting the query to a join instead of a sub query should speed up the execution:

SELECT *
FROM mydb://stable_lcs as slcs
INNER JOIN mydb://temp as tmp on tmp.objectid = slcs.objectid

2. Using the asynchronous option for query execution will allow the large result set to be dealt with separately. Note the `async_=True` below.

qid = queryClient.query(sql="""
    SELECT *
    FROM mydb://stable_lcs as slcs
    INNER JOIN mydb://temp as tmp on tmp.objectid = slcs.objectid
""", async_=True)

This returns a job ID for the query that can be used to retrieve the results when the query is finished.

query.results(qid)

Here is a full Python example (without the login steps) that executes in async mode and waits for the query execution to finish:

import time
from dl import queryClient as query

# execute the query in async mode
qid = query.query(sql="""
    SELECT *
    FROM mydb://stable_lcs as slcs
    INNER JOIN mydb://temp as tmp on tmp.objectid = slcs.objectid
""", async_=True)

# Loop and wait for the query to finish. We cannot access query results
# until the query is done.
while (status := query.status(qid)) != 'COMPLETED':
    print("\r", f"Status: {status.ljust(9, ' ')};", end="")
    if status == 'ERROR':
        raise RuntimeError("Query execution failed")
    time.sleep(5)

# when the query is completed the results can be accessed directly
query_results = query.results(qid)
print(query_results[:100])

Let us know if any of these steps resolve the issue for you. We also have many example notebooks that might help you while using the Data Lab client.

Thanks!

by chadddemo (180 points)
edited by 0
Hi chadddemo,

Thanks for your answer, but this hasn't resolved my issue. I just retried my MWE from above using an INNER JOIN, but my queries still run forever. I first did:

qc.query(sql="SELECT objectid FROM mydb://temp LIMIT 2", out="mydb://sub_temp")

To just write the first two object IDs to a tiny table. Then I did

q = """SELECT * FROM mydb://stable_lcs AS s
           INNER JOIN mydb://sub_temp AS t
           ON s.objectid = t.objectid"""
df = qc.query(sql=q, fmt="pandas")

and this query timed out at 5 minutes. I can think of no reason why this query should take more than a second. If I hand-write the object IDs and use a where clause, like in my first post, I get results in less than a second.

Thanks.
Hey adriansh,

I did some tests using these specific tables and I think this has to do with how the query plans are being generated when using the two different approaches. There are different execution plans and execution times when using a sub query compared to using IDs in the query directly.

If feasible you could add the object IDs to the query before execution. It would be an extra step but your first query can pull the object ids from the temp table and then interpolate those object IDs into the main query.

Another option I came across is to convert the objects from the sub query into an array. This improves the plan (and execution time) since the sub query becomes part of the initialization and only executes once. Something like the following:

SELECT *
    FROM mydb://stable_lcs s
    WHERE s.objectid = ANY(
       (select array(select objectid
                     from mydb://temp
                     limit 2)
       )::text[]
    )

i will continue to look into this but in the meantime hopefully one of the above suggestions will allow you to proceed with your work.

Let me know!
Your array suggestion worked for the WHERE clause, thank you, but ultimately I want to do an INNER JOIN so I can use some other columns from temp as well. I still can't understand why the INNER JOIN or WHERE would run for so long. I did exactly this earlier on Friday and everything worked smoothly. I have used WHERE clauses with subqueries like this and INNER JOINs to pull lightcurves for 100k objects at a time from the nsc_dr2.meas table, which is huge, and they rarely ran for longer than 10 minutes. I pulled a subset of lightcurves into mydb precisely so I could speed up this workflow by avoiding querying the meas table over and over.
q = """SELECT * FROM nsc_dr2.meas AS m
           INNER JOIN mydb://sub_temp AS t
           ON m.objectid = t.objectid"""
df = qc.query(sql=q, fmt="pandas")

This just worked in one second. So it seems like the problem is with my stable_lcs table?
OK let me review this in more detail and with some other folks to see what we can find. They are different tables so the plans will differ but it does seem like something is needed in the stable_lcs table (maybe additional indexing).
Okay, thanks. I was using this method to pull lightcurves for 10k objects at a time from the stable_lcs table on friday and it was returning results in ~1 minute before slowing down inexplicably and it hasn't fixed itself since.
Just to provide some additional benchmarking:

I just tried your array suggestion without the LIMIT clause and it worked. I pulled lightcurves from stable_lcs for 9265 objectids in 45 seconds. This is consistent with the performance I saw Friday afternoon using an INNER join on temp and stable_lcs before something bugged out.
Hey adriansh,

Have you tried running mydb_index on the objectid column in both tables? This seems to improve performance quite a bit in my tests. It still slows down if there are many results but limited queries run relatively quick.

The syntax for mydb_index is as follows:
qc.mydb_index("temp", "objectid")

After that, a query using an inner join should finish in about a minute depending on the constraints:
SELECT s.objectid FROM  mydb://stable_lcs AS s
INNER JOIN  mydb://temp AS t on t.objectid = s.objectid
ORDER BY s.objectid
LIMIT 10000

Let me know and thank you for your patience.

460 questions

473 answers

475 comments

662 users

Welcome to Data Lab Help Desk, where you can ask questions and receive answers from other members of the community.

Categories