Small nested query runs forever

Question

505 views

I've saved a table of lightcurves to mydb and I'm trying to INNER JOIN some small, temporary tables to get subsets of that table. In doing so, I found my queries were running seemingly forever. I've been able to reproduce the issue with a minimal working example, but I don't understand it. mydb://temp is one of the small temporary tables, and mydb://stable_lcs is the big table holding all the lightcurves I'm interested in.

This query: SELECT objectid FROM mydb://temp LIMIT 2 returns results 182752_32789, 185281_28807 in less than a second.

This query: SELECT * FROM mydb://stable_lcs WHERE objectid IN ('182752_32789', '185281_28807')" returns their lightcurves, as expected, also in less than a second.

If I nest these: SELECT * FROM mydb://stable_lcs WHERE objectid IN (SELECT objectid FROM mydb://temp LIMIT 2) this query runs seemingly forever. Moreover, if I stop the execution and rerun the first query, now that query takes forever too. Eventually, like 5-10 minutes or so, I can query mydb://temp again. Why?

asked Oct 19, 2024 by adriansh (170 points)
edited Oct 19, 2024 by 0 | 505 views

2 Answers

Answer 1 · 2024-10-21T16:42:48+0000

Hi, thanks for reaching out. Would you mind telling us your account name? (you were not logged in when you asked your question, hence we see it as submitted by "anonymous"). If we know your account we might be able to access your temp and stable_lcs tables to see what's going on.

Thanks!

Answer 2 · 2024-10-23T10:03:52+0000

In order to process the results for long running queries or queries with relatively large result sets use the asynchronous execution mode. Additionally, to speed up the query you can try using an `INNER JOIN` statement instead. See details below:

1. Adjusting the query to a join instead of a sub query should speed up the execution:

SELECT *
FROM mydb://stable_lcs as slcs
INNER JOIN mydb://temp as tmp on tmp.objectid = slcs.objectid

2. Using the asynchronous option for query execution will allow the large result set to be dealt with separately. Note the `async_=True` below.

qid = queryClient.query(sql="""
    SELECT *
    FROM mydb://stable_lcs as slcs
    INNER JOIN mydb://temp as tmp on tmp.objectid = slcs.objectid
""", async_=True)

This returns a job ID for the query that can be used to retrieve the results when the query is finished.

query.results(qid)

Here is a full Python example (without the login steps) that executes in async mode and waits for the query execution to finish:

import time
from dl import queryClient as query

# execute the query in async mode
qid = query.query(sql="""
    SELECT *
    FROM mydb://stable_lcs as slcs
    INNER JOIN mydb://temp as tmp on tmp.objectid = slcs.objectid
""", async_=True)

# Loop and wait for the query to finish. We cannot access query results
# until the query is done.
while (status := query.status(qid)) != 'COMPLETED':
    print("\r", f"Status: {status.ljust(9, ' ')};", end="")
    if status == 'ERROR':
        raise RuntimeError("Query execution failed")
    time.sleep(5)

# when the query is completed the results can be accessed directly
query_results = query.results(qid)
print(query_results[:100])

Let us know if any of these steps resolve the issue for you. We also have many example notebooks that might help you while using the Data Lab client.

Thanks!

answered Oct 23, 2024 by chadddemo (240 points)
edited Oct 23, 2024 by 0

Hi chadddemo,

Thanks for your answer, but this hasn't resolved my issue. I just retried my MWE from above using an INNER JOIN, but my queries still run forever. I first did:

qc.query(sql="SELECT objectid FROM mydb://temp LIMIT 2", out="mydb://sub_temp")

To just write the first two object IDs to a tiny table. Then I did

q = """SELECT * FROM mydb://stable_lcs AS s
INNER JOIN mydb://sub_temp AS t
ON s.objectid = t.objectid"""
df = qc.query(sql=q, fmt="pandas")

and this query timed out at 5 minutes. I can think of no reason why this query should take more than a second. If I hand-write the object IDs and use a where clause, like in my first post, I get results in less than a second.

Thanks.

commented Oct 23, 2024 by adriansh (170 points)

Hey adriansh,

I did some tests using these specific tables and I think this has to do with how the query plans are being generated when using the two different approaches. There are different execution plans and execution times when using a sub query compared to using IDs in the query directly.

If feasible you could add the object IDs to the query before execution. It would be an extra step but your first query can pull the object ids from the temp table and then interpolate those object IDs into the main query.

Another option I came across is to convert the objects from the sub query into an array. This improves the plan (and execution time) since the sub query becomes part of the initialization and only executes once. Something like the following:

SELECT *
    FROM mydb://stable_lcs s
    WHERE s.objectid = ANY(
       (select array(select objectid
                     from mydb://temp
                     limit 2)
       )::text[]
    )

i will continue to look into this but in the meantime hopefully one of the above suggestions will allow you to proceed with your work.

Let me know!

commented Oct 23, 2024 by chadddemo (240 points)

Small nested query runs forever

Please log in or register to add a comment.

Your answer

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Categories