Cross matching long dataset using local Jupyter notebook

Question

739 views

Hi,

I have large >5GB CSV file and I would like to cross match that without cutting it to smaller pieces.

I am looking for method where I can use anaconda / Jupyter notebook to do this.

My CSV has 20 million rows and I would like to build a code that would query for example 100 000 row blocks and keep writing the result file locally?.

CSV file would have RA and DEC columns that I would like to xmatch with for example GAIA edr3 with radius 1arc sec.

Result: The file with all original rows completed with GAIA data for those objects that have gaia data available.

I hope this makes sense and I will be open for other solutions as well. I am able to cut the file to smaller pieces but that does not feel the best way.

Edit: probably better way would be to use FITS file? So I am open for that because I suck with fits files!

Best LaaLaa

asked Mar 22, 2022 by LaaLaa | 739 views

1 Answer

Answer 1 · 2022-03-22T13:44:31+0000

Hi, thanks for reaching out.

If you want to work in a NB on the DL Jupyter NB server, please first upload your CSV, e.g. to the NB space. The format should have one header line, followed by the data lines, e.g.:

ra,dec,source_id
2.503706331038843,53.16659819583169,396011034259836416
2.5165612839280316,53.171095728972396,396011072920766848
2.519798375886651,53.17298647736945,396011072916668800

Then, in the NB server, upload the table to your MyDB (that's the user relational database). The reason to do so is that then you can run the xmatch directly on the DB, which is faster and should work for 20M rows, without the need to split things. Here's how:

from dl import queryClient as qc
qc.mydb_import('mytable','./mycsvfile.csv')  # table name in mydb, csv file name on local disk

How to write the xmatch queries is shown e.g. in this DL example NB:

https://github.com/astro-datalab/notebooks-latest/blob/master/04_HowTos/CrossmatchTables/How_to_crossmatch_tables.ipynb

(in the section "Run the crossmatch"). Details on writing the SQL string will depend whether you want to keep all matches within the 1 arcsec or just the nearest one, and whether you want to keep rows that have no counterpart in gaia_edr.gaia_source table or not.

Once you have written an xmatch query, you can run it in the NB and write out the result to a new CSV file like this:

xmatchsql = "SELECT ..." # your xmatch query SQL, see the example NB
qc.query(xmatchsql,timeout=600,out='./myxmatch.csv')

BTW, I tried these steps with a 20M rows CSV file (from gaia_edr3 itself), and the qc.mydb_import() step took 22 minutes for me, while a crossmatch query on it vs gaia_edr3.gaia_source took 13 minutes.

Please don't hesitate to contact us again if needed.

Robert

Cross matching long dataset using local Jupyter notebook

Please log in or register to add a comment.

Your answer

1 Answer

Please log in or register to add a comment.

Categories