1.1.3. Survey datasets

Most Data Lab workflows will begin by posing a question for which the first step in getting an answer will involve a query of one or more survey catalog datasets. In the Data Lab, catalogs are stored in databases, with any given catalog consisting of a number of separate but linked tables. These tables are accessed via Structured Query Language (SQL) or its variant, Astronomical Data Query Language (ADQL). From the beginning, users are thus presented with a set of challenges:

  • Learning what measurements the tables from a given survey dataset contain and what they are named

  • Learning how to construct a database query that will retrieve all of the measurements needed for a given question

  • If measurements from more than one table or more than one survey are needed, learning how to join tables in such a way that all of the information is retrieved

  • For complex questions in particular, learning how to optimize the database query for performance

For many users, the first step in answering a question through the Data Lab will thus be to learn about the particular datasets that it contains.

1.1.3.1. What kinds of datasets does the Data Lab contain?

1.1.3.1.1. Core datasets

These are large high-value datasets served by the Data Lab, possibly providing value-added data such as pre-computed columns or external-table crossmatches. Tables are optimized/indexed to support the most common science cases. Examples of current and coming datasets are DECaLS and the DESI Targeting Surveys, DES, the DESI survey, and the pixel data contained in the NOIRLab’s Astro Data Archive.

1.1.3.1.2. Hosted datasets

These are smaller-scale, Survey Team, or PI datasets where a delivered high-level data product collection is provided by users who want to share the data via Data Lab services. These are relatively static in terms of release frequency/versions but imply some level of Data Lab operational support in order to be made available to community users. Examples are SMASH, GOGREEN and GCLASS, and GNIRS-DQS.

1.1.3.1.3. Reference datasets

These are large external datasets, mirrored through the Data Lab because of their value as photometric, spectroscopic, or astrometric references. Examples are SDSS, AllWISE, unWISE, Gaia, and USNO A/B.

1.1.3.2. Guidance on understanding table schema

Given the variety of datasets available through the Data Lab database, learning how to identify the tables and table columns of interest can be a challenge. There are several tools to help with this:

  • The Data Lab Data Explorer page contains a schema browser through which you can browse the available datasets, their tables, and the column descriptions.

  • The datalab command has a schema method that will display the schema and table descriptions.

  • The Survey Data webpage contains full dataset descriptions and links to survey documentation.

In general, the survey datasets hosted by the Data Lab contain a few kinds of tables:

  • Overview: These are tables that provide summary information of the survey, such as the spatial organization of the catalog data. These tables generally have many fewer rows than the main catalog tables, as they do not contain individual objects.

  • Object: These are typically the main catalog tables, and contain aggregated information for the astronomical objects identified by the survey. There are often views of these main tables that apply a constraint to yield subsets of objects with similar properties, e.g. the specobj view of SDSS DR17. The object tables are sometimes broken into several tables, each with different columns of information but linked by a unique object identifier.

  • Measurement: These are typically tables containing time-stamped individual measurements of objects in the main catalog tables, in general organized by having one row for every individual epoch of every individual object. While the number of columns in these tables is typically smaller than for the object tables, the number of rows can be much larger, and thus care should be exercised when pulling data from them.

  • Neighbors: These are specialized tables that contain information on all the internal spatial matches within a specified radius of all objects in the object table. Depending on the density of the objects on the sky and matching radius, these tables can be very large.

  • Crossmatch: These tables typically contain the spatially matched cross-identifications of the main object table with object catalogs from one or more external surveys.

  • Exposure: These tables typically contain metadata, such as calibration information, airmass, etc., for every individual exposure taken during the survey. By joining these data through the measurement and object tables, users can assign these metadata values to their objects of interest.

  • Chip: These tables are similar to the Exposure tables, but contain metadata relevant to the individual chips in the mosaics that make up the exposures, e.g. chip-dependent photometric calibration information.

1.1.3.3. Caveat: Data type discrepancies

Some data types in our TapSchema may not exactly match how they are stored in the underlying database. These discrepancies primarily affect columns labeled as DOUBLE, which in certain cases are implemented differently behind the scenes. While these differences generally do not impact standard usage, they may be relevant for advanced queries or data handling. The following subsections outline the specific cases where such mismatches occur and how they are represented.

1.1.3.3.1. NUMERIC

Be aware that some columns are labeled as DOUBLE in our TapSchema but are stored as NUMERIC in our database. If there is a mismatch, it will be noted in the column description. Column descriptions for each table are available on our Data Explorer page. For convenience, each column with the NUMERIC datatype is also listed below:

Table Name

Column Name

Database Datatype

allwise.x1p5__source__desi_dr1__zpix

id2

NUMERIC(39,0)

allwise.x1p5__source__desi_edr__zpix

id2

NUMERIC(39,0)

delve_dr3.decade_shear

dnf_id1

NUMERIC(20,0)

desi_dr1.fiberassign

id

NUMERIC(39,0)

desi_dr1.potential

id

NUMERIC(39,0)

desi_dr1.target

id

NUMERIC(39,0)

desi_dr1.x1p5__zpix__allwise__source

id1

NUMERIC(39,0)

desi_dr1.x1p5__zpix__gaia_dr3__gaia_source

id1

NUMERIC(39,0)

desi_dr1.x1p5__zpix__nsc_dr2__object

id1

NUMERIC(39,0)

desi_dr1.x1p5__zpix__sdss_dr17__specobj

id1

NUMERIC(39,0)

desi_dr1.x1p5__zpix__unwise_dr1__object

id1

NUMERIC(39,0)

desi_dr1.zpix

id

NUMERIC(39,0)

desi_dr1.ztile

id

NUMERIC(39,0)

desi_dr1.ztile

targetphotid

NUMERIC(39,0)

desi_edr.fiberassign

id

NUMERIC(39,0)

desi_edr.potential

id

NUMERIC(39,0)

desi_edr.target

id

NUMERIC(39,0)

desi_edr.x1p5__zpix__allwise__source

id1

NUMERIC(39,0)

desi_edr.x1p5__zpix__gaia_dr3__gaia_source

id1

NUMERIC(39,0)

desi_edr.x1p5__zpix__nsc_dr2__object

id1

NUMERIC(39,0)

desi_edr.x1p5__zpix__sdss_dr17__specobj

id1

NUMERIC(39,0)

desi_edr.x1p5__zpix__unwise_dr1__object

id1

NUMERIC(39,0)

desi_edr.zpix

id

NUMERIC(39,0)

desi_edr.ztile

id

NUMERIC(39,0)

desi_edr.ztile

targetphotid

NUMERIC(39,0)

gaia_dr3.x1p5__gaia_source__desi_dr1__zpix

id2

NUMERIC(39,0)

gaia_dr3.x1p5__gaia_source__desi_edr__zpix

id2

NUMERIC(39,0)

nsc_dr2.x1p5__object__desi_dr1__zpix

id2

NUMERIC(39,0)

nsc_dr2.x1p5__object__desi_edr__zpix

id2

NUMERIC(39,0)

sdss_dr17.x1p5__specobj__desi_dr1__zpix

id2

NUMERIC(39,0)

sdss_dr17.x1p5__specobj__desi_edr__zpix

id2

NUMERIC(39,0)

unwise_dr1.x1p5__object__desi_dr1__zpix

id2

NUMERIC(39,0)

unwise_dr1.x1p5__object__desi_edr__zpix

id2

NUMERIC(39,0)

1.1.3.3.2. ARRAY

Be aware that some columns are labeled as DOUBLE in our TapSchema but are stored as ARRAY (or double precision[]) in our database. If there is a mismatch, it will be noted in the column description. Column descriptions for each table are available on our Data Explorer page. For convenience, each column with the ARRAY (or double precision[]) datatype is also listed below:

Table Name

Column Name

TapSchema Datatype

Database Datatype

desi_dr1.mws

covar

DOUBLE

double precision[]

desi_dr1.mws

elem

DOUBLE

double precision[]

desi_dr1.mws

elem_err

DOUBLE

double precision[]

desi_dr1.mws

param

DOUBLE

double precision[]

See the example code below on how to query and convert these columns into a matrix value:

from dl import queryClient as qc
import numpy as np
import pandas as pd
from io import StringIO

query = '''SELECT * FROM desi_dr1.mws
           LIMIT 10
        '''

# Save the query in default format (string with comma-separated values).

res = qc.query(sql=query)

# Starting from 'res' as the result from the qc.query() call in default
# format (csv string), the following properly formats the array columns.
# Those need to be listed manually.
# Define the converter to split the array values and convert to float.
# Note: the x[1:-1] gets rid of the opening and closing curly brackets
# in the CSV string.

float_converter = lambda x: np.array(x[1:-1].split(','), dtype='float')

# Define the converters for all array columns.

converters = {'covar': float_converter,
              'elem': float_converter,
              'elem_err': float_converter,
              'param': float_converter}

# Read into a pandas DataFrame.

df = pd.read_csv(StringIO(res), converters=converters)