Tables

Warning

The Tables client is in an Alpha release. Descartes Labs reserve the right to modify this API without any guarantees about backwards compatibility.

The Tables module lets you organize, upload, and query tabular data and vector geometries.

It’s meant for data at the scale of millions of features (rows). The spatial, temporal and attribute queries are performed server-side which allows you to work locally with a “table expression”, similar syntax to a Pandas dataframe but without requiring the entire table in memory. Tables defers the actual computation and constructs a Pandas dataframe from the results.

A typical use for the Tables service is storing the output from Tasks. For example, a computer vision detector might be run in thousands of tasks over many years of data across a continent; the geographic objects it detects could be saved as features for later querying and analysis.

Installation

Note

tables will install geopandas (https://geopandas.org/en/stable/) as a dependency by default on linux installations and skip its installation otherwise. If geopandas has not been previously installed on a non-linux system, then pandas will be used. See the geopandas installation instructions (https://geopandas.org/en/stable/getting_started/install.html#installing-with-pip) for more information on installing on non-linux systems.

Install with pip, a fresh virtual environment is recommened:

pip install "descarteslabs[tables]"

And confirm that the module was installed

>>> import descarteslabs as dl
>>> client = dl.tables.Tables()

Creating Tables

Note

Table names can only consist of alphanumeric and “_” characters.

Data types

Tables consist of one or more columns, each of which has a specific data type. The column’s type can be one of:

  • text UTF8 encoded string (str can be used as an alias)

  • int 64-bit integer

  • float 64-bit float

  • json can be used to hold arbitrary structures (dicts, lists) which must be json-serializable.

  • date

  • datetime

  • bool

  • auto can be used to create an auto-incrementing counter

Geospatial vector geometries are handled slightly differently since there are many possible geometry types and spatial referencing systems. The supported geometry types mirror the GeoJSON specification (https://datatracker.ietf.org/doc/html/rfc7946#section-3.1).

  • Point

  • LineString

  • Polygon

  • MultiPoint

  • MultiLineString

  • MultiPolygon

Spatial Referencing

The supported spatial referencing systems are defined by an identifier, called an srid. There are thousands of supported projections and datums, see the reference (https://spatialreference.org/ref/epsg/) for more information. Some of the more common SRIDs you might encounter:

  • 4326 is the WGS84 geographic coordinate system

  • 3857 is the spherical web mercator projection commonly used in interactive tiled mapping

  • 326xx represents the northern-hemisphere UTM zone xx in WGS84

  • 327xx represents the souther-hemisphere UTM zone xx in WGS84

To learn more about coordinate systems and the implications for your work, see Geospatial 101: Coordinate Systems.

Primary Key

Each table must have at least one column to represent primary_key. This key is used as a unique identifier and must contain unique, non-null values. This provides an index for fast lookups and allows you to update existing rows. You can specify multiple columns to be used as a primary key (i.e. a composite key). More information on primary keys can be found

If you don’t want to maintain your own primary key the auto datatype to will allow the service maintain it for you. Just omit the primary_key parameter when creating a table and the client will create one for you called auto_id. Note that you cannot use both auto and non-auto keys in the same table.

Examples

Putting it all together, we can create a schema which defines the table structure, column names and types. In this case, we’re using the Airports dataset from the Natural Earth project. Link to dataset: https://www.naturalearthdata.com/downloads/10m-cultural-vectors/airports/

>>> TABLE_NAME = "airports"
>>>
>>> schema = {
...     "properties": {
...         "scalerank": "int",
...         "type": "text",
...         "name": "text",
...         "wikipedia": "text",
...     },
...     "geometry": "Point",
... }
>>>
>>> srid = 4326
>>>
>>> primary_key = "name"

And invoke the create_table method

>>> client.create_table(TABLE_NAME, schema=schema, srid=srid, primary_key=primary_key)

If you want to use multiple columns as the primary key, just specify it as a list:

>>> primary_key = ["featureid", "tile_key"]

If you want to use an auto incrementing primary key, you can define your schema like so:

>>> schema = {
...     "properties": {
...         "my_id": "auto",
...         "column": "text",
...     },
...     "geometry": "Point",
... }

If you want a table without a geometry column, use "geometry": None and srid = None when creating the table.

>>> schema = {
...     "properties": {
...         "featureid": "int",
...         "tile_key": "text",
...         "distance": "int",
...         "detected_date": "date",
...         "contextual_data": "json",
...     },
...     "geometry": None,  # non-spatial table, no geometry
... }
>>>
>>> srid = None  # non-spatial table, no srid
>>>
>>> primary_key = "featureid"
>>>
>>> client.create_table("example_nonspatial", schema=schema, srid=srid, primary_key=primary_key)

Inspecting Datasets

The Tables client has a handy function for extracting the schema and srid from known file types and Geo/DataFrames. The caveat here is that there are known issues with the libraries being used to get this information:

  • fiona and geopandas dataframes both have their quirks that prevent some types of information from translating as desired.

  • SRID handling doesn’t translate well, if at all, and is highly dependent on the cleanliness of input data.

>>> schema, srid = client.inspect_dataset("/path/to/file.csv")
>>> schema, srid = client.inspect_dataset(df)

Warning

You should NOT use inspect_dataset in automated scripts or without manually verifying the schema and SRID returned. inspect_dataset does its best to give you a schema and SRID in order to save time but there are no guarantees that those are correct. After inspecing a dataset it is encumbent upon the user to manually verify the schema and SRID to ensure it is valid.

Deleting a table

Deleting a table is permanent and will remove the table and all of its rows from your account and from the account of any descarteslabs user that it was shared with.

>>> client.delete_table("example_nonspatial")

Adding Data

Tables provides two mechanisms for loading data into an existing table, uploading from a file or from a Geo/DataFrame.

Uploading from a file

>>> path = "/tmp/ne_10m_airports.geojson"
>>> jobid = client.upload_file(path, TABLE_NAME)

The data file must be readable with OGR (e.g. ogrinfo) and its schema must match the schema of the destination table. The supported file formats include:

  • ESRI shapefile in a .zip file archive

  • GeoJSON

  • Line-delimited GeoJSON, also known as GeoJSON Text Sequences

  • CSV

CSVs can contain an X and Y column which will be automatically cast to a point geometry.

Inserting from a dataframe

If you already have your data in memory as a pandas DataFrame or a geopandas GeoDataFrame, you can work with the dataframe directly without needing to write to a local file.

>>> import geopandas
>>> df = geopandas.read_file(path)

Note

The names and types of the dataframe columns must match the table schema.

Then invoke the insert_rows method

>>> jobid = client.insert_rows(df, TABLE_NAME)

Waiting on jobids

The upload and insert calls above are asyncronous. Your raw data is uploaded to temporary storage and put into a queue for processing. To check the status of an upload or insert job, use use the returned jobid and poll the service for the status

>>> # check once
>>> client.check_status(jobid)
>>>
>>> # check every 5 seconds
>>> client.wait_until_completion(jobid, poll_interval=5.0)

Upserts

If your data has a primary key, any row that you add to the table will be inserted. If a row with the same primary key already exists, the data in your new row will overwrite the existing row.

Note

You can also insert partial rows to update a subset of a row. The primary key must be one of the columns being inserted in order to update the correct row.

Data preparation

When inserting new rows, if your dataset is missing a column those values are assumed to be None.

When inserting partial rows, if a column has been omitted in the dataset being inserted then the existing values are not altered.

If your dataset has additional columns that were not defined in the table schema, they will be ignored.

The spelling of columns in your dataset must be an exact match to the existing column names and is case-sensitive.

Your data must serialize to JSON and that JSON representation must be cast-able to the internal data types listed above. Most scalar python types from the standard library will convert seamlessly but you may need to pay close attention to timestamps, numpy objects, and other types that don’t have a direct JSON representation.

If your dataset contains vector geometries, the coordinates are assumed to be in the same srid as the table.

Querying Data

List tables

To list the tables available for querying

>>> client.list_tables()
/home/stephen/.cache/bazel/_bazel_stephen/94c12998f847de25a1ba31a2db877e84/sandbox/linux-
sandbox/326/execroot/monorepo/bazel-out/k8-opt-
exec-2B5CBBC6/bin/docs/guides/pweave-
tables.runfiles/requirements_ipykernel/ipykernel/kernelbase.py:757:
RuntimeWarning: coroutine 'InProcessKernel._abort_queues' was never
awaited
  self._abort_queues()
RuntimeWarning: Enable tracemalloc to get the object allocation
traceback

Returns a dictionary of all tables you have access to, broken out by the level of permissions.

{
    'owner': ['airports'],
    'editor': {},
    'viewer': {}
}
  • owner is all the tables you created, associated directly with your account

  • editor shows all the tables owned by other descarteslabs users that you can read and write to.

  • viewer shows all the tables owned by other descarteslabs users that you can read.

See the Sharing Tables section below information on how to share tables.

You can also filter tables matching a specific display name pattern.

>>> client.list_tables(name="africa*.json")
/home/stephen/.cache/bazel/_bazel_stephen/94c12998f847de25a1ba31a2db877e84/sandbox/linux-
sandbox/326/execroot/monorepo/bazel-out/k8-opt-
exec-2B5CBBC6/bin/docs/guides/pweave-
tables.runfiles/requirements_ipykernel/ipykernel/kernelbase.py:757:
RuntimeWarning: coroutine 'InProcessKernel._abort_queues' was never
awaited
  self._abort_queues()
RuntimeWarning: Enable tracemalloc to get the object allocation
traceback

Returns a dictionary of the tables that match the search pattern that you have access to, broken out by the level of permissions.

{
    'owner': ['africa.json', 'africa-cairo.json'],
    'editor': {},
    'viewer': {}
}

Table Expressions

Ibis table expressions provide a pandas-like syntax that interacts with database backends with deffered (lazy) evaluation. You can build up a complex query expressions without sending any data over the network. This allows you to work with very large datasets and aggregate/filter them until they are more modestly sized.

>>> table_expr = client.table(TABLE_NAME)
>>> table_expr

When you view the representation of the table expression in an interactive environment, you see that it’s not a dataframe but an abstract DatabaseTable object on which you can build the query.

DatabaseTable[table]
  name: airports
  schema:
    wikipedia : string
    type : string
    name : string
    scalerank : int64
    geom : point;0

Only when you execute() (https://ibis-project.org/docs/generated/ibis.expr.api.Expr.execute.html?highlight=execute#ibis.expr.api.Expr.execute) does the request get sent to the server, which runs the query and returns a pandas dataframe (for a non-spatial table) or a geopandas geodataframe (if there is a geometry column and geopandas is installed).

>>> # Take the first 10 rows
>>> table_expr.limit(10).execute()

In the case of a table expression that returns a single value, you get a python scalar (in this case an integer).

>>> table_expr.count().execute()

Example queries

Here are some common queries

  • Aggregate values

  • Aggregate by time period

  • Filter by scalar value comparison

  • Filter by spatial intersection

We’ll demonstrate how to answer these questions in the context of the airports example table.

First, we can aggregate to find summary statistics for a column

>>> table_expr.aggregate([
...     table_expr.distance.min().name("min_distance")
...     table_expr.distance.mean().name("mean_distance")
...     table_expr.distance.max().name("max_distance")
... ]).execute()

Aggregate by timestamp, grouping by month. Note that our airports example doesn’t contain a date or datetime column; this query is only included as a theoretical example.

>>> table_expr.group_by([
...     table_expr.detected_date.month().name('detected_month'),
... ]).aggregate(
...     monthly_mean_detections=table_expr.name.count(),
... ).execute()

Filter by scalar value

>>> table_expr[
...     (table_expr.scalerank >= 4) &
...     (table_expr.scalerank <= 6)
... ].execute()

Filter by spatial intersection with a polygon geometry.

>>> import ibis
>>> from shapely.geometry import box
>>> aoi = box(-109.06, 36.99, -102.04, 41.00)
>>>
>>> table_expr[
...     table_expr.geom.intersects(ibis.literal(aoi, type="polygon;4326"))
... ].execute()

This only scratches the surface of the capabilities of ibis; Check out the Ibis Tutorial (https://ibis-project.org/docs/tutorial/index.html#tutorial) for additional examples. If you’re familiar with SQL, you may also wish to read the Ibis for SQL Programmers (https://ibis-project.org/docs/user_guide/sql.html) guide to see how SQL idioms translate to ibis.

Deleting Rows

Deleting features/rows from a table is as simple as providing a list of primary key ids to delete if the primary key is a single column.

>>> jobid = client.delete_rows(
...     [
...         "San Francisco Int'l",
...         "Sky Harbor Int'l",
...         "Hartsfield-Jackson Atlanta Int'l",
...     ],
...     TABLE_NAME
... )
>>> client.wait_until_completion(jobid, poll_interval=5.0)

We check that this row is not in the table now.

>>> len(client.table(TABLE_NAME).execute())
>>> "San Francisco Int'l" not in client.table(TABLE_NAME)["name"].execute()

For tables with composite primary keys, you will need to provide an list of primary keys as tuples. The order of the elements in each id is controlled by the pk_order argument.

# order primary keys by name first then type
client.delete_rows(
    ids=[('London Heathrow', 'major')],
    table_name="airports_composite_table",
    pk_order=["name", "type"]
)

# or the other way around
client.delete_rows(
    ids=[('major', 'London Heathrow')],
    table_name="airports_composite_table",
    pk_order=["type", "name"]
)

If an id does not match the type of the primary key or the pk_order field is incorrectly specified for composite primary keys, an error will be returned. However, providing an id that doesn’t exist won’t delete anything and return successfully.

Limits

By default, there is a 250_000 row limit which avoids accidentally transfering huge amounts of data by e.g. downloading the entire table. You can override the default and set your own limit by adding .limit(x) to any of the table expressions above.

Sharing Tables

Tables is integrated with Discover, which allows you to grant and revoke table access to other users on the Descartes Labs Platform. You will need to use the Discover client to share tables.

There are two types of roles:

  • editor has permission to upload and insert data to the table

  • viewer has permission to query

In order to share a table, first instantiate a Discover client:

>>> from descarteslabs.discover.client import Discover
>>> d = Discover()

Assuming your email is demo@descarteslabs.com and you want to share a table with someone:

>>> d.list_assets() # to get the name of the table, assume it's "asset/vector/mytable"
>>> d.table("asset/vector/mytable").share(with_="someone@company.com", as_="viewer")
/home/stephen/.cache/bazel/_bazel_stephen/94c12998f847de25a1ba31a2db877e84/sandbox/linux-
sandbox/326/execroot/monorepo/bazel-out/k8-opt-
exec-2B5CBBC6/bin/docs/guides/pweave-
tables.runfiles/monorepo/descarteslabs/auth/auth.py:525: UserWarning:
Failed to save token: [Errno 30] Read-only file system:
'/home/stephen/.descarteslabs/.jwt_token_7eaadeabe2db39bce056de32ebf6475adcee6704.json.05415_m2'
  self._get_token()

The other party will use it by referring the the table by name plus the additional owner keyword.

>>> # someone@company.com's python code
>>> # should have access to execute queries (but not to insert data)
>>> tbl = t.table("asset/vector/mytable", owner="demo@descarteslabs.com")

And finally, you can revoke access using

>>> d.table("asset/vector/mytable").revoke(from_="someone@company.com")