Tables¶
Warning
The Tables client is in an Alpha release. Descartes Labs reserve the right to modify this API without any guarantees about backwards compatibility.
The Tables
module
lets you organize, upload, and query tabular data and vector geometries.
It’s meant for data at the scale of millions of features (rows). The spatial, temporal and attribute queries are performed server-side which allows you to work locally with a “table expression”, similar syntax to a Pandas dataframe but without requiring the entire table in memory. Tables defers the actual computation and constructs a Pandas dataframe from the results.
A typical use for the Tables service is storing the output from Tasks. For example, a computer vision detector might be run in thousands of tasks over many years of data across a continent; the geographic objects it detects could be saved as features for later querying and analysis.
Installation¶
Note
tables
will install geopandas (https://geopandas.org/en/stable/) as a dependency by default on linux installations and
skip its installation otherwise. If geopandas
has not been previously installed on a
non-linux system, then pandas
will be used. See the geopandas installation instructions (https://geopandas.org/en/stable/getting_started/install.html#installing-with-pip)
for more information on installing on non-linux systems.
Install with pip
, a fresh virtual environment is recommened:
pip install "descarteslabs[tables]"
And confirm that the module was installed
>>> import descarteslabs as dl
>>> client = dl.tables.Tables()
Creating Tables¶
Data types¶
Tables consist of one or more columns, each of which has a specific data type. The column’s type can be one of:
text
UTF8 encoded string (str
can be used as an alias)int
64-bit integerfloat
64-bit floatjson
can be used to hold arbitrary structures (dicts, lists) which must be json-serializable.date
datetime
bool
auto
can be used to create an auto-incrementing counter
Geospatial vector geometries are handled slightly differently since there are many possible geometry types and spatial referencing systems. The supported geometry types mirror the GeoJSON specification (https://datatracker.ietf.org/doc/html/rfc7946#section-3.1).
Point
LineString
Polygon
MultiPoint
MultiLineString
MultiPolygon
Spatial Referencing¶
The supported spatial referencing systems are defined by an identifier, called an srid
. There
are thousands of supported projections and datums, see the reference (https://spatialreference.org/ref/epsg/) for more information. Some
of the more common SRIDs you might encounter:
4326
is the WGS84 geographic coordinate system3857
is the spherical web mercator projection commonly used in interactive tiled mapping326xx
represents the northern-hemisphere UTM zone xx in WGS84327xx
represents the souther-hemisphere UTM zone xx in WGS84
To learn more about coordinate systems and the implications for your work, see Geospatial 101: Coordinate Systems.
Primary Key¶
Each table must have at least one column to represent primary_key
. This key is used
as a unique identifier and must contain unique, non-null values. This provides an index for fast
lookups and allows you to update existing rows. You can specify multiple columns to be used as a
primary key (i.e. a composite key). More information on primary keys can be found
If you don’t want to maintain your own primary key the auto
datatype to will allow
the service maintain it for you. Just omit the primary_key
parameter when creating a table
and the client will create one for you called auto_id
.
Note that you cannot use both auto
and non-auto
keys in the same table.
Examples¶
Putting it all together, we can create a schema
which defines the table structure,
column names and types. In this case, we’re using the Airports dataset from the Natural Earth project.
Link to dataset: https://www.naturalearthdata.com/downloads/10m-cultural-vectors/airports/
>>> TABLE_NAME = "airports"
>>>
>>> schema = {
... "properties": {
... "scalerank": "int",
... "type": "text",
... "name": "text",
... "wikipedia": "text",
... },
... "geometry": "Point",
... }
>>>
>>> srid = 4326
>>>
>>> primary_key = "name"
And invoke the create_table
method
>>> client.create_table(TABLE_NAME, schema=schema, srid=srid, primary_key=primary_key)
If you want to use multiple columns as the primary key, just specify it as a list:
>>> primary_key = ["featureid", "tile_key"]
If you want to use an auto incrementing primary key, you can define your schema like so:
>>> schema = {
... "properties": {
... "my_id": "auto",
... "column": "text",
... },
... "geometry": "Point",
... }
If you want a table without a geometry column,
use "geometry": None
and srid = None
when creating the table.
>>> schema = {
... "properties": {
... "featureid": "int",
... "tile_key": "text",
... "distance": "int",
... "detected_date": "date",
... "contextual_data": "json",
... },
... "geometry": None, # non-spatial table, no geometry
... }
>>>
>>> srid = None # non-spatial table, no srid
>>>
>>> primary_key = "featureid"
>>>
>>> client.create_table("example_nonspatial", schema=schema, srid=srid, primary_key=primary_key)
Inspecting Datasets¶
The Tables
client has a handy function for extracting the schema and srid from known file types and Geo/DataFrames. The caveat here is that
there are known issues with the libraries being used to get this information:
fiona
andgeopandas
dataframes both have their quirks that prevent some types of information from translating as desired.SRID handling doesn’t translate well, if at all, and is highly dependent on the cleanliness of input data.
>>> schema, srid = client.inspect_dataset("/path/to/file.csv")
>>> schema, srid = client.inspect_dataset(df)
Warning
You should NOT use inspect_dataset
in automated scripts or without manually verifying the schema and SRID returned. inspect_dataset
does its best to give you a schema and SRID in order to save time but there are no guarantees that those are correct. After inspecing a dataset it is encumbent upon the user to manually verify the schema and SRID to ensure it is valid.
Deleting a table¶
Deleting a table is permanent and will remove the table and all of its rows from your account and from the account of any descarteslabs user that it was shared with.
>>> client.delete_table("example_nonspatial")
Adding Data¶
Tables provides two mechanisms for loading data into an existing table, uploading from a file or from a Geo/DataFrame.
Uploading from a file¶
>>> path = "/tmp/ne_10m_airports.geojson"
>>> jobid = client.upload_file(path, TABLE_NAME)
The data file must be readable with OGR (e.g. ogrinfo
) and its
schema must match the schema of the destination table. The supported file
formats include:
ESRI shapefile in a .zip file archive
GeoJSON
Line-delimited GeoJSON, also known as GeoJSON Text Sequences
CSV
CSVs can contain an X
and Y
column which will be automatically cast to a point geometry.
Inserting from a dataframe¶
If you already have your data in memory as a pandas DataFrame
or a geopandas GeoDataFrame
,
you can work with the dataframe directly without needing to write to a local file.
>>> import geopandas
>>> df = geopandas.read_file(path)
Note
The names and types of the dataframe columns must match the table schema.
Then invoke the insert_rows
method
>>> jobid = client.insert_rows(df, TABLE_NAME)
Waiting on jobids¶
The upload and insert calls above are asyncronous.
Your raw data is uploaded to temporary storage and
put into a queue for processing. To check the status of an upload or insert job, use
use the returned jobid
and poll the service for the status
>>> # check once
>>> client.check_status(jobid)
>>>
>>> # check every 5 seconds
>>> client.wait_until_completion(jobid, poll_interval=5.0)
Upserts¶
If your data has a primary key, any row that you add to the table will be inserted. If a row with the same primary key already exists, the data in your new row will overwrite the existing row.
Note
You can also insert partial rows to update a subset of a row. The primary key must be one of the columns being inserted in order to update the correct row.
Data preparation¶
When inserting new rows, if your dataset is missing a column those values are assumed to be None
.
When inserting partial rows, if a column has been omitted in the dataset being inserted then the existing values are not altered.
If your dataset has additional columns that were not defined in the table schema, they will be ignored.
The spelling of columns in your dataset must be an exact match to the existing column names and is case-sensitive.
Your data must serialize to JSON and that JSON representation must be cast-able to the internal data types listed above. Most scalar python types from the standard library will convert seamlessly but you may need to pay close attention to timestamps, numpy objects, and other types that don’t have a direct JSON representation.
If your dataset contains vector geometries, the coordinates are
assumed to be in the same srid
as the table.
Querying Data¶
List tables¶
To list the tables available for querying
>>> client.list_tables()
Returns a dictionary of all tables you have access to, broken out by the level of permissions.
{'owner': ['airports'],
'editor': {},
'viewer': {}}
owner
is all the tables you created, associated directly with your accounteditor
shows all the tables owned by other descarteslabs users that you can read and write to.viewer
shows all the tables owned by other descarteslabs users that you can read.
See the Sharing Tables section below information on how to share tables.
Table Expressions¶
Ibis table expressions provide a pandas-like syntax that interacts with database backends with deffered (lazy) evaluation. You can build up a complex query expressions without sending any data over the network. This allows you to work with very large datasets and aggregate/filter them until they are more modestly sized.
>>> table_expr = client.table(TABLE_NAME)
>>> table_expr
When you view the representation of the table expression in an interactive environment, you
see that it’s not a dataframe but an abstract DatabaseTable
object on which you can build the query.
DatabaseTable[table]
name: airports
schema:
wikipedia : string
type : string
name : string
scalerank : int64
geom : point;0
Only when you execute()
(https://ibis-project.org/docs/generated/ibis.expr.api.Expr.execute.html?highlight=execute#ibis.expr.api.Expr.execute) does the request get sent to the server,
which runs the query and returns a pandas dataframe (for a non-spatial table)
or a geopandas geodataframe (if there is a geometry column and geopandas is installed).
>>> # Take the first 10 rows
>>> table_expr.limit(10).execute()
In the case of a table expression that returns a single value, you get a python scalar (in this case an integer).
>>> table_expr.count().execute()
Example queries¶
Here are some common queries
Aggregate values
Aggregate by time period
Filter by scalar value comparison
Filter by spatial intersection
We’ll demonstrate how to answer these questions in the context of the airports
example table.
First, we can aggregate to find summary statistics for a column
>>> table_expr.aggregate([
... table_expr.distance.min().name("min_distance")
... table_expr.distance.mean().name("mean_distance")
... table_expr.distance.max().name("max_distance")
... ]).execute()
Aggregate by timestamp, grouping by month. Note that our airports example doesn’t contain a date or datetime column; this query is only included as a theoretical example.
>>> table_expr.group_by([
... table_expr.detected_date.month().name('detected_month'),
... ]).aggregate(
... monthly_mean_detections=table_expr.name.count(),
... ).execute()
Filter by scalar value
>>> table_expr[
... (table_expr.scalerank >= 4) &
... (table_expr.scalerank <= 6)
... ].execute()
Filter by spatial intersection with a polygon geometry.
>>> import ibis
>>> from shapely.geometry import box
>>> aoi = box(-109.06, 36.99, -102.04, 41.00)
>>>
>>> table_expr[
... table_expr.geom.intersects(ibis.literal(aoi, type="polygon;4326"))
... ].execute()
This only scratches the surface of the capabilities of ibis
; Check out the Ibis Tutorial (https://ibis-project.org/docs/tutorial/index.html#tutorial)
for additional examples. If you’re familiar with SQL, you may also wish to read the Ibis for SQL Programmers (https://ibis-project.org/docs/user_guide/sql.html)
guide to see how SQL idioms translate to ibis.
Deleting Rows¶
Deleting features/rows from a table is as simple as providing a list of primary key ids to delete if the primary key is a single column.
>>> jobid = client.delete_rows(
... [
... "San Francisco Int'l",
... "Sky Harbor Int'l",
... "Hartsfield-Jackson Atlanta Int'l",
... ],
... TABLE_NAME
... )
>>> client.wait_until_completion(jobid, poll_interval=5.0)
We check that this row is not in the table now.
>>> len(client.table(TABLE_NAME).execute())
>>> "San Francisco Int'l" not in client.table(TABLE_NAME)["name"].execute()
For tables with composite primary keys, you will need to provide an list of primary keys as tuples. The order of the elements in each id is controlled by the pk_order argument.
# order primary keys by name first then type
client.delete_rows(
ids=[('London Heathrow', 'major')],
table_name="airports_composite_table",
pk_order=["name", "type"]
)
# or the other way around
client.delete_rows(
ids=[('major', 'London Heathrow')],
table_name="airports_composite_table",
pk_order=["type", "name"]
)
If an id does not match the type of the primary key or the pk_order field is incorrectly specified for composite primary keys, an error will be returned. However, providing an id that doesn’t exist won’t delete anything and return successfully.
Limits¶
By default, there is a 250_000
row limit which avoids accidentally
transfering huge amounts of data by e.g. downloading the entire table.
You can override the default and set your own limit by adding .limit(x)
to any of the table expressions above.
Sharing Tables¶
Tables
is integrated with Discover
, which allows you to grant and revoke table access
to other users on the Descartes Labs Platform. You will need to use the
Discover client to share tables.
There are two types of roles:
editor
has permission to upload and insert data to the tableviewer
has permission to query
In order to share a table, first instantiate a Discover client:
>>> from descarteslabs.discover.client import Discover
>>> d = Discover()
Assuming your email is demo@descarteslabs.com
and you want to share a table with someone:
>>> d.list_assets() # to get the name of the table, assume it's "asset/vector/mytable"
>>> d.table("asset/vector/mytable").share(with_="someone@company.com", as_="viewer")
The other party will use it by referring the the table by name plus the additional owner
keyword.
>>> # someone@company.com's python code
>>> # should have access to execute queries (but not to insert data)
>>> tbl = t.table("asset/vector/mytable", owner="demo@descarteslabs.com")
And finally, you can revoke access using
>>> d.table("asset/vector/mytable").revoke(from_="someone@company.com")