EO Datasets 3¶
EO Datasets aims to be the easiest way to write, validate and convert dataset imagery and metadata for the Open Data Cube
Write a Dataset¶
Here’s a simple example of creating a dataset with one measurement (called “blue”) from an existing image:
collection = Path('/some/output/collection/path') with DatasetAssembler(collection) as p: p.product_family = "blues" # Date of acquisition (UTC if no timezone). p.datetime = datetime(2019, 7, 4, 13, 7, 5) # When the data was processed/created. p.processed_now() # Right now! # (If not newly created, set the date on the field: `p.processed = ...`) # Write our measurement from the given path, calling it 'blue'. p.write_measurement("blue", blue_geotiff_path) # Add a jpg thumbnail using our only measurement for the r/g/b bands. p.write_thumbnail("blue", "blue", "blue") # Complete the dataset. p.done()
Note that until you call done(), nothing will exist in the dataset’s final output location. It is stored in a hidden temporary folder in the output, and renamed by done() if complete and valid.
Custom stac-like properties can also be set directly on
p['fmask:cloud_cover'] = 34.0
Any known properties are automatically normalised:
p.platform = "LANDSAT_8" # to: 'landsat-8' p.processed = "2016-03-04 14:23:30Z" # into a date. p.maturity = "FINAL" # lowercased p.properties["eo:off_nadir"] = "34" # into a number
Most of the time our datasets are processed from an existing (input) dataset and have the same spatial information. We can add them as source datasets, to record the provenance, and the assembler can optionally copy any common metadata automatically:
collection = Path('/some/output/collection/path') with DatasetAssembler(collection) as p: # We add a source dataset, asking to inherit the common properties # (eg. platform, instrument, datetime) p.add_source_path(level1_ls8_dataset_path, auto_inherit_properties=True) # Set our product information. # It's a GA product of "numerus-unus" ("the number one"). p.producer = "ga.gov.au" p.product_family = "blues" p.dataset_version = "3.0.0"
We can write our new pixels as a numpy array, inheriting the existing grid spatial information (gridspec) from our input dataset:
# Write a measurement from a numpy array, using the source dataset's grid spec. p.write_measurement_numpy( "ones", numpy.ones((60, 60), numpy.int16), GridSpec.from_dataset_doc(l1_ls8_dataset), nodata=-999, )
Writing only metadata¶
The above examples copy the imagery, converting them to valid COG imagery. But sometimes you
don’t want to touch your imagery, you only want metadata. We can use
to refer to the image at it’s current path:
usgs_level1 = Path('datasets/LC08_L1TP_090084_20160121_20170405_01_T1') with DatasetAssembler( dataset_location=usgs_level1 ) as p: p.product_family = "level1" p.datetime = datetime(2019, 7, 4, 13, 7, 5) # Note the measurement in the metadata. (instead of ``write``) p.note_measurement('red', usgs_level1 / 'LC08_L1TP_090084_20160121_20170405_01_T1_B3.TIF' ) # Or relative to the dataset # (this will work unchanged on non-filesystem locations, such as ``s3://`` or tar files) p.note_measurement('blue', 'LC08_L1TP_090084_20160121_20170405_01_T1_B3.TIF', relative_to_dataset_location=True )
Note that the assembler will throw an error if any measurements live outside the dataset location, as they will have to be recorded as absolute rather than relative paths. (Relative paths are considered best-practice for Open Data Cube.)
You can allow absolute paths with a field on
with DatasetAssembler( dataset_location=usgs_level1, allow_absolute_paths=True, ): ...
API / Class¶
DatasetAssembler(collection_location=None, dataset_location=None, metadata_path=None, dataset_id=None, if_exists=<IfExists.ThrowError: 2>, allow_absolute_paths=False, naming_conventions='default')¶
__init__(collection_location=None, dataset_location=None, metadata_path=None, dataset_id=None, if_exists=<IfExists.ThrowError: 2>, allow_absolute_paths=False, naming_conventions='default')¶
Assemble a dataset with ODC metadata, writing metadata and (optionally) its imagery as COGs.
There are three optional paths that can be specified. At least one must be.
A collection path is the root folder where datasets will live (in sub-[sub]-folders).
Each dataset has its own dataset location, as stored in an Open Data Cube index. All paths inside the metadata are relative to this location.
An output metadata document location.
If you’re writing data, you typically only need to specify the collection path, and the others will be automatically generated using the naming conventions.
If you’re only writing a metadata file (for existing data), you only need to specify a metadata path.
If you’re storing data using an exotic URI schema, such as a ‘tar://’ URL path, you will need to specify this as your dataset location.
IfExists) – What to do if the output dataset already exists? By default, throw an error.
bool) – Allow metadata paths to refer to files outside the dataset location. this means they will have to be absolute paths, and not be portable. (default: False)
str) – Naming conventions to use. Supports default or dea. The latter has stricter metadata requirements (try it and see – it will tell your what’s missing).
- Return type
Record a reference to an additional file. Such as native metadata, thumbnails, checksums, etc. Anything other than ODC measurements.
By convention, the name should have prefixes with their category, such as ‘metadata:’ or ‘thumbnail:’
add_source_dataset(dataset, classifier=None, auto_inherit_properties=False)¶
Record a source dataset using its metadata document.
It can optionally copy common properties from the source dataset (platform, instrument etc)/
(see self.INHERITABLE_PROPERTIES for the list of fields that are inheritable)
bool) – Whether to copy any common properties from the dataset
How to classify the kind of source dataset. This is will automatically be filled with the family of dataset if available (eg. “level1”).
You want to set this if you have two datasets of the same type that are used for different purposes. Such as having a second level1 dataset that was used for QA (but is not this same scene).
add_source_path()if you have a filepath reference instead of a document.
add_source_path(*paths, classifier=None, auto_inherit_properties=False)¶
Record a source dataset using the path to its metadata document.
See other parameters in
Cancel the package, cleaning up temporary files.
This works like
DatasetAssembler.close(), but is intentional, so no warning will be raised for forgetting to complete the package first.
Clean up any temporary files, even if dataset has not been written
Write the dataset and move it into place.
It will be validated, metadata will be written, and if all is correct, it will be moved to the output location.
The final move is done atomically, so the dataset will only exist in the output location if it is complete.
IncompleteDatasetErrorIf any critical metadata is incomplete.
- Return type
The id and final path to the dataset metadata file.
Record extra metadata from the processing of the dataset.
It can be any document suitable for yaml/json serialisation, and will be written into the sidecar “proc-info” metadata.
This is typically used for recording processing parameters or environment information.
not recommended - will likely change soon.
Iterate through the list of measurement names that have been written, and their current (temporary) paths.
TODO: Perhaps we want to return a real measurement structure here as it’s not very extensible.
An optional displayable string to identify this dataset.
These are often used when when presenting a list of datasets, such as in search results or a filesystem folder. They are unstructured, but should be more humane than showing a list of UUIDs.
By convention they have no spaces, due to their usage in filenames.
A label will be auto-generated using the naming-conventions, but you can manually override it by setting this property.
note_measurement(name, path, expand_valid_data=True, relative_to_dataset_location=False)¶
Reference a measurement from its existing file path.
(no data is copied, but Geo information is read from it.)
note_software_version(name, url, version)¶
Record the version of some software used to produce the dataset.
write_measurement(name, path, overviews=(8, 16, 32), overview_resampling=<Resampling.average: 5>, expand_valid_data=True, file_id=None)¶
Write a measurement by copying it from a file path.
Assumes the file is gdal-readable.
str) – Identifier for the measurement eg
Resampling) – rasterio Resampling method to use
bool) – Include this measurement in the valid-data geometry of the metadata.
write_measurement_numpy(name, array, grid_spec, nodata=None, overviews=(8, 16, 32), overview_resampling=<Resampling.average: 5>, expand_valid_data=True, file_id=None)¶
Write a measurement from a numpy array and grid spec.
The most common case is to copy the grid spec from your input dataset, assuming you haven’t reprojected.
p.write_measurement_numpy( "blue", new_array, GridSpec.from_dataset_doc(source_dataset), nodata=-999, )
write_measurement()for other parameters.
write_measurement_rio(name, ds, overviews=(8, 16, 32), overview_resampling=<Resampling.average: 5>, expand_valid_data=True, file_id=None)¶
Write a measurement by reading it an open rasterio dataset
DatasetReader) – An open rasterio dataset
write_measurement()for other parameters.
write_measurements_odc_xarray(dataset, nodata, overviews=(8, 16, 32), overview_resampling=<Resampling.average: 5>, expand_valid_data=True, file_id=None)¶
Write measurements from an ODC xarray.Dataset
The main requirement is that the Dataset contains a CRS attribute and X/Y or lat/long dimensions and coordinates. These are used to create an ODC GeoBox.
Dataset) – an xarray dataset (as returned by
dc.load()and other methods)
write_measurement()for other parameters.
write_thumbnail(red, green, blue, resampling=<Resampling.average: 5>, static_stretch=None, percentile_stretch=(2, 98), scale_factor=10, kind=None)¶
Write a thumbnail for the dataset using the given measurements (specified by name) as r/g/b.
(the measurements must already have been written.)
A linear stretch is performed on the colour. By default this is a dynamic 2% stretch (the 2% and 98% percentile values of the input). The static_stretch parameter will override this with a static range of values.
str) – Name of measurement to put in red band
str) – Name of measurement to put in green band
str) – Name of measurement to put in blue band
str]) – If you have multiple thumbnails, you can specify the ‘kind’ name to distinguish them (it will be put in the filename). Eg. GA’s ARD has two thumbnails, one of kind
nbarand one of
int) – How many multiples smaller to make the thumbnail.
Resampling) – rasterio
rasterio.enums.Resamplingmethod to use.