Coding Period: Second Phase Part 2
Phase two of GSoC2020 is over, this blog is little late cos my laptop’s HDD died and I was unable to save my Hugo environment. Overall the summer has been really productive and the API design of the interface is finally reached beta period, let’s look at the progress.
Final API of the Interface
After several weeks of coding, design discussions, and code review, phase 1 of my project which consisted of designing API and all the functionality for the base raster interface is completed. In the last couple of weeks, I had to work on refactoring the methods to align them with the future design of the weights object, along with this there were some new test cases that I had to deal with as my earlier approach was focused only on GDAL raster formats and open_rasterio method which restricted the interface from being used with other datasets like NetCDF and therefore I’d to work on refactoring and thinking what will be the best way to introduce this functionality in the interface.
These new changes were finalized after project meeting with my mentors, they explained that it’ll be ideal if the support for other datasets is extended and the methods could be made more generalized. For this, a parameter dims was added to the weight builders, this argument will accept a dictionary, if dimensions of the DataArray does not match default dimensions which are ['band', 'time', 'lat', 'y', 'lon', 'x'].
e.g. dims dictionary:
|
|
After adding new fixes, I worked on adding documentation and tests which took me a while to get my head around. After a few iterations of code review by my mentors, they notified me that I need to follow the NumPy doc convention for adding docstring. And in the last week of phase 2, I was able to finalize most of the work, you can check this pr which itself documents my progress in the second phase.
I’ve provided an example notebook, which will try to cover the functionality offered by the raster interface.
Example Notebook for Raster awareness API
This notebook will give an overview of newly developed raster interface. We’ll cover basic usage of the functionality offered by the interface which mainly involves:
- converting
xarray.DataArrayobject to the PySAL’s weights object (libpysal.weights.W/WSP). - going back to the
xarray.DataArrayfrom weights object.
using different datasets:
- with missing values.
- with multiple layers.
- with non conventional dimension names.
|
|
Loading Data
The interface only accepts xarray.DataArray, this can be easily obtained from raster data
format using xarray’s I/O functionality which can read from a variety of data formats some of them are listed below:
- GDAL Raster Formats via
open_rasteriomethod. - NetCDF via
open_datasetmethod.
In this notebook we’ll work with NetCDF and GeoTIFF data.
Using local NetCDF dataset
In this small example we’ll build KNN distance weight object using a local NetCDF dataset with different dimensions names which doesn’t belong to the default list of dimensions.
We’ll also see how to speed up the reverse journey (from weights object to DataArray) by passing prebuilt coords and attrs to w2da method.
|
|
<xarray.Dataset>
Dimensions: (latitude: 73, longitude: 144, time: 62)
Coordinates:
* longitude (longitude) float32 0.0 2.5 5.0 7.5 ... 350.0 352.5 355.0 357.5
* latitude (latitude) float32 90.0 87.5 85.0 82.5 ... -85.0 -87.5 -90.0
* time (time) datetime64[ns] 2002-07-01T12:00:00 ... 2002-07-31T18:00:00
Data variables:
tcw (time, latitude, longitude) float32 ...
tcwv (time, latitude, longitude) float32 ...
lsp (time, latitude, longitude) float32 ...
cp (time, latitude, longitude) float32 ...
msl (time, latitude, longitude) float32 ...
blh (time, latitude, longitude) float32 ...
tcc (time, latitude, longitude) float32 ...
p10u (time, latitude, longitude) float32 ...
p10v (time, latitude, longitude) float32 ...
p2t (time, latitude, longitude) float32 ...
p2d (time, latitude, longitude) float32 ...
e (time, latitude, longitude) float32 ...
lcc (time, latitude, longitude) float32 ...
mcc (time, latitude, longitude) float32 ...
hcc (time, latitude, longitude) float32 ...
tco3 (time, latitude, longitude) float32 ...
tp (time, latitude, longitude) float32 ...
Attributes:
Conventions: CF-1.0
history: 2004-09-15 17:04:29 GMT by mars2netcdf-0.92
Out of 17 data variables we’ll use p2t for our analysis. This will give us our desired DataArray object da, we will further group da by day, taking average over the time dimension.
|
|
('day', 'latitude', 'longitude')
Weight builders (from_xarray, da2W, da2WSP) can recognise dimensions belonging to this list [band, time, lat, y, lon, x], if any of the dimension in the DataArray does not belong to the mentioned list then we need to pass a dictionary (specifying that dimension’s name) to the weight builder. We can see that the none of dimensions of da matches with the default dimensions ([band, time, lat, y, lon, x])
This means we have to create a dictionary mentioning the dimensions and ship it to weight builder, similar to our last example.
|
|
|
|
We can now speed up w2da by passing coords from the existing DataArray da which we used earlier.
Along with coords we can also pass attrs of the same DataArray this will help w2da to retain all the properties of original DataArray.
Let’s compare the DataArray returned by w2da and original DataArray. For this we’ll ship the derived data straight to w2da without any statistical analysis.
|
|
True
Using local GeoTIFF dataset
Up until now we’ve only played with netCDF datasets but in this example we’ll use a raster.tif file to see how interface interacts with it. We’ll also see how these methods handle missing data.
Unlike earlier we’ll use weight builder methods from raster.py, which we can call directly. Just a reminder that from_xarray uses methods from raster.py and therefore only difference exists in the API.
To access GDAL Raster Formats xarray offers open_rasterio method which uses rasterio as backend. It loads metadata, coordinate values from the raster file and assign them to the DataArray.
|
|
<xarray.DataArray (band: 1, y: 880, x: 940)>
[827200 values with dtype=float32]
Coordinates:
* band (band) int64 1
* y (y) float64 50.18 50.18 50.18 50.18 ... 49.45 49.45 49.45 49.45
* x (x) float64 5.745 5.746 5.747 5.747 ... 6.525 6.526 6.527 6.527
Attributes:
transform: (0.0008333333297872345, 0.0, 5.744583325, 0.0, -0.0008333...
crs: +init=epsg:4326
res: (0.0008333333297872345, 0.0008333333295454553)
is_tiled: 0
nodatavals: (-99999.0,)
scales: (1.0,)
offsets: (0.0,)
AREA_OR_POINT: Area
|
|
<matplotlib.collections.QuadMesh at 0x7f27287a3550>
We’ll look at how weight builders handle missing values. Firstly we’ll slice the DataArray to reduce overall size for easier visualization.
This time we’ll create WSP object using da2WSP method inside raster.py. Since our DataArray is single banded and all of its dimensions belong to the default list, we’ve to only ship the DataArray to the da2WSP.
|
|
After plotting both contiguities and sliced DataArray, we can see that the missing values are ignored by the da2WSP method and only indices of non missing values are stored in index attribute of WSP object.
|
|
Additional resources
What’s next?
My first deliverable is almost complete (few bugs here and there stills exists and beta testing is still going on). This coding phase is dedicated towards optimizing and integration of raster interface. Regarding optimization, I’ve succesfully built a faster raster weight builder using numba and COO_matrix, I’m working on the multicore implementation for the same. After finishing high performant weight builder, I’ll look into efficiently increasing the neighbors in the weights object.
Find out what happens next in the episode of GSoC2020… until then bye-bye đź‘‹.