GSoC2020 Project Report
Project Details
Field | Value |
---|---|
Project Title | Raster Awareness in PySAL |
Project Link | https://summerofcode.withgoogle.com/projects/#5775104799145984 |
Organization | NumFOCUS (Sub-Org: PySAL) |
Mentors | Stefanie Lumnitz, Dani Arribas-Bel, Levi John Wolf |
Contributed Repo | libpysal, splot |
Working Repo | Mgeeeek/libpysal |
Gitter Room | https://gitter.im/GSoC2020-PySAL/Raster_awareness_in_PySAL |
Project Blog | https://mgeeeek.github.io/tags/gsoc/ |
Project Description
From early on, PySAL
was designed with the focus of performing vector-based spatial analysis and therefore it didn’t have tools to handle input-output of large raster data. This restricted some folks in the geospatial community to use PySAL
’s analytical functionality over raster data. Furthermore, in recent years several geographic data organizations started releasing data in raster format which earlier came in vector format mostly because of advancement in computational capabilities and high storage availability. This led to an increase in the demand for the functionality offered by PySAL
to make it work with raster data.
Taking this into consideration, my main motive was to design and implement a lightweight interface which will provide the functionality for streamlining raster data access and making it more accessible to build the data structure accepted by the analytical methods of the PySAL library (mainly libpysal.weights.W/WSP
objects) from accessed raster data (which will be an instance of xarray.DataArray
). Ultimately, this functionality will open up the use of analytical methods like esda
, spatial regression
over raster data.
Following were the major milestones that we planned to finish for successfully building raster interface:
- Design quality APIs which followed existing structure used across PySAL.
- Deal with different properties of raster including missing data values, different data alignment, and multi-layer data.
- Add transformational methods to support conversion to and from PySAL weights object and
xarray.DataArray
object. - Integrate the interface with PySAL without causing any disturbance to other parts of the library.
- Study performance and memory efficiency of the conversion methods and optimize the interface to support computations for large raster datasets.
- Add proper documentation, tests, and an example notebook.
Since in this report we’ll talk a lot about raster data and weights object, let’s look at the visualization of these 2 things. Weights object in PySAL
is a graph like structure which expresses the relationship of a spatial point with its neighboring points. We can now take a look at the visual example below:
This figure consist of 2 plots of weights object (which are listed below) and 2 of raster data. These weights objects were obtained using a sliced raster. We can see the relationship of spatial points is different in Rook
and Queen
contiguity, interface dealt with nodatavals
while creating the weights object thats why we are seeing those empty pixels.
The development timeline for building the project was divided into 3 phases:
- API design phase
- Integration phase
- Optimization phase
API design phase
Quick Overview
Main goals achieved:
- Studied data structures provided by
xarray
and PySAL’sW
andWSP
objects. - Created an initial skeletal implementation of the interface.
- Iteratively designed and experimented with different APIs structures.
- Created a working prototype of the Base Raster Interface.
Pull Requests, Commits and Issues created:
- Issue: [open] Idea collection/discussion for Raster awareness project
- Commit: final API with working prototype
Blog posts published:
Contents:
- I/O dependency for accessing raster
- Planning main functionality for the interface
- Designing interface’s API (weights builder)
- Designing interface’s API (return journey to
xarray.DataArray
) - Iterative approach for API design
Detailed Summary
-
I/O dependency for accessing raster
The original plan involved using
rasterio
as the main library for raster data access but from the start of the community bonding period, my mentors suggested a shift fromrasterio
toxarray
. After going through the documentation ofxarray
it felt superior in terms of usability and accessibility as the features it provided outweighed that ofrasterio
and it also featuredrasterio
backend to access the raster data formats easily. Moreover, building the interface to work with the data structure provided by thexarray
library made the project available to a broader audience. Therefore I continued with thexarray
orxarray.DataArray
to be specific, as users can easily obtainxarray.DataArray
object from raster data format usingxarray
’s I/O functionality which can read from a variety of data formats some of them are listed below:- GDAL Raster Formats via
open_rasterio
method. - NetCDF via
open_dataset
method.
- GDAL Raster Formats via
-
Planning main functionality for the interface
After understanding the structure of
xarray.DataArray
, it was time to create an initial skeletal implementation of the interface which provided details of the functionality that the interface will offer. As described by Dani in this thread:- The interface would provide the functionality to create both weights object
libpysal.weights.W
&WSP
from user-provided data thoughxarray.DataArray
object. - It should also support the return journey of the data obtained after an analytical operation on the weights object to
xarray.DataArray
.
- The interface would provide the functionality to create both weights object
-
Designing interface’s API (weights builder)
Once the main functionality was finalized, we jumped on to the API designing part of the project. My mentors explained how PySAL was inclined towards an object-oriented paradigm and the majority of the operations are achieved through creating an instance of a particular object using the class method and then calling the instance method. But it was also in the transitioning phase of incorporating functional programming though it was not on the top of their priority list. This motivated us to provide the functionality of direct function calling to convert
DataArray
and create respective weights object, but also implement these methods as a class method using contiguity classes (Rook
andQueen
).1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
from libpysal.weights import raster, Rook import xarray as xr # Accessing raster data using xarray da = xr.open_rasterio("raster.tif") # Method 1 wsp1 = raster.da2WSP(da, criterion="rook") data1 = da.to_series()[wsp1.index] # Method 2 wsp2 = Rook.from_xarray(da, sparse=True) data2 = da.to_series()[wsp2.index] data1.equals(data2)
True
-
Designing interface’s API (return journey to
xarray.DataArray
)In the above example,
.from_xarray
method uses direct methods likeda2W
orda2WSP
under the hood to build weights object fromxarray.DataArray
. A new module calledraster
was introduced to hold all the functionalities offered by the interface, it also helped in structuring the interface by separating it with other areas of the library. Just like what we saw with.from_xarray
method, all the work related to the raster conversion is dispatched to one of the methods in theraster
module.A similar approach was taken for designing the API for the return journey. We implemented 2 types of methods:
toDataArray
which can be called directly.- Another way was to call the instance method
.to_xarray
which provided similar functionality to the direct methods.
In the later phase,
.to_xarray
method was removed due to several limitations because the structure of created weights object was different from the normal ones (more on this later), and therefore these methods would fail if the user passed the weights object not created using the interface. -
Iterative approach for API design
It took quite a few iterations to finalize the correct API of the base interface. Below are the early pull requests that I pushed to get feedbacks for the API of the transformational functions. These are far from clean as I was still experimenting with different design decisions.
- PR: [Closed] Raster Interface v1
- PR: [Closed] Raster Interface v2
- PR: [Closed] Raster Interface v3
Commit 9eb8311 was the point of the 1st phase where the majority of the APIs were locked. There were still some areas left where the community was divided particularly in structuring and naming the parameters of the methods, but these were minor issues that were solved in the later stages.
Integration phase
Quick Overview
Main goals achieved:
- Restructured interface by aligning the methods with future weights object design.
- Made
toDataArray
method work independent of raster metadata. - Added
KNN
weights support to the interface. - Fixed documentation, by following the NumPy-Doc convention for docstrings.
- Added unit-tests and an example notebook.
- Explored different ways to optimize the interface.
Pull Requests, Commits and Issues created:
- PR: [merged] Base raster interface
- Issue: [open] Idea collection/discussion for Raster awareness project
Blog posts published:
Contents:
- Refactoring weights object
- Refactoring
toDataArray
method - Adding distance-based weights object support
- Finalzing base raster interface
- Identifying areas to optimize
Detailed Summary
-
Refactoring weights object
In the second phase, we focused on how to integrate the interface with the libpysal. One major issue related to the integration was that the output weights of the transformation methods (
da2W
/da2WSP
) were different from the normal weights object, this was due to the extra attributes added to the weights object namelyattrs
andcoords
. This addition was implemented in the builder method to make the weights object more interactable withxarray.DataArray
as the user will want to access the raster data easily after building the weights from the DataArray. Thereforecoords
attribute stored the coordinates indices of non-missing values in the form ofpandas.MultiIndex
, andattrs
stored the metadata which was required when going back to thexarray.DataArray
. This issue was addressed in the community meeting. After a detailed meeting with community members regarding the weights object, one of the major change that was proposed was to shift towardsWSP
object (here are the meeting notes). Additionally, theid_order
attribute will be replaced byindex
which will contain an instance ofpandas.Index
aligned with theids
of the weight object.This required a few changes in my existing implementation:
- Renamed
coords
toindex
and removedattrs
attribute. - Added
index
parameter in the__init__
method ofWSP
class. - Fixed creation of
W
fromxarray.DataArray
which now only relies onda2WSP
method.
- Renamed
-
Refactoring
toDataArray
method, making it work with only weights object and independent of metadataA few of the major changes added during the second phase involved the
toDataArray
method. To make the interface more accessible we decided to only build theDataArray
from weights object and data values aligned to the weights object. This was tricky as the weights object does not contain any information related to the missing values except the coordinates in theindex
attribute. After several tries of my own, I shifted towards the source code of thepandas.MultiIndex
. From my earlier impressions, the data structure ofpandas.MultiIndex
was very different. Once I understood the architecture ofpandas.MultiIndex
I pushed a patch with the required changes, this patch refactored the logic and used thecodes
andlevels
property to construct all the coordinates of theDataArray
. Commit e5cd4e6 reflects the changes made targeting this method. LatertoDataArray
was split into 2 separate functionsw2da
andwsp2da
. We can see the use of this functionality below:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
from libpysal.weights import raster, Queen # Creating test DataArray da = raster.testDataArray( shape=(10, 56, 49), time=False, rand=False, missing_vals=True ) wsp = Queen.from_xarray( da=da, z_value=5, coords_label=None, sparse=True ) # Accessing data from da using index attribute data = da.to_series()[wsp.index] # Converting back to the DataArray da_back = wsp2da(data, wsp) da_back
<xarray.DataArray (band: 1, y: 56, x: 49)> -1 211 -1 -1 -1 41 37 -1 62 -1 ... -1 171 118 135 28 151 -1 -1 -1 146 Coordinates: * band (band) int64 5 * y (y) float64 66.1 66.2 66.3 66.4 66.5 ... 71.2 71.3 71.4 71.5 71.6 * x (x) float64 92.1 92.2 92.3 92.4 92.5 ... 96.5 96.6 96.7 96.8 96.9 Attributes: nodatavals: (-1,)
-
Adding distance-based weights object support
Rook
andQueen
contiguities were not sufficient when considered for further analysis on a raster data becauseRook
contiguity provided max 4 neighbors andQueen
provided max 8. To tackle this, the next step was to implement distance weight builder eitherKNN
orDistanceBand
. We started withKNN
, the initial implementation ofKNN
was very straight forward, we first converted cell centroids into points and then shipped them directly into theKNN
builder which is based onKDTree
. There were 2 major issues with this implementation first was very obvious as raster provided a regular lattice-like structure,KNN
could have been more optimized since there was no need for buildingKDTree
. Though the major issue was related to the ordering of the selected neighbors. Since in raster more than 2 neighbors can be equidistant from the focal point, the method inconsistently selected neighbors in a different environment. Therefore this was later removed from the project. -
Adding tests, documentation, example notebook, and finalizing base raster interface
Once all the components of the base raster Interface were finalized, we started working on the documentation and tests. From the beginning I followed the style of other methods inside libpysal when creating docstrings, luckily my mentors notified me early on to follow NumPy-Doc convention for docstrings. After fixing docstring we worked on adding unit-tests since we were not storing any raster dataset inside the project therefore we decided to create a
testDataArray
method (usage provided in the above example) which provided us with the different dataset to test interface correctly.I also added an example notebook that introduced the API of the interface and gave a brief overview of its use-case in different scenarios. The notebook can be accessed from here:
Commits 6565dd0 to 94ec4f4 contains the majority of work finalized in the second phase. Since the main API of the base interface was finalized, I created a pull request targeting the master branch of libpysal:
- PR: [merged] Base raster interface
Note: All my main work for the first and second phase lives inside this pull request. The commit history doesn’t look clean as for some reason I forced pushed a few changes.
-
Identifying areas to optimize for adding support for large rasters.
Along with working on the interface we were also identifying and exploring the different optimization methods to optimize the interface. This thread documents list of ideas we went through during exploration.
Optimization phase
Quick Overview
Main goals achieved:
- Shifted away from
lat2SW
-based weight builder. - Redesigned sparse matrix builder based on
COO_matrix
. - Numba-fied all possible methods using
njit
decorator. - Added multi-threaded implementation for WSP builder using
joblib
. - Added
higher_order
functionality to the modified weight builder. - Added raster-based weights object support inside
plot_spatial_weights
method insplot
.
Pull Requests, Commits and Issues created:
- PR: [WIP] Optimized raster-based weights builder
- PR: [WIP] Added raster weights plotting functionality
- Issue: [open] Distance Band functionality for Raster Interface weights
- Issue: [open] plotting for
libpysal
raster functionality
Blog posts published:
Contents:
- Moving on from
lat2SW
-based sparse weight builder - Optimizing the addition of higher-order neighbors
- Adding visualization support for the interface
Detailed Summary
-
Moving on from
lat2SW
-based (which usesdia_matrix
) to newcoo_matrix
-based sparse weight builderFollowing the idea-collection issue, the first target for optimizing the interface was the weight builders. The existing approach used the
lat2SW
method which provided the main functionality for constructing a sparse matrix for the regular lattice. The main issue while usinglat2SW
was when we’re dealing with data consisting of missing values, in this case, boolean indexing was used for removing missing rows and columns from the created sparse matrix. The performance takes a hit due to boolean indexing ofcsr_matrix
and this raised concerns for building weight objects for larger data. Memory consumption inlat2SW
is also high due to the use oflists
instead ofnp.array
.A proper explanation of this issue and how it was solved is given in this blog post.
Also, check out this notebook for a performance overview. These enhancements provided huge benefits both in terms of memory efficiency and faster performance. This also made the interface usable with large datasets.
-
Optimizing the addition of higher-order neighbors
In the previous section, I explained how
KNN
-based weight object was not very consistent when dealing with the raster-based data. Therefore, we decided to implementhigher_order
functionality, and 2 approaches were taken when adding the higher-order neighbors:-
Implementing the first approach was relatively easier than the second approach, as only a few changes were added inside the modified weights builder. An extra argument was provided which acted as a
threshold
and then using modular arithmetic conditions the algorithms added the neighbors lying inside thethreshold
radius also following contiguity criterion, though this implementation was different as it added higher order neighbors even if the lower order neighbors were missing and therefore it was more likeDistanceBand
. This thread provides some insightful discussion related to howDistanceBand
behaves with raster data: -
As for the second approach, we were able to implement a correct algorithm for
higher_order
functionality inside weight builder, this algorithm used recursion to traverse nth-order neighbors or nodes because raster can be assumed as a graph. Unfortunately, we could not incorporatenumba
andmulti-threading
to the method and therefore the algorithm was very inefficient. But after some testing, we were able to add a proper algorithm for using the method that Martin suggested in this thread.- Issue: higher_order weights of <= k
Since both approaches are different from each other we added
include_nas
argument which ifTrue
, computes weights using the first approach, and ifFalse
the builder uses the second approach. We can take a look at the visual example below which explains this more intuitively. Here we usedtestDataArray
method to create a small raster data filled with random and missing values.Fig. 2
shows theRook
contiguity withk=1
which means all first-order rook neighbors are selected.Fig. 3
shows theRook
contiguity withk=2
but withinclude_nas=False
which means allnon-missing
first and second-order rook neighbors are selected (this uses the second approach as described previously).include_nas
isTrue
forFig. 4
, we can see that in this figure all the first order neighbors are selected but while selecting second order neighbors, the method takes consideration of (missing + nonmissing) first-order neighbors. After completion of these methods, a pull request was created targeting the master branch oflibpysal
which added optimized weights builder andhigher_order
functionality. This pull request describes how these approaches are different from each other: -
This marked the end of the optimization phase.
-
Adding visualization support for the interface.
Along with optimization work, we also worked on adding the support for plotting functionality for a raster-based weight object. This required some changes inside the
plot_spatial_weights
method, mainly making it work with sparse weight objects instead of dense weight objects and adding ada
argument for passingDataArray
object.- Issue: [open] plotting for
libpysal
raster functionality - PR: [WIP] Added raster weights plotting functionality
The visualization aspect of the project is still in progress, once this pull request gets reviewed. I’ll add more enhancements which are listed in the issue.
- Issue: [open] plotting for
Next Phase (post-gsoc)
- Since the weight object is being refactored to remove the legacy part and adopt new features from newly available libraries like
pandas
andxarray
. I’ll need to reflect these changes to the interface once they are incorporated into thelibpysal
. - Additionally, the interface once merged will be in the alpha-testing phase and I’m expecting new bugs to pop-in which we were not able to test. Therefore I’ll have to be available to provide the bug fixes for a few initial months of merging the interface.
- However optimized the weights builder can be, there will be still a chance that the program will run out of memory as the raster size increases. One ambitious idea when working on the interface was related to lazily building the weights object. I did some early explorational prototyping for this part ([WIP] : Lazy weights object for raster interface) but it’ll require a massive change in the structure of the weights object probably a new class only supporting lazy operations. Once it is successfully implemented all the conversion and computation will be lazy which will allow users to use larger than memory rasters, albeit the computations will be slow.
- Even though
xarray
provides great support forDask
, the project has not yet taken advantage of or supportdask
structures. It’ll be a nice functionality to add to the interface in the future.
Huge thanks to PySAL Community
As a computer science undergrad this project was a breath of fresh air for me since I was relatively new to the PySAL library and the field of geospatial data science. I gained lots of experience working on this project especially working on somethings I wasn’t completely familiar with.
I like to thank my mentors Stefanie, Dani, Levi for taking the time to review the prs, clearing my doubts, and providing me with valuable feedback which helped me along the way of building this project. I also appreciate the work of other community members working on this cool library.
I’ll meet you in the new season of surviving 2020.
Mragank Shekhar