Correcting an incorrectly georeferenced raster dataset

Published on 2024-06-11

Objective

To extend the fccsmap Python package to support looking up FCCS fuelbed information within Canada.

Context

fccsmap is a Python package developed by Pacific North West Airfire used to look up FCCS fuelbed data within continental United States and Alaska. It does this by loading a georeferenced raster dataset and checking what value sits at the relevant coordinates.

Prior work includes Miles Epstein adding the dataset fccs_canada.nc to fccsmap and wiring it up for use. However, fccsmap fails to use the dataset; when it's loaded, dependencies complain that it lacks a geotransform matrix. It does indeed have a geotransform matrix, however, it doesn't appear to be recognized by any GIS software I've used. Namely, the dataset has the following property in its header:

Band1#transform={1000,0,-2341249.999999993,0,-1000,3833423.97736665}

Clearly, fccsmap fails to recognize the geotransformation of the dataset, while it does recognize it for the American datasets. Interestingly, PanopolyJ fails to recognize the geotransformations for both the American and Canadian datasets. Meanwhile, QGIS does recognize the geotransformation, however, the coordinates it reports are well off where they should be (the southern tip of Vancouver Island is listed as being almost 30 degrees east of where it's expected to be). Notably, the American FCCS datasets do not have this issue

Definitions and ideas

Raster data: data in "raster" form; think raster images. Data is layed out in a grid of "pixels," where each pixel contains or references some data.
Georeferencing and geotransformation: the process of applying a matrix transformation (the "geotransformation") to the raster grid in order to make the data correspond to geographic coordinates. Raster data is essentially a matrix, and we can apply matrix transformations to it accordingly.

Problem 1: Identifying the geotransform

Approach 1: Upgrading to a later version of GDAL

When using fccs_canada.nc, you get the following error:

>>> ca_lookup = FccsLookUp(is_canada=True)
>>> ca_lookup.filename
'/usr/local/lib/python3.8/dist-packages/fccsmap/data/fccs_canada.nc'
>>> ca_lookup.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
/usr/local/lib/python3.8/dist-packages/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/fccsmap/lookup.py", line 179, in look_up
    stats = self._look_up(new_geo_data)
  File "/usr/local/lib/python3.8/dist-packages/fccsmap/lookup.py", line 305, in _look_up
    stats = zonal_stats(s, self.gridfile_specifier,
  File "/usr/local/lib/python3.8/dist-packages/rasterstats/main.py", line 31, in zonal_stats
    return list(gen_zonal_stats(*args, **kwargs))
  File "/usr/local/lib/python3.8/dist-packages/rasterstats/main.py", line 156, in gen_zonal_stats
    fsrc = rast.read(bounds=geom_bounds)
  File "/usr/local/lib/python3.8/dist-packages/rasterstats/io.py", line 289, in read
    win = bounds_window(bounds, self.affine)
  File "/usr/local/lib/python3.8/dist-packages/rasterstats/io.py", line 150, in bounds_window
    row_start, col_start = rowcol(w, n, affine)
  File "/usr/local/lib/python3.8/dist-packages/rasterstats/io.py", line 141, in rowcol
    r = int(op((y - affine.f) / affine.e))
OverflowError: cannot convert float infinity to integer

This error is reminiscent of this issue in the rioxarray package, which the developers identified as being a bug with gdal, fixed in this pull request. It's possible updating to gdal version 3.6.1 would fix the issue. I am indeed using an outdated version of GDAL.

To test this, I decided to install the latest version of fccsmap using the latest version of GDAL in a Docker container, using the following Dockerfile:

FROM ubuntu:22.04

RUN cd /root

# installing tzdata, which is a dependency of one of the following packages,
# needs a little configuration to be run non-interactively. This manually
# sets the timezone data it expects
RUN ln -fs /usr/share/zoneinfo/America/Vancouver /etc/localtime

# Setting DEBIAN_FRONTEND here prevents undesired interactive prompts when
# installing certain packages (namely tzdata)
RUN apt-get update && DEBIAN_FRONTEND=noninteractive \
    apt-get install -y \
    python3 python3-dev python3-pip python3-venv\
    libnetcdf-dev libproj-dev libgdal-dev \
    python3-numpy python3-gdal \
    libxml2-dev libxslt1-dev

# Create and activate python virtual environment
RUN python3 -m venv /root/venv
ENV PATH="/root/venv/bin/activate:$PATH"

# Install pip dependencies
RUN pip install numpy
RUN export CPLUS_INCLUDE_PATH=/usr/include/gdal
RUN export CPLUS_INCLUDE_PATH=/usr/include/gdal
RUN pip install gdal==`gdal-config --version`

RUN pip install --no-binary gdal --extra-index https://pypi.airfire.org/simple fccsmap==4.1.2

Unfortunately, using the latest version of GDAL didn't resolve the issue, and it failed with a similar error.

Approach 2: Identifying why the geotransformation isn't recognized

Why isn't GDAL recognizing the geotransformation? I suspect it's because the netCDF driver is searching for it somewhere it's not. To find out, we can trace the callback stack to see where specifically it's looking. Recall from earlier:

>>> ca_lookup.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
/usr/local/lib/python3.8/dist-packages/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.

Following the chain of dependencies, we find this line which seems to be throwing the warning:

# rasterio_v1.3/__init__.py
dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)

The DatasetReader class can in turn be found in rasterio/io.py, which extends DatasetReaderBase in rasterio/_io.pxd. DatasetReaderBase extends DatasetBase and this is pretty much where our journey in Pythonland ends because from here on out everything's defined in the C++ GDAL library.

We can get a bit closer by searching the codebase for NotGeoreferencedWarning, which takes us to rasterio/_base.pyx and the read_transform method:

def read_transform(self):
    """Return the stored GDAL GeoTransform"""
    cdef double gt[6]

    if self._hds == NULL:
        raise ValueError("Null dataset")
    err = GDALGetGeoTransform(self._hds, gt)
    if err == GDALError.failure and not self._has_gcps_or_rpcs():
        warnings.warn(
            ("Dataset has no geotransform, gcps, or rpcs. "
            "The identity matrix will be returned."),
            NotGeoreferencedWarning)

    return [gt[i] for i in range(6)]

Maybe it's worth looking into using GCPs or RPCs if this rabbit hole doesn't take us anywhere.

The key here is GDALGetGeoTransform, which is in fact a reference to a C function; one that takes a representation of a dataset and a pointer to a location in memory where the geotransform matrix will be loaded. It's defined in the gdal.h header

This is effectively a dead end for me; tracing the GDAL codebase adds a lot of complexity to this project, particularly now that I have a few other leads that might work better

Approach 3: Manually configuring the GeoTransform

Approach 2 taught me a lot about how rasterio and GDAL work under the hood. fccsmap exposes the GDAL dataset reader, and I might be able to leverage that to dynamically configure the dataset. Having done that, I might be able to export it as well.

import rioxarray
import rasterio
from fccsmap.lookup import FccsLookUp
old_ds_filename = FccsLookUp(is_canada=True)._filename
old_ds = rasterio.open(old_ds_filename)
## /usr/local/lib/python3.10/dist-packages/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
##   dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)

# One-liner:
# new_ds = rasterio.open('./modified_fccs_canada.nc', 'w', width=old_ds.width, height=old_ds.height, count=old_ds.count, nodata=0, dtype='float32', crs=old_ds.crs, transform=rasterio.transform.Affine(1000,0,-2341249.999999993,0,-1000,3833423.97736665))

new_ds = rasterio.open('./modified_fccs_canada.nc', 'w',
  driver='netCDF',
  dtype='float32',
  width=old_ds.width,
  height=old_ds.height,
  crs=old_ds.crs,
  transform=rasterio.transform.Affine(1000,0,-2341249.999999993,0,-1000,3833423.97736665),
  count=old_ds.count,
  nodata=0)
## Traceback (most recent call last):
##   File "<stdin>", line 1, in <module>
##   File "/usr/local/lib/python3.10/dist-packages/rasterio/env.py", line 451, in wrapper
##     return f(*args, **kwds)
##   File "/usr/local/lib/python3.10/dist-packages/rasterio/__init__.py", line 230, in open
##     raise RasterioIOError(
## rasterio.errors.RasterioIOError: Blacklisted: file cannot be opened by driver 'netCDF' in 'w' mode

rasterio raises such a RasterioIOError when the driver doesn't support writing as described in the Writing Datasets section of the rasterio documentation. The implications of this aren't immediately obvious; if you leave off the driver='netCDF' line, this code works fine. Presumably, in this case, it's not writing in NetCDF format. Instead, it's probably writing in GeoTIFF, which is better supported (new lead: can we use GTIFF instead of NetCDF?).

To quote the documentation:

Some other formats that are writable by GDAL can also be written by Rasterio. These use an IndirectRasterUpdater which does not create directly but uses a temporary in-memory dataset and GDALCreateCopy() to produce the final output.

The Raster drivers page in the GDAL documentation lists the NetCDF driver as one supporting "Creation," though it doesn't elaborate on what that means. The documentation for the driver itself claims the NetCDF driver supports both reading and writing.

If you run the same command, dropping the driver kwarg, you get:

new_ds = rasterio.open('./modified_fccs_canada.nc', 'w', width=old_ds.width, height=old_ds.height, crs=old_ds.crs, count=old_ds.count, nodata=0, dtype='float32', transform=rasterio.transform.Affine(1000,0,-2341249.999999993,0,-1000,3833423.97736665))
new_ds.driver
## 'netCDF'

Maybe this is a bug? We're going to run with this for now.

From here, we can get the raster data of fccs_canada.nc and write it to our new dataset:

raster = rioxarray.open_rasterio(old_ds)
new_ds.write(raster)
new_ds.close()

l = FccsLookUp(fccs_fuelload_file='./modified_fccs_canada.nc')
l.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
## Traceback (most recent call last):
##   [...]
## OverflowError: cannot convert float infinity to integer

This has brought us back to where we started. However, I lucked out and made a discovery after landing here. I remembered that this error is otherwise a result of looking up a value outside the domain of the dataset. For example, let's take the known working American continental dataset:

us = FccsLookUp()
us.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
## Traceback (most recent call last):
##   [...]
## OverflowError: cannot convert float infinity to integer

This fails because the request is being made for a point in Canada—a point in space outside continental US.

Previously, I'd noticed that when I loaded the dataset in QGIS, it seemed like the coordinates more eastern than they should have been. In light of this, I tried:

l.look_up({'type': 'MultiPoint', 'coordinates':[[51, -80]]})
{'fuelbeds': OrderedDict([('0.0', {'percent': 100.0, 'grid_cells': 480249})]), 'units': 'm^2', 'sampled_grid_cells': 480249, 'sampled_area': 240128428465.43958}

Of course, the next thing that came to mind was to check this on the existinng fccs_canada.nc dataset:

ca = FccsLookUp(is_canada=True)
ca.look_up({'type': 'MultiPoint', 'coordinates':[[51, -80]]})
{'fuelbeds': OrderedDict([('0.0', {'percent': 100.0, 'grid_cells': 480249})]), 'units': 'm^2', 'sampled_grid_cells': 480249, 'sampled_area': 240128428465.43958}
## Traceback (most recent call last):
##   [...]
## OverflowError: cannot convert float infinity to integer

In the end, finicking with the dataset was not in fact a red herring. We've marshalled the dataset into working with fccsmap. For completeness, here's the code to generate it:

import rioxarray
import rasterio
from fccsmap.lookup import FccsLookUp
old_ds_filename = FccsLookUp(is_canada=True)._filename
old_ds = rasterio.open(old_ds_filename)
new_ds = rasterio.open('./modified_fccs_canada.nc', 'w', width=old_ds.width, height=old_ds.height, crs=old_ds.crs, count=old_ds.count, nodata=0, dtype='float32', transform=rasterio.transform.Affine(1000,0,-2341249.999999993,0,-1000,3833423.97736665))
raster = rioxarray.open_rasterio(old_ds)
new_ds.write(raster)
old_ds.close()
new_ds.close()

Remaining question: what will it take to fix the offset? Resolving this and patching it into fccsmap will completely resolve the problem.

Problem 2: Correcting the geotransform

Approach 1: Georeferencing

To do this, we should be able to use the georeferencer tool in QGIS. Canada is quite big so I figured I'd add quite a few GCPs, which took a while but was worth it; I managed to get accuracy to about one or two pixels, which I suspect is reasonable for now. The next move was to export the layer as a NetCDF file, scp it to the server and mount it as a volume to the container to try it out.

When I did, I got this error:

>>> l = FccsLookUp(fccs_fuelload_file='./better_fccs_canada.nc')
>>> l.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/xarray/backends/file_manager.py", line 211, in _acquire_with_cache_info
    file = self._cache[self._key]
  File "/usr/local/lib/python3.10/dist-packages/xarray/backends/lru_cache.py", line 56, in __getitem__
    value = self._cache[key]
KeyError: [<function open at 0x7f9978ce70a0>, ('./better_fccs_canada.nc',), 'r', (('sharing', False),), 'dbded168-54a3-4bf7-8ae3-a8450fda19f1']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "rasterio/_base.pyx", line 310, in rasterio._base.DatasetBase.__init__
  File "rasterio/_base.pyx", line 221, in rasterio._base.open_dataset
  File "rasterio/_err.pyx", line 221, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: './better_fccs_canada.nc' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 135, in look_up
    stats = self._look_up(new_geo_data)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 22, in _
    r = func(*args,  **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/lookup.py", line 90, in _look_up
    raster = rioxarray.open_rasterio(self._filename)
  File "/usr/local/lib/python3.10/dist-packages/rioxarray/_io.py", line 1124, in open_rasterio
    riods = manager.acquire()
  File "/usr/local/lib/python3.10/dist-packages/xarray/backends/file_manager.py", line 193, in acquire
    file, _ = self._acquire_with_cache_info(needs_lock)
  File "/usr/local/lib/python3.10/dist-packages/xarray/backends/file_manager.py", line 217, in _acquire_with_cache_info
    file = self._opener(*self._args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/rasterio/env.py", line 451, in wrapper
    return f(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/rasterio/__init__.py", line 304, in open
    dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
  File "rasterio/_base.pyx", line 312, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: './better_fccs_canada.nc' not recognized as a supported file format.

It's hard to explain what, but something about this error made me think that maybe it had to do with the fact that I've mounted the NetCDF file as a volume. So, I tried adding a COPY directive to the Dockerfile to put it directly into the container, and tried again:

>>> from fccsmap.lookup import FccsLookUp
>>> l = FccsLookUp(fccs_fuelload_file='./better_fccs_canada.nc')
>>> l.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
Traceback (most recent call last):
  [...]
OverflowError: cannot convert float infinity to integer

Ah yes, our old friend. We know this error occurs when the point is outside the domain (in this case, it really shouldn't be; it's around the BC/Alberta border). I suspect the reason we were getting it before on our "malformed" dataset had to do with the fact that GDAL wasn't recognizing the geotransform, and so 51.000 N, -116.000 E was actually being interpreted as, say (51, -116) on the cartesian grid of pixels, which is probably out of the domain. This shouldn't be the problem now, though; it really seems like GDAL is recognizing the geotransform, and we can verify this like so:

>>> import rasterio
>>> ds = rasterio.open('./better_fccs_canada.nc')
>>> ds.get_transform()
[-185.9183, 0.01839601254850161, 0.0, 84.47139999999999, 0.0, -0.01839601518026565]

That's something; it's definitely not the identity matrix.

The next thing to come to mind was that maybe the coordinates needed to be flipped. I noticed when I was working in QGIS, it seemed to be reading and interpreting geographic coordinates in the opposite order I expected, so why not try that:

>>> l.look_up({'type': 'MultiPoint', 'coordinates':[[-116, 51]]})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 135, in look_up
    stats = self._look_up(new_geo_data)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 22, in _
    r = func(*args,  **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/lookup.py", line 94, in _look_up
    return self._look_up_in_file(geo_data_df, self._filename)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 22, in _
    r = func(*args,  **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 242, in _look_up_in_file
    stats = zonal_stats(geo_data_df, filename,
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/main.py", line 36, in zonal_stats
    return list(gen_zonal_stats(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/main.py", line 171, in gen_zonal_stats
    fsrc = rast.read(bounds=geom_bounds, boundless=boundless)
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/io.py", line 349, in read
    new_array = self.src.read(
  File "rasterio/_io.pyx", line 627, in rasterio._io.DatasetReaderBase.read
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 633. GiB for an array with shape (1, 412166, 412120) and data type float32

Wow. 633 GiB?? That's a lot of gibibytes. What on earth is it about this dataset that's making rasterstats want to allocate 633 GiB? At first, I thought maybe it was failing to make some kind of optimization. I would expect a non-sparse array of 412166 by 412120 32 bit cells to be big. But, it'd be much bigger than 633 GiB; it'd actually be about 5.4 tebibytes. Either way, this file is only 121.9 MiB. Not to mention I'm pretty sure that's not actually how big the dataset is. It should be closer to 1 GiB with absolutely no compression.

Could be an interesting line of inquiry, but the goal here is to to get something that works. Earlier, I figured out the new geotransform of the dataset. What if I rewrote the old one, but used the new geotransform this time?

So, I tried following the procedure established in the first part, replacing the geotransform with (185.9183, 0.01839601254850161, 0.0, 84.47139999999999, 0.0, -0.01839601518026565), and I got:

>>> l.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 135, in look_up
    stats = self._look_up(new_geo_data)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 22, in _
    r = func(*args,  **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/lookup.py", line 94, in _look_up
    return self._look_up_in_file(geo_data_df, self._filename)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 22, in _
    r = func(*args,  **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 242, in _look_up_in_file
    stats = zonal_stats(geo_data_df, filename,
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/main.py", line 36, in zonal_stats
    return list(gen_zonal_stats(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/main.py", line 171, in gen_zonal_stats
    fsrc = rast.read(bounds=geom_bounds, boundless=boundless)
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/io.py", line 308, in read
    win = bounds_window(bounds, self.affine)
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/io.py", line 150, in bounds_window
    row_start, col_start = rowcol(w, n, affine)
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/io.py", line 142, in rowcol
    r = int(op((y - affine.f) / affine.e))
ZeroDivisionError: float division by zero

Subsequently, I tried exporting the dataset as a GeoTIFF file and that made no difference.

After a bit more investigation, I realized that ds.get_transform() and ds.transform() return two different objects: the former a list and the latter an Affine object. Initially, I'd used the former, which gave me the division by zero error. Using the latter, I got the allocation error again. This closes the door on the ZeroDivisionError and takes us back where we started.

Some more investigation lead me back to the fccsmap package. Notably, this error appears to occur "later" than the previous divide by infinity errors, so it looks like we got something right. And, in fact, we did. the new data set would work flawlessly if it weren't for an issue in fccsmap:

# fccsmap/baselookup.py
def _create_geo_data_df(self, geo_data):
    logging.debug("Creating data frame of geo-data")
    shape = shapely.geometry.shape(geo_data)
    wgs84_df = geopandas.GeoDataFrame({'geometry': [shape]}, crs="EPSG:4326")
    return wgs84_df.to_crs(self._crs)

fccsmap calls this function which essentially takes the data frame of the GeoJSON data passed to it as one projected using the EPSG:4326 coordinate system (the one used by the American data sets) and then projects it to the coordinate system of the actual dataset we're using. I was clued in by this StackExchange question that doing this can lead to comically large memory allocations. It does look like the procedure above should bypass these issues, since the data is actually being reprojected by the to_crs method, but if I drop the intermediary step and treat the input GeoJSON as being in the same coordinate system as that of the dataset, it works perfectly.