House Prices

This guide runs through building, training and evaluation of models for estimating house prices in California. The models in this example are using quad pooling operations. In the guide we will be using California Housing dataset. The dataset is comprised of 20640 data points with 8 features each, including geographical coordinates.

Open In Colab

[1]:
!pip install torch-geopooling matplotlib osmnx geopandas torchmetrics scikit-learn -qqq

Defining the Dataset

In order to use the data for training a neural network, we need to define a dataset class, that should inherit from the standard torch.utils.data.Dataset, and implement __len__ and __getitem__.

Since the dataset is relatively small, we could simply download it into memory and then parse CSV file to turn it into the Pandas DataFrame.

[2]:
import pandas as pd
from sklearn.datasets import fetch_california_housing


dataset = fetch_california_housing(as_frame=True).frame
dataset.head()
[2]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Below you can see sample of dataset, or coordinates of houses in California depicted as blue dots. Black lines in the picture represent coastlines from the Open Street Map.

[3]:
import warnings
warnings.simplefilter("ignore")

import osmnx as ox


lon_min = dataset["Longitude"].min()
lon_max = dataset["Longitude"].max()
lat_min = dataset["Latitude"].min()
lat_max = dataset["Latitude"].max()

bounding_box = (lat_max, lat_min, lon_max, lon_min)
coastlines = ox.features_from_bbox(bbox=bounding_box, tags={"natural": "coastline"})
[4]:
from matplotlib import pyplot as plot
from matplotlib import colors
from matplotlib import cm
from shapely import Polygon, LineString


def render_coastlines(ax):
    for coastline in coastlines.itertuples():
        if isinstance(coastline.geometry, Polygon):
            ax.plot(*coastline.geometry.exterior.xy, color="black", lw=1)
        if isinstance(coastline.geometry, LineString):
            ax.plot(*coastline.geometry.xy, color="black", lw=1)

fig, ax = plot.subplots(ncols=2)
fig.set_dpi(300)
fig.set_size_inches(15, 8)


cmap = plot.get_cmap("RdBu_r")
norm = colors.Normalize(
    vmin=dataset["MedHouseVal"].min(),
    vmax=dataset["MedHouseVal"].max(),
)

ax[0].set_xlabel("Longitude, deg")
ax[0].set_ylabel("Latitude, deg")
ax[0].scatter(dataset["Longitude"], dataset["Latitude"], s=1, color="blue", alpha=0.1)
ax[0].set_title("Locations of California houses")

ax[1].set_xlabel("Longitude, deg")
ax[1].set_ylabel("Latitude, deg")
ax[1].scatter(
    dataset["Longitude"], dataset["Latitude"],
    s=20, color=cmap(norm(dataset["MedHouseVal"])),
    alpha=0.1
)
ax[1].set_title("Prices of houses in California")

fig.colorbar(cm.ScalarMappable(norm=norm, cmap=cmap), ax=ax[1], label="House Value, $")

render_coastlines(ax[0])
render_coastlines(ax[1])
../_images/guide_house_prices_7_0.png

On the next step we need to split dataset into training and test sub-sets, in this example we will use 80% of data for training and 20% of data for testing (evaluation) of a model.

[5]:
import numpy as np
import torch
from torch.utils.data import Dataset, random_split


class HousePricesDataset(Dataset):
    def __init__(self, df):
        self.df = df
        self.max_value = df["MedHouseVal"].max()

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # Take necessary row from the dataset and convert it to torch.Tensor.
        x = self.df.iloc[idx][["Longitude", "Latitude"]].to_numpy(dtype=np.float64)
        y = self.df.iloc[idx][["MedHouseVal"]].to_numpy(dtype=np.float64)# / self.max_value
        return torch.DoubleTensor(x), torch.DoubleTensor(y)


train_size = int(len(dataset) * 0.8)
test_size = len(dataset) - train_size

houses_prices_dataset = HousePricesDataset(dataset)
train_set, test_set = random_split(houses_prices_dataset, [train_size, test_size])

Defining Models

The model we are going to use will be compare regular quad pooling layer and adaptive quad pooling layer. Essentially, those modules build quadtree for spatial decomposition of the points. We assume that houses within similar geographical coordinates cost similar money.

Adaptive module allows to dynamically shape the decomposition based on density of points: regions with higher density are finer grained than regions with fewer points. Regular pooling module on the other hand, tiles the target polygon into quads of the same size.

Below we define two models QuadNN and AdaptiveQuadNN that use quad pooling operations: regular and adaptive.

[6]:
from torch import nn
from torch_geopooling.nn import AdaptiveQuadPool2d
from torch_geopooling.nn import QuadPool2d


class QuadNN(nn.Module):

    def __init__(self, bbox):
        super().__init__()
        n, s, e, w = bbox
        self.pool = QuadPool2d(
            feature_dim=1,
            polygon=Polygon([(w, s), (e, s), (e, n), (w, n)]),
            exterior=(-180, -90, 360, 180),
            max_depth=10
        )
        self.linear = nn.Linear(1, 1, dtype=torch.float64)

    def forward(self, x):
        x = torch.clamp(self.pool(x), 0.0)
        x = self.linear(x)
        return x


class AdaptiveQuadNN(nn.Module):

    def __init__(self):
        super().__init__()
        self.pool = AdaptiveQuadPool2d(
            feature_dim=1,
            exterior=(-180, -90, 360, 180),
            capacity=10,
            max_depth=12
        )
        self.linear = nn.Linear(1, 1, dtype=torch.float64)

    def forward(self, x):
        x = self.pool(x)
        x = self.linear(x)
        return x

Training Models

Further we use a standard training loop and Stochastic Gradient Descent optimization algorithm to train models. Note, that adaptive pooling operations are using sparse weight tensor and therefore only a limited set of optimization algorithms could be used for training.

[7]:
from torch.utils.data import DataLoader
from torch.optim import SGD


def train(model, lr=0.1, epochs=10):
    loss_fn = nn.HuberLoss()
    optimizer = SGD(model.parameters(), lr=lr)

    for epoch in range(epochs):
        for batch in DataLoader(train_set, batch_size=2048):
            x, y = batch
            optimizer.zero_grad()
            y_pred = model(x)

            loss = loss_fn(y, y_pred)
            loss.backward()
            optimizer.step()
    return model
[8]:
qnn = train(QuadNN(bounding_box), lr=1.0, epochs=50)
[9]:
aqnn = train(AdaptiveQuadNN(), lr=1.0, epochs=50)

Evaluating Models

On the final step, we evaluate models’ performance and compute mean absolute error and weighted mean absolute percentage error. As you could see the from the table below, both models require improvements.

Additionally, torch_geopooling library provides TileWKT transformer that allows to inspect the model and plot quads that where learned by the model. Below are presented learned decomposition for regular and adaptive pooling operations.

[10]:
import warnings
warnings.filterwarnings("ignore")

from torchmetrics import MeanAbsoluteError
from torchmetrics import WeightedMeanAbsolutePercentageError


def test(model):
    mae = MeanAbsoluteError()
    wmape = WeightedMeanAbsolutePercentageError()

    test_loader = DataLoader(train_set, batch_size=len(test_set))
    x, y_true = next(iter(test_loader))
    y_pred = model.eval()(x)

    return {
        "MAE": mae(y_true, y_pred).numpy(),
        "wMAPE": wmape(y_true, y_pred).numpy(),
    }

metrics = {}
with torch.no_grad():
    metrics["QuadNN"] = test(qnn)
    metrics["AdaptiveQuanNN"] = test(aqnn)

pd.DataFrame(metrics)
[10]:
QuadNN AdaptiveQuanNN
MAE 0.62189245 0.8252109
wMAPE 0.30851611 0.5629974

Below are depicted tiles colored with learned prices, the prices are calculated for centers for the learned tiles. As it could be seen, adaptive quad pooling operation changes the resolution of tiles, when the density of data increases.

[11]:
from torch_geopooling.transforms import TileWKT
from shapely import from_wkt


def render_tiles(ax, model, tiles, bbox):
    centers = []
    exterior = (-180, -90, 360, 180)
    transformer = TileWKT(exterior=exterior, internal=False)

    # Calculate centers of terminal nodes in a quadtree and then use these
    # coordinates to compute associated prices for each quad in the final tree.
    for tile in transformer(tiles):
        poly = from_wkt(tile)
        center = poly.representative_point()
        centers.append((center.x, center.y))

    with torch.no_grad():
        values = model(torch.DoubleTensor(centers)).detach().numpy()

    # Use different colors to represent housing prices: warm colors of the
    # spectre are associated with higher prices, while cooler colors of the
    # spectre depic lower prices.
    cmap = plot.get_cmap("RdBu_r")
    norm = colors.Normalize(vmin=np.min(values), vmax=np.max(values))

    for i, tile in enumerate(transformer(tiles)):
        poly = from_wkt(tile)
        ax.fill(*poly.exterior.xy, color=cmap(norm(values[i])), lw=0.5, edgecolor="black", alpha=0.8)

    n, s, e, w = bbox
    ax.set_xlim(w, e)
    ax.set_ylim(s, n)
    ax.set_xlabel("Longitude, deg")
    ax.set_ylabel("Latitude, deg")


fig, ax = plot.subplots(ncols=2)
fig.set_dpi(300)
fig.set_size_inches(15, 8)

ax[0].set_title("Decomposition of QuadPool2d module")
ax[1].set_title("Decomposition of AdaptiveQuadPool2d module")


render_tiles(ax[0], qnn, qnn.pool.tiles, bounding_box)
render_tiles(ax[1], aqnn, aqnn.pool.weight.coalesce().indices().t()[:,:-1], bounding_box)

render_coastlines(ax[0])
render_coastlines(ax[1])
../_images/guide_house_prices_19_0.png