Sprint 3

Sprint 3#

Summary#

In this notebook, I explored how to use an ALS-derived treelist to predict missing tree attributes, namely Diameter, Crown Base Height, and Species. I also compared a TLS (Terrestrial Laser Scanning) derived dataset to a field-derived dataset at Shaver Lake.

Part 1: TLS vs. Field Data at Shaver Lake#

# Task 1 - Your code here
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Loading the full plots inventory
plot_inventory = pd.read_csv("./01_plot_identification.csv")

# Specify our site of interest
site_name = "SHA"

# We only keep the inventory plots for Shaver Lake
shaver_inventory = plot_inventory.loc[plot_inventory.site_name == site_name, ['inventory_id', 'plot_blk']]
shaver_inventory.head()

	inventory_id	plot_blk
0	143_pre_SHA_burn3d	CAFKU_0143_20240721_1
1	140_pre_SHA_burn3d	CAFKU_0140_20240721_1
2	152_pre_SHA_burn3d	CAFKU_0152_20240721_1
3	147_pre_SHA_burn3d	CAFKU_0147_20240720_1
4	151_pre_SHA_burn3d	CAFKU_0151_20240720_1

# We sample the first 5 plots
shaver_sample = shaver_inventory.head()

# Load both the TLS and the field collected treelist
tls_treelist = pd.read_csv("./TLS_treelist.csv")
field_treelist = pd.read_csv("./03_tree.csv")

# Filter tls_treelist based on plot_blk in ind_sample
tls_treelist_shaver = tls_treelist.merge(shaver_sample[['plot_blk']], on='plot_blk', how='inner')

# Filter field_treelist based on inventory_id in ind_sample
field_treelist_shaver = field_treelist.merge(shaver_sample[['inventory_id']], on='inventory_id', how='inner')

# Plot trees from both sources
fig, ax = plt.subplots()
plt.scatter(tls_treelist_shaver.H, tls_treelist_shaver.DBH/2.54, axes = ax, label = 'TLS')
plt.scatter(field_treelist_shaver.tree_ht, field_treelist_shaver.tree_dbh/10, axes = ax, marker = '.',label= 'Field')
plt.xlabel('Height (m)')
plt.ylabel('DBH (cm)')
plt.legend()
plt.title('TLS vs Field Data at Shaver Lake')

Text(0.5, 1.0, 'TLS vs Field Data at Shaver Lake')

_images/38fb612c633263a0be2ef7f539777f290da5a9666d781d45dc4dd9fc312b65c0.png

Our Thoughts#

The plots comparing TLS and Field data for Shaver Lake and Independence Lake reveal notable differences in tree structure and measurement consistency. At Shaver Lake, the TLS and Field data show a broader range of both height and DBH values, with TLS generally capturing larger DBH measurements than the Field method for taller trees. This suggests that TLS may be more effective at capturing larger trees, or that there is a systematic overestimation in TLS measurements. In contrast, the Independence Lake plot showed a tighter clustering of values and a smaller overall range in DBH, with the TLS and Field data aligning more closely, though TLS still slightly overestimates DBH.

The discrepancies between the two treelists may be due to differences in forest structure. Shaver Lake appears to have more mature, taller trees with greater DBH, possibly leading to occlusion or shadowing in TLS measurements. In contrast, Independence Lake may have a denser, younger forest where TLS and Field measurements are more easily matched. These observations generally align with what we found by comparing the two sites in previous sprints as well. Other factors such as terrain, scanner setup, and environmental conditions at the time of data collection could also contribute to these differences.

Part 2: Modeling Diameter, Species, and CBH at Shaver Lake#

import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

ALS_treetops = pd.read_csv('./ALS_treetops.csv')
FF_treelist = pd.read_csv('FF_treelist_all.csv')
fia_ref_species_table = pd.read_csv('./REF_SPECIES.csv')

site_name = 'SHA'

# Filter ALS treetops by site
ALS_treetops_filter = ALS_treetops[ALS_treetops.site_name == site_name]

# For FastFuels treelist, filter by site
FF_treelist_filter = FF_treelist[FF_treelist.site_name == site_name]

# Rename height column for ALS file (if needed)
ALS_treetops_filter = ALS_treetops_filter.rename(columns={'height_m': 'HT'})

# Convert ALS treetops to a GeoDataFrame using coordinates (assumes columns X_4326 and Y_4326)
ALS_treetops_gdf = gpd.GeoDataFrame(
    ALS_treetops_filter,
    geometry=gpd.points_from_xy(ALS_treetops_filter.X_4326, ALS_treetops_filter.Y_4326),
    crs=4326
)

# Use the filtered FastFuels treelist as our training treelist
treelist = FF_treelist_filter
treelist.head()

	SPCD	DIA	HT	STATUSCD	CBH	CR	X	Y	geometry	site_name	X_4326	Y_4326
23266	814	14.732	9.7536	2	NaN	NaN	-2029248.192	1.811832e+06	POINT (-119.2807981866652 37.10055993214569)	SHA	-119.280798	37.100560
23267	814	14.732	9.7536	2	NaN	NaN	-2029228.422	1.811843e+06	POINT (-119.2806115374424 37.10070078752966)	SHA	-119.280612	37.100701
23268	122	NaN	NaN	2	NaN	NaN	-2029212.404	1.811821e+06	POINT (-119.280374586538 37.10054546366893)	SHA	-119.280375	37.100545
23269	814	13.208	6.0960	1	2.13360	0.813356	-2029218.343	1.811834e+06	POINT (-119.2804745751547 37.10064118282663)	SHA	-119.280475	37.100641
23270	814	13.970	5.7912	1	3.18516	0.534938	-2029244.407	1.811816e+06	POINT (-119.2807125425391 37.10043003743448)	SHA	-119.280713	37.100430

Model 1 – Predicting Diameter (DIA) from Height (HT)#

# Diameter prediction: split treelist data into training and testing sets
independent_variables = ["HT"]
dependent_variable = "DIA"
include_variables = independent_variables + [dependent_variable]

trees_train, trees_test = train_test_split(treelist[include_variables].dropna(), test_size=0.2)
print(f"Training set size: {len(trees_train)}")
print(f"Test set size: {len(trees_test)}")

# Train Random Forest model for diameter prediction
model_ht = RandomForestRegressor()
model_ht.fit(trees_train[independent_variables], trees_train[dependent_variable])

# Predict on test set and on ALS data
predicted_dbh_test = model_ht.predict(trees_test[independent_variables])
predicted_dbh = model_ht.predict(ALS_treetops_filter[independent_variables])

# Evaluate model performance: R^2 and RMSE on test set
r2 = model_ht.score(trees_test[independent_variables], trees_test[dependent_variable])
rmse = np.sqrt(((predicted_dbh_test - trees_test[dependent_variable])**2).mean())
print(f"Model R^2: {r2:.2f}")
print(f"Model RMSE: {rmse:.2f} (cm)")

Training set size: 45013
Test set size: 11254
Model R^2: 0.92
Model RMSE: 8.17 (cm)

# Plot actual vs. predicted diameters on test data
fig, ax = plt.subplots()
trees_test = trees_test.copy()  # to avoid SettingWithCopyWarning
trees_test["predicted_diameter"] = predicted_dbh_test
trees_test.plot.scatter(x="DIA", y="predicted_diameter", ax=ax)
upper_dia_limit = max(trees_test["DIA"].max(), trees_test["predicted_diameter"].max()) + 1
ax.plot([0, upper_dia_limit], [0, upper_dia_limit], color='red')
ax.set_xlabel("Actual Diameter (cm)")
ax.set_ylabel("Predicted Diameter (cm)")
ax.set_title("Validation Data: Diameter Prediction")
plt.show()

_images/44f5a964e3d99062910a2e1afbbfe94706e33221943903da8a63970749d823b4.png

Now, we apply our diameter model to the ALS treetops.

# Predict diameter for ALS-derived treetops
ALS_treetops_filter["predicted_diameter"] = predicted_dbh

# Plot predicted diameter versus height for ALS trees
fig, ax = plt.subplots()
ALS_treetops_filter.plot.scatter(x="HT", y="predicted_diameter", ax=ax)
ax.set_xlabel("Height (m)")
ax.set_ylabel("Predicted Diameter (cm)")
ax.set_title("Shaver Lake: Predicted Tree Diameter from ALS")
plt.show()

_images/4a68068280abe8209dcbb3100c4de81095ff17b148060a5fe88e4a943eda8cd3.png

Model 2 – Predicting Species Code (SPCD) from Height#

# Create a dictionary mapping SPCD to common species names using FIA reference table
spcd_to_common_name = dict(zip(fia_ref_species_table['SPCD'], fia_ref_species_table['COMMON_NAME']))

# Prepare data for species classification
independent_variables = ["HT"]
dependent_variable = "SPCD"
include_variables = independent_variables + [dependent_variable]

# Split FastFuels treelist data (filtered for SHA) into training and test sets
trees_train, trees_test = train_test_split(treelist[include_variables].dropna(), test_size=0.2)
print(f"Training set size: {len(trees_train)}")
print(f"Test set size: {len(trees_test)}")

# Train a Random Forest Classifier for SPCD prediction
model_spcd = RandomForestClassifier()
model_spcd.fit(trees_train[independent_variables], trees_train[dependent_variable])

# Predict species on test set and on ALS data
predicted_spcd_test = model_spcd.predict(trees_test[independent_variables])
predicted_spcd = model_spcd.predict(ALS_treetops_filter[independent_variables])

# Evaluate classification performance with a classification report
unique_species = sorted(trees_test[dependent_variable].unique())
species_names = [spcd_to_common_name.get(spcd, f"Unknown ({spcd})") for spcd in unique_species]
report = classification_report(trees_test[dependent_variable], predicted_spcd_test, zero_division=0, target_names=species_names)
print(report)

# Display confusion matrix
cm = confusion_matrix(trees_test[dependent_variable], predicted_spcd_test)
disp = ConfusionMatrixDisplay(cm, display_labels=species_names)
disp.plot(cmap='Blues', values_format='d')
plt.xticks(rotation=90, ha='right')
plt.title("Species Classification Confusion Matrix")
plt.show()

Training set size: 45013
Test set size: 11254
                            precision    recall  f1-score   support

                 white fir       0.57      0.65      0.61      2594
                 grand fir       0.00      0.00      0.00        60
        California red fir       0.00      0.00      0.00         2
           western juniper       0.00      0.00      0.00         1
             western larch       0.00      0.00      0.00         1
             incense-cedar       0.69      0.77      0.73      4097
            lodgepole pine       0.84      0.50      0.63        42
              Jeffrey pine       0.86      0.07      0.13       263
                sugar pine       0.61      0.49      0.54       886
            ponderosa pine       0.71      0.73      0.72      2372
               Douglas-fir       0.51      0.09      0.16       229
           Pacific madrone       0.00      0.00      0.00         7
curlleaf mountain-mahogany       0.00      0.00      0.00        12
        eastern cottonwood       0.00      0.00      0.00         2
             bitter cherry       0.49      1.00      0.66        82
           canyon live oak       0.63      0.41      0.49       318
                Gambel oak       1.00      0.38      0.56        13
          Oregon white oak       0.62      0.23      0.33        35
      California black oak       0.41      0.20      0.27       230
      California white oak       0.00      0.00      0.00         8

                  accuracy                           0.65     11254
                 macro avg       0.40      0.28      0.29     11254
              weighted avg       0.65      0.65      0.63     11254

_images/ed908b89d781725dbc6a2f92e0361c716e5c5a44d395efcd9598a73177dd8878.png

The model appears to misclassify white fir frequently as Douglas-fir, grand fir, and California black oak. There are also notable misclassifications of western larch as incense-cedar, Jeffrey pine as ponderosa pine, and California black oak as canyon live oak, indicating confusion among similar or visually overlapping species.

Model 3 – Predicting Crown Base Height (CBH) from Height#

independent_variables = "HT"  # single predictor
dependent_variable = "CBH"

# Use a 90/10 split for training and testing (adjust as needed)
trees_train, trees_test = train_test_split(treelist[['HT', 'CBH']].dropna(), test_size=0.1)
print(f"Training set size: {len(trees_train)}")
print(f"Test set size: {len(trees_test)}")

# Train Random Forest model for CBH prediction
model_cbh = RandomForestRegressor()
model_cbh.fit(trees_train[[independent_variables]], trees_train[dependent_variable])

# Predict CBH on test data and on ALS treetops
predicted_cbh_test = model_cbh.predict(trees_test[[independent_variables]])
predicted_cbh = model_cbh.predict(ALS_treetops_filter[[independent_variables]])

# Evaluate model performance: R^2 and RMSE on test data
r2_cbh = model_cbh.score(trees_test[[independent_variables]], trees_test[dependent_variable])
rmse_cbh = np.sqrt(((predicted_cbh_test - trees_test[dependent_variable])**2).mean())
print(f"Model R^2: {r2_cbh:.2f}")
print(f"Model RMSE: {rmse_cbh:.2f} (m)")

# Plot actual vs. predicted CBH
fig, ax = plt.subplots()
trees_test = trees_test.copy()
trees_test["predicted_cbh"] = predicted_cbh_test
trees_test.plot.scatter(x="CBH", y="predicted_cbh", ax=ax)
upper_cbh_limit = max(trees_test["CBH"].max(), trees_test["predicted_cbh"].max()) + 1
ax.plot([0, upper_cbh_limit], [0, upper_cbh_limit], color='red')
ax.set_xlabel("Actual CBH (m)")
ax.set_ylabel("Predicted CBH (m)")
ax.set_title("Validation Data: Crown Base Height Prediction")
plt.show()

Training set size: 43967
Test set size: 4886
Model R^2: 0.85
Model RMSE: 3.27 (m)

_images/f92e904da2caf174568f1b4966977e2d8ca9249344bf898d927b6814b13291d8.png

Comparing with Field Data

Now that we have predicted all three tree metrics on the ALS-detected trees, we will save the predictions as a new treelist and compare these results with field-collected observations.

predicted_treelist = pd.DataFrame({
    'treeID': ALS_treetops_filter.treeID,
    'HT': ALS_treetops_filter.HT,
    'DIA': predicted_dbh,
    'SPCD': predicted_spcd,
    'CBH': predicted_cbh,
    'X_4326': ALS_treetops_filter.X_4326,
    'Y_4326': ALS_treetops_filter.Y_4326,
})

predicted_treelist.head()

	treeID	HT	DIA	SPCD	CBH	X_4326	Y_4326
3120	1	23.800797	55.540907	81	8.407979	-119.274157	37.108535
3121	2	16.103474	28.740294	15	6.526245	-119.274331	37.108530
3122	3	37.818615	89.018722	122	20.640655	-119.274482	37.108500
3123	4	21.019176	52.739436	81	12.121794	-119.274285	37.108481
3124	5	16.300404	28.740294	15	6.526245	-119.274088	37.108478

# Compare predictions to field data using spatial join
# Load plot boundaries and field tree data
plots_df = pd.read_csv('./01_plot_identification.csv')

# Create GeoDataFrame for plots (each unique spatial reference system is processed)
plots_intermediate = []
for srs in np.unique(plots_df.plot_coord_srs):
    plots_subset = plots_df[plots_df.plot_coord_srs == srs]
    plots_subset_gdf = gpd.GeoDataFrame(
        plots_subset, 
        geometry=gpd.points_from_xy(plots_subset.plot_coord_x, plots_subset.plot_coord_y), 
        crs=srs
    )
    # Reproject to EPSG 5070
    plots_subset_gdf = plots_subset_gdf.to_crs(5070)
    plots_intermediate.append(plots_subset_gdf)

plots_gdf = pd.concat(plots_intermediate)
plots_gdf = plots_gdf.dropna(subset=['plot_blk'])

plot_size = 1/10  # acre
acre_to_m2 = 4046.86
plot_size_m2 = plot_size * acre_to_m2
plot_radius = np.sqrt(plot_size_m2 / np.pi)
plots_gdf = plots_gdf.set_geometry(plots_gdf.buffer(plot_radius))

# Filter plots for site SHA
plots_filtered = plots_gdf[plots_gdf.site_name == site_name]
plots_filtered.plot()
plt.title(f'Plots at {site_name}')
plt.xlabel('X (meters)')
plt.ylabel('Y (meters)')
plt.show()

_images/7fc4eb3758e64b1aa2ed3650bd84bef0fbcf1e21bcedc97cb06ccf5344ff408f.png

predicted_treelist_gdf = gpd.GeoDataFrame(
    predicted_treelist,
    geometry=gpd.points_from_xy(predicted_treelist.X_4326, predicted_treelist.Y_4326),
    crs=4326
)
predicted_treelist_gdf = predicted_treelist_gdf.to_crs(plots_filtered.crs)

# Spatial join to assign plots to predicted trees
predicted_treelist_plots = predicted_treelist_gdf.sjoin(plots_filtered)


field_data_trees = pd.read_csv('./03_tree.csv')

plot_id = plot_names[0]
trees_filtered = field_data_trees[field_data_trees.inventory_id == plot_id]
predicted_treelist_filtered = predicted_treelist_plots[predicted_treelist_plots.inventory_id == plot_id]

predicted_treelist_filtered.head()

	treeID	HT	DIA	SPCD	CBH	X_4326	Y_4326	geometry	index_right	inventory_id	...	groundpost_cover_total	veg_cover_id	veg_cover_tot_veg_percent	veg_cover_tov_percent	veg_cover_und_percent	veg_cover_tos_percent	veg_cover_herb_percent	sp_invcov_id	sp_cov_inv_obs	sp_cov_inv_obs_label
9545	6436	13.405090	30.773926	81	7.036213	-119.274053	37.100451	POINT (-2028674.954 1811676.71)	38	100_pre_SHA_burn3d	...	100	1	40.0	5.0	1.0	33.0	1.0	1	3.0	Abigail
9563	6454	13.098443	22.465848	81	4.437432	-119.274038	37.100426	POINT (-2028674.356 1811673.73)	38	100_pre_SHA_burn3d	...	100	1	40.0	5.0	1.0	33.0	1.0	1	3.0	Abigail
9567	6458	11.713481	20.281709	15	7.247664	-119.274074	37.100423	POINT (-2028677.552 1811674.188)	38	100_pre_SHA_burn3d	...	100	1	40.0	5.0	1.0	33.0	1.0	1	3.0	Abigail
9588	6479	14.202027	36.443269	81	7.182681	-119.274040	37.100393	POINT (-2028675.429 1811670.091)	38	100_pre_SHA_burn3d	...	100	1	40.0	5.0	1.0	33.0	1.0	1	3.0	Abigail

4 rows × 70 columns

Discussion#

In this notebook we used the ALS-derived treelist and the FastFuels treelist as our training dataset to predict missing tree attributes (DIA, SPCD, and CBH) at Shaver Lake (SHA). Our simple models—based solely on height—perform reasonably well for lower diameters and show promise for species and CBH prediction, although additional predictors could further improve performance.

Training Data Considerations:

FastFuels Treelist:
- Pros: Large sample size and extensive coverage; beneficial for machine learning model training.
- Cons: May not capture local variability at SHA as accurately.
FIA Data:
- Pros: Systematic and standardized measurements across regions.
- Cons: Sampling resolution may be coarser and less representative of the continuous landscape.
Field-Collected Data:
- Pros: Provides site-specific measurements that can capture local conditions and variability.
- Cons: Often limited in sample size, which might reduce model generalizability.

In future iterations we may want to train separate models using FIA or field-collected data as our training dataset. Comparing RMSE, R², and classification accuracy across these models can help determine which source best suits our local prediction needs.