Classifier Evaluation#
To evaluate our trained classifier model, we compute a few metrics on the test dataset. We use the TorchMetrics library to generate class-wise reports. TorchMetrics is a library in PyTorch for computing metrics for evaluation tasks. The following metrics are reported:
Precision: Precision measures the proportion of true positive predictions among all positive predictions made by the model. It indicates how many of the predicted positive cases are actually positive. As a reminder, precision ranges from 0 to 1, where higher values indicate better performance. See the page “Detector evaluate” for the equation to calculate precision.
Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of true positive predictions among all actual positive cases in the dataset. It indicates how many of the actual positive cases the model correctly identifies. As a reminder, recall also ranges from 0 to 1, with higher values indicating better performance. See the page “Detector evaluate” for the equation to calculate recall.
F1-Score: F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall. F1-score is calculated as:
F1-score = 2 x ((Precision x Recall) / (Precision + Recall))
F1-score ranges from 0 to 1, where higher values indicate better performance. It is particularly useful when there is an imbalance between the classes in the dataset.
Note: Support represents the number of actual occurrences of each class in the dataset. It is the number of true instances for each class in the ground truth. Support is not used directly in the computation of other metrics but provides context for understanding the distribution of classes in the dataset.
These metrics are commonly used for evaluating classification models and provide insights into their performance in terms of correctness and completeness of predictions across different classes. We are calculating them individually for each class.
Results#
Category: complete
precision recall f1-score support
complete 0.95 0.93 0.94 1199
incomplete 0.69 0.74 0.71 234
accuracy 0.90 1433
macro avg 0.82 0.83 0.83 1433
weighted avg 0.90 0.90 0.90 1433
Category: condition
precision recall f1-score support
poor 0.70 0.76 0.73 425
fair 0.82 0.75 0.79 896
good 0.42 0.55 0.48 112
accuracy 0.74 1433
macro avg 0.65 0.69 0.66 1433
weighted avg 0.75 0.74 0.74 1433
Category: material
precision recall f1-score support
mix-other-unclear 0.45 0.52 0.48 319
plaster 0.89 0.75 0.82 937
brick_or_cement-concrete_block 0.50 0.73 0.60 86
wood_polished 0.53 0.77 0.63 60
stone_with_mud-ashlar_with_lime_or_cement 0.30 0.67 0.41 9
corrugated_metal 0.62 0.93 0.74 14
wood_crude-plank 0.40 0.40 0.40 5
container-trailer 0.29 0.67 0.40 3
accuracy 0.70 1433
macro avg 0.50 0.68 0.56 1433
weighted avg 0.74 0.70 0.71 1433
Category: security
precision recall f1-score support
secured 0.77 0.82 0.79 361
unsecured 0.94 0.92 0.93 1072
accuracy 0.89 1433
macro avg 0.85 0.87 0.86 1433
weighted avg 0.90 0.89 0.89 1433
Category: use
precision recall f1-score support
residential 0.98 0.92 0.95 1233
critical_infrastructure 0.20 0.50 0.29 6
mixed 0.59 0.72 0.65 97
commercial 0.52 0.76 0.62 97
accuracy 0.89 1433
macro avg 0.57 0.73 0.63 1433
weighted avg 0.92 0.89 0.90 1433
Usage#
Here’s an overview of the script classifier_evaluate.py
focusing on its functionality and usage:
Functionality#
Model Loading: The script loads a trained classifier model from a checkpoint file specified by the user.
Model Evaluation: The loaded model is evaluated on the test dataset using the
evaluate_model
function. Predictions are generated for each category of interest (e.g., completeness, condition, material, security, use).Classification Reports: Classification reports are generated for each class, providing metrics such as precision, recall, and F1-score for each class within the category.
How to run#
To run the script, follow these steps:
Command-Line Arguments: Specify the following command-line arguments:
CHECKPOINT_PATH
: Path to the trained model checkpoint file.IMG_DIR
: Directory containing the dataset images.DATA_DIR
: Directory containing partitioned CSV files for the dataset.
Run Script: Execute the script using the command:
python classifier_evaluate.py <CHECKPOINT_PATH> <IMG_DIR> <DATA_DIR>
Replace <CHECKPOINT_PATH>
, <IMG_DIR>
, and <DATA_DIR>
with the appropriate values.
Output: The script will generate classification reports for each category of interest, providing insights into the model’s performance on the test dataset.