Classifier Evaluation#

To evaluate our trained classifier model, we compute a few metrics on the test dataset. We use the TorchMetrics library to generate class-wise reports. TorchMetrics is a library in PyTorch for computing metrics for evaluation tasks. The following metrics are reported:

  1. Precision: Precision measures the proportion of true positive predictions among all positive predictions made by the model. It indicates how many of the predicted positive cases are actually positive. As a reminder, precision ranges from 0 to 1, where higher values indicate better performance. See the page “Detector evaluate” for the equation to calculate precision.

  2. Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of true positive predictions among all actual positive cases in the dataset. It indicates how many of the actual positive cases the model correctly identifies. As a reminder, recall also ranges from 0 to 1, with higher values indicating better performance. See the page “Detector evaluate” for the equation to calculate recall.

  3. F1-Score: F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall. F1-score is calculated as:

    F1-score = 2 x ((Precision x Recall) / (Precision + Recall))

    F1-score ranges from 0 to 1, where higher values indicate better performance. It is particularly useful when there is an imbalance between the classes in the dataset.

Note: Support represents the number of actual occurrences of each class in the dataset. It is the number of true instances for each class in the ground truth. Support is not used directly in the computation of other metrics but provides context for understanding the distribution of classes in the dataset.

These metrics are commonly used for evaluating classification models and provide insights into their performance in terms of correctness and completeness of predictions across different classes. We are calculating them individually for each class.

Results#

Category: complete
              precision    recall  f1-score   support

    complete       0.95      0.93      0.94      1199
  incomplete       0.69      0.74      0.71       234

    accuracy                           0.90      1433
   macro avg       0.82      0.83      0.83      1433
weighted avg       0.90      0.90      0.90      1433


Category: condition
              precision    recall  f1-score   support

        poor       0.70      0.76      0.73       425
        fair       0.82      0.75      0.79       896
        good       0.42      0.55      0.48       112

    accuracy                           0.74      1433
   macro avg       0.65      0.69      0.66      1433
weighted avg       0.75      0.74      0.74      1433


Category: material
                                           precision    recall  f1-score   support

                        mix-other-unclear       0.45      0.52      0.48       319
                                  plaster       0.89      0.75      0.82       937
           brick_or_cement-concrete_block       0.50      0.73      0.60        86
                            wood_polished       0.53      0.77      0.63        60
stone_with_mud-ashlar_with_lime_or_cement       0.30      0.67      0.41         9
                         corrugated_metal       0.62      0.93      0.74        14
                         wood_crude-plank       0.40      0.40      0.40         5
                        container-trailer       0.29      0.67      0.40         3

                                 accuracy                           0.70      1433
                                macro avg       0.50      0.68      0.56      1433
                             weighted avg       0.74      0.70      0.71      1433


Category: security
              precision    recall  f1-score   support

     secured       0.77      0.82      0.79       361
   unsecured       0.94      0.92      0.93      1072

    accuracy                           0.89      1433
   macro avg       0.85      0.87      0.86      1433
weighted avg       0.90      0.89      0.89      1433


Category: use
                         precision    recall  f1-score   support

            residential       0.98      0.92      0.95      1233
critical_infrastructure       0.20      0.50      0.29         6
                  mixed       0.59      0.72      0.65        97
             commercial       0.52      0.76      0.62        97

               accuracy                           0.89      1433
              macro avg       0.57      0.73      0.63      1433
           weighted avg       0.92      0.89      0.90      1433

Usage#

Here’s an overview of the script classifier_evaluate.py focusing on its functionality and usage:

Functionality#

  1. Model Loading: The script loads a trained classifier model from a checkpoint file specified by the user.

  2. Model Evaluation: The loaded model is evaluated on the test dataset using the evaluate_model function. Predictions are generated for each category of interest (e.g., completeness, condition, material, security, use).

  3. Classification Reports: Classification reports are generated for each class, providing metrics such as precision, recall, and F1-score for each class within the category.

How to run#

To run the script, follow these steps:

  1. Command-Line Arguments: Specify the following command-line arguments:

    • CHECKPOINT_PATH: Path to the trained model checkpoint file.

    • IMG_DIR: Directory containing the dataset images.

    • DATA_DIR: Directory containing partitioned CSV files for the dataset.

  2. Run Script: Execute the script using the command:

python classifier_evaluate.py <CHECKPOINT_PATH> <IMG_DIR> <DATA_DIR>

Replace <CHECKPOINT_PATH>, <IMG_DIR>, and <DATA_DIR> with the appropriate values.

  1. Output: The script will generate classification reports for each category of interest, providing insights into the model’s performance on the test dataset.