I am working on an object-based classification from one false-color image. In order to limit error of classification of my three categories (Artificial surface, tree vegetation and herbaceous vegetation), I created 5 classifications to fusion them by a majority voting (FusionOfClassification algorithm).
I divided my training vector into 5 subsets : for each classification model 4 subsets were used as a training vector and the lastest was used as a validating vector. Thus each subset was used as a validating vector.
I also removed shadow segments before classifications (segments showing a mean value of brightness less than 70).
Each of my classification model (from TrainVectorClassifier algorithm) shows good results (about 0.95 for precision, Recall or F-Score) obtained in the “log” window.
However, I wanted to test my final classification (Fusion of the 5 classif by majority voting) with ComputeConfusionMatrix algorithm and I used a new subset from all my training vector data. I otbained metrics around 0.75 for F-Score, Precision, Recall for each of the 3 categories.
I think my final classification is good, metrics should be rather high as my classification models (0.95 / 1 for F-Score, Precision, Recall) but its evaluation shows poor results (0.75 for each metrics).
So my question is, how evaluate a classification obtained by a majority voting ? Does it make sense to use the ComputeConfusionMatrix agorithm on a classification fusionned by majority voting?
Do I have to average the metrics for each of the models to get an assessment of the final classification? Or is my validating vector the problem for the evaluation of the final classification?
I am afraid that the result of the compute confusion matrix algorithm is a fair way to evaluate the quality of a map, no matter what the method to produce it has been used. I suggest that you also validate each of your input data with the same metho to see if the majority voting deteriorates your results (e.g. one map is very good but the 4 other fail on the same classes)
The classification results of the models are most of the time optimistics because they are optimized for a subset of the reality. Another reason that could explain the differences is the fact that the quality of the model is based on the number of objects that are well classified (relevant for building models), but a sample of points used for ComputeConfusionMatrix will look at the proportion in area (which is relevant to assess the quality of a final product). So if a few very large polygons are misclassified (e.g. a large crop field), it only accounts for 1 error in the model but its area is much larger than, e.g. 100 houses. Another possible difference between the model quality result and the map quality result is the fact that the proportion of the classes are not always preserved in training sample. For a rigourous accuracy assemment you should make sure that each point has the same sampling probability or you should apply some correction to the confusion matrix.
Thus I will test each of my classification before the majority voting to see results.
Ok for the different methods between the classification results of the models and the ComputeConfusionMatrix.
As I removed some segments (shadow segments) before classification, do you think that nodata pixels (sometimes crossed by validating vectors) could biaised the evaluation process like the ComputeConfusionMatrix is based on the pixels well classified ?
So for a best classification, is it better to make training vectors manually (e.g. by drawing the canopee of a tree) or to sample segments from the segmentation to obtain training vectors for each class ?