Maximum sample size in TrainImagesClassifier


I have a question that I cannot find in any of the documentation or forum questions.

I am doing a random forest classification using TrainImagesClassifier.

When setting a max training/validation sample size to unlimited (-1 for, the max sample size is guided by the smallest class even if My question is why is this the case? I thought algorithms like random forest worked well with highly imbalanced classes. Is there a way to turn this off? Or am I misinterpreting the output?

output with sampling.vt=-1

2023-08-28 16:25:35 (INFO) TrainImagesClassifier: Sampling strategy : fit the number of samples based on the smallest class
2023-08-28 16:25:35 (INFO) TrainImagesClassifier: Sampling rates for image 1 : className requiredSamples totalSamples rate
1 1906 136419 0.0139717
2 1906 12516 0.152285
3 1906 56893 0.0335015
4 1906 35371 0.053886
5 1906 177884 0.0107148
6 1906 77607 0.0245596
7 1906 1906 1
8 1906 8907 0.213989

The classification performs well even with this constraint in terms of f1-scores and so on, but with this sampling strategy it makes it hard to compare with other implementations of random forest that do not do this.


Dear @grover,
Thank you for using OTB.
I think RF are still biased if your training set is highly unbalanced. But setting,, should allow you to use an unbalanced training set anyway. This might be a bug.
In your message, you say that you set sampling.vt=-1, but the parameter sampling.vt doesn’t exist. Are you sure you used the good parameters ? If so, I will open an issue.

Julien :slight_smile:

1 Like

Hi Julien,

Thanks for the reply. Yes, this is a highly unbalanced dataset. The data are minerals in a rock thin section, so it is hard to get training data on very rare minerals.

That (sampling.vt) was a typo. Here is the call from QGIS with default RF parameters save changing,, and

{ '' : ['X:/biotite_ml/eds_6_3_ML/eds_6-3_vrt_7_redo.vrt'], 'io.vd' : ['X:/biotite_ml/eds_6_3_ML/6_3_train.shp'], 'io.valid' : None, 'io.imstat' : '', 'io.out' : 'TEMPORARY_OUTPUT', 'io.confmatout' : 'TEMPORARY_OUTPUT', 'cleanup' : True, '' : -1, '' : -1, '' : 0, 'sample.vtr' : 0.5, 'sample.vfn' : 'classvalue', 'elev.dem' : '', 'elev.geoid' : '', 'elev.default' : 0, 'classifier' : 'rf', 'classifier.rf.max' : 5, 'classifier.rf.min' : 10, 'classifier.rf.ra' : 0, '' : 10, 'classifier.rf.var' : 0, 'classifier.rf.nbtrees' : 100, 'classifier.rf.acc' : 0.01, 'rand' : 0 }

Pertinent output:

2023-09-19 09:23:56 (INFO) TrainImagesClassifier: Sampling rates...
2023-09-19 09:23:56 (INFO) TrainImagesClassifier: Sampling strategy : fit the number of samples based on the smallest class

The sampling strategy is still does downsampling of the majority classes to balance with the minority class. The documentation seems to say that this cannot be turned off, but the parameter seems to indicate otherwise(?):

Description of from the OTB cookbook:

Maximum size per class (in pixels) of the training sample list (default = 1000) (no limit = -1). If equal to -1, then the maximal size of the available training sample list per class will be equal to the surface area of the smallest class multiplied by the training sample ratio

So, it may be that it always does the downsampling regardless of whether I am unsure if this is a bug now.


Well, and what if you set to the size of your most represented class? (and still set to zero)

But accordingly to what I read in the source code, your call should work as expected (use all samples for all classes).

Here’s the sampling strategy output using (size of most represented) and

2023-09-19 11:48:10 (INFO) TrainImagesClassifier: Sampling rates...
2023-09-19 11:48:10 (INFO) TrainImagesClassifier: Sampling strategy : fit the number of samples based on the smallest class

This is the same as before. This only came up because I wanted to compare the OTB results to sci-kit learn’s RF, get feature importances, and do cross-fold validation hyperparameter tuning. You can almost do a sampling strategy like OTB in sklearn (balanced with max sample limit), but sklearn bootstraps for every tree within the ensemble.


I opened an issue here. You can follow the progress directly there.


Thank you for the help.