Maximum sample size in TrainImagesClassifier

grover · August 28, 2023, 8:53pm

Hi,

I have a question that I cannot find in any of the documentation or forum questions.

I am doing a random forest classification using TrainImagesClassifier.

When setting a max training/validation sample size to unlimited (-1 for sampling.mt/sampling.vt), the max sample size is guided by the smallest class even if sample.bm=0. My question is why is this the case? I thought algorithms like random forest worked well with highly imbalanced classes. Is there a way to turn this off? Or am I misinterpreting the output?

output with sampling.mt=-1 sampling.vt=-1 sampling.bm=0

2023-08-28 16:25:35 (INFO) TrainImagesClassifier: Sampling strategy : fit the number of samples based on the smallest class
2023-08-28 16:25:35 (INFO) TrainImagesClassifier: Sampling rates for image 1 : className requiredSamples totalSamples rate
1 1906 136419 0.0139717
2 1906 12516 0.152285
3 1906 56893 0.0335015
4 1906 35371 0.053886
5 1906 177884 0.0107148
6 1906 77607 0.0245596
7 1906 1906 1
8 1906 8907 0.213989

The classification performs well even with this constraint in terms of f1-scores and so on, but with this sampling strategy it makes it hard to compare with other implementations of random forest that do not do this.

Thanks,
Grover

julienosman · September 19, 2023, 12:50pm

Dear @grover,
Thank you for using OTB.
I think RF are still biased if your training set is highly unbalanced. But setting sampling.mt=-1, sampling.mv=-1, andsampling.bm=0 should allow you to use an unbalanced training set anyway. This might be a bug.
In your message, you say that you set sampling.vt=-1, but the parameter sampling.vt doesn’t exist. Are you sure you used the good parameters ? If so, I will open an issue.

Regards.
Julien

grover · September 19, 2023, 1:38pm

Hi Julien,

Thanks for the reply. Yes, this is a highly unbalanced dataset. The data are minerals in a rock thin section, so it is hard to get training data on very rare minerals.

That (sampling.vt) was a typo. Here is the call from QGIS with default RF parameters save changing sampling.mt=-1, sampling.mv=-1, and sampling.bm=0:

{ 'io.il' : ['X:/biotite_ml/eds_6_3_ML/eds_6-3_vrt_7_redo.vrt'], 'io.vd' : ['X:/biotite_ml/eds_6_3_ML/6_3_train.shp'], 'io.valid' : None, 'io.imstat' : '', 'io.out' : 'TEMPORARY_OUTPUT', 'io.confmatout' : 'TEMPORARY_OUTPUT', 'cleanup' : True, 'sample.mt' : -1, 'sample.mv' : -1, 'sample.bm' : 0, 'sample.vtr' : 0.5, 'sample.vfn' : 'classvalue', 'elev.dem' : '', 'elev.geoid' : '', 'elev.default' : 0, 'classifier' : 'rf', 'classifier.rf.max' : 5, 'classifier.rf.min' : 10, 'classifier.rf.ra' : 0, 'classifier.rf.cat' : 10, 'classifier.rf.var' : 0, 'classifier.rf.nbtrees' : 100, 'classifier.rf.acc' : 0.01, 'rand' : 0 }

Pertinent output:

2023-09-19 09:23:56 (INFO) TrainImagesClassifier: Sampling rates...
2023-09-19 09:23:56 (INFO) TrainImagesClassifier: Sampling strategy : fit the number of samples based on the smallest class

The sampling strategy is still does downsampling of the majority classes to balance with the minority class. The documentation seems to say that this cannot be turned off, but the parameter sampling.bm seems to indicate otherwise(?):

Description of sampling.mt from the OTB cookbook:

Maximum size per class (in pixels) of the training sample list (default = 1000) (no limit = -1). If equal to -1, then the maximal size of the available training sample list per class will be equal to the surface area of the smallest class multiplied by the training sample ratio

So, it may be that it always does the downsampling regardless of whether sampling.bm=0. I am unsure if this is a bug now.

Thanks,
Grover

julienosman · September 19, 2023, 3:34pm

Well, and what if you set sampling.mt to the size of your most represented class? (and still set sampling.bm to zero)

But accordingly to what I read in the source code, your call should work as expected (use all samples for all classes).

grover · September 19, 2023, 4:02pm

Here’s the sampling strategy output using sampling.mt=136419 (size of most represented) and sampling.bm=0:

2023-09-19 11:48:10 (INFO) TrainImagesClassifier: Sampling rates...
2023-09-19 11:48:10 (INFO) TrainImagesClassifier: Sampling strategy : fit the number of samples based on the smallest class

This is the same as before. This only came up because I wanted to compare the OTB results to sci-kit learn’s RF, get feature importances, and do cross-fold validation hyperparameter tuning. You can almost do a sampling strategy like OTB in sklearn (balanced with max sample limit), but sklearn bootstraps for every tree within the ensemble.

Thanks,
Grover

julienosman · September 20, 2023, 8:02am

I opened an issue here. You can follow the progress directly there.

grover · September 20, 2023, 12:27pm

Julien,

Thank you for the help.

-Grover