Improving classification performance on imbalanced class distributions.

1. Tasks Overview

  1. Image Classification

  2. Masked Language Modeling

  3. Machine Translation

2. Image classification

2.1. Datasets

Ruh either of the below commands to download and prepare datasets: - Run data/imgcls/get-data.sh — This includes data augmentations provided by dataset creators. - Run data/imgcls/get-data-uniq.sh-- This excludes augmentations provided by dataset creators. Maybe useful if you have dynamic augmentations (we do as part of pytorch data loader)

cd data/imgcls
./get-data.sh        # output dirs: hirise and msl
./get-data-uniq.sh   # output dirs: hirise-uniq and msl-uniq

This script download the following datasets:

2.2. Model

3. Masked Language Modeling (MLM)

MLM is based on 🤗 Transformers.

3.1. Datasets

Automatically downloaded.

3.2. Model

See imblearn/mlm/ directory

4. (Neural) Machine Translation

4.1. Datasets

TODO:
  • Hindi-English

4.2. Models

5. Configuration file: conf.yml

Here is the basic schema of conf.yml

model:
    name: <name>
    args:
        key1: <value>
        key2: <value>
optimizer:
    name: <name>
    args:
        key1: <value>
schedule:
    name: <name>
    args:
        key1: <value>
loss:
    name: <name>
    args:
        key1: <value>
train:
    data: <path/data/train>
    batch_size: 2    # number of images
    max_step: 300    # maximum number of steps
    max_epoch: 100   # maximum number of epochs
    checkpoint: 100  # validate and checkpoint every these many steps
    keep_in_mem: true  # keep datasets in memory

validation:
    data: <path/data/val>
    batch_size: 10
    patience: 10
    by: macro_f1

tests:
    #test: <path/data/test1>       # dont use tests until the end
    val: <path/data/val>

5.1. Models

5.1.1. Image Classifier

model:
    name: image_classifier
    args:
        n_classes: 19
        intermediate: 40
        dropout: 0.2
        parent: resnext50_32x4d    # torchvision.models.<this>
        pretrained: true           # initialize pretrained parent model

5.1.2. Masked Language Model

Work in progress

5.2. Optimizers

The following optimizers are supported

  • adam=torch.optim.Adam

  • sgd=torch.optim.SGD

  • adagrad=torch.optim.Adagrad

  • adam_w=torch.optim.AdamW

  • adadelta=torch.optim.Adadelta

  • sparse_adam=torch.optim.SparseAdam

Example for adam
optimizer:
    name: adam
    args:
        lr: 0.0005
        betas: [0.9, 0.999]

5.3. Schedule

Schedule is an optional component. Comment or delete the schedule: block in conf.yml to disable it.

The following learning schedules are supported:

5.3.1. inverse_sqrt: Inverse Square Root

schedule:
    name: inverse_sqrt
    args:
        peak_lr: 0.0005
        warmup: 100

5.3.2. noam: Noam

schedule:
    name: noam
    args:
        scaler: 2
        model_dim: 100
        warmup: 2000

5.4. Loss

5.4.1. Cross Entropy

loss:
    name: cross_entropy
    args:
        weight_by: inverse_frequency

5.4.2. Weighted Cross Entropy

Manual weights for classes

loss.args.weight can be set to a list. The list should have one weight per each class and match the order in <experiment-dir/classes.csv>.

Using heuristics

Heuristics based on frequencies in training corpus can be used to obtain weights. Let \$c\$ be a class with \$f_c\$ be class frequency in training corpus, and \$w_c\$ be weight to be inferred using heuristics.

Heuristic is to be set to loss.args.weight_by in conf.yml. The valid values to loss.args.weight_by are:

  1. inverse_frequency
    \$w_c \propto 1/f_c \$

  2. inverse_log
    \$w_c \propto 1/\log(f_c) \$

  3. inverse_sqrt
    \$w_c \propto 1/\sqrt(f_c) \$

  4. information_content
    Uses information content
    Let \$\pi_c = \frac{f_c}{\sum_i f_i} \$ be probability in training corpus (i.e. prior)
    \$w_c = -\log_2(\pi_c)\$

Effective number of samples

Instead of using raw frequencies from training corpus, we can also use effective frequencies (i.e. number of samples).

Example:

loss:
    name: cross_entropy
    args:
        weight_by: inverse_frequency (1)
        # to use effective number of samples
        eff_frequency: true        (2)
        eff_beta: 0.99            (3)
1 Other supported heuristics can also be used
2 to enable it
3 \$\beta \in [0, 1)\$ is required when eff_frequency=true.

Effective number of samples is a kind of smoothing function for frequencies. If \$\beta=0 \implies \$ all classes attain same frequency of 1 as effective frequency(thus results in unweighted cross entropy); and if \$\beta \rightarrow 1\$ effective frequencies approaches raw frequencies (thus, no smoothing is in effect).

5.4.3. Focal Loss

Implements loss = \$\sum_c y_c (1-p_c)^\gamma \log(p_c)\$ where \$y_c\$ is ground thruth class, \$p_c\$ is model output probability, and \$\gamma\$ is a hyper parameter.

loss:
    name: focal_loss
    args:
        gamma: 2

5.4.4. Label Smoothing

Extends cross_entropy

loss:
    name: smooth_cross_entropy
    args: (1)
        #weight_by: inverse_frequency
        #eff_frequency: true
        #eff_beta: 0.99
        smooth_epsilon: 0.05
1 Label smoothing works on top of cross_entropy, so all the args of cross_entropy such as weight_by are valid here.

5.4.5. Balanced label smoothing

Extends cross_entropy

this is experimental
loss:
    name: smooth_cross_entropy
    args:
        #weight_by: inverse_frequency
        smooth_epsilon: 0.05
        smooth_weight_by: inverse_frequency

5.5. Macro Cross Entropy

Extends smooth_cross_entropy

+ This does not accept weight_by, instead does macro average, which is an unweighted average across classes. + Since this extends smooth_cross_entropy, label smoothing params (smooth_epsilon) and also smooth_weight_by are supported (optional).

loss:
    name: macro_cross_entropy
    args:
        smooth_epsilon: 0.1
        smooth_weight_by: inverse_frequency

5.6. Trainer

This example is for image classifier model, and it is subject to change.

train:
    data: <path/data/train>   (1)
    batch_size: 20           # number of images
    max_step: 200_000        # maximum number of steps (2)
    min_step: 10_000         # minimum number of steps.. ignore early stop til then
    max_epoch: 100           # maximum number of epochs (2)
    checkpoint: 1000         # validate and checkpoint every these many steps
    keep_in_mem: true        # keep datasets in memory (3)
1 The directory specified by path should be compatible with torchvision.datasets.ImageFolder
2 max_step or max_epoch whichever comes earlier
3 Default is true and uses CPU memory. To use GPU memory set keep_in_mem: cuda. To disable: set it to false or null.

5.7. Validation

validation:
    data: <path/data/val>  (1)
    batch_size: 10         # this can be larger than train.batch_size
    patience: 10           # patience for early stop
    by: macro_f1           # metric to use for early stop
1 The directory specified by path should be compatible with torchvision.datasets.ImageFolder