Need help with CategoricalEncodingBenchmark?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

DenisVorotyntsev
132 Stars 31 Forks 13 Commits 1 Opened issues

Description

Benchmarking different approaches for categorical encoding for tabular data

Services available

!
?

Need anything else?

Contributors list

CategoricalEncodingBenchmark

Benchmarking different approaches for categorical encoding

Reproducibility of results

Requirements

pip install -r requirements.txt

Benchmark the dataset

To benchmark encoders for your dataset:

  1. Install libraries in requirements

  2. Process the dataset as shown in

    notebooks/1-prepare-datasets.ipynb
  3. Add name of the dataset in

    dataset_list
    in
    src/run_experiment.py
  4. python run_experiment.py
  5. Run

    notebooks/2-show-results.ipynb

Used datasets and raw scores

All datasets except poverty_A(B,C) came from different domains; they have a different number of observations, number of categorical and numerical features. The objective for all datasets - binary classification. Preprocessing of datasets were simple: I removed all time-based columns from datasets. Remaining columns were either categorical or numerical. Details of the experiments could be found in my blog post: Benchmarking Categorical Encoders.

Table 1.1 Used datasets

| Name | Total points | Train points | Test points | Number of features | Number of categorical features | Short description | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | Telecom | 7.0k | 4.2k | 2.8k | 20 | 16 | Churn prediction for telecom data | | Adult | 48.8k | 29.3k | 19.5k | 15 | 8 | Predict if persons' income is bigger 50k | | Employee | 32.7k | 19.6k | 13.1k | 10 | 9 | Predict an employee's access needs, given his/her job role| | Credit | 307.5k | 184.5k | 123k | 121 | 18 | Loan repayment | | Mortgages | 45.6k | 27.4k | 18.2k | 20 | 9 | Predict if house mortgage is founded | | Promotion | 54.8 | 32.8k | 21.9k | 13 | 5 | Predict if an employee will get a promotion | | Kick | 72.9k | 43.7k | 29.1k | 32 | 19 | Predict if a car purchased at auction is good/bad buy | | Kdd_upselling | 50k | 30k | 20k | 230 | 40 | Predict up-selling for a customer | | Taxi | 892.5k | 535.5k | 357k | 8 | 5 | Predict the probability of an offer being accepted by a certain driver | | Poverty_A | 37.6k | 22.5k | 15.0k | 41 | 38 | Predict whether or not a given household for a given country is poor or not | | Poverty_B | 20.2k | 12.1k | 8.1k | 224 | 191 | Predict whether or not a given household for a given country is poor or not | | Poverty_C | 29.9k | 17.9k | 11.9k | 41 | 35 | Predict whether or not a given household for a given country is poor or not |

The ROC AUC scores for each dataset are presented in tables below. Note: some experiments required too much memory to run, so some values are missing.

Table 1.2 ROC AUC scores for None Validation

| | telecom | adult | employee | credit | mortgages | promotion | kick | kddupselling | taxi | povertyA | povertyB | povertyC | |:--------------------------|:----------:|:--------:|:-----------:|:---------:|:------------:|:------------:|:-------:|:----------------:|:-------:|:------------:|:------------:|:------------:| | BackwardDifferenceEncoder | 0.6454 | 0.8555 | 0.5006 | 0.7442 | 0.5997 | 0.6482 | | | | 0.5149 | 0.5484 | 0.4945 | | CatBoostEncoder | 0.7666 | 0.868 | 0.5004 | 0.7478 | 0.6279 | 0.7811 | 0.6583 | 0.8549 | 0.5477 | 0.5179 | 0.5638 | 0.5427 | | FrequencyEncoder | 0.8405 | 0.9291 | 0.807 | 0.7593 | 0.6949 | 0.9052 | 0.7907 | 0.8643 | 0.5656 | 0.7276 | 0.6164 | 0.7177 | | HelmertEncoder | 0.8404 | 0.9297 | 0.83 | 0.7601 | 0.7001 | 0.9079 | | | | 0.7325 | 0.6343 | 0.7168 | | JamesSteinEncoder | 0.7195 | 0.8688 | 0.5003 | 0.7485 | 0.6049 | 0.7984 | 0.6592 | 0.8516 | 0.5432 | 0.4918 | 0.5304 | 0.4836 | | LeaveOneOutEncoder | 0.5 | 0.5214 | 0.6233 | 0.4957 | 0.5 | 0.5457 | 0.5027 | 0.5 | 0.5 | 0.5006 | 0.5002 | 0.4527 | | MEstimateEncoder | 0.6944 | 0.8617 | 0.4998 | 0.7368 | 0.6086 | 0.8156 | 0.653 | 0.8448 | 0.5091 | 0.5254 | 0.434 | 0.4528 | | OrdinalEncoder | 0.7409 | 0.8616 | 0.501 | 0.7445 | 0.6008 | 0.7124 | 0.6531 | 0.8448 | 0.5498 | 0.473 | 0.4683 | 0.5611 | | SumEncoder | 0.8404 | 0.929 | 0.8053 | 0.7593 | 0.6944 | 0.9073 | | | | 0.7355 | 0.6206 | 0.7372 | | TargetEncoder | 0.7195 | 0.8696 | 0.5003 | 0.7483 | 0.6064 | 0.7971 | 0.6594 | 0.8483 | 0.5428 | 0.4955 | 0.5401 | 0.4751 | | WOEEncoder | 0.7056 | 0.8645 | 0.5012 | 0.7439 | 0.615 | 0.7345 | 0.6398 | 0.844 | 0.5485 | 0.478 | 0.5356 | 0.4671 |

Table 1.3 ROC AUC scores for Single Validation

| | telecom | adult | employee | credit | mortgages | promotion | kick | kddupselling | taxi | povertyA | povertyB | povertyC | |:--------------------------|:----------:|:--------:|:-----------:|:---------:|:------------:|:------------:|:-------:|:----------------:|:-------:|:------------:|:------------:|:------------:| | BackwardDifferenceEncoder | 0.8382 | 0.9293 | 0.7569 | 0.7595 | 0.6894 | 0.9064 | | | | 0.7323 | 0.6151 | 0.7108 | | CatBoostEncoder | 0.8392 | 0.9292 | 0.8498 | 0.7594 | 0.6951 | 0.8918 | 0.7901 | 0.8654 | 0.5844 | 0.7429 | 0.6902 | 0.7333 | | FrequencyEncoder | 0.8392 | 0.9293 | 0.8138 | 0.7592 | 0.6937 | 0.9055 | 0.7902 | 0.8634 | 0.582 | 0.7302 | 0.6128 | 0.7195 | | HelmertEncoder | 0.8404 | 0.9297 | 0.8344 | 0.7597 | 0.7027 | 0.9083 | | | | 0.7297 | 0.6374 | 0.7196 | | JamesSteinEncoder | 0.8388 | 0.9292 | 0.7817 | 0.7597 | 0.667 | 0.9053 | 0.5835 | 0.726 | 0.5898 | 0.7303 | 0.6764 | 0.7217 | | LeaveOneOutEncoder | 0.5 | 0.5182 | 0.6121 | 0.4997 | 0.5 | 0.5403 | 0.4682 | 0.5 | 0.5 | 0.5103 | 0.5 | 0.4959 | | MEstimateEncoder | 0.8394 | 0.929 | 0.7353 | 0.7593 | 0.6957 | 0.9054 | 0.5877 | 0.5953 | 0.5946 | 0.7302 | 0.6493 | 0.7076 | | OrdinalEncoder | 0.8404 | 0.9299 | 0.8274 | 0.7585 | 0.6917 | 0.9078 | 0.7809 | 0.8465 | 0.6034 | 0.7337 | 0.6635 | 0.742 | | SumEncoder | 0.8404 | 0.929 | 0.8053 | 0.7593 | 0.6944 | 0.9073 | | | | 0.7355 | 0.6206 | 0.7372 | | TargetEncoder | 0.8388 | 0.9293 | 0.815 | 0.7599 | 0.6702 | 0.9057 | 0.7042 | 0.713 | 0.5894 | 0.7292 | 0.6742 | 0.7207 | | WOEEncoder | 0.8393 | 0.9294 | 0.8325 | 0.7599 | 0.6801 | 0.9056 | 0.7172 | 0.8391 | 0.5903 | 0.7279 | 0.6737 | 0.7224 |

Table 1.4 ROC AUC scores for Double Validation

| | telecom | adult | employee | credit | mortgages | promotion | kick | kddupselling | taxi | povertyA | povertyB | povertyC | |:-------------------|:----------:|:--------:|:-----------:|:---------:|:------------:|:------------:|:-------:|:----------------:|:-------:|:------------:|:------------:|:------------:| | CatBoostEncoder | 0.8394 | 0.9293 | 0.8529 | 0.7592 | 0.6967 | 0.9056 | 0.7899 | 0.8633 | 0.6031 | 0.7418 | 0.6902 | 0.7343 | | FrequencyEncoder | 0.8371 | 0.9221 | 0.5563 | 0.755 | 0.6582 | 0.8749 | 0.7655 | 0.8551 | 0.5657 | 0.6873 | 0.6037 | 0.6961 | | JamesSteinEncoder | 0.8398 | 0.9296 | 0.8489 | 0.7598 | 0.6981 | 0.905 | 0.7901 | 0.8628 | 0.6033 | 0.7412 | 0.6895 | 0.7366 | | LeaveOneOutEncoder | 0.8393 | 0.9295 | 0.8496 | 0.7595 | 0.6963 | 0.9055 | 0.7902 | 0.8635 | 0.602 | 0.7416 | 0.6931 | 0.7345 | | MEstimateEncoder | 0.8405 | 0.9292 | 0.8125 | 0.7597 | 0.6939 | 0.9063 | 0.7881 | 0.863 | 0.5984 | 0.7375 | 0.6801 | 0.7204 | | TargetEncoder | 0.8393 | 0.9294 | 0.8537 | 0.7596 | 0.6954 | 0.9057 | 0.7909 | 0.8643 | 0.6025 | 0.7415 | 0.6903 | 0.7352 | | WOEEncoder | 0.8401 | 0.9294 | 0.824 | 0.7599 | 0.6977 | 0.9041 | 0.7905 | 0.8631 | 0.6011 | 0.7407 | 0.6911 | 0.7345 |

Results

To determine the best encoder, I scaled the ROC AUC scores of each dataset (min-max scale) and then averaged results among the encoder. The obtained result represents the average performance score for each encoder (higher is better). The encoders performance scores for each type of validation are shown in tables 2.1–2.3. 

To determine the best validation strategy, I compared the top score of each dataset for each type of validation. The scores improvement (top score for a dataset and an average score for encoder) are shown in table 2.4 and 2.5 below.

Table 2.1 Encoders performance scores - None Validation

| | None Validation | |:--------------------------|:-------:| | HelmertEncoder | 0.9517 | | SumEncoder | 0.9434 | | FrequencyEncoder | 0.9176 | | CatBoostEncoder | 0.5728 | | TargetEncoder | 0.5174 | | JamesSteinEncoder | 0.5162 | | OrdinalEncoder | 0.4964 | | WOEEncoder | 0.4905 | | MEstimateEncoder | 0.4501 | | BackwardDifferenceEncoder | 0.4128 | | LeaveOneOutEncoder | 0.0697 |

Table 2.2 Encoders performance scores - Single Validation

| | Single Validation | |:--------------------------|:-------:| | CatBoostEncoder | 0.9726 | | OrdinalEncoder | 0.9694 | | HelmertEncoder | 0.9558 | | SumEncoder | 0.9434 | | WOEEncoder | 0.9326 | | FrequencyEncoder | 0.9315 | | BackwardDifferenceEncoder | 0.9108 | | TargetEncoder | 0.8915 | | JamesSteinEncoder | 0.8555 | | MEstimateEncoder | 0.8189 | | LeaveOneOutEncoder | 0.0729 |

Table 2.3 Encoders performance scores - Double Validation

| | Double Validation | |:-------------------|:-------:| | JamesSteinEncoder | 0.9918 | | CatBoostEncoder | 0.9917 | | TargetEncoder | 0.9916 | | LeaveOneOutEncoder | 0.9909 | | WOEEncoder | 0.9838 | | MEstimateEncoder | 0.9686 | | FrequencyEncoder | 0.8018 |

Table 2.4 Top score improvement (percent)

| | None -> Single | Single -> Double | |:--------------|:-----------------:|:-------------------:| | telecom | 0.00 | 0.01 | | adult | 0.02 | -0.03 | | employee | 1.98 | 0.39 | | credit | -0.01 | -0.00 | | mortgages | 0.26 | -0.47 | | promotion | 0.04 | -0.20 | | kick | -0.05 | 0.06 | | kddupselling | 0.10 | -0.11 | | taxi | 3.78 | -0.01 | | povertyA | 0.74 | -0.11 | | povertyB | 5.59 | 0.29 | | povertyC | 0.48 | -0.54 |

Table 2.5 Encoders performance scores improvement (percent)

| | None -> Single | Single -> Double | |:--------------------------|:-----------------:|:-------------------:| | BackwardDifferenceEncoder | 27.20 | | | CatBoostEncoder | 20.10 | 0.40 | | FrequencyEncoder | 0.30 | -4.90 | | HelmertEncoder | 0.20 | | | JamesSteinEncoder | 17.70 | 6.30 | | LeaveOneOutEncoder | 0.20 | 53.20 | | MEstimateEncoder | 18.90 | 8.10 | | OrdinalEncoder | 24.10 | | | SumEncoder | 0.00 | | | TargetEncoder | 19.60 | 4.20 | | WOEEncoder | 23.40 | 1.90 |

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.