Scaling XGBoost: Training Classification Models with Thousands of Classes

By Martin Penchev
10 min read

Training large-scale classification models presents unique computational challenges, particularly when dealing with thousands of classes. While XGBoost is highly optimized for performance, its behavior in multi-GPU environments can lead to unexpected memory constraints that catch many practitioners off guard.

Understanding XGBoost GPU Support

XGBoost offers comprehensive support for CUDA-capable GPUs 1, allowing for significant acceleration of most internal algorithms, including model training, prediction, and evaluation. To enable GPU acceleration, the device parameter should be specified as cuda. For systems with multiple GPUs, a specific device can be targeted using the cuda:<ordinal> syntax (e.g., cuda:0 for the first GPU). Furthermore, XGBoost supports fully distributed multi-GPU training on Linux platforms through integrations with Dask 2 and Spark, enabling it to scale across multiple nodes.

When running on a GPU, XGBoost uses a highly optimized data structure for storing the dataset in device memory. It employs a compressed ELLPACK format, where data is quantized and stored as integers with a minimum bit length. This format is efficient for parallel computation and can significantly reduce memory usage compared to standard sparse matrix representations. However, even with these optimizations, certain training configurations can lead to memory pressure.

When using Dask for distributed training, frameworks like dask_cudf can load and partition data across multiple GPUs. XGBoost's Dask interface seamlessly handles the conversion from these distributed DataFrame partitions into its internal compressed format on each worker. This process is designed to be highly efficient, minimizing data transfer and memory overhead. While the initial dataset may be managed efficiently, the core training algorithm requires its own significant allocations for intermediate results. The most substantial of these working memory structures, particularly in multi-class scenarios, is the gradient-pair matrix.

The Gradient-Pair Matrix

When XGBoost is configured for multi-class classification (using the multi:softprob objective), it must first allocate a large matrix on each GPU to hold intermediate training values. This structure, the gradient-pair matrix, is fundamental to the algorithm, and its size is determined by the number of training samples and classes.

The memory calculation is straightforward, but the outcome can be substantial:

Memory Needed  =  (Rows)  ×  (Classes)  ×  8 bytes\text{Memory Needed} \;=\; (\text{Rows}) \;\times\; (\text{Classes}) \;\times\; 8 \text{ bytes}

The "8 bytes" in this formula arises from the storage required for the gradient and hessian values, which are fundamental to the gradient boosting algorithm. For each entry in the matrix, XGBoost stores a pair of 32-bit (4-byte) single-precision floating-point numbers: one for the gradient and one for the second-order gradient, or hessian. This precision level is standard for most GPU-accelerated machine learning tasks, as it provides a good balance between numerical accuracy, memory footprint, and computational speed.

This is an internal algorithmic detail and is not exposed as a user-configurable parameter; therefore, it is not possible, for instance, to switch to 16-bit half-precision to reduce memory. The memory constraint imposed by this matrix is fundamental, reinforcing the need to change the overall training strategy rather than attempting to tweak low-level parameters.

For a dataset with 665,148 samples and 5,190 classes, the memory required on each GPU is:

665,148×5,190×8 bytes25.7 GiB per GPU665,148 \times 5,190 \times 8 \text{ bytes} \approx 25.7\ \text{GiB per GPU}

For a GPU like the NVIDIA A10G with 24 GiB of memory, the allocation will fail. The reason for it is because the entire gradient-pair matrix is replicated across every GPU in the cluster. Consequently, simply adding more GPUs does not alleviate the memory pressure on individual devices. Each GPU must still be able to accommodate the full matrix.

XGBoost Memory Calculator

Required GPU Memory for Gradient Matrix:

25.72 GiB

Three Strategies to Overcome the Memory Wall

Instead of focusing on minor setting adjustments, a change in core strategy is required. These approaches should be viewed as distinct architectural patterns for solving the problem, each with its own trade-offs in terms of complexity, speed, and resource utilization.

#StrategyHow it WorksBest For
1Train on CPULeverages system RAM, which is typically much larger and more cost-effective than specialized GPU VRAM.Scenarios where a machine with high RAM is available and training speed is not critical.
2One-vs-Rest on GPUsDecomposes the problem into thousands of smaller binary classification tasks.The most common and scalable approach for retaining GPU acceleration.
3Hierarchical SoftmaxOrganizes classes into a tree structure, reducing the prediction space at each node.Datasets where labels have a natural hierarchical relationship.

Strategy 1: Train on CPU

The most direct solution is to sidestep the GPU memory constraint by moving the computation to the CPU. A system's main RAM is often significantly larger and more cost-effective than specialized GPU VRAM. By changing the XGBoost tree_method from gpu_hist to hist, the model will train on the CPU, using system memory for the gradient-pair matrix.

While GPU acceleration can offer a 5-20x speedup for smaller trees, the performance advantage diminishes as tree depth increases. For very large datasets, the difference can be substantial, with CPU training being an order of magnitude slower. However, from a cost perspective, utilizing a CPU-based machine with a large amount of system RAM can be more economical than provisioning high-VRAM GPUs. This makes CPU training a viable strategy for non-urgent tasks or in environments where budget is a primary constraint. Furthermore, CPU training offers deterministic reproducibility, which is not always guaranteed with GPU-based training. It serves as an excellent baseline for verifying model logic before moving to more complex GPU strategies.

Strategy 2: One-vs-Rest on GPUs

Instead of building one massive multi-class model, this strategy involves training thousands of simple, independent binary models. This approach, known as One-vs-Rest (OvR), reframes the problem. Each model is responsible for answering a single question: "Does this sample belong to class k, or not?"

A binary model's memory footprint is minimal because the number of classes is always two. This allows the computation to fit comfortably within the 24 GiB memory limit of a standard GPU, combining the scalability of a different training paradigm with the speed of GPU hardware. The primary trade-off is the complexity of managing thousands of individual models for training and inference.

import dask_xgboost as dxgb
import dask.array as da
 
# This assumes you have a Dask client and your data (X, y) ready
num_classes = 5190
all_models = []
 
for k in range(num_classes):
    print(f"Training a model for class {k}...")
    # Make the target binary: 1 for our class, 0 for all others
    y_binary = (y == k).astype("int8")
 
    # The OvR approach creates highly imbalanced classes (many 0s, few 1s).
    # `scale_pos_weight` compensates for this by giving more weight to the minority class.
    pos_rate = da.mean(y_binary).compute()
    scale_pos_weight = (1 - pos_rate) / pos_rate if pos_rate > 0 else 1
 
    params = {
        "objective": "binary:logistic", # Simple binary goal
        "tree_method": "gpu_hist",      # Use the GPU!
        "scale_pos_weight": scale_pos_weight,
        "max_depth": 4
    }
 
    # DaskQuantileDMatrix is a memory-efficient way to load data for XGBoost
    dtrain = dxgb.DaskQuantileDMatrix(client, X, y_binary)
 
    # Train the small, independent model for class k
    model_k = dxgb.train(
        client,
        params,
        dtrain,
        num_boost_round=100,
    )['booster']
 
    all_models.append(model_k)

Strategy 3: Hierarchical Softmax

If the labels possess an inherent tree-like structure, this strategy can be leveraged. It replaces a single flat classification with a series of smaller classifications that traverse a tree of labels.

Each step is a classification task with a much smaller number of outputs, allowing the underlying model to fit easily into memory. This approach is highly efficient but is only applicable when a meaningful and well-defined hierarchy exists in the labels. The main drawback is the risk of error propagation: a mistake at a high level of the tree (e.g., misclassifying a phone as a kitchen appliance) cannot be corrected at lower levels.

Even with an effective strategy like One-vs-Rest, an optimized cluster configuration is vital for performance and stability. The following Dask setup, based on NVIDIA's recommendations 3, provides a robust starting point for a multi-GPU environment.

import os
from dask_cuda import LocalCUDACluster
from distributed import Client
 
# This tells the system to let the network start up first
os.environ["RAPIDS_NO_INITIALIZE"] = "1"
 
cluster = LocalCUDACluster(
    protocol="ucx",             # Unified Communication X: A high-performance network protocol optimized for HPC and GPU data transfer.
    rmm_pool_size="20GB",         # RAPIDS Memory Manager (RMM): Pre-allocates a 20GB memory pool on each GPU to accelerate memory operations.
    device_memory_limit="18GB",   # Spills data to system RAM if GPU memory usage exceeds 18GB, preventing out-of-memory crashes.
    jit_unspill=True,             # Proactively moves spilled data back to the GPU when it's needed again, improving performance.
    local_directory="/scratch"    # Specifies a directory on a fast local disk (like an NVMe SSD) for temporary spill files.
)
client = Client(cluster)

This configuration uses a high-speed networking protocol (UCX) and sophisticated memory management (RMM with JIT unspill) to ensure your training jobs run as smoothly and efficiently as possible, minimizing bottlenecks from data transfer and memory allocation.

Why Common Hyperparameters Don't Solve the Problem

When facing out-of-memory errors, one might try to adjust model hyperparameters. While this can sometimes help with memory issues that arise during the training process, it is ineffective against the specific pre-training allocation failure caused by a high class count. This initial error must be solved before any tuning can have an effect.

  • num_boost_round (or n_estimators): Controls the number of trees. These are built sequentially, so this parameter affects total training time and final model size, but not the peak memory required at the start.
  • max_depth: While an extremely deep tree can cause its own out-of-memory error by consuming too much working memory during tree construction, this is a runtime error. It is distinct from the pre-training allocation error that is the primary focus of this article. Therefore, reducing max_depth might solve a runtime OOM but will not help if the initial memory allocation fails before training even begins.
  • subsample and colsample_bytree: These are powerful tools for preventing overfitting by sampling data and features. However, this sampling occurs after the initial, full-sized gradient-pair matrix is already in memory, so they offer no relief from the initial allocation pressure.

Key Insights

The out-of-memory error in multi-class XGBoost is not a bug but a predictable hardware bottleneck. It signals that the chosen modeling strategy is mismatched with the problem's scale. The solution is not to tweak parameters but to rethink the architecture.

  • Analyze First: Always perform a calculation of the memory footprint before launching a training job. This simple step saves hours of debugging.
  • Adopt the Right Strategy: The choice between CPU training, One-vs-Rest, or Hierarchical Softmax is a critical design decision based on your hardware, data, and project goals.
  • Optimize Your Environment: Once the core memory issue is solved, a well-configured Dask cluster with optimized networking and memory management will ensure your solution is both scalable and performant.

Understanding the root cause and framing the problem as an architectural challenge allows for the design of robust systems to train models with thousands of classes without hitting a memory wall.

Footnotes

  1. XGBoost developers. (2024). XGBoost Documentation. https://xgboost.readthedocs.io/

  2. Dask developers. (2024). Dask Documentation. https://docs.dask.org/en/latest/

  3. Liu, J. (2023). Unlocking Multi-GPU Model Training with Dask XGBoost. NVIDIA Developer Blog. https://developer.nvidia.com/blog/unlocking-multi-gpu-model-training-with-dask-xgboost/