June 25, 2025
Introducing TabBench: Benchmarking Tabular ML Models for Enterprise Tasks
Introducing TabBench, an open-source benchmark built by Neuralk-AI to evaluate tabular ML models on practical, real-world industry tasks, starting with commerce-related use cases.

While breakthroughs in text-based LLMs and computer vision models often dominate the headlines, the reality is that much of enterprise AI still relies on tabular data. From predicting customer churn to personalizing recommendations or optimizing pricing, machine learning (ML) on tabular datasets powers core business use cases that can play a crucial role in accelerating growth and boosting operational efficiency.

Yet, in practice, the tabular ML landscape remains highly fragmented: Training and evaluation datasets vary significantly in size, domain, and preprocessing methods. Evaluation protocols are often inconsistent, with results commonly reported on oversimplified academic tasks or synthetic datasets. This makes it difficult for data teams to assess the true generalization capacities and practical value of their models in real enterprise environments, often resulting in missed business opportunities and losses in performance.

To address these challenges, we’re excited to introduce TabBench, an open-source benchmark built by Neuralk-AI to evaluate tabular ML models on practical, real-world industry tasks, starting with commerce-related use cases.

Why TabBench?

TabBench offers for the first time a unified, open-source framework to evaluate model performance on industry-focused use cases, starting with the Commerce sector, such as product categorization, deduplication, and more.

Despite the dominant role of tabular data in real-world ML applications, many teams still face major setbacks when building, evaluating, and deploying tabular models that meet the specific needs of their use cases. TabBench is designed to address these challenges by offering:

  • Evaluation of models on real industry datasets: Most benchmarks are based on simplified academic tasks and unrealistic datasets. TabBench focuses on industry-relevant use cases that are evaluated on messy, real-world industry data, ensuring meaningful, real-world impact.*
  • Use-case focused workflows: Common industry tasks like product categorization or deduplication are often absent from traditional benchmarks, partly because their workflows can be complex and hard to standardize. TabBench provides pre-built, modular workflows for each industry-focused use case that reflect how teams operate in real enterprise settings.
  • Standardized preprocessing and metrics: Differences in data cleaning, feature engineering, and evaluation pipelines make it difficult to compare models reliably. TabBench streamlines the entire pipeline through a clear, step-by-step workflow, from preprocessing to final evaluation, ensuring fair and reproducible results.

NICL, a Tabular Foundation Model by Neuralk-AI

To raise the standard of tabular ML performance for these real-world industrial use cases, we developed NICL (Neuralk In-Context Learning) — a novel Tabular Foundation Model designed to deliver state-of-the-art results on industrial predictive tasks.

The TabBench Dashboard gives you a first glance at the performance of NICL compared to classical ML approaches and existing Tabular Foundation Models.

As we see in the following plot, NICL is on par with TabICL in terms of ranking (a commonly used evaluation metric to assess model robustness across diverse datasets) on a classification task spanning 50 OpenML datasets, followed by TabPFNv2.**

Getting started with a TabBench Workflow

In TabBench, each use case (e.g., product categorization) is broken down into a sequence of steps, organized into a Workflow that is end-to-end, from loading the dataset to getting the final predictions of a model.

One can quickly start experimenting with a TabBench Workflow by directly downloading via pip:

$ pip install tabbench


TabBench offers a built-in visualization tool that can be directly used within a notebook environment. With a single command, you can generate a clear, user-friendly diagram showing each Workflow step, its inputs, and its outputs, making the overall process easy to follow and debug.

Let's say you want to run Product Categorisation on a product catalog, using an XGBoost classifier.

You can do so by running this part of code:

from tabbench.workflow.use_casesimport Categorisation

use_case= Categorisation('best_buy_simple_categ')
use_case.notebook_display()


This will generate an interactive display like the one below, showcasing how TabBench effortlessly manages input-output data through a modular, step-by-step workflow. Try it yourself!

A TabBench Workflow for Product Categorization


In the above example, for the Product Categorization use case with an XGBoost classifier, the process includes:

  • LoadDataset: Loads the dataset to be used for evaluation.
  • StratifiedShuffleSplitter: Splits the data into training and test sets while preserving class distribution.
  • PreprocessingStep:
    • Applies standard scaling to continuous features
    • Applies one-hot encoding to categorical features
  • TfIdfVectorizer: Transforms text features using TF-IDF vectorization.
  • LabelEncoding: Encodes the target variable to ensure compatibility with most model formats.
  • XGBoostClassifier: A gradient boosting model that can be used for training and prediction on tabular data.

Each step in the workflow is fully parameterizable: users can easily swap out components (e.g., use LightGBM instead of XGBoost, or choose a different vectorizer or preprocessing technique) to match their specific use case or experimentation needs.

More generally, TabBench workflows follow a consistent structure across use cases, broken down into four main steps that can handle nearly any ML scenario:

  1. Load: Loads the data, defines the use case (e.g., Product Categorization), and splits the data accordingly.
  2. Vectorize: Performs necessary preprocessing and converts data entries into vector embeddings.
  3. Predict: Applies a model to the vectorized data. This step can involve training a new model or using a pre-trained one from a selection of choices. Post-processing may also occur depending on the selected model.
  4. Evaluate: Assesses the accuracy and performance of the Predict step.

To explore TabBench in more depth and train your first model on a Product Categorization task, check out one of our interactive notebooks here.

TabBench currently focuses on classification and categorization tasks with more use cases being added soon. You can explore all the results via the TabBench Dashboard, and dive deeper into the evaluation pipelines, implementation details, and contribution guidelines directly on our GitHub.


Want to go further?

NICL will soon be available via API for inference and evaluation on your own tabular datasets! If you're interested in integrating it into your ML workflows or exploring how it performs on your custom use case, request early access here.

* Interested in evaluating your model or contributing to TabBench? Please consult our Contribute page.

** Industry datasets are not released as part of TabBench for privacy constraints and to avoid any contamination of the evaluation protocol.