Tabular foundation models are powerful, but we're still in early days. Here's what we don't understand, where they fail, and the fascinating collision course with large language models.


We've spent four articles explaining what tabular foundation models can do. Now, let's explore what we think they can do, what our intuition tells us, even if we don’t fully understand or can’t yet explain it.
This matters because technology hype cycles are real, and the gap between "this is promising" and "this solves all your problems" is often filled with disappointment. By acknowledging both what we know and what we sense, we can make wiser decisions, and appreciate the true breakthroughs even more.
Here's a humbling fact: we don't fully understand why these models work as well as they do.
The theory says: train on synthetic data drawn from a prior, and the model learns to approximate Bayesian inference. In practice, the models often exceed what theory predicts. They generalise in ways that surprise even their creators.
Some hypotheses:
But unfortunately, we don't have exact answers.
Every tabular foundation model interprets problems through the lens of a prior: a formalised set of beliefs about what tabular datasets typically look like. Different choices of priors naturally lead to different models.
But what’s the “right” prior for real-world tabular data? Most current generators rely on structural causal models or Bayesian neural networks, producing diverse datasets that cover many possible patterns. Yet, they may still miss structures that are common in specific domains.
The research community’s focus on public datasets like OpenML’s, and the publication pressures that come with them can exacerbate this problem. Just as the computer vision has been overfitting ImageNet errors, overreliance on standardised datasets risks overfitting models to quirks of benchmark data rather than real-world problems.
At Neuralk, we took a different approach: we analysed patterns in industrial tabular data. Some of the recurring structures we observed include:
Our current foundation model handles these patterns well, but there may still be value in exploring domain-specific tabular foundation models. The key question remains: can a single prior realistically cover all domains, or are specialised models the better path forward?
In language models, we've discovered scaling laws: bigger models trained on more data predictably get better. You can extrapolate performance from smaller experiments.
For tabular foundation models, the picture is…. less clear.
More synthetic data helps, but the relationship isn't as clean. If we add, real data, how much does it help the model? What's the optimal ratio of synthetic to real? How does model size interact with data diversity?
Some recent work shows encouraging scaling behaviour, but we don't have the confident predictions we have for language models.

Every model fails somewhere. For certain tabular foundation models, failure modes can include:
Distribution shift: If your real data looks fundamentally different from anything the synthetic generator could produce, the model struggles. Edge cases outside the prior distribution can produce poor predictions with unwarranted confidence. For example, when positive cases in a classification dataset are 0.1% of your data, even well-calibrated models face challenges. The prior may not adequately capture such extreme imbalance.
Adversarial vulnerability: Recent research found that tabular foundation models can be vulnerable to small, carefully designed perturbations. Changing a few feature values in specific ways can flip predictions. For security-critical applications, this matters.
Complex temporal dependencies: Tabular foundation models treat each row as independent. If your problem requires understanding sequences—how a customer's behaviour evolved over time, not just their current state—the standard approach struggles.
Handling long context: how to keep state of the art performance when facing very long context?
Knowing these failure modes help us design appropriate safeguards, and mitigate impact. For example, NICL has shown to be extremely performant, even which extreme class imbalances.
Now for the exciting part: what happens when tabular foundation models meet large language models?
This sounds crazy at first. LLMs are trained on text. Your customer data is numbers in columns. What's the connection?
With clever prompting, LLMs can make surprisingly reasonable-sounding tabular predictions. You serialise the table into text, describe what you want to predict, provide a few examples, and ask for predictions.
The results are significantly below the state-of-the-art, but they come with something unique: natural language explanations of why the model predicted what it did. The problem is, their reasoning is often wrong.
Here's where it gets interesting.
Traditional tabular models don't understand what columns mean. They see "age" as column 7 with numeric values. They don't know that "age" relates to life stages, health, experience, or wisdom.
LLMs, trained on human text, have absorbed massive amounts of world knowledge. They know that "age" and "years of experience" might correlate. They know that "income" in "USD" is different from "income" in "rupees." They understand that "cancer stage" has specific medical meaning.
Models like TabSTAR or SAP’s ContextTab leverage this. They use text encoders (like BERT) to embed column names, incorporating semantic meaning into tabular predictions. A column named "customer_lifetime_value" gets different treatment than "random_id_7" because the model understands what the name implies.
This opens possibilities:
The next frontier is probably multi-modal tabular learning. Real business problems don't fit neatly into "tabular" vs. "text" vs. "image" boxes.
Consider insurance claim processing:
A truly powerful system would reason across all of these together. Early research is exploring this direction, combining the pattern-recognition of tabular foundation models with the world knowledge of LLMs and the perceptual capabilities of vision models.
At Neuralk, we’re focusing not only on making the best tabular foundation models: we’re challenging ourselves to build the brain and the relationships around the predictive model: scoping the right business question, sourcing and cleaning data (whatever it’s format), engineering meaningful features from domain knowledge, running predictions, generating actionable insights, iterating. The goal is not to be the best in a lab, but to be the most valuable in a real-world enterprise setting.

Based on current research trajectories, here's what to expect in the near term:
TabPFN started with 1,000 samples. Version 2.5 handles 50,000. Neuralk’s NICL can handle tens of millions. Foundation models are now viable for the large-scale enterprise datasets where traditional methods currently dominate.
Pure synthetic pre-training vs. pure real-data training is a false dichotomy. The future is hybrid:
This mirrors the trajectory of NLP: GPT (general pre-training) → domain-specific models (BioBERT, LegalBERT) → fine-tuning on your data.
Much of data science isn't about prediction—it's about understanding causation. Will this marketing campaign cause higher sales, or just correlate with factors that would increase sales anyway?
Tabular foundation models, trained on structural causal models, are positioned to move beyond prediction into causal inference. Early work like CausalFM extends the PFN framework to estimate causal effects. Expect more development here.
Combine tabular foundation models with LLMs, add a dash of automated machine learning (AutoML), and you get something like an AI data scientist:
We're not fully there yet, but the pieces are coming together.
If any of these research topics interest you, our team is growing and we’d love to talk.
Tabular foundation models represent a genuine paradigm shift in how we approach structured data. The ability to make competitive predictions without feature engineering, hyperparameter tuning, or extensive training time is remarkable.
But they're not magic. They have limitations. They don't obsolete traditional methods. They're another tool—a powerful one—in the data scientist's toolkit.
The practitioners who will thrive are those who:
The technology is young. The best practices are still being written. The papers that will define this field in five years probably haven't been published yet (but we’re working on them 😈)
That's what makes it exciting; if you’re looking to work in this exciting field, please check out our Careers page.
Across these five articles, we've covered:
Part 1: The Revolution Comes to Spreadsheets
Introduced tabular foundation models—pre-trained AI systems that learn patterns from millions of tables and apply that knowledge to new datasets without retraining.
Explored traditional ML methods (decision trees, random forests, gradient boosting) that have dominated tabular data, and why they've worked so well.
Revealed how synthetic datasets, generated from structural causal models, train foundation models to learn meta-skills applicable to any tabular problem.
Provided practical guidance for enterprises: when to use foundation models vs. traditional methods, and the real-world considerations beyond raw accuracy.
Part 5: The Unknown Frontier
Acknowledged limitations and open questions, explored the convergence with LLMs, and looked ahead at what's coming next.
→ We don't fully understand why tabular foundation models work as well as they do—surprising even their creators
→ Open questions remain around optimal priors, scaling laws, and failure modes
→ Adversarial vulnerability and distribution shift are known weaknesses
→ LLMs bring semantic understanding to tabular data—knowing that "age" means something about life stages, not just column 7
→ The future likely involves hybrid systems: synthetic pre-training + real-world data + domain fine-tuning
→ Integration with causal inference and multi-modal learning are active research frontiers
→ The field is young—best practices are still being written
Thanks for reading this series. The world of tabular AI is evolving rapidly, and we'll continue covering developments as they unfold. Have questions? Want to see specific topics explored? Drop us a note.
Glossary of Terms
- Adversarial perturbation: Small, carefully designed input changes that cause a model to make incorrect predictions
- Causal inference: Statistical methods for determining cause-and-effect relationships, not just correlations
- Distribution shift: When the data a model encounters differs from the data it was trained on
- Inductive bias: The assumptions a learning algorithm makes to generalise beyond training data
- Multi-modal learning: Training models on multiple types of data (text, images, tables) simultaneously
- Scaling laws: Mathematical relationships describing how model performance improves with more data and parameters
- Semantic understanding: Grasping the meaning of concepts, not just their statistical patterns

