The Next Frontier: Open Questions and What's Next for Tabular AI (Part 5 of 5)

Tabular foundation models are powerful, but we're still in early days. Here's what we don't understand, where they fail, and the fascinating collision course with large language models.
March 5, 2026
|
Business
Hugo Owen
March 5, 2026

Honest Uncertainty

We've spent four articles explaining what tabular foundation models can do. Now, let's explore what we think they can do, what our intuition tells us, even if we don’t fully understand or can’t yet explain it.

This matters because technology hype cycles are real, and the gap between "this is promising" and "this solves all your problems" is often filled with disappointment. By acknowledging both what we know and what we sense, we can make wiser decisions, and appreciate the true breakthroughs even more.

The Open Questions

1. Why Do They Actually Work?

Here's a humbling fact: we don't fully understand why these models work as well as they do.

The theory says: train on synthetic data drawn from a prior, and the model learns to approximate Bayesian inference. In practice, the models often exceed what theory predicts. They generalise in ways that surprise even their creators.

Some hypotheses:

  • The transformer architecture’s inductive bias makes it well-suited to capturing feature interactions in tabular data that other architectures might miss.
  • Meta-learning across millions of datasets creates emergent capabilities
  • Synthetic priors don’t need to look like real data; they need to behave like real data. By modeling the joint distribution between features and labels, the model learns the rules of classification during in-context learning, even if it has never seen the specific features before.

But unfortunately, we don't have exact answers.

2. What's the Right Prior?

Every tabular foundation model interprets problems through the lens of a prior: a formalised set of beliefs about what tabular datasets typically look like. Different choices of priors naturally lead to different models.

But what’s the “right” prior for real-world tabular data? Most current generators rely on structural causal models or Bayesian neural networks, producing diverse datasets that cover many possible patterns. Yet, they may still miss structures that are common in specific domains.

The research community’s focus on public datasets like OpenML’s, and the publication pressures that come with them can exacerbate this problem. Just as the computer vision has been overfitting ImageNet errors, overreliance on standardised datasets risks overfitting models to quirks of benchmark data rather than real-world problems.

At Neuralk, we took a different approach: we analysed patterns in industrial tabular data. Some of the recurring structures we observed include:

  • Temporal dynamics and regime shifts in financial data
  • Structured correlations between symptoms and diagnoses in medical datasets
  • Periodic patterns and sensor drift in industrial sensor data

Our current foundation model handles these patterns well, but there may still be value in exploring domain-specific tabular foundation models. The key question remains: can a single prior realistically cover all domains, or are specialised models the better path forward?

3. Scaling Laws for Tabular Data

In language models, we've discovered scaling laws: bigger models trained on more data predictably get better. You can extrapolate performance from smaller experiments.

For tabular foundation models, the picture is…. less clear.

More synthetic data helps, but the relationship isn't as clean. If we add, real data, how much does it help the model? What's the optimal ratio of synthetic to real? How does model size interact with data diversity?

Some recent work shows encouraging scaling behaviour, but we don't have the confident predictions we have for language models.

4. When Do They Fail?

Every model fails somewhere. For certain tabular foundation models, failure modes can include:

Distribution shift: If your real data looks fundamentally different from anything the synthetic generator could produce, the model struggles. Edge cases outside the prior distribution can produce poor predictions with unwarranted confidence. For example, when positive cases in a classification dataset are 0.1% of your data, even well-calibrated models face challenges. The prior may not adequately capture such extreme imbalance.

Adversarial vulnerability: Recent research found that tabular foundation models can be vulnerable to small, carefully designed perturbations. Changing a few feature values in specific ways can flip predictions. For security-critical applications, this matters.

Complex temporal dependencies: Tabular foundation models treat each row as independent. If your problem requires understanding sequences—how a customer's behaviour evolved over time, not just their current state—the standard approach struggles.

Handling long context: how to keep state of the art performance when facing very long context?

Knowing these failure modes help us design appropriate safeguards, and mitigate impact. For example, NICL has shown to be extremely performant, even which extreme class imbalances.

The LLM Convergence

Now for the exciting part: what happens when tabular foundation models meet large language models?

Using LLMs for Tabular Prediction

This sounds crazy at first. LLMs are trained on text. Your customer data is numbers in columns. What's the connection?

With clever prompting, LLMs can make surprisingly reasonable-sounding tabular predictions. You serialise the table into text, describe what you want to predict, provide a few examples, and ask for predictions.

The results are significantly below the state-of-the-art, but they come with something unique: natural language explanations of why the model predicted what it did. The problem is, their reasoning is often wrong.

The Semantic Advantage

Here's where it gets interesting.

Traditional tabular models don't understand what columns mean. They see "age" as column 7 with numeric values. They don't know that "age" relates to life stages, health, experience, or wisdom.

LLMs, trained on human text, have absorbed massive amounts of world knowledge. They know that "age" and "years of experience" might correlate. They know that "income" in "USD" is different from "income" in "rupees." They understand that "cancer stage" has specific medical meaning.

Models like TabSTAR or SAP’s ContextTab leverage this. They use text encoders (like BERT) to embed column names, incorporating semantic meaning into tabular predictions. A column named "customer_lifetime_value" gets different treatment than "random_id_7" because the model understands what the name implies.

This opens possibilities:

  • Better handling of novel columns the model has never seen
  • More intuitive feature relationships based on semantic similarity
  • Natural language interfaces for querying and explaining models

Multi-Modal Futures

The next frontier is probably multi-modal tabular learning. Real business problems don't fit neatly into "tabular" vs. "text" vs. "image" boxes.

Consider insurance claim processing:

  • Structured data: policy type, coverage amount, customer tenure
  • Text: claim description, adjuster notes
  • Images: photos of damage

A truly powerful system would reason across all of these together. Early research is exploring this direction, combining the pattern-recognition of tabular foundation models with the world knowledge of LLMs and the perceptual capabilities of vision models.

At Neuralk, we’re focusing not only on making the best tabular foundation models: we’re challenging ourselves to build the brain and the relationships around the predictive model: scoping the right business question, sourcing and cleaning data (whatever it’s format), engineering meaningful features from domain knowledge, running predictions, generating actionable insights, iterating. The goal is not to be the best in a lab, but to be the most valuable in a real-world enterprise setting.

What's Actually Next?

Based on current research trajectories, here's what to expect in the near term:

Larger Scale

TabPFN started with 1,000 samples. Version 2.5 handles 50,000. Neuralk’s NICL can handle tens of millions. Foundation models are now viable for the large-scale enterprise datasets where traditional methods currently dominate.

Hybrid Systems

Pure synthetic pre-training vs. pure real-data training is a false dichotomy. The future is hybrid:

  • Pre-train on synthetic data for broad coverage
  • Continue pre-training on real-world data for domain adaptation
  • Fine-tune on your specific dataset for maximum performance

This mirrors the trajectory of NLP: GPT (general pre-training) → domain-specific models (BioBERT, LegalBERT) → fine-tuning on your data.

Better Integration with Causal Inference

Much of data science isn't about prediction—it's about understanding causation. Will this marketing campaign cause higher sales, or just correlate with factors that would increase sales anyway?

Tabular foundation models, trained on structural causal models, are positioned to move beyond prediction into causal inference. Early work like CausalFM extends the PFN framework to estimate causal effects. Expect more development here.

Automated Data Science Assistants

Combine tabular foundation models with LLMs, add a dash of automated machine learning (AutoML), and you get something like an AI data scientist:

  • "Here's my dataset. What can you predict?"
  • "Why did this customer churn? What could we have done differently?"
  • "Generate synthetic data that looks like my real data but preserves privacy."

We're not fully there yet, but the pieces are coming together.

If any of these research topics interest you, our team is growing and we’d love to talk.

A Measured Conclusion

Tabular foundation models represent a genuine paradigm shift in how we approach structured data. The ability to make competitive predictions without feature engineering, hyperparameter tuning, or extensive training time is remarkable.

But they're not magic. They have limitations. They don't obsolete traditional methods. They're another tool—a powerful one—in the data scientist's toolkit.

The practitioners who will thrive are those who:

  • Understand when foundation models are appropriate vs. traditional methods
  • Know how to evaluate whether the models are working for their specific problem
  • Can combine multiple approaches intelligently
  • Stay current as the field rapidly evolves

The technology is young. The best practices are still being written. The papers that will define this field in five years probably haven't been published yet (but we’re working on them 😈)

That's what makes it exciting; if you’re looking to work in this exciting field, please check out our Careers page.

Series Recap

Across these five articles, we've covered:

Part 1: The Revolution Comes to Spreadsheets

Introduced tabular foundation models—pre-trained AI systems that learn patterns from millions of tables and apply that knowledge to new datasets without retraining.

Part 2: The Old Guard

Explored traditional ML methods (decision trees, random forests, gradient boosting) that have dominated tabular data, and why they've worked so well.

Part 3: The Art of Fake Data

Revealed how synthetic datasets, generated from structural causal models, train foundation models to learn meta-skills applicable to any tabular problem.

Part 4: Where the Value Lies

Provided practical guidance for enterprises: when to use foundation models vs. traditional methods, and the real-world considerations beyond raw accuracy.

Part 5: The Unknown Frontier

Acknowledged limitations and open questions, explored the convergence with LLMs, and looked ahead at what's coming next.

Key Takeaways

→ We don't fully understand why tabular foundation models work as well as they do—surprising even their creators

→ Open questions remain around optimal priors, scaling laws, and failure modes

→ Adversarial vulnerability and distribution shift are known weaknesses

→ LLMs bring semantic understanding to tabular data—knowing that "age" means something about life stages, not just column 7

→ The future likely involves hybrid systems: synthetic pre-training + real-world data + domain fine-tuning

→ Integration with causal inference and multi-modal learning are active research frontiers

→ The field is young—best practices are still being written

Thanks for reading this series. The world of tabular AI is evolving rapidly, and we'll continue covering developments as they unfold. Have questions? Want to see specific topics explored? Drop us a note.

Glossary of Terms

- Adversarial perturbation: Small, carefully designed input changes that cause a model to make incorrect predictions

- Causal inference: Statistical methods for determining cause-and-effect relationships, not just correlations

- Distribution shift: When the data a model encounters differs from the data it was trained on

- Inductive bias: The assumptions a learning algorithm makes to generalise beyond training data

- Multi-modal learning: Training models on multiple types of data (text, images, tables) simultaneously

- Scaling laws: Mathematical relationships describing how model performance improves with more data and parameters

- Semantic understanding: Grasping the meaning of concepts, not just their statistical patterns