Faking It: Synthetic Datasets and Training Tabular Foundation Models (Part 3 of 5)

TabPFN was trained on 130 million datasets, but they were all fake. This isn't a bug; it's the entire point. Here's why synthetic data is the secret weapon of tabular foundation models.
February 17, 2026
|
Business
Hugo Owen
February 17, 2026

The Data Paradox

Here's the fundamental challenge of building a foundation model for tabular data:

GPT-4 was trained on trillions of words from the internet. CLIP learned from billions of image-caption pairs. The foundation model playbook is clear: gather massive amounts of data, train a giant neural network, and watch the magic happen.

But where's the equivalent for tables?

Unlike text (which is everywhere) or images (which fill the internet), high-quality labeled tabular datasets are surprisingly rare. Most business data is proprietary, sensitive, and locked behind corporate firewalls. Medical records can't be shared. Financial data is regulated. And even the public datasets that exist—like those on Kaggle or OpenML—number in the thousands, not millions.

You can't build a foundation model on a few thousand datasets; you need millions. So what do you do?

You make them up.

Why Fake Data Works

At first, this seems absurd. How can a model learn to predict real-world phenomena from data that was never real?

The answer lies in what we're actually trying to teach the model. TabPFN doesn't need to learn that "income predicts credit default" or "tumor size correlates with malignancy." Those are specific relationships in specific domains.

What TabPFN needs to learn is more fundamental:

  • How to detect patterns in columns of numbers
  • How to handle different types of noise and outliers
  • How to weight features of varying importance
  • How to interpolate and extrapolate from training examples
  • How to be appropriately uncertain when data is limited

These are meta-skills—skills for learning itself. And it turns out you can teach meta-skills using carefully constructed synthetic data.

The Structural Causal Model Approach

Tabular Foundation Models generate training data using structural causal models (SCMs)—mathematical frameworks that describe how variables cause and influence each other.

Imagine you're creating a synthetic dataset about... let's say, plant growth. An SCM might specify:

sunlight ~ Random(0, 10)
water ~ Random(0, 10)  
growth = 0.3*sunlight + 0.5*water + noise

This creates a fake dataset where sunlight and water independently affect growth, with water being more important. The model training on this data learns: "Ah, sometimes one feature matters more than another. I should figure out which."

But that's just one dataset. The magic happens at scale.

Millions of Variations

Tabular Foundation models are trained on millions of synthetic datasets, each generated with randomly varied parameters:

Relationship complexity: Some datasets have simple linear relationships. Others have complex, non-linear interactions. Some have threshold effects ("growth only happens if water > 5"). This variety teaches the model to detect different pattern types.

Feature importance: Sometimes all features matter equally. Sometimes only 2 out of 50 features are predictive. The model learns to identify which features are actually useful.

Noise levels: Real data is noisy. By varying the amount of noise added to synthetic data, the model learns to distinguish signal from randomness.

Sample sizes: Training datasets range from tiny (50 samples) to larger (thousands). This teaches the model how to behave with varying amounts of evidence.

Missing values: Values are randomly removed to simulate real-world incompleteness. The model learns to handle gaps gracefully.

Class imbalance: Some synthetic datasets have 50% positive cases; others have 5%. The model learns to calibrate probabilities appropriately.

Causal structures: By varying the directed acyclic graphs (DAGs) that define causal relationships, the model sees simple independent causes, complex chains, confounders, and mediators.

The result is a model that has "seen" virtually every type of statistical pattern that might appear in real data. It develops intuitions that generalize.

The Prior: Encoding Statistical Beliefs

Here's where things get philosophically interesting.

The synthetic data generation process defines a prior distribution over possible tabular problems. In Bayesian statistics, a prior represents your beliefs before seeing data.

The synthetic data generation process encodes beliefs like:

  • Simple relationships are more common than complex ones (Occam's razor)
  • Features tend to have varying importance
  • Real relationships usually have some noise
  • Missing values are a normal part of data

When the model then sees your actual dataset, it combines this prior knowledge with the evidence in your data to form predictions. If your training set is small, the prior matters more—the model relies heavily on its general knowledge. If your training set is large, the data dominates—the model adapts to the specifics of your problem.

This is Bayesian inference, implemented through a neural network.

The Term You'll Hear: Prior-data Fitted Networks

The technical name for this approach is Prior-data Fitted Network (PFN).

A PFN is trained to approximate Bayesian inference. During pre-training, it sees millions of synthetic datasets (drawn from the prior) along with their optimal predictions. The network learns to internalize the entire inference process.

At inference time, when you give it a new dataset, it performs what would normally require expensive (Bayesian) computations—but it does so in a single forward pass through the neural network. The prior is "baked in" to the network weights.

This is why Tabular Foundation Models can be so fast. Traditional Bayesian inference is computationally expensive. Neural networks running forward passes are fast. PFNs give you the best of both worlds.

The Advantages of Synthetic Data

No data privacy concerns: You can't leak sensitive information from data that never contained real people.

Perfect labels: In real data, labels are often noisy—human labelers make mistakes, measurements have errors. Synthetic data has ground truth.

Unlimited scale: Need more training data? Generate it. There's no ceiling on dataset size.

Controlled diversity: You can ensure the training data covers edge cases that might be rare in real datasets.

No benchmark contamination: A major concern in ML is that models memorize benchmark datasets. If your foundation model was accidentally trained on the test set of a famous benchmark, its performance there is meaningless. Synthetic data sidesteps this entirely.

But What About Real Data?

Synthetic data isn't the only option, and recent research suggests combining approaches may be even better.

Some Tabular Foundation Models like TabDPT takes a different path: it pre-trains on real tabular datasets from OpenML, using a technique called column-masking (hiding some columns and predicting them from others). This lets the model learn from actual real-world statistical patterns, not just simulated ones.

Another TFM, Real-TabPFN uses a two-stage approach:

  1. Stage 1: Pre-train on synthetic data (like original TabPFN)
  2. Stage 2: Continue pre-training on curated real-world datasets

The intuition is that synthetic data provides broad coverage and prevents overfitting, while real data adds the subtle, domain-specific patterns that synthetic generators might miss.

It's like learning to play music: you can develop solid technique practicing scales and exercises (synthetic), but you also need to play real songs (actual data) to develop musicality.

At Neuralk, we trained our Tabular Foundation Model, NICL, exclusively on synthetic data. However, our synthetic data generation is carefully guided by real-world data to accurately reproduce the behaviors and patterns found in enterprise datasets. This approach ensures that NICL does not develop unintended biases or overfit to specific datasets, enabling robust generalization across diverse real-world applications.

The Craft of Data Generation

Creating the synthetic data generator is itself a research challenge. If your generator produces data that's too simple or too uniform, the model won't learn rich patterns. If it produces data that's too different from real-world distributions, the learned skills won't transfer.

Every Tabular Foundation Model team spends considerable effort engineering their generator, using:

  • Bayesian Neural Networks: Neural networks with uncertainty over their weights, which naturally produce diverse input-output mappings
  • Structural Causal Models: Directed graphs defining causal relationships between variables
  • Realistic noise models: Not just Gaussian noise, but various distributions and patterns of missingness
  • Varied dimensionality: Different numbers of features and samples

This is meta-engineering: engineering the thing that generates the thing that trains the thing.

When Synthetic Falls Short

Synthetic data isn't a silver bullet. Some limitations:

Distribution mismatch: If your real problem has patterns the synthetic generator can't produce, the model may struggle. For example, if your data has specific temporal dynamics or domain-specific structures not captured by general-purpose generators.

Domain knowledge gap: A synthetic generator doesn't know that "age" behaves differently from "temperature" even though both are numbers. Real data from specific domains carries semantic meaning that synthetic data lacks.

The generator is an assumption: Your choice of synthetic data generator implicitly assumes certain patterns are more common than others. If those assumptions don't match reality, performance suffers.

This is why the field is actively exploring hybrid approaches, combining synthetic pre-training with real-data fine-tuning.

Key Takeaways

→ Tabular foundation models face a data scarcity problem—high-quality labeled tables are rare compared to text or images

→  Tabular foundation models solve this by training on millions of synthetic datasets generated from structural causal models

→ Synthetic data teaches meta-skills: detecting patterns, handling noise, weighting features—not domain-specific facts

→ The synthetic data generator defines a "prior"—beliefs about what tabular problems typically look like

→ Prior-data Fitted Networks (PFNs) approximate Bayesian inference in a single forward pass

→ Advantages: unlimited scale, no privacy issues, perfect labels, no benchmark contamination

→ Hybrid approaches (synthetic + real data) may offer the best of both worlds

Next up: Part 4 gets practical. When should enterprises use tabular foundation models? What are the real-world considerations around latency, scale, and interpretability?

Glossary of terms

- Structural Causal Model (SCM): A mathematical framework describing how variables cause and influence each other
- Prior distribution: In Bayesian statistics, your beliefs about possible outcomes before seeing data
- Posterior distribution: Your updated beliefs after combining prior knowledge with observed data
- Prior-data Fitted Network (PFN): A neural network trained to approximate Bayesian inference on new datasets
- Directed Acyclic Graph (DAG): A graph with directed edges and no cycles, used to represent causal relationships
- Benchmark contamination: When a model is accidentally trained on test data, inflating its apparent performance
- Column-masking: A training technique where some columns are hidden and the model predicts them from others
- Forward Pass:
The process of feeding input data through a neural network layer by layer, applying weights and activation functions at each step, to produce an output prediction.