Heavybit logo

The Data Pipeline is the New Secret Sauce

Jesse Robbins

Heavybit by Jesse Robbins · · Article

"The biggest challenge emerging is building and operating the infrastructure both for creating and running the data pipelines to build, manage, and maintain a robust, secure body of proprietary data."

— Jesse Robbins

I wrote this at Heavybit in September 2024. The argument: the data pipeline is the differentiating asset in enterprise AI. Includes four inference hosting models and four enterprise maturity phases.

The data pipeline is the bottleneck in enterprise AI.

I wrote that at Heavybit in September 2024. The argument: enterprises that will get real work out of AI are the ones whose pipelines produce a secure first-party dataset and stay correct as the underlying systems change. Buying access to a model is one part of shipping AI into production. The harder work is operational.

I have lived this pattern. The conditions enterprises were grappling with in 2024 looked 1:1 with what I saw at Amazon as DevOps practices took shape. Slow-to-evolve organizations, regulatory pressure, customer risk, the same handful of operators carrying the weight. The starting point is different this time. DevOps had to drag the field toward continuous integration and delivery. Data pipelines for AI start there. I made that argument at Data Council earlier in 2024, before writing it up.

In September 2024, roughly 40% of enterprises surveyed said they had deployed an AI program or were actively exploring one. Microsoft was reporting 53,000 organizations using its AI offerings via Azure. Gartner had 87% of “mature organizations” carrying dedicated AI teams. These programs are not bought off the shelf. They produce an artifact: the internal dataset. It is the end result of a complicated toolchain run by a team that does this work full time. The data pipeline is that artifact’s production line. It is the thing that does or does not compound.

The piece maps four inference hosting models and four enterprise maturity phases. Most of the value is in the phase model. Phase 1 looks like real experimentation against a hosted API. Phase 2 is the moment teams realize they have to stand up an internal pipeline to extract durable value from the use cases they have proven. Phase 3 is cost shock. The bill from the API provider becomes the line item that gets the AI program in front of the CFO. Phase 4 is the only one that requires judgment, because the right answer for one workload is rarely the right answer for another. Mature enterprises end up running mixed inference configurations and treating optionality as the durable asset, not any single hosting choice.

The operational point underneath the taxonomies is the one I care about. A data pipeline is a continuous process that begins when the model ships to production. It needs the same monitoring, validation, and team discipline as any other production software, plus the security and privacy work that stops personally identifiable information from leaking on the way through. Without those practices, enterprises either fail to build their internal dataset at all or ship real business risk: privacy leaks, model drift, the cost of re-training on bad data. Cost-control work belongs in this phase too, including model merging and mixture-of-experts as alternatives to retraining on the entire dataset.

In 2025, Heavybit backed Recce. I joined the board. CL Kao and his team are building data validation for the moment a pipeline change ships to production, which is exactly the discipline this article said enterprises would have to develop. Recce is the practical answer to a question this piece could only frame: how do you know your pipeline is still correct after the change?

The biggest challenge emerging is building and operating the infrastructure both for creating and running the data pipelines to build, manage, and maintain a robust, secure body of proprietary data to train, fine-tune, and orchestrate LLM operations, and for running inference, the actual process of models running calculations on inputted data.

The piece runs through two frameworks.

Four inference hosting models.

  • Hosted API. Calling OpenAI, Anthropic, and the rest. The provider absorbs the cost and operational burden of running large models. Enterprises pay in tokens.
  • On-device edge. Smaller models running locally, often on high-end laptops, sometimes paired with three-billion-parameter open-weight checkpoints. Lower latency, better data locality, an unclear scaling story for larger teams.
  • On-premise data center. Everything behind the firewall. Most enterprise IT workloads moved off-prem years ago for total-cost reasons. AI inference is following the same path outside heavily regulated workloads.
  • Off-premise cloud via third-party data center. The managed-inference layer. Resembles traditional cloud computing more every quarter. Introduces network latency and dependency on the provider’s reliability posture.

Four enterprise maturity phases.

  • Phase 1, off-the-shelf cloud start. Most enterprises begin here, against a hosted API. Data science and operations teams focus on identifying valuable use cases. The hosting question is abstracted by the provider contract.
  • Phase 2, scaling what works. Teams have a pipeline that is good enough to deliver value on specific workloads. They harden privacy posture for the data types and jobs that matter. The bill starts to register.
  • Phase 3, cost shock and optimization. API spend hits a number that gets noticed. Teams reassess: continue paying for hosted inference, or invest in an internal model and the inference configuration to run it. Tooling investments here include pretraining datasets, data filtering, model evaluation.
  • Phase 4, specialization. Mature teams run mixed configurations and stop treating any single hosting choice as the answer. They prize optionality over vendor lock-in. They consider model merging and mixture-of-experts as alternatives to retraining on the entire dataset every time.

The piece names a specific operational point. The pipeline itself is the artifact, and building one demands operational effectiveness plus the security and privacy work that stops PII from leaking. Without that, enterprises fail to build their internal dataset at best, or ship real business risk from privacy leaks, poor model performance, and the cost of retraining on bad data.

The article also references Chaoyu Yang of BentoML on specialized AI systems built for specific use cases as a likely source of durable advantage. Specialized infrastructure is where serious teams have ended up in 2025 and 2026.

Read the full article at Heavybit.

People

Since this came out…

  1. AWS, Azure, and GCP have all shipped managed inference offerings that map closely to the four hosting models in the article. The market converged on the same taxonomy. The open question now is which enterprises reach Phase 4 specialization and which stall at Phase 2.
  2. Heavybit backed Recce. I joined the board. Recce builds data validation for the change-management layer of the pipeline. That is exactly the discipline this article said enterprises would have to develop, applied to the moment a change ships to production.

Further reading

Topics