The Analytics Stack Tax

Most organisations running analytics are paying a tax they don't see on any invoice.

It's not the cloud bill. Or the headcount. It's the hidden cost of stitching five or more tools together and calling it a Modern Data Platform.

You know the stack. The one the latest hire asked ChatGPT to add to their Pokedex résumé. Polars or Pandas for batch processing. Kafka and Flink for streaming. SciPy for statistics. dbt for orchestration. And usually more besides.

So your team writes glue layers between the batch engine and the streaming engine. Translation code so the statistician's Python model experiments can run live on production data. Adapter logic extending SQL-focused engines like dbt beyond what they were designed to do. All of this glue ships bugs rather than value.

And then someone asks: "Can we run this analysis in real-time?"

And the honest answer is: "no, it doesn't support it, and we'd need to rebuild it". The batch and the streaming pipelines are different codebases, written in separate programming languages, with their own failure edge cases. The model that works on yesterday's data doesn't just drop into a live feed. Not without weeks (or months) of engineering to bridge the gap.

Now the CEO is asking why the recent multi-million dollar data investment cannot deliver.

This is the analytics stack tax. Not any single tool's fault — the cost lives in the integration surface between all of them.

What if you didn't have to pay the toll?

Lightning-fast. No translation.

SpaceCell Lightning is the Live analytics engine. One compute engine across data manipulation, statistical analysis, and real-time execution.

Most systems are designed to process finite, static datasets, with live streaming bolted on as an afterthought when the tooling doesn't deliver. Lightning inverts this. Batch processing is simply a stream with a known end. The computation drains to completion on the same engine, running the same code with the same semantics — whether the live stream is finite or continuous.

Here's what averaging a column on a data Table looks like:

let result: Table = sales
    .group_by(&["product"])
    .select(&["revenue"])
    .mean()
    .collect()?;

Now here's that same operation running real-time — on continuous data flow. This is a live data pipeline.

let results: LiveStream<Table> = market_data
    .dam("1h")
    .group_by(&["product"])
    .select(&["revenue"])
    .mean()
    .live()?;

for window in results {
    // process each hourly window, as it completes
}

Most teams need separate systems and translation layers to achieve this. Lightning provides it out of the box in one compute model.

Same group_by. Same semantics. Your data scientist writes the logic once, and it works without translation. No three month translation project to make the same logic run continuously on live data via a full rewrite.

Live is not a mode. It's the premise.

The previous generation of analytics tools was built to solve a specific problem: making sense of big data lakes. Hadoop was slow. Spark was bloated and infra-heavy. Pandas couldn't scale. Tools like Polars and DuckDB emerged as engineering feats to address that challenge — fast, modern and popular.

But they solve that problem. The assumption baked in is: you have a dataset. You load it into memory. You query it. You get a result. When the dataset doesn't fit, you chunk it.

Here's what actually happened to that dataset, though. Before you loaded it, it accumulated somewhere. A queue, a staging table, an S3 bucket, a Kafka topic with a consumer that batches into Parquet files every hour. Events happened continuously in the real world — transactions, sensor readings, user actions, market ticks — and your infrastructure collected them into a pile so your batch engine could process them later.

The pile is the bottleneck. Not the processing speed. By the time it's been through those systems, you're often already a day late.

So how many milliseconds difference does this make? That's the underlying issue — you might save 20 milliseconds on a query, but if you're a day (86,400,000ms) after the event your organisation wants to influence, how do you hope to get on top of it? The assumption means you are studying history lessons for next time.

And the tax: the time spent orchestrating and keeping those touchpoints healthy often exceeds the time spent on the analysis itself. On top of paying the pile's data storage and compliance costs.

Unless you're running surveys or processing census data, events occur continuously. A customer churned while your pipeline was waiting for the next hourly extract. A sensor crossed a threshold while your batch job was queued behind three others. A market moved while your data sat in a staging table waiting for the 2am run that often fails.

Keeping your system up to date means your team, your dashboards, and every workflow built off that data feed gets the updated reality this second. Not hourly or nightly. As it happens — with any exceptions routed for immediate resolution.

Lightning is this. Data flows in, computation happens, results flow out live. Continuously. There's no accumulation step unless you choose one.

Want an average over a time period? Whack a Dam on it:

let hourly_avg: LiveStream<Table> = sensor_feed
    .dam("1h")
    .mean()
    .live()?;

for result in hourly_avg {
    // updated average, every hour, as data arrives
}

Fill the dam up with how much data you need. When it's full, it processes.

Need a rolling view of the 1000 latest records for each hardware sensor? Different scenario, same approach:

let rolling: LiveStream<Table> = sensor_feed
    .dam("1000")
    .group_by(&["sensor_id"])
    .mean()
    .live()?;

Or, a continuously updating global average from the first record onwards, feeding a trading model?:

let global: LiveStream<Table> = sensor_feed
    .mean()
    .live()?;

From here, it is straightforward to push that data directly into any API's the organisation uses that benefit from this live state.

With a rich set of base streaming primitives you can chop, change and control what you need and when you need it. Live.

Speed that actually matters

Running live changes what your organisation can do, not just how fast it does it.

You stop briefing on yesterday and start operating on today. Decisions get made against monitoring the situation as it is, not as it was. Issues get caught and routed before they compound. The work shifts from justifying what happened to driving what happens next.

Lightning runs at the point of capture, so your analysis is ready while it still counts. From there, the enriched feed plugs into whatever matters: live dashboards, risk alerts, application services acting on emerging patterns, algorithmic trades via external APIs, workflow tickets for outbound recourse, or hardware control loops on a production line.

Whip out the Ferrari and drive your organisation to victory.

One mode, any datatype.

Many existing tools require different syntaxes for multiple contexts — and the complexity multiplies. Eager versus lazy expression trees. Each mode is a separate API surface to learn. Lightning takes a different approach. The same .sum() works across every context — array, table, grouped, and live stream — with consistent semantics:

// Array
let total: Array = prices.sum().collect()?;

// Table — broadcasts across columns
let totals: Table = trades.sum().collect()?;

// Grouped — same syntax
let by_region: Table = trades
    .group_by(&["region"])
    .sum().collect()?;

// Live streaming — same syntax
let live: LiveStream<Table> = feed
    .dam("1h")
    .group_by(&["region"])
    .sum()
    .live()?;

Four contexts, same language. Learn the ergonomic, pandas-like pattern once — and it works as you roar from static analysis to live production.

Why Rust, and why now

The above isn't pseudocode — it's Rust, working out there today.

There's a persistent myth that Rust means low-level, verbose, and "fighting the borrow checker". That's a design issue, not a language barrier. Rust can be high-level if the API supports it. Data tooling tends to celebrate complexity. Lightning strives for simplicity.

Ergonomics aside, the market's realised something: AI writes better Rust than Python.

When an LLM generates a Python analytics pipeline, it guesses the underlying types, as the language builds on top of more complex types than what it exposes to the user directly. Because it has no compilation step, Python often builds a pipeline that looks correct, yet silently produces the wrong answer at run-time because a column was a string where it expected a float. You don't find out until the production incident you called in over the weekend for, and dread before 8am with your five Weetbix. Or worse, you don't find out at all.

And I am sure you don't want to sit there whilst your agent runs the app to check, combing through logs whilst you manually simulate data edge cases. This is why AI agents like Claude Code write better Rust - they can analyse your program at compile-time and find these errors before-hand, from the Rust compiler, and loop you back in once those parts are solved.

In Rust, the compiler catches it. AI thrives on constraints because the guesswork has been eliminated. The types are real. When an LLM generates a Lightning pipeline, either it compiles correctly, or it doesn't and you know beforehand. No "it ran but the answer was wrong because Pandas silently coerced an int column to float."

Don't fight fires. Avoid lighting matches!

As long as you manage the agent carefully, your AI-generated pipeline can be production-ready the moment it compiles. Rust provides strong guarantees that eliminate an entire class of Python runtime bugs. Your agent iterates against a compiler that provides instant feedback, whilst deferring to your expertise on the hard stuff — freeing you up to multi-task, manage and monitor.

The problem runs deeper than tooling

The stack tax isn't just about running five tools. It's organisational drag.

Hiring splits. People who know Databricks and others who know Flink. Different skill sets, mental models, ecosystems. Your team fragments along tool boundaries and badges instead of problems to solve.

Testing gaps. Your batch pipeline has tests. Your streaming pipeline has different tests. You repeat many of the same data quality assertions. The integration between them has whatever someone remembered to write at 4pm on a Friday. The bugs live in the seams — where your coverage is thinnest.

Latency to insight. A data scientist builds a model in a notebook using Pandas and SciPy. Getting that model into production on live data means translating it into a streaming framework. That translation takes weeks and introduces subtle behavioural differences that take more weeks to reconcile.

Operational complexity. Many tools means additional upgrade cycles, more breaking changes, more monitoring dashboards, more on-call runbooks. Your platform team spends more time keeping the stack healthy than building on top of it — until the organisation restructures every 2-3 years wondering why they aren't getting enough value out of the data function.

None of these problems are technical in isolation. They're systemic. They come from stitching together tools that were never designed to work together.

The analytics stack tax — infographic — **Figure 1** **The Analytics Stack Tax**: *Here's what it costs you.*

Why this matters commercially

The stack tax is real money. But it's not even the most expensive part.

Batch architectures that materialise entire datasets in memory have a useful property — for infrastructure providers. Every out-of-memory error is an upgrade opportunity. Every dataset that outgrows its instance is a pricing tier transition. The tooling ecosystem around data lakes evolved in lockstep with the cloud platforms that host them, and the incentives are not pointing in your direction. Your Out-Of-Memory (OOM) errors are someone else's revenue growth.

Lightning processes data as a live stream. It doesn't need to hold your entire dataset in memory to produce results. There's no OOM cliff that forces you onto a larger instance. Data flows through the computation graph, results come out the other side, and memory scales with your data rate and windowing strategy, rather than your accumulated volume.

A mid-size data team running Pandas + Kafka + SciPy + dbt is paying for multiple sets of infrastructure, an integration layer for each pair, and the engineering time to keep the orchestration story coherent-ish. At least 30-40% of a data platform team's effort goes to integration, not analysis. Lightning collapses that into one engine surface, monitoring, and upgrade cycle.

For algorithmic trading desks, this has a sharper edge. The gap between backtesting and live execution is where money leaks. If your backtest runs on one engine and your live system runs on another, every behavioural difference is a risk. Lightning closes that gap structurally — the backtest and the live system run the same code, engine and numerical behaviour. The Python round-trip is 30 microseconds, so you might not need to rewrite your existing stack to use it.

And this isn't just a data transformation engine with a few aggregates bolted on. Lightning ships with full statistical modelling — regression, hypothesis testing, distributions, time-series analysis — the kind of thing you'd normally reach for statsmodels or SciPy for, built in and running at native Rust speed. Your quant's model and your production pipeline are the same code.

For IoT and manufacturing, sensor streams shouldn't need five services between ingestion and dashboard. Rolling statistics, anomaly detection, windowed aggregation — one pipeline with sub-millisecond overhead.

For teams that have been told "we can't do that live" — more often than not, the barrier isn't the analysis itself. It's the cost of rebuilding it for a different set of tools. Lightning gets it right the first time.

Under the surface

Lightning is fast. Aside from operating at the moment of the event, it employs the latest technology advancements in hardware utilisation to deliver extreme performance even on commodity hardware. By leveraging SIMD-native parallel processing, avoiding async scheduler overhead in the hot path, and retaining strong-typing throughout the engine, Lightning pipelines ship with the full suite of Rust optimisations to ensure your decisions enact at light-speed.

Built on a custom Apache Arrow implementation, all scientific computations are validated against SciPy reference implementations to stringent numerical accuracy standards. The full architecture is documented separately for engineering teams.

Lightning is the compute layer. Your warehouse, lake, or object store remains where data persists — Iceberg, Delta, Parquet on S3, Snowflake, whatever you're already running. Lightning reads from it, writes to it, and runs the same code whether the source is a live Kafka topic or a backfilled Parquet file. The unification is at the compute layer, not the storage layer. This is deliberate: storage decisions involve cost, compliance, and vendor relationships you probably don't want an analytics engine dictating. That being said, because data can be cleaned and transformed in Lightning, you may be able to eliminate a number of data persistence steps to help simplify this by running QA and exception handling upfront.

Persistence, replay, and fan-out are expressed in the same pipeline primitives as the live compute itself — write a stream to disk, read it back through the same code, tee (copy) a feed to multiple sinks. The engine doesn't prescribe a durability model; it gives you the pieces to persist whatever fits within your latency and risk budget. For teams already operating event-driven services, the compute path stays simple while the surrounding durability model stays under your control.

Where Lightning shines is the live pipe — a continuous feed in, real computation across it, an enriched stream out the other side. WebSocket feeds, Kafka topics, TCP market data, sensor telemetry — anything where the data is moving and the value is in acting on it before it's too late.

In summary, SpaceCell Lightning is engineered from the foundations without layering on top any pre-existing assumptions, to ensure data keeps up with the urgent needs of the modern enterprise.

The window is now

The stack tax exists because the tools that came before were each built to solve one piece of the puzzle. But every seam between them is a cost — in latency, in hiring, in bugs that live in the integration layer, in decisions made on stale data while fresh data sat in a queue waiting to be processed, and in tools that are unable to talk to each other.

The backtest and the live system can be the same code. The quant's model and the production pipeline can be the same engine. The batch report and the streaming dashboard can agree — because they were computed by the same system.

Lightning processes at the point of capture. Analysis and action live in the same pipeline, responding to fresh events rather than queuing them up for tomorrow's history lesson.

The window is open. It's time to act.

Lightning is for engineers working with live data — market feeds, sensor streams, telemetry, anything where the signal matters in the moment and the analysis has to keep up with real-time operating frequency. If you're building a market-leading system and want compute that runs where your data already flows, from event to outcome — get in touch.