The Bottleneck: Why Traditional APIs Fall Apart
Imagine you're building a data analytics platform and your users want to query millions of rows. A naïve REST API implementation would probably look something like this.
@app.get("/data")
def get_data():
results = db.execute("SELECT * FROM large_table").fetchall() # all rows in memory
json_data = json.dumps(results) # memory spike #2
return {"data": json_data} # send all at once
The result? Your server crashes, users wait 30+ seconds, and your memory usage spikes to gigabytes. Not exactly a great user experience.
The Solution: Multi-layered Streaming Architecture
I was building a managed geospatial database service as part of my previous internship where users would upload 100+GB of geospatial datasets. The platform automatically provisioned and configured databases, indexed the data, and enabled users to run spatial joins and large-scale analytical queries efficiently.
Instead of treating queries as a single massive blob, I re-architected the pipeline around streaming. The design looks like this:
DuckDB Query → Chunked Processing → orjson Serialization → ZSTD Compression → NDJSON Streaming → Client
The result: millions of rows delivered in under 5 seconds, with constant memory usage and sub-1-second time-to-first-byte.
Layer 1: DuckDB Database Processing
I chose DuckDB because it's built for analytical workloads.
Columnar storage — Unlike traditional row-oriented databases, DuckDB stores values column by column. This has three direct benefits for geospatial analytics:
- Reduced I/O: Queries rarely need every attribute. If you're running
SELECT population FROM parcels WHERE ST_Area(geom) > 1000, DuckDB only reads the population and geom columns, skipping dozens of others. In a row store, you would pull entire rows into memory. - Better compression: Storing similar values together (all populations, all city codes, etc.) enables dictionary and run-length encoding. This shrinks disk and memory footprints and speeds scans.
- Vectorised execution: DuckDB processes data in cache-friendly batches
(typically ~1,024 values at a time). That allows it to apply expensive
functions like
ST_Intersectsacross many geometries in a single CPU vectorised instruction pass.
Late materialisation — DuckDB defers stitching columns back into rows until it's certain which rows survive filtering. For example, if 100M data points are filtered down to 2M, only those 2M are ever materialised as rows, saving substantial CPU and memory.
In-process execution — DuckDB runs as an embedded engine. There is no client-server network hop, which eliminates latency overhead and makes it simple to bundle into an API service (like FastAPI). This "bring the database to the app" model is especially powerful for provisioning isolated databases per tenant or per dataset.
Alternatives Databases I Considered
- PostgreSQL: better concurrency and transactional guarantees, but slower for analytical scans (8–15s vs 5–7s).
- ClickHouse: excellent analytics engine with columnar storage, but operationally heavier and overkill for embedded use cases.
- BigQuery/Snowflake: serverless and powerful, but high latency and vendor costs made them poor fits for our real-time needs.
For an embedded, tenant-isolated platform dealing with 100+ GB user uploads, DuckDB hit the sweet spot.
Streaming directly from the database
The key breakthrough was implementing incremental streaming directly from the database cursor, rather than the common anti-pattern of loading everything into memory first:
# Pre-load everything (5+ second delay)
result = execute_duckdb_query(db_path, query) # Loads ALL data
return stream_chunked_query_results(result["rows"]) # Stream pre-loaded data
# Stream directly from database (<1 second to first byte)
async def stream_query_results_incremental(db_path, query, chunk_size=1000):
# Send metadata immediately
metadata = {"type": "query_metadata", "columns": [], "total_rows": None}
yield _to_bytes_line(dumps, metadata)
# Stream directly from database cursor
with connect_to_db(db_path) as conn:
cursor = conn.cursor()
cursor.execute(query)
while True:
rows = cursor.fetchmany(chunk_size) # Fetch chunk from DB
if not rows:
break
# Convert and stream immediately
data_packet = {"type": "data", "rows": rows}
yield _to_bytes_line(dumps, data_packet)
This approach eliminates the 5-second delay because:
- Metadata flows immediately (client knows schema before data starts)
- Database cursor streams data (no pre-loading into memory)
- First chunk ready in <1 second (not 5+ seconds)
Layer 2: Chunked Processing for Constant Memory
Instead of loading everything into memory, we stream in batches:
def _chunked(iterable, size):
it = iter(iterable)
while (batch := list(islice(it, size))):
yield batch
def stream_raw_chunked_query_results(query_results, chunk_size=1000):
for batch in _chunked(query_results, chunk_size):
yield _to_bytes_line(dumps, {"type": "data", "rows": batch})
Benefits:
- Instead of holding millions of rows in Python lists, you materialise a few thousand at a time, process them, then discard. Memory stays ~50–100 MB regardless of dataset size.
- The first rows are sent to the client almost instantly, so users perceive the system as responsive even before the query finishes
- Chunk size is tunable. Small chunks = lower latency to first row, larger chunks = fewer syscalls and slightly better throughput
Layer 3: High-performance Serialisation
Python's built-in json module buckles at scale. So, I replaced it with
orjson, a Rust-based library that's blazingly fast. orjson uses SIMD (Single
Instruction, Multiple Data) for tasks like escaping strings or copying buffers—
whereas the standard json library loops through character by character. This is
fine for smaller workloads but disastrous when serialising hundreds of megabytes
of text.
Layer 4: Async Compression with ZSTD
So why compress in the first place? Because the bottleneck quickly shifts from CPU to network bandwidth.
NDJSON is verbose: every row repeats keys, brackets, and punctuation. A million rows can balloon into hundreds of megabytes. Even if your server can generate the data in seconds, pushing 500 MB uncompressed over a 100 Mbps link still takes ~40 seconds. Compression shrinks that payload by 60–80%, turning a painful 40-second transfer into under 10 seconds.
But compression has to be done carefully. If you naïvely compress everything at the end, you reintroduce latency — the client won't see any data until the full query finishes. Our solution was to stream and compress incrementally:
async def zstd_compress_stream(src, level=6):
cctx = zstd.ZstdCompressor(level=level)
sink = _ChunkSink()
writer = cctx.stream_writer(sink)
for chunk in src:
writer.write(chunk)
await asyncio.get_running_loop().run_in_executor(None, writer.flush, zstd.FLUSH_BLOCK)
yield await asyncio.get_running_loop().run_in_executor(None, sink.drain)
Layer 5: NDJSON Streaming
The final piece was choosing the right wire format. JSON arrays don't stream well — clients have to wait for the closing bracket before parsing anything. Instead I went with Newline Delimited JSON (NDJSON):
yield _to_bytes_line(dumps, {"type": "query_metadata", "columns": columns})
for batch in _chunked(query_results, chunk_size):
yield _to_bytes_line(dumps, {"type": "data", "rows": batch})
Each line is a complete JSON object. Clients can read line by line, parse
incrementally, and update their UI progressively. Metadata goes first so the
client knows the schema before data arrives. If something fails mid-stream, you
can even emit {"type":"error"} as another line instead of tearing down the
connection.
Resources: