Streaming Millions of Database Rows

How I built a FastAPI + DuckDB pipeline to handle large data

The Bottleneck: Why Traditional APIs Fall Apart

Imagine you're building a data analytics platform and your users want to query millions of rows. A naïve REST API implementation would probably look something like this.

python
@app.get("/data") def get_data(): results = db.execute("SELECT * FROM large_table").fetchall() # all rows in memory json_data = json.dumps(results) # memory spike #2 return {"data": json_data} # send all at once

The result? Your server crashes, users wait 30+ seconds, and your memory usage spikes to gigabytes. Not exactly a great user experience.

The Solution: Multi-layered Streaming Architecture

I was building a managed geospatial database service as part of my previous internship where users would upload 100+GB of geospatial datasets. The platform automatically provisioned and configured databases, indexed the data, and enabled users to run spatial joins and large-scale analytical queries efficiently.

Instead of treating queries as a single massive blob, I re-architected the pipeline around streaming. The design looks like this:

bash
DuckDB Query → Chunked Processing → orjson Serialization → ZSTD Compression → NDJSON Streaming → Client

The result: millions of rows delivered in under 5 seconds, with constant memory usage and sub-1-second time-to-first-byte.

Layer 1: DuckDB Database Processing

I chose DuckDB because it's built for analytical workloads.

Columnar storage — Unlike traditional row-oriented databases, DuckDB stores values column by column. This has three direct benefits for geospatial analytics:

  • Reduced I/O: Queries rarely need every attribute. If you're running SELECT population FROM parcels WHERE ST_Area(geom) > 1000, DuckDB only reads the population and geom columns, skipping dozens of others. In a row store, you would pull entire rows into memory.
  • Better compression: Storing similar values together (all populations, all city codes, etc.) enables dictionary and run-length encoding. This shrinks disk and memory footprints and speeds scans.
  • Vectorised execution: DuckDB processes data in cache-friendly batches (typically ~1,024 values at a time). That allows it to apply expensive functions like ST_Intersects across many geometries in a single CPU vectorised instruction pass.

Late materialisation — DuckDB defers stitching columns back into rows until it's certain which rows survive filtering. For example, if 100M data points are filtered down to 2M, only those 2M are ever materialised as rows, saving substantial CPU and memory.

In-process execution — DuckDB runs as an embedded engine. There is no client-server network hop, which eliminates latency overhead and makes it simple to bundle into an API service (like FastAPI). This "bring the database to the app" model is especially powerful for provisioning isolated databases per tenant or per dataset.

Alternatives Databases I Considered

  1. PostgreSQL: better concurrency and transactional guarantees, but slower for analytical scans (8–15s vs 5–7s).
  2. ClickHouse: excellent analytics engine with columnar storage, but operationally heavier and overkill for embedded use cases.
  3. BigQuery/Snowflake: serverless and powerful, but high latency and vendor costs made them poor fits for our real-time needs.

For an embedded, tenant-isolated platform dealing with 100+ GB user uploads, DuckDB hit the sweet spot.

Streaming directly from the database

The key breakthrough was implementing incremental streaming directly from the database cursor, rather than the common anti-pattern of loading everything into memory first:

python
# Pre-load everything (5+ second delay) result = execute_duckdb_query(db_path, query) # Loads ALL data return stream_chunked_query_results(result["rows"]) # Stream pre-loaded data # Stream directly from database (<1 second to first byte) async def stream_query_results_incremental(db_path, query, chunk_size=1000): # Send metadata immediately metadata = {"type": "query_metadata", "columns": [], "total_rows": None} yield _to_bytes_line(dumps, metadata) # Stream directly from database cursor with connect_to_db(db_path) as conn: cursor = conn.cursor() cursor.execute(query) while True: rows = cursor.fetchmany(chunk_size) # Fetch chunk from DB if not rows: break # Convert and stream immediately data_packet = {"type": "data", "rows": rows} yield _to_bytes_line(dumps, data_packet)

This approach eliminates the 5-second delay because:

  • Metadata flows immediately (client knows schema before data starts)
  • Database cursor streams data (no pre-loading into memory)
  • First chunk ready in <1 second (not 5+ seconds)

Layer 2: Chunked Processing for Constant Memory

Instead of loading everything into memory, we stream in batches:

python
def _chunked(iterable, size): it = iter(iterable) while (batch := list(islice(it, size))): yield batch def stream_raw_chunked_query_results(query_results, chunk_size=1000): for batch in _chunked(query_results, chunk_size): yield _to_bytes_line(dumps, {"type": "data", "rows": batch})

Benefits:

  • Instead of holding millions of rows in Python lists, you materialise a few thousand at a time, process them, then discard. Memory stays ~50–100 MB regardless of dataset size.
  • The first rows are sent to the client almost instantly, so users perceive the system as responsive even before the query finishes
  • Chunk size is tunable. Small chunks = lower latency to first row, larger chunks = fewer syscalls and slightly better throughput

Layer 3: High-performance Serialisation

Python's built-in json module buckles at scale. So, I replaced it with orjson, a Rust-based library that's blazingly fast. orjson uses SIMD (Single Instruction, Multiple Data) for tasks like escaping strings or copying buffers— whereas the standard json library loops through character by character. This is fine for smaller workloads but disastrous when serialising hundreds of megabytes of text.

Layer 4: Async Compression with ZSTD

So why compress in the first place? Because the bottleneck quickly shifts from CPU to network bandwidth.

NDJSON is verbose: every row repeats keys, brackets, and punctuation. A million rows can balloon into hundreds of megabytes. Even if your server can generate the data in seconds, pushing 500 MB uncompressed over a 100 Mbps link still takes ~40 seconds. Compression shrinks that payload by 60–80%, turning a painful 40-second transfer into under 10 seconds.

But compression has to be done carefully. If you naïvely compress everything at the end, you reintroduce latency — the client won't see any data until the full query finishes. Our solution was to stream and compress incrementally:

python
async def zstd_compress_stream(src, level=6): cctx = zstd.ZstdCompressor(level=level) sink = _ChunkSink() writer = cctx.stream_writer(sink) for chunk in src: writer.write(chunk) await asyncio.get_running_loop().run_in_executor(None, writer.flush, zstd.FLUSH_BLOCK) yield await asyncio.get_running_loop().run_in_executor(None, sink.drain)

Layer 5: NDJSON Streaming

The final piece was choosing the right wire format. JSON arrays don't stream well — clients have to wait for the closing bracket before parsing anything. Instead I went with Newline Delimited JSON (NDJSON):

python
yield _to_bytes_line(dumps, {"type": "query_metadata", "columns": columns}) for batch in _chunked(query_results, chunk_size): yield _to_bytes_line(dumps, {"type": "data", "rows": batch})

Each line is a complete JSON object. Clients can read line by line, parse incrementally, and update their UI progressively. Metadata goes first so the client knows the schema before data arrives. If something fails mid-stream, you can even emit {"type":"error"} as another line instead of tearing down the connection.

Resources: