Query optimization¶

Jetliner's scan_avro() integrates with Polars' query optimizer to minimize I/O and memory usage. For buffer and batch size tuning, see Streaming.

Optimizations¶

Optimization	What it does	Benefit
Projection pushdown	Only deserializes columns used in query	Less CPU, less memory
Predicate pushdown	Filters data during reading	Less memory, faster queries
Early stopping	Stops reading after row limit	Faster for `head()`/`limit()`

Example¶

import jetliner
import polars as pl

# Highly optimized: deserializes 2 columns, filters during read, stops at 1000 rows
result = (
    jetliner.scan_avro("data.avro")
    .select(["user_id", "amount"])
    .filter(pl.col("amount") > 100)
    .head(1000)
    .collect()
)

These optimizations also apply when using sink_batches() or collect_batches() for streaming — see Streaming.

Polars automatically detects which columns and filters to push down. See the Polars user guide for details on how the query optimizer works.

Using read_avro() with columns¶

For eager loading, use read_avro() with the columns parameter:

df = jetliner.read_avro("data.avro", columns=["user_id", "amount"])

Limitations¶

Sorting: Early stopping doesn't apply when sorting (all data must be read first)
Complex expressions: Some complex filter expressions may not push down