Skip to content

scan_avro

jetliner.scan_avro(source, *, n_rows=None, row_index_name=None, row_index_offset=0, glob=True, include_file_paths=None, ignore_errors=False, storage_options=None, buffer_blocks=4, buffer_bytes=64 * 1024 * 1024, read_chunk_size=None, batch_size=100000, max_block_size=512 * 1024 * 1024)

Scan Avro file(s), returning a LazyFrame with query optimization support.

This function uses Polars' IO plugin system to enable query optimizations:

  • Projection pushdown: Only read columns that are actually used in the query
  • Predicate pushdown: Apply filters during reading, not after
  • Early stopping: Stop reading after the requested number of rows

Parameters:

Name Type Description Default
source str | Path | Sequence[str] | Sequence[Path]

Path to Avro file(s). Supports:

  • Local filesystem paths: /path/to/file.avro, ./relative/path.avro
  • S3 URIs: s3://bucket/key.avro
  • Glob patterns with standard wildcards:

  • * matches any characters except / (e.g., data/*.avro)

  • ** matches zero or more directories (e.g., data/**/*.avro)
  • ? matches a single character (e.g., file?.avro)
  • [...] matches character ranges (e.g., data/[0-9]*.avro)
  • {a,b,c} matches alternatives (e.g., data/{2023,2024}/*.avro)

  • Multiple files: ["file1.avro", "file2.avro"]

When multiple files are provided, schemas must be compatible.

required
n_rows int | None

Maximum number of rows to read across all files. None means read all rows.

None
row_index_name str | None

If provided, adds a row index column with this name as the first column. With multiple files, the index continues across files.

None
row_index_offset int

Starting value for the row index (only used if row_index_name is set).

0
glob bool

If True, expand glob patterns in the source path. Set to False to treat patterns as literal filenames.

True
include_file_paths str | None

If provided, adds a column with this name containing the source file path for each row. Useful for tracking data provenance with multiple files.

None
ignore_errors bool

If True, skip corrupted blocks/records and continue reading. If False, fail immediately on first error. Errors are not accessible in scan mode.

False
storage_options dict[str, str] | None

Configuration for S3 connections. Supported keys:

  • endpoint: Custom S3 endpoint (for MinIO, LocalStack, R2, etc.)
  • aws_access_key_id: AWS access key (overrides environment)
  • aws_secret_access_key: AWS secret key (overrides environment)
  • region: AWS region (overrides environment)

Values here take precedence over environment variables.

None
buffer_blocks int

Number of blocks to prefetch for better I/O performance.

4
buffer_bytes int

Maximum bytes to buffer during prefetching.

64MB
read_chunk_size int | None

Read buffer chunk size in bytes. If None, auto-detects based on source: 64KB for local files, 4MB for S3. Larger values reduce I/O operations but use more memory.

None
batch_size int

Minimum rows per DataFrame batch. Rows accumulate until this threshold is reached, then yield. Batches may exceed this since blocks are processed whole. Final batch may be smaller if fewer rows remain.

100,000
max_block_size int | None

Maximum decompressed block size in bytes. Blocks that would decompress to more than this limit are rejected. Set to None to disable the limit. Default: 512MB (536870912 bytes).

This protects against decompression bombs - maliciously crafted files where a small compressed block expands to consume excessive memory.

512MB

Returns:

Type Description
LazyFrame

A LazyFrame that can be used with Polars query operations.

Examples:

>>> import jetliner
>>> import polars as pl
>>>
>>> # Basic scan
>>> lf = jetliner.scan_avro("data.avro")
>>>
>>> # With query optimization
>>> result = (
...     jetliner.scan_avro("data/*.avro")
...     .select(["col1", "col2"])
...     .filter(pl.col("amount") > 100)
...     .head(1000)
...     .collect()
... )
>>>
>>> # Multiple files with row tracking
>>> result = (
...     jetliner.scan_avro(
...         ["file1.avro", "file2.avro"],
...         row_index_name="idx",
...         include_file_paths="source"
...     )
...     .collect()
... )
>>>
>>> # S3 with custom endpoint (MinIO, LocalStack, R2)
>>> lf = jetliner.scan_avro(
...     "s3://bucket/data.avro",
...     storage_options={
...         "endpoint": "http://localhost:9000",
...         "aws_access_key_id": "minioadmin",
...         "aws_secret_access_key": "minioadmin",
...     }
... )