Skip to content

Jetliner

scan_avro

scan_avro¶

`jetliner.scan_avro(source, , n_rows=None, row_index_name=None, row_index_offset=0, glob=True, include_file_paths=None, ignore_errors=False, storage_options=None, buffer_blocks=4, buffer_bytes=64 1024 * 1024, read_chunk_size=None, batch_size=100000, max_block_size=512 * 1024 * 1024)` ¶

Scan Avro file(s), returning a LazyFrame with query optimization support.

This function uses Polars' IO plugin system to enable query optimizations:

Projection pushdown: Only read columns that are actually used in the query
Predicate pushdown: Apply filters during reading, not after
Early stopping: Stop reading after the requested number of rows

Parameters:

Name	Type	Description	Default
`source`	`str \| Path \| Sequence[str] \| Sequence[Path]`	Path to Avro file(s). Supports: Local filesystem paths: `/path/to/file.avro`, `./relative/path.avro` S3 URIs: `s3://bucket/key.avro` Glob patterns with standard wildcards: `` matches any characters except `/` (e.g., `data/.avro`) `` matches zero or more directories (e.g., `data//.avro`) `?` matches a single character (e.g., `file?.avro`) `[...]` matches character ranges (e.g., `data/[0-9].avro`) `{a,b,c}` matches alternatives (e.g., `data/{2023,2024}/*.avro`) Multiple files: `["file1.avro", "file2.avro"]` When multiple files are provided, schemas must be compatible.	required
`n_rows`	`int \| None`	Maximum number of rows to read across all files. None means read all rows.	`None`
`row_index_name`	`str \| None`	If provided, adds a row index column with this name as the first column. With multiple files, the index continues across files.	`None`
`row_index_offset`	`int`	Starting value for the row index (only used if row_index_name is set).	`0`
`glob`	`bool`	If True, expand glob patterns in the source path. Set to False to treat patterns as literal filenames.	`True`
`include_file_paths`	`str \| None`	If provided, adds a column with this name containing the source file path for each row. Useful for tracking data provenance with multiple files.	`None`
`ignore_errors`	`bool`	If True, skip corrupted blocks/records and continue reading. If False, fail immediately on first error. Errors are not accessible in scan mode.	`False`
`storage_options`	`dict[str, str] \| None`	Configuration for S3 connections. Supported keys: `endpoint`: Custom S3 endpoint (for MinIO, LocalStack, R2, etc.) `aws_access_key_id`: AWS access key (overrides environment) `aws_secret_access_key`: AWS secret key (overrides environment) `region`: AWS region (overrides environment) Values here take precedence over environment variables.	`None`
`buffer_blocks`	`int`	Number of blocks to prefetch for better I/O performance.	`4`
`buffer_bytes`	`int`	Maximum bytes to buffer during prefetching.	`64MB`
`read_chunk_size`	`int \| None`	Read buffer chunk size in bytes. If None, auto-detects based on source: 64KB for local files, 4MB for S3. Larger values reduce I/O operations but use more memory.	`None`
`batch_size`	`int`	Minimum rows per DataFrame batch. Rows accumulate until this threshold is reached, then yield. Batches may exceed this since blocks are processed whole. Final batch may be smaller if fewer rows remain.	`100,000`
`max_block_size`	`int \| None`	Maximum decompressed block size in bytes. Blocks that would decompress to more than this limit are rejected. Set to None to disable the limit. Default: 512MB (536870912 bytes). This protects against decompression bombs - maliciously crafted files where a small compressed block expands to consume excessive memory.	`512MB`

Returns:

Type	Description
`LazyFrame`	A LazyFrame that can be used with Polars query operations.

Examples:

>>> import jetliner
>>> import polars as pl
>>>
>>> # Basic scan
>>> lf = jetliner.scan_avro("data.avro")
>>>
>>> # With query optimization
>>> result = (
...     jetliner.scan_avro("data/*.avro")
...     .select(["col1", "col2"])
...     .filter(pl.col("amount") > 100)
...     .head(1000)
...     .collect()
... )
>>>
>>> # Multiple files with row tracking
>>> result = (
...     jetliner.scan_avro(
...         ["file1.avro", "file2.avro"],
...         row_index_name="idx",
...         include_file_paths="source"
...     )
...     .collect()
... )
>>>
>>> # S3 with custom endpoint (MinIO, LocalStack, R2)
>>> lf = jetliner.scan_avro(
...     "s3://bucket/data.avro",
...     storage_options={
...         "endpoint": "http://localhost:9000",
...         "aws_access_key_id": "minioadmin",
...         "aws_secret_access_key": "minioadmin",
...     }
... )