MultiAvroReader¶

`jetliner.MultiAvroReader(paths, *, batch_size=100000, buffer_blocks=4, buffer_bytes=67108864, ignore_errors=False, projected_columns=None, n_rows=None, row_index_name=None, row_index_offset=0, include_file_paths=None, storage_options=None, read_chunk_size=None, max_block_size=536870912)` ¶

Bases: Iterator[DataFrame]

Multi-file Avro reader for batch iteration over DataFrames.

Use this class when you need to read multiple Avro files with batch-level control:

Progress tracking across files and batches
Row index continuity across files
File path tracking per row
Per-batch error inspection (with ignore_errors=True)

For most use cases, prefer scan_avro() (lazy with query optimization) or read_avro() (eager loading) — both support multiple files. Use MultiAvroReader when you need explicit iteration control over multi-file reads.

For single-file batch iteration, use AvroReader.

Examples:

Basic multi-file iteration:

>>> reader = MultiAvroReader(["file1.avro", "file2.avro"])
>>> for df in reader:
...     print(f"Batch: {df.shape}")

With row tracking and file path injection:

>>> with MultiAvroReader(
...     ["file1.avro", "file2.avro"],
...     row_index_name="idx",
...     include_file_paths="source_file"
... ) as reader:
...     for df in reader:
...         print(f"File {reader.current_source_index + 1}/{reader.total_sources}")
...         process(df)

With row limit across all files:

>>> with MultiAvroReader(
...     ["file1.avro", "file2.avro"],
...     n_rows=100_000
... ) as reader:
...     for df in reader:
...         process(df)
...     print(f"Read {reader.rows_read} rows total")

Create a new MultiAvroReader.

Parameters:

Name	Type	Description	Default
`paths`	`list[str \| PathLike[str]]`	List of paths to read. Glob patterns should be expanded before passing. Supports local paths and s3:// URIs (all paths must use the same source type).	required
`batch_size`	`int`	Target number of rows per DataFrame.	`100_000`
`buffer_blocks`	`int`	Number of blocks to prefetch.	`4`
`buffer_bytes`	`int`	Maximum bytes to buffer.	`64MB`
`ignore_errors`	`bool`	If True, skip bad records and continue; if False, fail on first error.	`False`
`projected_columns`	`list[str] \| None`	List of column names to read. None reads all columns.	`None`
`n_rows`	`int \| None`	Maximum number of rows to read across all files. None reads all.	`None`
`row_index_name`	`str \| None`	If provided, adds a row index column with this name, continuous across files.	`None`
`row_index_offset`	`int`	Starting value for the row index.	`0`
`include_file_paths`	`str \| None`	If provided, adds a column with this name containing the source file path.	`None`
`storage_options`	`dict[str, str] \| None`	Dict for S3 configuration with keys like `endpoint`, `aws_access_key_id`, `aws_secret_access_key`, `region`.	`None`
`read_chunk_size`	`int \| None`	Read buffer chunk size in bytes. When None, auto-detects: 64KB for local files, 4MB for S3.	`None`
`max_block_size`	`int \| None`	Maximum decompressed block size in bytes. Blocks exceeding this limit are rejected. Set to None to disable.	`512MB`

Raises:

Type	Description
`FileNotFoundError`	If any file does not exist.
`PermissionError`	If access is denied.
`SchemaError`	If file schemas are incompatible.
`RuntimeError`	For other errors (S3, parsing, etc.).

Attributes¶

`batch_size` `property` ¶

Get the target number of rows per DataFrame batch.

Returns:

Type	Description
`int`	The batch size.

`current_source_index` `property` ¶

Get the index of the file currently being read (0-based).

Returns:

Type	Description
`int`	Current source index.

`error_count` `property` ¶

Get the number of errors encountered.

Returns:

Type	Description
`int`	Number of errors.

`errors` `property` ¶

Get accumulated errors from skip mode reading.

Available during and after iteration when using ignore_errors=True.

Returns:

Type	Description
`list[BadBlockError]`	List of errors encountered.

`is_finished` `property` ¶

Check if iteration is complete.

Returns:

Type	Description
`bool`	True if finished, False otherwise.

`polars_schema` `property` ¶

Get the Polars schema for the DataFrames being produced.

When projection is used, only the projected columns are included.

Returns:

Type	Description
`Schema`	A Polars Schema with column names and their data types.

`rows_read` `property` ¶

Get the total number of rows read so far across all files.

Returns:

Type	Description
`int`	Total rows read.

`schema` `property` ¶

Get the Avro schema as a JSON string (unified across files).

Returns:

Type	Description
`str`	The Avro schema as a JSON string.

`schema_dict` `property` ¶

Get the Avro schema as a Python dictionary.

Returns:

Type	Description
`dict[str, Any]`	The Avro schema as a Python dict.

Raises:

Type	Description
`ValueError`	If the schema JSON cannot be parsed.

`total_sources` `property` ¶

Get the total number of source files.

Returns:

Type	Description
`int`	Number of source files.

Functions¶

`enter()` ¶

Enter the context manager.

Returns:

Type	Description
`MultiAvroReader`	Self for use in `with` statements.

`exit(exc_type, exc_val, exc_tb)` ¶

Exit the context manager.

Releases the current file handle and marks the reader as finished, preventing further iteration. Accumulated errors and metadata (schema, rows_read, etc.) remain accessible after exit.

Returns:

Type	Description
`bool`	False to indicate exceptions should not be suppressed.

`iter()` ¶

Return self as the iterator.

`next()` ¶

Get the next DataFrame batch.

Returns:

Type	Description
`DataFrame`	A Polars DataFrame containing the next batch of records.

Raises:

Type	Description
`StopIteration`	When all records have been read or reader is closed.

MultiAvroReader¶

jetliner.MultiAvroReader(paths, *, batch_size=100000, buffer_blocks=4, buffer_bytes=67108864, ignore_errors=False, projected_columns=None, n_rows=None, row_index_name=None, row_index_offset=0, include_file_paths=None, storage_options=None, read_chunk_size=None, max_block_size=536870912) ¶

Attributes¶

batch_size property ¶

current_source_index property ¶

error_count property ¶

errors property ¶

is_finished property ¶

polars_schema property ¶

rows_read property ¶

schema property ¶

schema_dict property ¶

total_sources property ¶

Functions¶

__enter__() ¶

__exit__(exc_type, exc_val, exc_tb) ¶

__iter__() ¶

__next__() ¶

`jetliner.MultiAvroReader(paths, *, batch_size=100000, buffer_blocks=4, buffer_bytes=67108864, ignore_errors=False, projected_columns=None, n_rows=None, row_index_name=None, row_index_offset=0, include_file_paths=None, storage_options=None, read_chunk_size=None, max_block_size=536870912)` ¶

`batch_size` `property` ¶

`current_source_index` `property` ¶

`error_count` `property` ¶

`errors` `property` ¶

`is_finished` `property` ¶

`polars_schema` `property` ¶

`rows_read` `property` ¶

`schema` `property` ¶

`schema_dict` `property` ¶

`total_sources` `property` ¶

`enter()` ¶

`exit(exc_type, exc_val, exc_tb)` ¶

`iter()` ¶

`next()` ¶