Skip to content

MultiAvroReader

jetliner.MultiAvroReader(paths, *, batch_size=100000, buffer_blocks=4, buffer_bytes=67108864, ignore_errors=False, projected_columns=None, n_rows=None, row_index_name=None, row_index_offset=0, include_file_paths=None, storage_options=None, read_chunk_size=None, max_block_size=536870912)

Bases: Iterator[DataFrame]

Multi-file Avro reader for batch iteration over DataFrames.

Use this class when you need to read multiple Avro files with batch-level control:

  • Progress tracking across files and batches
  • Row index continuity across files
  • File path tracking per row
  • Per-batch error inspection (with ignore_errors=True)

For most use cases, prefer scan_avro() (lazy with query optimization) or read_avro() (eager loading) — both support multiple files. Use MultiAvroReader when you need explicit iteration control over multi-file reads.

For single-file batch iteration, use AvroReader.

Examples:

Basic multi-file iteration:

>>> reader = MultiAvroReader(["file1.avro", "file2.avro"])
>>> for df in reader:
...     print(f"Batch: {df.shape}")

With row tracking and file path injection:

>>> with MultiAvroReader(
...     ["file1.avro", "file2.avro"],
...     row_index_name="idx",
...     include_file_paths="source_file"
... ) as reader:
...     for df in reader:
...         print(f"File {reader.current_source_index + 1}/{reader.total_sources}")
...         process(df)

With row limit across all files:

>>> with MultiAvroReader(
...     ["file1.avro", "file2.avro"],
...     n_rows=100_000
... ) as reader:
...     for df in reader:
...         process(df)
...     print(f"Read {reader.rows_read} rows total")

Create a new MultiAvroReader.

Parameters:

Name Type Description Default
paths list[str | PathLike[str]]

List of paths to read. Glob patterns should be expanded before passing. Supports local paths and s3:// URIs (all paths must use the same source type).

required
batch_size int

Target number of rows per DataFrame.

100_000
buffer_blocks int

Number of blocks to prefetch.

4
buffer_bytes int

Maximum bytes to buffer.

64MB
ignore_errors bool

If True, skip bad records and continue; if False, fail on first error.

False
projected_columns list[str] | None

List of column names to read. None reads all columns.

None
n_rows int | None

Maximum number of rows to read across all files. None reads all.

None
row_index_name str | None

If provided, adds a row index column with this name, continuous across files.

None
row_index_offset int

Starting value for the row index.

0
include_file_paths str | None

If provided, adds a column with this name containing the source file path.

None
storage_options dict[str, str] | None

Dict for S3 configuration with keys like endpoint, aws_access_key_id, aws_secret_access_key, region.

None
read_chunk_size int | None

Read buffer chunk size in bytes. When None, auto-detects: 64KB for local files, 4MB for S3.

None
max_block_size int | None

Maximum decompressed block size in bytes. Blocks exceeding this limit are rejected. Set to None to disable.

512MB

Raises:

Type Description
FileNotFoundError

If any file does not exist.

PermissionError

If access is denied.

SchemaError

If file schemas are incompatible.

RuntimeError

For other errors (S3, parsing, etc.).

Attributes

batch_size property

Get the target number of rows per DataFrame batch.

Returns:

Type Description
int

The batch size.

current_source_index property

Get the index of the file currently being read (0-based).

Returns:

Type Description
int

Current source index.

error_count property

Get the number of errors encountered.

Returns:

Type Description
int

Number of errors.

errors property

Get accumulated errors from skip mode reading.

Available during and after iteration when using ignore_errors=True.

Returns:

Type Description
list[BadBlockError]

List of errors encountered.

is_finished property

Check if iteration is complete.

Returns:

Type Description
bool

True if finished, False otherwise.

polars_schema property

Get the Polars schema for the DataFrames being produced.

When projection is used, only the projected columns are included.

Returns:

Type Description
Schema

A Polars Schema with column names and their data types.

rows_read property

Get the total number of rows read so far across all files.

Returns:

Type Description
int

Total rows read.

schema property

Get the Avro schema as a JSON string (unified across files).

Returns:

Type Description
str

The Avro schema as a JSON string.

schema_dict property

Get the Avro schema as a Python dictionary.

Returns:

Type Description
dict[str, Any]

The Avro schema as a Python dict.

Raises:

Type Description
ValueError

If the schema JSON cannot be parsed.

total_sources property

Get the total number of source files.

Returns:

Type Description
int

Number of source files.

Functions

__enter__()

Enter the context manager.

Returns:

Type Description
MultiAvroReader

Self for use in with statements.

__exit__(exc_type, exc_val, exc_tb)

Exit the context manager.

Releases the current file handle and marks the reader as finished, preventing further iteration. Accumulated errors and metadata (schema, rows_read, etc.) remain accessible after exit.

Returns:

Type Description
bool

False to indicate exceptions should not be suppressed.

__iter__()

Return self as the iterator.

__next__()

Get the next DataFrame batch.

Returns:

Type Description
DataFrame

A Polars DataFrame containing the next batch of records.

Raises:

Type Description
StopIteration

When all records have been read or reader is closed.