MultiAvroReader¶
jetliner.MultiAvroReader(paths, *, batch_size=100000, buffer_blocks=4, buffer_bytes=67108864, ignore_errors=False, projected_columns=None, n_rows=None, row_index_name=None, row_index_offset=0, include_file_paths=None, storage_options=None, read_chunk_size=None, max_block_size=536870912)
¶
Bases: Iterator[DataFrame]
Multi-file Avro reader for batch iteration over DataFrames.
Use this class when you need to read multiple Avro files with batch-level control:
- Progress tracking across files and batches
- Row index continuity across files
- File path tracking per row
- Per-batch error inspection (with
ignore_errors=True)
For most use cases, prefer scan_avro() (lazy with query optimization) or
read_avro() (eager loading) — both support multiple files. Use MultiAvroReader
when you need explicit iteration control over multi-file reads.
For single-file batch iteration, use AvroReader.
Examples:
Basic multi-file iteration:
>>> reader = MultiAvroReader(["file1.avro", "file2.avro"])
>>> for df in reader:
... print(f"Batch: {df.shape}")
With row tracking and file path injection:
>>> with MultiAvroReader(
... ["file1.avro", "file2.avro"],
... row_index_name="idx",
... include_file_paths="source_file"
... ) as reader:
... for df in reader:
... print(f"File {reader.current_source_index + 1}/{reader.total_sources}")
... process(df)
With row limit across all files:
>>> with MultiAvroReader(
... ["file1.avro", "file2.avro"],
... n_rows=100_000
... ) as reader:
... for df in reader:
... process(df)
... print(f"Read {reader.rows_read} rows total")
Create a new MultiAvroReader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
paths
|
list[str | PathLike[str]]
|
List of paths to read. Glob patterns should be expanded before passing. Supports local paths and s3:// URIs (all paths must use the same source type). |
required |
batch_size
|
int
|
Target number of rows per DataFrame. |
100_000
|
buffer_blocks
|
int
|
Number of blocks to prefetch. |
4
|
buffer_bytes
|
int
|
Maximum bytes to buffer. |
64MB
|
ignore_errors
|
bool
|
If True, skip bad records and continue; if False, fail on first error. |
False
|
projected_columns
|
list[str] | None
|
List of column names to read. None reads all columns. |
None
|
n_rows
|
int | None
|
Maximum number of rows to read across all files. None reads all. |
None
|
row_index_name
|
str | None
|
If provided, adds a row index column with this name, continuous across files. |
None
|
row_index_offset
|
int
|
Starting value for the row index. |
0
|
include_file_paths
|
str | None
|
If provided, adds a column with this name containing the source file path. |
None
|
storage_options
|
dict[str, str] | None
|
Dict for S3 configuration with keys like |
None
|
read_chunk_size
|
int | None
|
Read buffer chunk size in bytes. When None, auto-detects: 64KB for local files, 4MB for S3. |
None
|
max_block_size
|
int | None
|
Maximum decompressed block size in bytes. Blocks exceeding this limit are rejected. Set to None to disable. |
512MB
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If any file does not exist. |
PermissionError
|
If access is denied. |
SchemaError
|
If file schemas are incompatible. |
RuntimeError
|
For other errors (S3, parsing, etc.). |
Attributes¶
batch_size
property
¶
Get the target number of rows per DataFrame batch.
Returns:
| Type | Description |
|---|---|
int
|
The batch size. |
current_source_index
property
¶
Get the index of the file currently being read (0-based).
Returns:
| Type | Description |
|---|---|
int
|
Current source index. |
error_count
property
¶
Get the number of errors encountered.
Returns:
| Type | Description |
|---|---|
int
|
Number of errors. |
errors
property
¶
Get accumulated errors from skip mode reading.
Available during and after iteration when using ignore_errors=True.
Returns:
| Type | Description |
|---|---|
list[BadBlockError]
|
List of errors encountered. |
is_finished
property
¶
Check if iteration is complete.
Returns:
| Type | Description |
|---|---|
bool
|
True if finished, False otherwise. |
polars_schema
property
¶
Get the Polars schema for the DataFrames being produced.
When projection is used, only the projected columns are included.
Returns:
| Type | Description |
|---|---|
Schema
|
A Polars Schema with column names and their data types. |
rows_read
property
¶
Get the total number of rows read so far across all files.
Returns:
| Type | Description |
|---|---|
int
|
Total rows read. |
schema
property
¶
Get the Avro schema as a JSON string (unified across files).
Returns:
| Type | Description |
|---|---|
str
|
The Avro schema as a JSON string. |
schema_dict
property
¶
Get the Avro schema as a Python dictionary.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
The Avro schema as a Python dict. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the schema JSON cannot be parsed. |
total_sources
property
¶
Get the total number of source files.
Returns:
| Type | Description |
|---|---|
int
|
Number of source files. |
Functions¶
__enter__()
¶
Enter the context manager.
Returns:
| Type | Description |
|---|---|
MultiAvroReader
|
Self for use in |
__exit__(exc_type, exc_val, exc_tb)
¶
Exit the context manager.
Releases the current file handle and marks the reader as finished, preventing further iteration. Accumulated errors and metadata (schema, rows_read, etc.) remain accessible after exit.
Returns:
| Type | Description |
|---|---|
bool
|
False to indicate exceptions should not be suppressed. |
__iter__()
¶
Return self as the iterator.
__next__()
¶
Get the next DataFrame batch.
Returns:
| Type | Description |
|---|---|
DataFrame
|
A Polars DataFrame containing the next batch of records. |
Raises:
| Type | Description |
|---|---|
StopIteration
|
When all records have been read or reader is closed. |