AvroReader¶

`jetliner.AvroReader(path, *, batch_size=100000, buffer_blocks=4, buffer_bytes=67108864, ignore_errors=False, projected_columns=None, storage_options=None, read_chunk_size=None, max_block_size=536870912)` ¶

Bases: Iterator[DataFrame]

Single-file Avro reader for batch iteration over DataFrames.

Use this class when you need control over batch processing:

Progress tracking between batches
Per-batch error inspection (with ignore_errors=True)
Writing batches to external sinks (database, API, etc.)
Early termination based on content

For most use cases, prefer scan_avro() (lazy with query optimization) or read_avro() (eager loading). Use AvroReader when you need explicit iteration control.

For reading multiple files with batch control, use MultiAvroReader.

Examples:

Basic iteration:

>>> for df in jetliner.AvroReader("data.avro"):
...     print(df.shape)

With context manager (recommended):

>>> with jetliner.AvroReader("data.avro") as reader:
...     for df in reader:
...         process(df)

Error inspection in skip mode:

>>> with jetliner.AvroReader("data.avro", ignore_errors=True) as reader:
...     for df in reader:
...         process(df)
...     if reader.error_count > 0:
...         print(f"Skipped {reader.error_count} errors")
...         for err in reader.errors:
...             print(f"  [{err.kind}] Block {err.block_index}: {err.message}")

Schema inspection:

>>> with jetliner.AvroReader("data.avro") as reader:
...     print(reader.schema_dict["fields"])

Create a new AvroReader.

Parameters:

Name	Type	Description	Default
`path`	`str \| PathLike[str]`	Path to the Avro file. Supports local paths and s3:// URIs.	required
`batch_size`	`int`	Target number of rows per DataFrame.	`100_000`
`buffer_blocks`	`int`	Number of blocks to prefetch.	`4`
`buffer_bytes`	`int`	Maximum bytes to buffer.	`64MB`
`ignore_errors`	`bool`	If True, skip bad records and continue; if False, fail on first error.	`False`
`projected_columns`	`list[str] \| None`	List of column names to read. None reads all columns.	`None`
`storage_options`	`dict[str, str] \| None`	Dict for S3 configuration with keys like `endpoint`, `aws_access_key_id`, `aws_secret_access_key`, `region`.	`None`
`read_chunk_size`	`int \| None`	Read buffer chunk size in bytes. When None, auto-detects: 64KB for local files, 4MB for S3. Larger values reduce I/O operations but use more memory.	`None`
`max_block_size`	`int \| None`	Maximum decompressed block size in bytes. Blocks exceeding this limit are rejected. Set to None to disable. Protects against decompression bombs.	`512MB`

Raises:

Type	Description
`FileNotFoundError`	If the file does not exist.
`PermissionError`	If access is denied.
`RuntimeError`	For other errors (S3, parsing, etc.).

Examples:

Local file:

>>> reader = jetliner.AvroReader("/path/to/file.avro")

S3 file:

>>> reader = jetliner.AvroReader("s3://bucket/key.avro")

With options:

>>> reader = jetliner.AvroReader(
...     "file.avro",
...     batch_size=50000,
...     ignore_errors=True
... )

S3-compatible services (MinIO, LocalStack, R2):

>>> reader = jetliner.AvroReader(
...     "s3://bucket/key.avro",
...     storage_options={
...         "endpoint": "http://localhost:9000",
...         "aws_access_key_id": "minioadmin",
...         "aws_secret_access_key": "minioadmin",
...     }
... )

Attributes¶

`batch_size` `property` ¶

Get the target number of rows per DataFrame batch.

Returns:

Type	Description
`int`	The batch size.

`error_count` `property` ¶

Get the number of errors encountered during reading.

Quick check for whether any errors occurred, without iterating through the errors list.

Returns:

Type	Description
`int`	Number of errors.

Examples:

>>> with jetliner.AvroReader("file.avro", ignore_errors=True) as reader:
...     for df in reader:
...         process(df)
...
...     if reader.error_count > 0:
...         print(f"Warning: {reader.error_count} errors during read")

`errors` `property` ¶

Get accumulated errors from skip mode reading.

In skip mode (ignore_errors=True), errors are accumulated rather than causing immediate failure. This property returns all errors encountered.

Returns:

Type	Description
`list[BadBlockError]`	List of errors encountered during reading.

Examples:

>>> with jetliner.AvroReader("file.avro", ignore_errors=True) as reader:
...     for df in reader:
...         process(df)
...
...     for err in reader.errors:
...         print(f"[{err.kind}] Block {err.block_index}: {err.message}")

`is_finished` `property` ¶

Check if all records have been read or the reader is closed.

Returns:

Type	Description
`bool`	True if finished, False otherwise.

`pending_records` `property` ¶

Get the number of records waiting to be returned in the next batch.

Returns:

Type	Description
`int`	Number of pending records.

`polars_schema` `property` ¶

Get the Polars schema for the DataFrames being produced.

When projection is used, only the projected columns are included.

Returns:

Type	Description
`Schema`	A Polars Schema with column names and their data types.

Examples:

>>> reader = jetliner.AvroReader("file.avro")
>>> print(reader.polars_schema)  # Schema({'id': Int64, 'name': String})

`rows_read` `property` ¶

Get the total number of rows returned across all batches.

Returns:

Type	Description
`int`	Total rows read so far.

`schema` `property` ¶

Get the Avro schema as a JSON string.

Returns:

Type	Description
`str`	The Avro schema as a JSON string.

Examples:

>>> reader = jetliner.AvroReader("file.avro")
>>> print(reader.schema)  # JSON string

`schema_dict` `property` ¶

Get the Avro schema as a Python dictionary.

Returns:

Type	Description
`dict[str, Any]`	The Avro schema as a Python dict.

Raises:

Type	Description
`ValueError`	If the schema JSON cannot be parsed.

Examples:

>>> reader = jetliner.AvroReader("file.avro")
>>> schema = reader.schema_dict
>>> print(schema["name"])  # Record name
>>> print(schema["fields"])  # List of fields

Functions¶

`enter()` ¶

Enter the context manager.

Returns:

Type	Description
`AvroReader`	Self for use in `with` statements.

`exit(exc_type, exc_val, exc_tb)` ¶

Exit the context manager.

Releases resources held by the reader. After this call, the reader cannot be used for iteration.

Returns:

Type	Description
`bool`	False to indicate exceptions should not be suppressed.

`iter()` ¶

Return self as the iterator.

`next()` ¶

Get the next DataFrame batch.

Returns:

Type	Description
`DataFrame`	A Polars DataFrame containing the next batch of records.

Raises:

Type	Description
`StopIteration`	When all records have been read or reader is closed.
`DecodeError`	If an error occurs during reading (in strict mode).
`ParseError`	If the file structure is invalid.

AvroReader¶

jetliner.AvroReader(path, *, batch_size=100000, buffer_blocks=4, buffer_bytes=67108864, ignore_errors=False, projected_columns=None, storage_options=None, read_chunk_size=None, max_block_size=536870912) ¶

Attributes¶

batch_size property ¶

error_count property ¶

errors property ¶

is_finished property ¶

pending_records property ¶

polars_schema property ¶

rows_read property ¶

schema property ¶

schema_dict property ¶

Functions¶

__enter__() ¶

__exit__(exc_type, exc_val, exc_tb) ¶

__iter__() ¶

__next__() ¶

`jetliner.AvroReader(path, *, batch_size=100000, buffer_blocks=4, buffer_bytes=67108864, ignore_errors=False, projected_columns=None, storage_options=None, read_chunk_size=None, max_block_size=536870912)` ¶

`batch_size` `property` ¶

`error_count` `property` ¶

`errors` `property` ¶

`is_finished` `property` ¶

`pending_records` `property` ¶

`polars_schema` `property` ¶

`rows_read` `property` ¶

`schema` `property` ¶

`schema_dict` `property` ¶

`enter()` ¶

`exit(exc_type, exc_val, exc_tb)` ¶

`iter()` ¶

`next()` ¶