AvroReader¶
jetliner.AvroReader(path, *, batch_size=100000, buffer_blocks=4, buffer_bytes=67108864, ignore_errors=False, projected_columns=None, storage_options=None, read_chunk_size=None, max_block_size=536870912)
¶
Bases: Iterator[DataFrame]
Single-file Avro reader for batch iteration over DataFrames.
Use this class when you need control over batch processing:
- Progress tracking between batches
- Per-batch error inspection (with
ignore_errors=True) - Writing batches to external sinks (database, API, etc.)
- Early termination based on content
For most use cases, prefer scan_avro() (lazy with query optimization) or
read_avro() (eager loading). Use AvroReader when you need explicit
iteration control.
For reading multiple files with batch control, use MultiAvroReader.
Examples:
Basic iteration:
With context manager (recommended):
Error inspection in skip mode:
>>> with jetliner.AvroReader("data.avro", ignore_errors=True) as reader:
... for df in reader:
... process(df)
... if reader.error_count > 0:
... print(f"Skipped {reader.error_count} errors")
... for err in reader.errors:
... print(f" [{err.kind}] Block {err.block_index}: {err.message}")
Schema inspection:
Create a new AvroReader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | PathLike[str]
|
Path to the Avro file. Supports local paths and s3:// URIs. |
required |
batch_size
|
int
|
Target number of rows per DataFrame. |
100_000
|
buffer_blocks
|
int
|
Number of blocks to prefetch. |
4
|
buffer_bytes
|
int
|
Maximum bytes to buffer. |
64MB
|
ignore_errors
|
bool
|
If True, skip bad records and continue; if False, fail on first error. |
False
|
projected_columns
|
list[str] | None
|
List of column names to read. None reads all columns. |
None
|
storage_options
|
dict[str, str] | None
|
Dict for S3 configuration with keys like |
None
|
read_chunk_size
|
int | None
|
Read buffer chunk size in bytes. When None, auto-detects: 64KB for local files, 4MB for S3. Larger values reduce I/O operations but use more memory. |
None
|
max_block_size
|
int | None
|
Maximum decompressed block size in bytes. Blocks exceeding this limit are rejected. Set to None to disable. Protects against decompression bombs. |
512MB
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
PermissionError
|
If access is denied. |
RuntimeError
|
For other errors (S3, parsing, etc.). |
Examples:
Local file:
S3 file:
With options:
>>> reader = jetliner.AvroReader(
... "file.avro",
... batch_size=50000,
... ignore_errors=True
... )
S3-compatible services (MinIO, LocalStack, R2):
>>> reader = jetliner.AvroReader(
... "s3://bucket/key.avro",
... storage_options={
... "endpoint": "http://localhost:9000",
... "aws_access_key_id": "minioadmin",
... "aws_secret_access_key": "minioadmin",
... }
... )
Attributes¶
batch_size
property
¶
Get the target number of rows per DataFrame batch.
Returns:
| Type | Description |
|---|---|
int
|
The batch size. |
error_count
property
¶
Get the number of errors encountered during reading.
Quick check for whether any errors occurred, without iterating through the errors list.
Returns:
| Type | Description |
|---|---|
int
|
Number of errors. |
Examples:
errors
property
¶
Get accumulated errors from skip mode reading.
In skip mode (ignore_errors=True), errors are accumulated rather than
causing immediate failure. This property returns all errors encountered.
Returns:
| Type | Description |
|---|---|
list[BadBlockError]
|
List of errors encountered during reading. |
Examples:
is_finished
property
¶
Check if all records have been read or the reader is closed.
Returns:
| Type | Description |
|---|---|
bool
|
True if finished, False otherwise. |
pending_records
property
¶
Get the number of records waiting to be returned in the next batch.
Returns:
| Type | Description |
|---|---|
int
|
Number of pending records. |
polars_schema
property
¶
Get the Polars schema for the DataFrames being produced.
When projection is used, only the projected columns are included.
Returns:
| Type | Description |
|---|---|
Schema
|
A Polars Schema with column names and their data types. |
Examples:
rows_read
property
¶
Get the total number of rows returned across all batches.
Returns:
| Type | Description |
|---|---|
int
|
Total rows read so far. |
schema
property
¶
schema_dict
property
¶
Get the Avro schema as a Python dictionary.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
The Avro schema as a Python dict. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the schema JSON cannot be parsed. |
Examples:
Functions¶
__enter__()
¶
__exit__(exc_type, exc_val, exc_tb)
¶
Exit the context manager.
Releases resources held by the reader. After this call, the reader cannot be used for iteration.
Returns:
| Type | Description |
|---|---|
bool
|
False to indicate exceptions should not be suppressed. |
__iter__()
¶
Return self as the iterator.
__next__()
¶
Get the next DataFrame batch.
Returns:
| Type | Description |
|---|---|
DataFrame
|
A Polars DataFrame containing the next batch of records. |
Raises:
| Type | Description |
|---|---|
StopIteration
|
When all records have been read or reader is closed. |
DecodeError
|
If an error occurs during reading (in strict mode). |
ParseError
|
If the file structure is invalid. |