Skip to content

AvroReader

jetliner.AvroReader(path, *, batch_size=100000, buffer_blocks=4, buffer_bytes=67108864, ignore_errors=False, projected_columns=None, storage_options=None, read_chunk_size=None, max_block_size=536870912)

Bases: Iterator[DataFrame]

Single-file Avro reader for batch iteration over DataFrames.

Use this class when you need control over batch processing:

  • Progress tracking between batches
  • Per-batch error inspection (with ignore_errors=True)
  • Writing batches to external sinks (database, API, etc.)
  • Early termination based on content

For most use cases, prefer scan_avro() (lazy with query optimization) or read_avro() (eager loading). Use AvroReader when you need explicit iteration control.

For reading multiple files with batch control, use MultiAvroReader.

Examples:

Basic iteration:

>>> for df in jetliner.AvroReader("data.avro"):
...     print(df.shape)

With context manager (recommended):

>>> with jetliner.AvroReader("data.avro") as reader:
...     for df in reader:
...         process(df)

Error inspection in skip mode:

>>> with jetliner.AvroReader("data.avro", ignore_errors=True) as reader:
...     for df in reader:
...         process(df)
...     if reader.error_count > 0:
...         print(f"Skipped {reader.error_count} errors")
...         for err in reader.errors:
...             print(f"  [{err.kind}] Block {err.block_index}: {err.message}")

Schema inspection:

>>> with jetliner.AvroReader("data.avro") as reader:
...     print(reader.schema_dict["fields"])

Create a new AvroReader.

Parameters:

Name Type Description Default
path str | PathLike[str]

Path to the Avro file. Supports local paths and s3:// URIs.

required
batch_size int

Target number of rows per DataFrame.

100_000
buffer_blocks int

Number of blocks to prefetch.

4
buffer_bytes int

Maximum bytes to buffer.

64MB
ignore_errors bool

If True, skip bad records and continue; if False, fail on first error.

False
projected_columns list[str] | None

List of column names to read. None reads all columns.

None
storage_options dict[str, str] | None

Dict for S3 configuration with keys like endpoint, aws_access_key_id, aws_secret_access_key, region.

None
read_chunk_size int | None

Read buffer chunk size in bytes. When None, auto-detects: 64KB for local files, 4MB for S3. Larger values reduce I/O operations but use more memory.

None
max_block_size int | None

Maximum decompressed block size in bytes. Blocks exceeding this limit are rejected. Set to None to disable. Protects against decompression bombs.

512MB

Raises:

Type Description
FileNotFoundError

If the file does not exist.

PermissionError

If access is denied.

RuntimeError

For other errors (S3, parsing, etc.).

Examples:

Local file:

>>> reader = jetliner.AvroReader("/path/to/file.avro")

S3 file:

>>> reader = jetliner.AvroReader("s3://bucket/key.avro")

With options:

>>> reader = jetliner.AvroReader(
...     "file.avro",
...     batch_size=50000,
...     ignore_errors=True
... )

S3-compatible services (MinIO, LocalStack, R2):

>>> reader = jetliner.AvroReader(
...     "s3://bucket/key.avro",
...     storage_options={
...         "endpoint": "http://localhost:9000",
...         "aws_access_key_id": "minioadmin",
...         "aws_secret_access_key": "minioadmin",
...     }
... )

Attributes

batch_size property

Get the target number of rows per DataFrame batch.

Returns:

Type Description
int

The batch size.

error_count property

Get the number of errors encountered during reading.

Quick check for whether any errors occurred, without iterating through the errors list.

Returns:

Type Description
int

Number of errors.

Examples:

>>> with jetliner.AvroReader("file.avro", ignore_errors=True) as reader:
...     for df in reader:
...         process(df)
...
...     if reader.error_count > 0:
...         print(f"Warning: {reader.error_count} errors during read")

errors property

Get accumulated errors from skip mode reading.

In skip mode (ignore_errors=True), errors are accumulated rather than causing immediate failure. This property returns all errors encountered.

Returns:

Type Description
list[BadBlockError]

List of errors encountered during reading.

Examples:

>>> with jetliner.AvroReader("file.avro", ignore_errors=True) as reader:
...     for df in reader:
...         process(df)
...
...     for err in reader.errors:
...         print(f"[{err.kind}] Block {err.block_index}: {err.message}")

is_finished property

Check if all records have been read or the reader is closed.

Returns:

Type Description
bool

True if finished, False otherwise.

pending_records property

Get the number of records waiting to be returned in the next batch.

Returns:

Type Description
int

Number of pending records.

polars_schema property

Get the Polars schema for the DataFrames being produced.

When projection is used, only the projected columns are included.

Returns:

Type Description
Schema

A Polars Schema with column names and their data types.

Examples:

>>> reader = jetliner.AvroReader("file.avro")
>>> print(reader.polars_schema)  # Schema({'id': Int64, 'name': String})

rows_read property

Get the total number of rows returned across all batches.

Returns:

Type Description
int

Total rows read so far.

schema property

Get the Avro schema as a JSON string.

Returns:

Type Description
str

The Avro schema as a JSON string.

Examples:

>>> reader = jetliner.AvroReader("file.avro")
>>> print(reader.schema)  # JSON string

schema_dict property

Get the Avro schema as a Python dictionary.

Returns:

Type Description
dict[str, Any]

The Avro schema as a Python dict.

Raises:

Type Description
ValueError

If the schema JSON cannot be parsed.

Examples:

>>> reader = jetliner.AvroReader("file.avro")
>>> schema = reader.schema_dict
>>> print(schema["name"])  # Record name
>>> print(schema["fields"])  # List of fields

Functions

__enter__()

Enter the context manager.

Returns:

Type Description
AvroReader

Self for use in with statements.

__exit__(exc_type, exc_val, exc_tb)

Exit the context manager.

Releases resources held by the reader. After this call, the reader cannot be used for iteration.

Returns:

Type Description
bool

False to indicate exceptions should not be suppressed.

__iter__()

Return self as the iterator.

__next__()

Get the next DataFrame batch.

Returns:

Type Description
DataFrame

A Polars DataFrame containing the next batch of records.

Raises:

Type Description
StopIteration

When all records have been read or reader is closed.

DecodeError

If an error occurs during reading (in strict mode).

ParseError

If the file structure is invalid.