Skip to content

read_avro

jetliner.read_avro(source, *, columns=None, n_rows=None, row_index_name=None, row_index_offset=0, glob=True, include_file_paths=None, ignore_errors=False, storage_options=None, buffer_blocks=4, buffer_bytes=64 * 1024 * 1024, read_chunk_size=None, batch_size=100000, max_block_size=512 * 1024 * 1024)

Read Avro file(s), returning a DataFrame.

Eagerly reads data into memory. For large files or when you need query optimization, use scan_avro() instead.

Parameters:

Name Type Description Default
source str | Path | Sequence[str] | Sequence[Path]

Path to Avro file(s). Supports:

  • Local filesystem paths: /path/to/file.avro, ./relative/path.avro
  • S3 URIs: s3://bucket/key.avro
  • Glob patterns with standard wildcards:

  • * matches any characters except / (e.g., data/*.avro)

  • ** matches zero or more directories (e.g., data/**/*.avro)
  • ? matches a single character (e.g., file?.avro)
  • [...] matches character ranges (e.g., data/[0-9]*.avro)
  • {a,b,c} matches alternatives (e.g., data/{2023,2024}/*.avro)

  • Multiple files: ["file1.avro", "file2.avro"]

When multiple files are provided, schemas must be compatible.

required
columns Sequence[str] | Sequence[int] | None

Columns to read. Projection happens during decoding for efficiency. Can be:

  • List of column names: ["col1", "col2"]
  • List of column indices (0-based): [0, 2, 5]
  • None to read all columns
None
n_rows int | None

Maximum number of rows to read across all files. None means read all rows.

None
row_index_name str | None

If provided, adds a row index column with this name as the first column. With multiple files, the index continues across files.

None
row_index_offset int

Starting value for the row index (only used if row_index_name is set).

0
glob bool

If True, expand glob patterns in the source path. Set to False to treat patterns as literal filenames.

True
include_file_paths str | None

If provided, adds a column with this name containing the source file path for each row. Useful for tracking data provenance with multiple files.

None
ignore_errors bool

If True, skip corrupted blocks/records and continue reading. If False, fail immediately on first error. Errors are not accessible in read mode.

False
storage_options dict[str, str] | None

Configuration for S3 connections. Supported keys:

  • endpoint: Custom S3 endpoint (for MinIO, LocalStack, R2, etc.)
  • aws_access_key_id: AWS access key (overrides environment)
  • aws_secret_access_key: AWS secret key (overrides environment)
  • region: AWS region (overrides environment)

Values here take precedence over environment variables.

None
buffer_blocks int

Number of blocks to prefetch for better I/O performance.

4
buffer_bytes int

Maximum bytes to buffer during prefetching.

64MB
read_chunk_size int | None

Read buffer chunk size in bytes. If None, auto-detects based on source: 64KB for local files, 4MB for S3. Larger values reduce I/O operations but use more memory.

None
batch_size int

Minimum rows per DataFrame batch. Rows accumulate until this threshold is reached, then yield. Batches may exceed this since blocks are processed whole. Final batch may be smaller if fewer rows remain.

100,000
max_block_size int | None

Maximum decompressed block size in bytes. Blocks that would decompress to more than this limit are rejected. Set to None to disable the limit. Default: 512MB (536870912 bytes).

This protects against decompression bombs - maliciously crafted files where a small compressed block expands to consume excessive memory.

512MB

Returns:

Type Description
DataFrame

A DataFrame containing the Avro data.

Examples:

>>> import jetliner
>>>
>>> # Read all columns
>>> df = jetliner.read_avro("data.avro")
>>>
>>> # Read specific columns by name
>>> df = jetliner.read_avro("data.avro", columns=["col1", "col2"])
>>>
>>> # Read specific columns by index
>>> df = jetliner.read_avro("data.avro", columns=[0, 2, 5])
>>>
>>> # Read with row limit
>>> df = jetliner.read_avro("data.avro", n_rows=1000)
>>>
>>> # Multiple files with row tracking
>>> df = jetliner.read_avro(
...     ["file1.avro", "file2.avro"],
...     row_index_name="idx",
...     include_file_paths="source"
... )