read_avro¶
jetliner.read_avro(source, *, columns=None, n_rows=None, row_index_name=None, row_index_offset=0, glob=True, include_file_paths=None, ignore_errors=False, storage_options=None, buffer_blocks=4, buffer_bytes=64 * 1024 * 1024, read_chunk_size=None, batch_size=100000, max_block_size=512 * 1024 * 1024)
¶
Read Avro file(s), returning a DataFrame.
Eagerly reads data into memory. For large files or when you need query
optimization, use scan_avro() instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str | Path | Sequence[str] | Sequence[Path]
|
Path to Avro file(s). Supports:
When multiple files are provided, schemas must be compatible. |
required |
columns
|
Sequence[str] | Sequence[int] | None
|
Columns to read. Projection happens during decoding for efficiency. Can be:
|
None
|
n_rows
|
int | None
|
Maximum number of rows to read across all files. None means read all rows. |
None
|
row_index_name
|
str | None
|
If provided, adds a row index column with this name as the first column. With multiple files, the index continues across files. |
None
|
row_index_offset
|
int
|
Starting value for the row index (only used if row_index_name is set). |
0
|
glob
|
bool
|
If True, expand glob patterns in the source path. Set to False to treat patterns as literal filenames. |
True
|
include_file_paths
|
str | None
|
If provided, adds a column with this name containing the source file path for each row. Useful for tracking data provenance with multiple files. |
None
|
ignore_errors
|
bool
|
If True, skip corrupted blocks/records and continue reading. If False, fail immediately on first error. Errors are not accessible in read mode. |
False
|
storage_options
|
dict[str, str] | None
|
Configuration for S3 connections. Supported keys:
Values here take precedence over environment variables. |
None
|
buffer_blocks
|
int
|
Number of blocks to prefetch for better I/O performance. |
4
|
buffer_bytes
|
int
|
Maximum bytes to buffer during prefetching. |
64MB
|
read_chunk_size
|
int | None
|
Read buffer chunk size in bytes. If None, auto-detects based on source: 64KB for local files, 4MB for S3. Larger values reduce I/O operations but use more memory. |
None
|
batch_size
|
int
|
Minimum rows per DataFrame batch. Rows accumulate until this threshold is reached, then yield. Batches may exceed this since blocks are processed whole. Final batch may be smaller if fewer rows remain. |
100,000
|
max_block_size
|
int | None
|
Maximum decompressed block size in bytes. Blocks that would decompress to more than this limit are rejected. Set to None to disable the limit. Default: 512MB (536870912 bytes). This protects against decompression bombs - maliciously crafted files where a small compressed block expands to consume excessive memory. |
512MB
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A DataFrame containing the Avro data. |
Examples:
>>> import jetliner
>>>
>>> # Read all columns
>>> df = jetliner.read_avro("data.avro")
>>>
>>> # Read specific columns by name
>>> df = jetliner.read_avro("data.avro", columns=["col1", "col2"])
>>>
>>> # Read specific columns by index
>>> df = jetliner.read_avro("data.avro", columns=[0, 2, 5])
>>>
>>> # Read with row limit
>>> df = jetliner.read_avro("data.avro", n_rows=1000)
>>>
>>> # Multiple files with row tracking
>>> df = jetliner.read_avro(
... ["file1.avro", "file2.avro"],
... row_index_name="idx",
... include_file_paths="source"
... )