Skip to content

GeoParquet

Read and write GeoParquet files.

geoarrow.rust.io.read_parquet

read_parquet(
    path: Union[str, Path, BinaryIO],
    *,
    store: Optional[ObjectStore] = None,
    batch_size: int = 65536
) -> Table

Read a GeoParquet file from a path on disk into an Arrow Table.

Example:

Reading from a local path:

from geoarrow.rust.io import read_parquet
table = read_parquet("path/to/file.parquet")

Reading from an HTTP(S) url:

from geoarrow.rust.io import read_parquet

url = "https://raw.githubusercontent.com/opengeospatial/geoparquet/v1.0.0/examples/example.parquet"
table = read_parquet(url)

Reading from a remote file on an S3 bucket.

from geoarrow.rust.io import ObjectStore, read_parquet

options = {
    "aws_access_key_id": "...",
    "aws_secret_access_key": "...",
    "aws_region": "..."
}
store = ObjectStore('s3://bucket', options=options)
table = read_parquet("path/in/bucket.parquet", store=store)

Parameters:

  • path (Union[str, Path, BinaryIO]) –

    the path to the file

  • store (Optional[ObjectStore], default: None ) –

    the ObjectStore to read from. Defaults to None.

  • batch_size (int, default: 65536 ) –

    the number of rows to include in each internal batch of the table.

Returns:

  • Table

    Table from GeoParquet file.

geoarrow.rust.io.read_parquet_async async

read_parquet_async(
    path: Union[str, Path, BinaryIO],
    *,
    store: Optional[ObjectStore] = None,
    batch_size: int = 65536
) -> Table

Read a GeoParquet file from a path on disk into an Arrow Table.

Examples:

Reading from an HTTP(S) url:

from geoarrow.rust.io import read_parquet_async

url = "https://raw.githubusercontent.com/opengeospatial/geoparquet/v1.0.0/examples/example.parquet"
table = await read_parquet_async(url)

Reading from a remote file on an S3 bucket.

from geoarrow.rust.io import ObjectStore, read_parquet_async

options = {
    "aws_access_key_id": "...",
    "aws_secret_access_key": "...",
    "aws_region": "..."
}
store = ObjectStore('s3://bucket', options=options)
table = await read_parquet_async("path/in/bucket.parquet", store=store)

Parameters:

  • path (Union[str, Path, BinaryIO]) –

    the path to the file

  • store (Optional[ObjectStore], default: None ) –

    the ObjectStore to read from. Defaults to None.

  • batch_size (int, default: 65536 ) –

    the number of rows to include in each internal batch of the table.

Returns:

  • Table

    Table from GeoParquet file.

geoarrow.rust.io.write_parquet

write_parquet(
    table: ArrowStreamExportable,
    file: Union[str, Path, BinaryIO],
    *,
    encoding: GeoParquetEncoding | GeoParquetEncodingT = GeoParquetEncoding.WKB
) -> None

Write an Arrow RecordBatch, Table, or RecordBatchReader to a GeoParquet file on disk.

If a RecordBatchReader is passed, only one batch at a time will be materialized in memory.

Parameters:

geoarrow.rust.io.ParquetDataset

num_row_groups property

num_row_groups: int

The total number of row groups across all files

num_rows property

num_rows: int

The total number of rows across all files.

schema_arrow property

schema_arrow: Schema

Access the Arrow schema of the generated data

__init__

__init__(paths: Sequence[str], store: ObjectStore) -> None

Construct a new ParquetDataset

This will synchronously fetch metadata from all listed files.

Parameters:

  • paths (Sequence[str]) –

    a list of string URLs to read from.

  • store (ObjectStore) –

    the file system interface to read from.

Returns:

  • None

    A new ParquetDataset object.

crs

crs(column_name: str | None = None) -> CRS

Access the CRS of this file.

Parameters:

  • column_name (str | None, default: None ) –

    The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

Returns:

  • CRS

    CRS

read

read(
    *,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None,
    bbox: Sequence[int | float] | None = None,
    bbox_paths: BboxCovering | None = None
) -> Table

Perform a sync read with the given options

Parameters:

  • batch_size (int | None, default: None ) –

    description. Defaults to None.

  • limit (int | None, default: None ) –

    description. Defaults to None.

  • offset (int | None, default: None ) –

    description. Defaults to None.

  • bbox (Sequence[int | float] | None, default: None ) –

    description. Defaults to None.

  • bbox_paths (BboxCovering | None, default: None ) –

    description. Defaults to None.

Returns:

read_async async

read_async(
    *,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None,
    bbox: Sequence[int | float] | None = None,
    bbox_paths: BboxCovering | None = None
) -> Table

Perform an async read with the given options

Parameters:

  • batch_size (int | None, default: None ) –

    description. Defaults to None.

  • limit (int | None, default: None ) –

    description. Defaults to None.

  • offset (int | None, default: None ) –

    description. Defaults to None.

  • bbox (Sequence[int | float] | None, default: None ) –

    description. Defaults to None.

  • bbox_paths (BboxCovering | None, default: None ) –

    description. Defaults to None.

Returns:

geoarrow.rust.io.ParquetFile

num_row_groups property

num_row_groups: int

The number of row groups in this file.

num_rows property

num_rows: int

The number of rows in this file.

schema_arrow property

schema_arrow: Schema

Access the Arrow schema of the generated data

__init__

__init__(path: str, store: ObjectStore) -> None

Construct a new ParquetFile

This will synchronously fetch metadata from the provided path

Parameters:

  • path (str) –

    a string URL to read from.

  • store (ObjectStore) –

    the file system interface to read from.

Returns:

  • None

    A new ParquetFile object.

crs

crs(column_name: str | None = None) -> CRS

Access the CRS of this file.

Parameters:

  • column_name (str | None, default: None ) –

    The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

Returns:

  • CRS

    CRS

file_bbox

file_bbox() -> List[float] | None

Access the bounding box of the given column for the entire file

If no column name is passed, retrieves the bbox from the primary geometry column.

An Err will be returned if the column name does not exist in the dataset None will be returned if the metadata does not contain bounding box information.

read

read(
    *,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None,
    bbox: Sequence[int | float] | None = None,
    bbox_paths: BboxCovering | None = None
) -> Table

Perform a sync read with the given options

Parameters:

  • batch_size (int | None, default: None ) –

    description. Defaults to None.

  • limit (int | None, default: None ) –

    description. Defaults to None.

  • offset (int | None, default: None ) –

    description. Defaults to None.

  • bbox (Sequence[int | float] | None, default: None ) –

    description. Defaults to None.

  • bbox_paths (BboxCovering | None, default: None ) –

    description. Defaults to None.

Returns:

read_async async

read_async(
    *,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None,
    bbox: Sequence[int | float] | None = None,
    bbox_paths: BboxCovering | None = None
) -> Table

Perform an async read with the given options

Parameters:

  • batch_size (int | None, default: None ) –

    description. Defaults to None.

  • limit (int | None, default: None ) –

    description. Defaults to None.

  • offset (int | None, default: None ) –

    description. Defaults to None.

  • bbox (Sequence[int | float] | None, default: None ) –

    description. Defaults to None.

  • bbox_paths (BboxCovering | None, default: None ) –

    description. Defaults to None.

Returns:

row_group_bounds

row_group_bounds(
    row_group_idx: int, bbox_paths: BboxCovering | None = None
) -> List[float]

Get the bounds of a single row group.

Parameters:

  • row_group_idx (int) –

    The row group index.

  • bbox_paths (BboxCovering | None, default: None ) –

    For files written with spatial partitioning, you don't need to pass in these column names, as they'll be specified in the metadata Defaults to None.

Returns:

  • List[float]

    The bounds of a single row group.

row_groups_bounds

row_groups_bounds(bbox_paths: BboxCovering | None = None) -> NativeArray

Get the bounds of all row groups.

As of GeoParquet 1.1 you won't need to pass in these column names, as they'll be specified in the metadata.

Parameters:

  • bbox_paths (BboxCovering | None, default: None ) –

    For files written with spatial partitioning, you don't need to pass in these column names, as they'll be specified in the metadata Defaults to None.

Returns:

  • NativeArray

    A geoarrow "box" array with bounds of all row groups.

geoarrow.rust.io.ParquetWriter

Writer interface for a single Parquet file.

This allows you to write Parquet files that are larger than memory.

close

close() -> None

Close this file.

The recommended use of this class is as a context manager, which will close the file automatically.

is_closed

is_closed() -> bool

Returns True if the file has already been closed.

write_batch

write_batch(batch: ArrowArrayExportable) -> None

Write a single RecordBatch to the Parquet file

write_table

write_table(table: ArrowArrayExportable | ArrowStreamExportable) -> None

Write a table or stream of batches to the Parquet file

This accepts an Arrow RecordBatch, Table, or RecordBatchReader. If a RecordBatchReader is passed, only one batch at a time will be materialized in memory.

Parameters:

geoarrow.rust.io.types.BboxCovering

Bases: TypedDict

Column names for the per-row bounding box covering used in spatial partitioning.

The spatial partitioning defined in GeoParquet 1.1 allows for a "covering" field. The covering should be four float columns that represent the bounding box of each row of the data.

As of GeoParquet 1.1, this metadata is included in the Parquet file itself, but this typed dict can be used with spatially-partitioned GeoParquet datasets that do not write GeoParquet 1.1 metadata. Providing this information is unnecessary for GeoParquet 1.1 files with included covering information.

xmax instance-attribute

xmax: Sequence[str]

The path to the xmax bounding box column.

xmin instance-attribute

xmin: Sequence[str]

The path to the xmin bounding box column.

ymax instance-attribute

ymax: Sequence[str]

The path to the ymax bounding box column.

ymin instance-attribute

ymin: Sequence[str]

The path to the ymin bounding box column.

geoarrow.rust.io.enums.GeoParquetEncoding

Bases: StrEnum

Options for geometry encoding in GeoParquet.

Native class-attribute instance-attribute

Native = auto()

Use native GeoArrow geometry types when writing GeoParquet files.

Supported as of GeoParquet version 1.1.

This option provides for better read and write performance and for inferring spatial partitioning from remote files. But it does not yet have widespread support.

WKB class-attribute instance-attribute

WKB = auto()

Use Well-Known Binary (WKB) encoding when writing GeoParquet files.

geoarrow.rust.io.types.GeoParquetEncodingT module-attribute

GeoParquetEncodingT = Literal['wkb', 'native']

Acceptable strings to be passed into the encoding parameter for write_parquet.