Skip to content

GeoParquet

Read and write GeoParquet files.

geoarrow.rust.io.GeoParquetDataset

An interface to read from a collection GeoParquet files with the same schema.

fragments property

fragments: List[GeoParquetFile]

Get the list of files in this dataset.

num_row_groups property

num_row_groups: int

The total number of row groups across all files

num_rows property

num_rows: int

The total number of rows across all files.

crs

crs(column_name: str | None = None) -> CRS | None

Access the CRS of this file.

Parameters:

  • column_name (str | None, default: None ) –

    The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

fragment

fragment(path: str) -> GeoParquetFile

Get a single file from this dataset.

open classmethod

Construct a new ParquetDataset

This will synchronously fetch metadata from all listed files.

Parameters:

Returns:

read

read(
    *,
    bbox: Sequence[int | float] | None = None,
    parse_to_native: bool = True,
    coord_type: CoordTypeInput | None = None,
    batch_size: int | None = None
) -> Table

Perform a sync read with the given options

Other Parameters:

  • bbox (Sequence[int | float] | None) –

    The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.

  • parse_to_native (bool) –

    If True, the data will be parsed to native Arrow types. Defaults to True.

  • coord_type (CoordTypeInput | None) –

    The coordinate type to use. Defaults to separated coordinates.

  • batch_size (int | None) –

    The number of rows in each internal batch of the table. Defaults to 1024.

read_async async

read_async(
    *,
    bbox: Sequence[int | float] | None = None,
    parse_to_native: bool = True,
    coord_type: CoordTypeInput | None = None,
    batch_size: int | None = None
) -> Table

Perform an async read with the given options

Other Parameters:

  • bbox (Sequence[int | float] | None) –

    The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.

  • parse_to_native (bool) –

    If True, the data will be parsed to native Arrow types. Defaults to True.

  • coord_type (CoordTypeInput | None) –

    The coordinate type to use. Defaults to separated coordinates.

  • batch_size (int | None) –

    The number of rows in each internal batch of the table. Defaults to 1024.

schema_arrow

schema_arrow(
    *, parse_to_native: bool = True, coord_type: CoordTypeInput | None = None
) -> Schema

Access the Arrow schema of the generated data.

Parameters:

  • parse_to_native (bool, default: True ) –

    If True, the schema will be parsed to native Arrow types. Defaults to True.

  • coord_type (CoordTypeInput | None, default: None ) –

    The coordinate type to use. Defaults to separated coordinates.

geoarrow.rust.io.GeoParquetFile

num_row_groups property

num_row_groups: int

The number of row groups in this file.

num_rows property

num_rows: int

The number of rows in this file.

crs

crs(column_name: str | None = None) -> CRS | None

Access the CRS of this file.

Parameters:

  • column_name (str | None, default: None ) –

    The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

file_bbox

file_bbox() -> List[float] | None

Access the bounding box of the given column for the entire file

If no column name is passed, retrieves the bbox from the primary geometry column.

An error will be returned if the column name does not exist in the dataset. None will be returned if the metadata does not contain bounding box information.

open classmethod

open(path: str | PathInput, store: ObjectStore) -> GeoParquetFile

Open a Parquet file from the given path.

This will synchronously fetch metadata from the provided path.

Parameters:

  • path (str | PathInput) –

    a string URL to read from.

  • store (ObjectStore) –

    the object store interface to read from.

open_async async classmethod

open_async(path: str | PathInput, store: ObjectStore) -> GeoParquetFile

Open a Parquet file from the given path asynchronously.

This will fetch metadata from the provided path in an async manner.

Parameters:

  • path (str | PathInput) –

    a string URL to read from.

  • store (ObjectStore) –

    the object store interface to read from.

read

read(
    *,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None,
    bbox: Sequence[int | float] | None = None,
    parse_to_native: bool = True,
    coord_type: CoordTypeInput | None = None
) -> Table

Perform a synchronous read with the given options

Other Parameters:

  • bbox (Sequence[int | float] | None) –

    The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.

  • parse_to_native (bool) –

    If True, the data will be parsed to native Arrow types. Defaults to True.

  • coord_type (CoordTypeInput | None) –

    The coordinate type to use. Defaults to separated coordinates.

  • batch_size (int | None) –

    The number of rows in each internal batch of the table. Defaults to 1024.

  • limit (int | None) –

    The maximum number of rows to read. Defaults to None, which means all rows will be read.

  • offset (int | None) –

    The number of rows to skip before starting to read. Defaults to None, which means no rows will be skipped.

read_async async

read_async(
    *,
    bbox: Sequence[int | float] | None = None,
    parse_to_native: bool = True,
    coord_type: CoordTypeInput | None = None,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None
) -> Table

Perform an async read with the given options

Other Parameters:

  • bbox (Sequence[int | float] | None) –

    The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.

  • parse_to_native (bool) –

    If True, the data will be parsed to native Arrow types. Defaults to True.

  • coord_type (CoordTypeInput | None) –

    The coordinate type to use. Defaults to separated coordinates.

  • batch_size (int | None) –

    The number of rows in each internal batch of the table. Defaults to 1024.

  • limit (int | None) –

    The maximum number of rows to read. Defaults to None, which means all rows will be read.

  • offset (int | None) –

    The number of rows to skip before starting to read. Defaults to None, which means no rows will be skipped.

row_group_bounds

row_group_bounds(
    row_group_idx: int, column_name: str | None = None
) -> List[float]

Get the bounds of a single row group.

Parameters:

  • row_group_idx (int) –

    The row group index.

  • column_name (str | None, default: None ) –

    The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

Returns:

  • List[float]

    The bounds of a single row group.

row_groups_bounds

row_groups_bounds(column_name: str | None = None) -> Array

Get the bounds of all row groups.

As of GeoParquet 1.1 you won't need to pass in these column names, as they'll be specified in the metadata.

Parameters:

  • column_name (str | None, default: None ) –

    The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

Returns:

  • Array

    A geoarrow "box" array with bounds of all row groups.

schema_arrow

schema_arrow(
    *, parse_to_native: bool = True, coord_type: CoordTypeInput | None = None
) -> Schema

Access the Arrow schema of the generated data.

Parameters:

  • parse_to_native (bool, default: True ) –

    If True, the schema will be parsed to native Arrow types. Defaults to True.

  • coord_type (CoordTypeInput | None, default: None ) –

    The coordinate type to use. Defaults to separated coordinates.

geoarrow.rust.io.GeoParquetWriter

Writer interface for a single GeoParquet file.

This allows you to write GeoParquet files that are larger than memory.

__init__

__init__(
    file: str | Path | BinaryIO,
    schema: ArrowSchemaExportable,
    *,
    encoding: GeoParquetEncoding | GeoParquetEncodingT = WKB,
    compression: (
        Literal["uncompressed", "snappy", "lzo", "lz4", "lz4_raw"] | str
    ) = "zstd(1)",
    writer_version: Literal["parquet_1_0", "parquet_2_0"] = "parquet_2_0"
) -> None

Create a new GeoParquetWriter.

Note

This currently only supports writing to local files. Directly writing to object stores will be supported in a release soon.

Parameters:

Other Parameters:

  • encoding (GeoParquetEncoding | GeoParquetEncodingT) –

    the geometry encoding to use. See GeoParquetEncoding for more details on supported geometry encodings.

  • compression (Literal['uncompressed', 'snappy', 'lzo', 'lz4', 'lz4_raw'] | str) –

    the compression algorithm to use. This can be either one of the strings in the Literal type, or a string that contains the compression level, like gzip(9) or brotli(11) or zstd(22). The default is zstd(1).

  • writer_version (Literal['parquet_1_0', 'parquet_2_0']) –

    the Parquet writer version to use. Defaults to "parquet_2_0".

close

close() -> None

Close this file.

This is required to ensure that all data is flushed to disk and the file is properly finalized.

The recommended use of this class is as a context manager, which will close the file automatically.

is_closed

is_closed() -> bool

Returns True if the file has already been closed.

write_batch

write_batch(batch: ArrowArrayExportable) -> None

Write a single RecordBatch to the GeoParquet file.

write_table

write_table(table: ArrowArrayExportable | ArrowStreamExportable) -> None

Write a table or stream of batches to the Parquet file

This accepts an Arrow RecordBatch, Table, or RecordBatchReader. If a RecordBatchReader is passed, only one batch at a time will be materialized in memory, allowing you to write large datasets without running out of memory.

Parameters:

geoarrow.rust.io.enums.GeoParquetEncoding

Bases: StrEnum

Options for geometry encoding in GeoParquet.

GEOARROW class-attribute instance-attribute

GEOARROW = 'geoarrow'

Use native GeoArrow-based geometry types when writing GeoParquet files.

Note

GeoParquet ecosystem support is not as widespread for the GeoArrow encoding as for the WKB encoding.

This is only valid when all geometries are one of the supported single-geometry type encodings (i.e., "point", "linestring", "polygon", "multipoint", "multilinestring", "multipolygon").

Using this encoding may provide better performance. Performance is most likely to be improved when writing points. Writing points plus an external bounding-box column requires storing each x-y coordinate pair 3 times instead of one, so this could provide significant file size savings. There has not yet been widespread testing for other geometry types.

These encodings correspond to the separated (struct) representation of coordinates for single-geometry type encodings. This encoding results in useful column statistics when row groups and/or files contain related features.

WKB class-attribute instance-attribute

WKB = 'wkb'

Use Well-Known Binary (WKB) encoding when writing GeoParquet files.

This is the preferred option for maximum portability. See upstream specification reference.

geoarrow.rust.io.types.GeoParquetEncodingT module-attribute

GeoParquetEncodingT = Literal['wkb', 'geoarrow']

Acceptable strings to be passed into the encoding parameter for GeoParquetWriter.

geoarrow.rust.io.PathInput

Bases: TypedDict

path instance-attribute

path: str

The path to the file.

size instance-attribute

size: int

The size of the file in bytes.

If this is provided, only bounded range requests will be made instead of suffix requests. This is useful for object stores that do not support suffix requests, in particular Azure.