GeoParquet¶

Read and write GeoParquet files.

geoarrow.rust.io.GeoParquetDataset ¶

An interface to read from a collection GeoParquet files with the same schema.

fragments `property` ¶

fragments: List[GeoParquetFile]

Get the list of files in this dataset.

num_row_groups `property` ¶

num_row_groups: int

The total number of row groups across all files

num_rows `property` ¶

num_rows: int

The total number of rows across all files.

crs ¶

crs(column_name: str | None = None) -> CRS | None

Access the CRS of this file.

Parameters:

column_name (str | None, default: None ) –

The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

fragment ¶

fragment(path: str) -> GeoParquetFile

Get a single file from this dataset.

open `classmethod` ¶

open(
    paths: Sequence[str] | Sequence[PathInput], store: ObjectStore
) -> GeoParquetDataset

Construct a new ParquetDataset

This will synchronously fetch metadata from all listed files.

Parameters:

paths (Sequence[str] | Sequence[PathInput]) –

a list of string URLs to read from.
store (ObjectStore) –

the file system interface to read from.

Returns:

GeoParquetDataset –

A new ParquetDataset object.

read ¶

read(
    *,
    bbox: Sequence[int | float] | None = None,
    parse_to_native: bool = True,
    coord_type: CoordTypeInput | None = None,
    batch_size: int | None = None
) -> Table

Perform a sync read with the given options

Other Parameters:

bbox (Sequence[int | float] | None) –

The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.
parse_to_native (bool) –

If True, the data will be parsed to native Arrow types. Defaults to True.
coord_type (CoordTypeInput | None) –

The coordinate type to use. Defaults to separated coordinates.
batch_size (int | None) –

The number of rows in each internal batch of the table. Defaults to 1024.

read_async `async` ¶

read_async(
    *,
    bbox: Sequence[int | float] | None = None,
    parse_to_native: bool = True,
    coord_type: CoordTypeInput | None = None,
    batch_size: int | None = None
) -> Table

Perform an async read with the given options

Other Parameters:

bbox (Sequence[int | float] | None) –

The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.
parse_to_native (bool) –

If True, the data will be parsed to native Arrow types. Defaults to True.
coord_type (CoordTypeInput | None) –

The coordinate type to use. Defaults to separated coordinates.
batch_size (int | None) –

The number of rows in each internal batch of the table. Defaults to 1024.

schema_arrow ¶

schema_arrow(
    *, parse_to_native: bool = True, coord_type: CoordTypeInput | None = None
) -> Schema

Access the Arrow schema of the generated data.

Parameters:

parse_to_native (bool, default: True ) –

If True, the schema will be parsed to native Arrow types. Defaults to True.
coord_type (CoordTypeInput | None, default: None ) –

The coordinate type to use. Defaults to separated coordinates.

geoarrow.rust.io.GeoParquetFile ¶

num_row_groups `property` ¶

num_row_groups: int

The number of row groups in this file.

num_rows `property` ¶

num_rows: int

The number of rows in this file.

path `property` ¶

path: str

The path to the file within the provided object store.

crs ¶

crs(column_name: str | None = None) -> CRS | None

Access the CRS of this file.

Parameters:

column_name (str | None, default: None ) –

The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

file_bbox ¶

file_bbox() -> List[float] | None

Access the bounding box of the given column for the entire file

If no column name is passed, retrieves the bbox from the primary geometry column.

An error will be returned if the column name does not exist in the dataset. None will be returned if the metadata does not contain bounding box information.

open `classmethod` ¶

open(path: str | PathInput, store: ObjectStore) -> GeoParquetFile

Open a Parquet file from the given path.

This will synchronously fetch metadata from the provided path.

Parameters:

path (str | PathInput) –

a string URL to read from.
store (ObjectStore) –

the object store interface to read from.

open_async `async` `classmethod` ¶

open_async(path: str | PathInput, store: ObjectStore) -> GeoParquetFile

Open a Parquet file from the given path asynchronously.

This will fetch metadata from the provided path in an async manner.

Parameters:

path (str | PathInput) –

a string URL to read from.
store (ObjectStore) –

the object store interface to read from.

read ¶

read(
    *,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None,
    bbox: Sequence[int | float] | None = None,
    parse_to_native: bool = True,
    coord_type: CoordTypeInput | None = None
) -> Table

Perform a synchronous read with the given options

Other Parameters:

bbox (Sequence[int | float] | None) –

The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.
parse_to_native (bool) –

If True, the data will be parsed to native Arrow types. Defaults to True.
coord_type (CoordTypeInput | None) –

The coordinate type to use. Defaults to separated coordinates.
batch_size (int | None) –

The number of rows in each internal batch of the table. Defaults to 1024.
limit (int | None) –

The maximum number of rows to read. Defaults to None, which means all rows will be read.
offset (int | None) –

The number of rows to skip before starting to read. Defaults to None, which means no rows will be skipped.

read_async `async` ¶

read_async(
    *,
    bbox: Sequence[int | float] | None = None,
    parse_to_native: bool = True,
    coord_type: CoordTypeInput | None = None,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None
) -> Table

Perform an async read with the given options

Other Parameters:

bbox (Sequence[int | float] | None) –

The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.
parse_to_native (bool) –

If True, the data will be parsed to native Arrow types. Defaults to True.
coord_type (CoordTypeInput | None) –

The coordinate type to use. Defaults to separated coordinates.
batch_size (int | None) –

The number of rows in each internal batch of the table. Defaults to 1024.
limit (int | None) –

The maximum number of rows to read. Defaults to None, which means all rows will be read.
offset (int | None) –

The number of rows to skip before starting to read. Defaults to None, which means no rows will be skipped.

row_group_bounds ¶

row_group_bounds(
    row_group_idx: int, column_name: str | None = None
) -> List[float]

Get the bounds of a single row group.

Parameters:

row_group_idx (int) –

The row group index.
column_name (str | None, default: None ) –

The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

Returns:

List[float] –

The bounds of a single row group.

row_groups_bounds ¶

row_groups_bounds(column_name: str | None = None) -> Array

Get the bounds of all row groups.

As of GeoParquet 1.1 you won't need to pass in these column names, as they'll be specified in the metadata.

Parameters:

column_name (str | None, default: None ) –

The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

Returns:

Array –

A geoarrow "box" array with bounds of all row groups.

schema_arrow ¶

schema_arrow(
    *, parse_to_native: bool = True, coord_type: CoordTypeInput | None = None
) -> Schema

Access the Arrow schema of the generated data.

Parameters:

parse_to_native (bool, default: True ) –

If True, the schema will be parsed to native Arrow types. Defaults to True.
coord_type (CoordTypeInput | None, default: None ) –

The coordinate type to use. Defaults to separated coordinates.

geoarrow.rust.io.GeoParquetWriter ¶

Writer interface for a single GeoParquet file.

This allows you to write GeoParquet files that are larger than memory.

init ¶

__init__(
    file: str | Path | BinaryIO,
    schema: ArrowSchemaExportable,
    *,
    encoding: GeoParquetEncoding | GeoParquetEncodingT = WKB,
    compression: (
        Literal["uncompressed", "snappy", "lzo", "lz4", "lz4_raw"] | str
    ) = "zstd(1)",
    writer_version: Literal["parquet_1_0", "parquet_2_0"] = "parquet_2_0",
    generate_covering: bool = False
) -> None

Create a new GeoParquetWriter.

Note

This currently only supports writing to local files. Directly writing to object stores will be supported in a release soon.

Parameters:

file (str | Path | BinaryIO) –

the path to the file or a Python file object in binary write mode.
schema (ArrowSchemaExportable) –

the Arrow schema of the data to write.

Other Parameters:

encoding (GeoParquetEncoding | GeoParquetEncodingT) –

the geometry encoding to use. See GeoParquetEncoding for more details on supported geometry encodings.
compression (Literal['uncompressed', 'snappy', 'lzo', 'lz4', 'lz4_raw'] | str) –

the compression algorithm to use. This can be either one of the strings in the Literal type, or a string that contains the compression level, like gzip(9) or brotli(11) or zstd(22). The default is zstd(1).
writer_version (Literal['parquet_1_0', 'parquet_2_0']) –

the Parquet writer version to use. Defaults to "parquet_2_0".
generate_covering (bool) –

If True, the writer will generate a bounding box column (covering) for the geometries in the file. This is useful for spatially-indexed reads. Defaults to False.

close ¶

close() -> None

Close this file.

This is required to ensure that all data is flushed to disk and the file is properly finalized.

The recommended use of this class is as a context manager, which will close the file automatically.

is_closed ¶

is_closed() -> bool

Returns True if the file has already been closed.

write_batch ¶

write_batch(batch: ArrowArrayExportable) -> None

Write a single RecordBatch to the GeoParquet file.

write_table ¶

write_table(table: ArrowArrayExportable | ArrowStreamExportable) -> None

Write a table or stream of batches to the Parquet file

This accepts an Arrow RecordBatch, Table, or RecordBatchReader. If a RecordBatchReader is passed, only one batch at a time will be materialized in memory, allowing you to write large datasets without running out of memory.

Parameters:

table (ArrowArrayExportable | ArrowStreamExportable) –

description

geoarrow.rust.io.enums.GeoParquetEncoding ¶

Bases: StrEnum

Options for geometry encoding in GeoParquet.

GEOARROW `class-attribute` `instance-attribute` ¶

GEOARROW = 'geoarrow'

Use native GeoArrow-based geometry types when writing GeoParquet files.

Note

GeoParquet ecosystem support is not as widespread for the GeoArrow encoding as for the WKB encoding.

This is only valid when all geometries are one of the supported single-geometry type encodings (i.e., "point", "linestring", "polygon", "multipoint", "multilinestring", "multipolygon").

Using this encoding may provide better performance. Performance is most likely to be improved when writing points. Writing points plus an external bounding-box column requires storing each x-y coordinate pair 3 times instead of one, so this could provide significant file size savings. There has not yet been widespread testing for other geometry types.

These encodings correspond to the separated (struct) representation of coordinates for single-geometry type encodings. This encoding results in useful column statistics when row groups and/or files contain related features.

WKB `class-attribute` `instance-attribute` ¶

WKB = 'wkb'

Use Well-Known Binary (WKB) encoding when writing GeoParquet files.

This is the preferred option for maximum portability. See upstream specification reference.

geoarrow.rust.io.types.GeoParquetEncodingT `module-attribute` ¶

GeoParquetEncodingT = Literal['wkb', 'geoarrow']

Acceptable strings to be passed into the encoding parameter for GeoParquetWriter.

geoarrow.rust.io.PathInput ¶

Bases: TypedDict

path `instance-attribute` ¶

path: str

The path to the file.

size `instance-attribute` ¶

size: int

The size of the file in bytes.

If this is provided, only bounded range requests will be made instead of suffix requests. This is useful for object stores that do not support suffix requests, in particular Azure.

GeoParquet¶

geoarrow.rust.io.GeoParquetDataset ¶

fragments property ¶

num_row_groups property ¶

num_rows property ¶

crs ¶

fragment ¶

open classmethod ¶

read ¶

read_async async ¶

schema_arrow ¶

geoarrow.rust.io.GeoParquetFile ¶

num_row_groups property ¶

num_rows property ¶

path property ¶

crs ¶

file_bbox ¶

open classmethod ¶

open_async async classmethod ¶

read ¶

read_async async ¶

row_group_bounds ¶

row_groups_bounds ¶

schema_arrow ¶

geoarrow.rust.io.GeoParquetWriter ¶

__init__ ¶

close ¶

is_closed ¶

write_batch ¶

write_table ¶

geoarrow.rust.io.enums.GeoParquetEncoding ¶

GEOARROW class-attribute instance-attribute ¶

WKB class-attribute instance-attribute ¶

geoarrow.rust.io.types.GeoParquetEncodingT module-attribute ¶

geoarrow.rust.io.PathInput ¶

path instance-attribute ¶

size instance-attribute ¶

fragments `property` ¶

num_row_groups `property` ¶

num_rows `property` ¶

open `classmethod` ¶

read_async `async` ¶

num_row_groups `property` ¶

num_rows `property` ¶

path `property` ¶

open `classmethod` ¶

open_async `async` `classmethod` ¶

read_async `async` ¶

init ¶

GEOARROW `class-attribute` `instance-attribute` ¶

WKB `class-attribute` `instance-attribute` ¶

geoarrow.rust.io.types.GeoParquetEncodingT `module-attribute` ¶

path `instance-attribute` ¶

size `instance-attribute` ¶