GeoParquet¶
Read and write GeoParquet files.
geoarrow.rust.io.GeoParquetDataset ¶
An interface to read from a collection GeoParquet files with the same schema.
crs ¶
crs(column_name: str | None = None) -> CRS | None
Access the CRS of this file.
Parameters:
-
column_name
(str | None
, default:None
) –The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.
open
classmethod
¶
open(
paths: Sequence[str] | Sequence[PathInput], store: ObjectStore
) -> GeoParquetDataset
Construct a new ParquetDataset
This will synchronously fetch metadata from all listed files.
Parameters:
-
paths
(Sequence[str] | Sequence[PathInput]
) –a list of string URLs to read from.
-
store
(ObjectStore
) –the file system interface to read from.
Returns:
-
GeoParquetDataset
–A new ParquetDataset object.
read ¶
read(
*,
bbox: Sequence[int | float] | None = None,
parse_to_native: bool = True,
coord_type: CoordTypeInput | None = None,
batch_size: int | None = None
) -> Table
Perform a sync read with the given options
Other Parameters:
-
bbox
(Sequence[int | float] | None
) –The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.
-
parse_to_native
(bool
) –If True, the data will be parsed to native Arrow types. Defaults to True.
-
coord_type
(CoordTypeInput | None
) –The coordinate type to use. Defaults to separated coordinates.
-
batch_size
(int | None
) –The number of rows in each internal batch of the table. Defaults to 1024.
read_async
async
¶
read_async(
*,
bbox: Sequence[int | float] | None = None,
parse_to_native: bool = True,
coord_type: CoordTypeInput | None = None,
batch_size: int | None = None
) -> Table
Perform an async read with the given options
Other Parameters:
-
bbox
(Sequence[int | float] | None
) –The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.
-
parse_to_native
(bool
) –If True, the data will be parsed to native Arrow types. Defaults to True.
-
coord_type
(CoordTypeInput | None
) –The coordinate type to use. Defaults to separated coordinates.
-
batch_size
(int | None
) –The number of rows in each internal batch of the table. Defaults to 1024.
schema_arrow ¶
schema_arrow(
*, parse_to_native: bool = True, coord_type: CoordTypeInput | None = None
) -> Schema
Access the Arrow schema of the generated data.
Parameters:
-
parse_to_native
(bool
, default:True
) –If True, the schema will be parsed to native Arrow types. Defaults to True.
-
coord_type
(CoordTypeInput | None
, default:None
) –The coordinate type to use. Defaults to separated coordinates.
geoarrow.rust.io.GeoParquetFile ¶
crs ¶
crs(column_name: str | None = None) -> CRS | None
Access the CRS of this file.
Parameters:
-
column_name
(str | None
, default:None
) –The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.
file_bbox ¶
Access the bounding box of the given column for the entire file
If no column name is passed, retrieves the bbox from the primary geometry column.
An error will be returned if the column name does not exist in the dataset. None will be returned if the metadata does not contain bounding box information.
open
classmethod
¶
open(path: str | PathInput, store: ObjectStore) -> GeoParquetFile
Open a Parquet file from the given path.
This will synchronously fetch metadata from the provided path.
Parameters:
-
path
(str | PathInput
) –a string URL to read from.
-
store
(ObjectStore
) –the object store interface to read from.
open_async
async
classmethod
¶
open_async(path: str | PathInput, store: ObjectStore) -> GeoParquetFile
Open a Parquet file from the given path asynchronously.
This will fetch metadata from the provided path in an async manner.
Parameters:
-
path
(str | PathInput
) –a string URL to read from.
-
store
(ObjectStore
) –the object store interface to read from.
read ¶
read(
*,
batch_size: int | None = None,
limit: int | None = None,
offset: int | None = None,
bbox: Sequence[int | float] | None = None,
parse_to_native: bool = True,
coord_type: CoordTypeInput | None = None
) -> Table
Perform a synchronous read with the given options
Other Parameters:
-
bbox
(Sequence[int | float] | None
) –The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.
-
parse_to_native
(bool
) –If True, the data will be parsed to native Arrow types. Defaults to True.
-
coord_type
(CoordTypeInput | None
) –The coordinate type to use. Defaults to separated coordinates.
-
batch_size
(int | None
) –The number of rows in each internal batch of the table. Defaults to 1024.
-
limit
(int | None
) –The maximum number of rows to read. Defaults to None, which means all rows will be read.
-
offset
(int | None
) –The number of rows to skip before starting to read. Defaults to None, which means no rows will be skipped.
read_async
async
¶
read_async(
*,
bbox: Sequence[int | float] | None = None,
parse_to_native: bool = True,
coord_type: CoordTypeInput | None = None,
batch_size: int | None = None,
limit: int | None = None,
offset: int | None = None
) -> Table
Perform an async read with the given options
Other Parameters:
-
bbox
(Sequence[int | float] | None
) –The 2D bounding box to use for spatially-filtered reads. Requires the source GeoParquet dataset to be version 1.1 with either a bounding box column or native geometry encoding. Defaults to None.
-
parse_to_native
(bool
) –If True, the data will be parsed to native Arrow types. Defaults to True.
-
coord_type
(CoordTypeInput | None
) –The coordinate type to use. Defaults to separated coordinates.
-
batch_size
(int | None
) –The number of rows in each internal batch of the table. Defaults to 1024.
-
limit
(int | None
) –The maximum number of rows to read. Defaults to None, which means all rows will be read.
-
offset
(int | None
) –The number of rows to skip before starting to read. Defaults to None, which means no rows will be skipped.
row_group_bounds ¶
row_groups_bounds ¶
Get the bounds of all row groups.
As of GeoParquet 1.1 you won't need to pass in these column names, as they'll be specified in the metadata.
Parameters:
-
column_name
(str | None
, default:None
) –The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.
Returns:
-
Array
–A geoarrow "box" array with bounds of all row groups.
schema_arrow ¶
schema_arrow(
*, parse_to_native: bool = True, coord_type: CoordTypeInput | None = None
) -> Schema
Access the Arrow schema of the generated data.
Parameters:
-
parse_to_native
(bool
, default:True
) –If True, the schema will be parsed to native Arrow types. Defaults to True.
-
coord_type
(CoordTypeInput | None
, default:None
) –The coordinate type to use. Defaults to separated coordinates.
geoarrow.rust.io.GeoParquetWriter ¶
Writer interface for a single GeoParquet file.
This allows you to write GeoParquet files that are larger than memory.
__init__ ¶
__init__(
file: str | Path | BinaryIO,
schema: ArrowSchemaExportable,
*,
encoding: GeoParquetEncoding | GeoParquetEncodingT = WKB,
compression: (
Literal["uncompressed", "snappy", "lzo", "lz4", "lz4_raw"] | str
) = "zstd(1)",
writer_version: Literal["parquet_1_0", "parquet_2_0"] = "parquet_2_0"
) -> None
Create a new GeoParquetWriter.
Note
This currently only supports writing to local files. Directly writing to object stores will be supported in a release soon.
Parameters:
-
file
(str | Path | BinaryIO
) –the path to the file or a Python file object in binary write mode.
-
schema
(ArrowSchemaExportable
) –the Arrow schema of the data to write.
Other Parameters:
-
encoding
(GeoParquetEncoding | GeoParquetEncodingT
) –the geometry encoding to use. See GeoParquetEncoding for more details on supported geometry encodings.
-
compression
(Literal['uncompressed', 'snappy', 'lzo', 'lz4', 'lz4_raw'] | str
) –the compression algorithm to use. This can be either one of the strings in the
Literal
type, or a string that contains the compression level, likegzip(9)
orbrotli(11)
orzstd(22)
. The default iszstd(1)
. -
writer_version
(Literal['parquet_1_0', 'parquet_2_0']
) –the Parquet writer version to use. Defaults to
"parquet_2_0"
.
close ¶
close() -> None
Close this file.
This is required to ensure that all data is flushed to disk and the file is properly finalized.
The recommended use of this class is as a context manager, which will close the file automatically.
write_batch ¶
write_batch(batch: ArrowArrayExportable) -> None
Write a single RecordBatch to the GeoParquet file.
write_table ¶
write_table(table: ArrowArrayExportable | ArrowStreamExportable) -> None
Write a table or stream of batches to the Parquet file
This accepts an Arrow RecordBatch, Table, or RecordBatchReader. If a RecordBatchReader is passed, only one batch at a time will be materialized in memory, allowing you to write large datasets without running out of memory.
Parameters:
-
table
(ArrowArrayExportable | ArrowStreamExportable
) –description
geoarrow.rust.io.enums.GeoParquetEncoding ¶
Bases: StrEnum
Options for geometry encoding in GeoParquet.
GEOARROW
class-attribute
instance-attribute
¶
GEOARROW = 'geoarrow'
Use native GeoArrow-based geometry types when writing GeoParquet files.
Note
GeoParquet ecosystem support is not as widespread for the GeoArrow encoding as for the WKB encoding.
This is only valid when all geometries are one of the supported single-geometry type encodings (i.e., "point"
, "linestring"
, "polygon"
, "multipoint"
, "multilinestring"
, "multipolygon"
).
Using this encoding may provide better performance. Performance is most likely to be improved when writing points. Writing points plus an external bounding-box column requires storing each x-y coordinate pair 3 times instead of one, so this could provide significant file size savings. There has not yet been widespread testing for other geometry types.
These encodings correspond to the separated (struct) representation of coordinates for single-geometry type encodings. This encoding results in useful column statistics when row groups and/or files contain related features.
WKB
class-attribute
instance-attribute
¶
WKB = 'wkb'
Use Well-Known Binary (WKB) encoding when writing GeoParquet files.
This is the preferred option for maximum portability. See upstream specification reference.
geoarrow.rust.io.types.GeoParquetEncodingT
module-attribute
¶
GeoParquetEncodingT = Literal['wkb', 'geoarrow']
Acceptable strings to be passed into the encoding
parameter for
GeoParquetWriter
.