GeoParquet¶
Read and write GeoParquet files.
geoarrow.rust.io.read_parquet ¶
read_parquet(
path: Union[str, Path, BinaryIO],
*,
store: Optional[ObjectStore] = None,
batch_size: int = 65536
) -> Table
Read a GeoParquet file from a path on disk into an Arrow Table.
Example:
Reading from a local path:
from geoarrow.rust.core import read_parquet
table = read_parquet("path/to/file.parquet")
Reading from an HTTP(S) url:
from geoarrow.rust.core import read_parquet
url = "https://raw.githubusercontent.com/opengeospatial/geoparquet/v1.0.0/examples/example.parquet"
table = read_parquet(url)
Reading from a remote file on an S3 bucket.
from geoarrow.rust.core import ObjectStore, read_parquet
options = {
"aws_access_key_id": "...",
"aws_secret_access_key": "...",
"aws_region": "..."
}
store = ObjectStore('s3://bucket', options=options)
table = read_parquet("path/in/bucket.parquet", store=store)
Parameters:
-
path
(Union[str, Path, BinaryIO]
) –the path to the file
-
store
(Optional[ObjectStore]
, default:None
) –the ObjectStore to read from. Defaults to None.
-
batch_size
(int
, default:65536
) –the number of rows to include in each internal batch of the table.
Returns:
-
Table
–Table from GeoParquet file.
geoarrow.rust.io.read_parquet_async
async
¶
read_parquet_async(
path: Union[str, Path, BinaryIO],
*,
store: Optional[ObjectStore] = None,
batch_size: int = 65536
) -> Table
Read a GeoParquet file from a path on disk into an Arrow Table.
Examples:
Reading from an HTTP(S) url:
from geoarrow.rust.core import read_parquet_async
url = "https://raw.githubusercontent.com/opengeospatial/geoparquet/v1.0.0/examples/example.parquet"
table = await read_parquet_async(url)
Reading from a remote file on an S3 bucket.
from geoarrow.rust.core import ObjectStore, read_parquet_async
options = {
"aws_access_key_id": "...",
"aws_secret_access_key": "...",
"aws_region": "..."
}
store = ObjectStore('s3://bucket', options=options)
table = await read_parquet_async("path/in/bucket.parquet", store=store)
Parameters:
-
path
(Union[str, Path, BinaryIO]
) –the path to the file
-
store
(Optional[ObjectStore]
, default:None
) –the ObjectStore to read from. Defaults to None.
-
batch_size
(int
, default:65536
) –the number of rows to include in each internal batch of the table.
Returns:
-
Table
–Table from GeoParquet file.
geoarrow.rust.io.write_parquet ¶
write_parquet(
table: ArrowStreamExportable,
file: Union[str, Path, BinaryIO],
*,
encoding: GeoParquetEncoding | GeoParquetEncodingT = GeoParquetEncoding.WKB
) -> None
Write an Arrow RecordBatch, Table, or RecordBatchReader to a GeoParquet file on disk.
If a RecordBatchReader is passed, only one batch at a time will be materialized in memory.
Parameters:
-
table
(ArrowStreamExportable
) –the table to write.
-
file
(Union[str, Path, BinaryIO]
) –the path to the file or a Python file object in binary write mode.
-
encoding
(GeoParquetEncoding | GeoParquetEncodingT
, default:WKB
) –the geometry encoding to use. Defaults to
GeoParquetEncoding.WKB
.
geoarrow.rust.io.ParquetDataset ¶
__init__ ¶
crs ¶
crs(column_name: str | None = None) -> CRS
Access the CRS of this file.
Parameters:
-
column_name
(str | None
, default:None
) –The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.
Returns:
-
CRS
–CRS
read ¶
read(
*,
batch_size: int | None = None,
limit: int | None = None,
offset: int | None = None,
bbox: Sequence[int | float] | None = None,
bbox_paths: BboxCovering | None = None
) -> Table
Perform a sync read with the given options
Parameters:
-
batch_size
(int | None
, default:None
) –description. Defaults to None.
-
limit
(int | None
, default:None
) –description. Defaults to None.
-
offset
(int | None
, default:None
) –description. Defaults to None.
-
bbox
(Sequence[int | float] | None
, default:None
) –description. Defaults to None.
-
bbox_paths
(BboxCovering | None
, default:None
) –description. Defaults to None.
Returns:
-
Table
–description
read_async
async
¶
read_async(
*,
batch_size: int | None = None,
limit: int | None = None,
offset: int | None = None,
bbox: Sequence[int | float] | None = None,
bbox_paths: BboxCovering | None = None
) -> Table
Perform an async read with the given options
Parameters:
-
batch_size
(int | None
, default:None
) –description. Defaults to None.
-
limit
(int | None
, default:None
) –description. Defaults to None.
-
offset
(int | None
, default:None
) –description. Defaults to None.
-
bbox
(Sequence[int | float] | None
, default:None
) –description. Defaults to None.
-
bbox_paths
(BboxCovering | None
, default:None
) –description. Defaults to None.
Returns:
-
Table
–description
geoarrow.rust.io.ParquetFile ¶
__init__ ¶
__init__(path: str, store: ObjectStore) -> None
Construct a new ParquetFile
This will synchronously fetch metadata from the provided path
Parameters:
-
path
(str
) –a string URL to read from.
-
store
(ObjectStore
) –the file system interface to read from.
Returns:
-
None
–A new ParquetFile object.
crs ¶
crs(column_name: str | None = None) -> CRS
Access the CRS of this file.
Parameters:
-
column_name
(str | None
, default:None
) –The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.
Returns:
-
CRS
–CRS
file_bbox ¶
Access the bounding box of the given column for the entire file
If no column name is passed, retrieves the bbox from the primary geometry column.
An Err will be returned if the column name does not exist in the dataset None will be returned if the metadata does not contain bounding box information.
read ¶
read(
*,
batch_size: int | None = None,
limit: int | None = None,
offset: int | None = None,
bbox: Sequence[int | float] | None = None,
bbox_paths: BboxCovering | None = None
) -> Table
Perform a sync read with the given options
Parameters:
-
batch_size
(int | None
, default:None
) –description. Defaults to None.
-
limit
(int | None
, default:None
) –description. Defaults to None.
-
offset
(int | None
, default:None
) –description. Defaults to None.
-
bbox
(Sequence[int | float] | None
, default:None
) –description. Defaults to None.
-
bbox_paths
(BboxCovering | None
, default:None
) –description. Defaults to None.
Returns:
-
Table
–description
read_async
async
¶
read_async(
*,
batch_size: int | None = None,
limit: int | None = None,
offset: int | None = None,
bbox: Sequence[int | float] | None = None,
bbox_paths: BboxCovering | None = None
) -> Table
Perform an async read with the given options
Parameters:
-
batch_size
(int | None
, default:None
) –description. Defaults to None.
-
limit
(int | None
, default:None
) –description. Defaults to None.
-
offset
(int | None
, default:None
) –description. Defaults to None.
-
bbox
(Sequence[int | float] | None
, default:None
) –description. Defaults to None.
-
bbox_paths
(BboxCovering | None
, default:None
) –description. Defaults to None.
Returns:
-
Table
–description
row_group_bounds ¶
row_group_bounds(
row_group_idx: int, bbox_paths: BboxCovering | None = None
) -> List[float]
Get the bounds of a single row group.
Parameters:
-
row_group_idx
(int
) –The row group index.
-
bbox_paths
(BboxCovering | None
, default:None
) –For files written with spatial partitioning, you don't need to pass in these column names, as they'll be specified in the metadata Defaults to None.
Returns:
row_groups_bounds ¶
row_groups_bounds(bbox_paths: BboxCovering | None = None) -> NativeArray
Get the bounds of all row groups.
As of GeoParquet 1.1 you won't need to pass in these column names, as they'll be specified in the metadata.
Parameters:
-
bbox_paths
(BboxCovering | None
, default:None
) –For files written with spatial partitioning, you don't need to pass in these column names, as they'll be specified in the metadata Defaults to None.
Returns:
-
NativeArray
–A geoarrow "box" array with bounds of all row groups.
geoarrow.rust.io.ParquetWriter ¶
Writer interface for a single Parquet file.
This allows you to write Parquet files that are larger than memory.
close ¶
close() -> None
Close this file.
The recommended use of this class is as a context manager, which will close the file automatically.
write_batch ¶
write_batch(batch: ArrowArrayExportable) -> None
Write a single RecordBatch to the Parquet file
write_table ¶
write_table(table: ArrowArrayExportable | ArrowStreamExportable) -> None
Write a table or stream of batches to the Parquet file
This accepts an Arrow RecordBatch, Table, or RecordBatchReader. If a RecordBatchReader is passed, only one batch at a time will be materialized in memory.
Parameters:
-
table
(ArrowArrayExportable | ArrowStreamExportable
) –description
geoarrow.rust.io.types.BboxCovering ¶
Bases: TypedDict
Column names for the per-row bounding box covering used in spatial partitioning.
The spatial partitioning defined in GeoParquet 1.1 allows for a "covering"
field.
The covering should be four float columns that represent the bounding box of each
row of the data.
As of GeoParquet 1.1, this metadata is included in the Parquet file itself, but this typed dict can be used with spatially-partitioned GeoParquet datasets that do not write GeoParquet 1.1 metadata. Providing this information is unnecessary for GeoParquet 1.1 files with included covering information.
geoarrow.rust.io.enums.GeoParquetEncoding ¶
Bases: StrEnum
Options for geometry encoding in GeoParquet.
Native
class-attribute
instance-attribute
¶
Native = auto()
Use native GeoArrow geometry types when writing GeoParquet files.
Supported as of GeoParquet version 1.1.
This option provides for better read and write performance and for inferring spatial partitioning from remote files. But it does not yet have widespread support.
geoarrow.rust.io.types.GeoParquetEncodingT
module-attribute
¶
GeoParquetEncodingT = Literal['wkb', 'native']
Acceptable strings to be passed into the encoding
parameter for
write_parquet
.