Skip to content

Functions

Read and write to and from geospatial file formats.

geoarrow.rust.io

ObjectStore

ParquetDataset

num_row_groups property

num_row_groups: int

The total number of row groups across all files

num_rows property

num_rows: int

The total number of rows across all files.

schema_arrow property

schema_arrow: Schema

Access the Arrow schema of the generated data

crs

crs(column_name: str | None = None) -> CRS

Access the CRS of this file.

Parameters:

  • column_name (str | None, default: None ) –

    The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

Returns:

  • CRS

    CRS

read

read(
    *,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None,
    bbox: Sequence[IntFloat] | None = None,
    bbox_paths: BboxPaths | None = None
) -> Table

Perform a sync read with the given options

Parameters:

  • batch_size (int | None, default: None ) –

    description. Defaults to None.

  • limit (int | None, default: None ) –

    description. Defaults to None.

  • offset (int | None, default: None ) –

    description. Defaults to None.

  • bbox (Sequence[IntFloat] | None, default: None ) –

    description. Defaults to None.

  • bbox_paths (BboxPaths | None, default: None ) –

    description. Defaults to None.

Returns:

read_async async

read_async(
    *,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None,
    bbox: Sequence[IntFloat] | None = None,
    bbox_paths: BboxPaths | None = None
) -> Table

Perform an async read with the given options

Parameters:

  • batch_size (int | None, default: None ) –

    description. Defaults to None.

  • limit (int | None, default: None ) –

    description. Defaults to None.

  • offset (int | None, default: None ) –

    description. Defaults to None.

  • bbox (Sequence[IntFloat] | None, default: None ) –

    description. Defaults to None.

  • bbox_paths (BboxPaths | None, default: None ) –

    description. Defaults to None.

Returns:

ParquetFile

num_row_groups property

num_row_groups: int

The number of row groups in this file.

num_rows property

num_rows: int

The number of rows in this file.

schema_arrow property

schema_arrow: Schema

Access the Arrow schema of the generated data

crs

crs(column_name: str | None = None) -> CRS

Access the CRS of this file.

Parameters:

  • column_name (str | None, default: None ) –

    The geometry column name. If there is more than one geometry column in the file, you must specify which you want to read. Defaults to None.

Returns:

  • CRS

    CRS

file_bbox

file_bbox() -> List[float] | None

Access the bounding box of the given column for the entire file

If no column name is passed, retrieves the bbox from the primary geometry column.

An Err will be returned if the column name does not exist in the dataset None will be returned if the metadata does not contain bounding box information.

read

read(
    *,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None,
    bbox: Sequence[IntFloat] | None = None,
    bbox_paths: BboxPaths | None = None
) -> Table

Perform a sync read with the given options

Parameters:

  • batch_size (int | None, default: None ) –

    description. Defaults to None.

  • limit (int | None, default: None ) –

    description. Defaults to None.

  • offset (int | None, default: None ) –

    description. Defaults to None.

  • bbox (Sequence[IntFloat] | None, default: None ) –

    description. Defaults to None.

  • bbox_paths (BboxPaths | None, default: None ) –

    description. Defaults to None.

Returns:

read_async async

read_async(
    *,
    batch_size: int | None = None,
    limit: int | None = None,
    offset: int | None = None,
    bbox: Sequence[IntFloat] | None = None,
    bbox_paths: BboxPaths | None = None
) -> Table

Perform an async read with the given options

Parameters:

  • batch_size (int | None, default: None ) –

    description. Defaults to None.

  • limit (int | None, default: None ) –

    description. Defaults to None.

  • offset (int | None, default: None ) –

    description. Defaults to None.

  • bbox (Sequence[IntFloat] | None, default: None ) –

    description. Defaults to None.

  • bbox_paths (BboxPaths | None, default: None ) –

    description. Defaults to None.

Returns:

row_group_bounds

row_group_bounds(
    row_group_idx: int, bbox_paths: BboxPaths | None = None
) -> List[float]

Get the bounds of a single row group.

Parameters:

  • row_group_idx (int) –

    The row group index.

  • bbox_paths (BboxPaths | None, default: None ) –

    For files written with spatial partitioning, you don't need to pass in these column names, as they'll be specified in the metadata Defaults to None.

Returns:

  • List[float]

    The bounds of a single row group.

row_groups_bounds

row_groups_bounds(bbox_paths: BboxPaths | None = None) -> GeometryArray

Get the bounds of all row groups.

As of GeoParquet 1.1 you won't need to pass in these column names, as they'll be specified in the metadata.

Parameters:

  • bbox_paths (BboxPaths | None, default: None ) –

    For files written with spatial partitioning, you don't need to pass in these column names, as they'll be specified in the metadata Defaults to None.

Returns:

  • GeometryArray

    A geoarrow "box" array with bounds of all row groups.

ParquetWriter

Writer interface for a single Parquet file.

This allows you to write Parquet files that are larger than memory.

close

close() -> None

Close this file.

The recommended use of this class is as a context manager, which will close the file automatically.

is_closed

is_closed() -> bool

Returns True if the file has already been closed.

write_batch

write_batch(batch: ArrowArrayExportable) -> None

Write a single RecordBatch to the Parquet file

write_table

write_table(table: ArrowArrayExportable | ArrowStreamExportable) -> None

Write a table or stream of batches to the Parquet file

This accepts an Arrow RecordBatch, Table, or RecordBatchReader. If a RecordBatchReader is passed, only one batch at a time will be materialized in memory.

Parameters:

read_csv

read_csv(
    file: str | Path | BinaryIO,
    geometry_column_name: str,
    *,
    batch_size: int = 65536
) -> Table

Read a CSV file from a path on disk into a Table.

Parameters:

  • file (str | Path | BinaryIO) –

    the path to the file or a Python file object in binary read mode.

  • geometry_column_name (str) –

    the name of the geometry column within the CSV.

  • batch_size (int, default: 65536 ) –

    the number of rows to include in each internal batch of the table.

Returns:

  • Table

    Table from CSV file.

read_flatgeobuf

read_flatgeobuf(
    file: Union[str, Path, BinaryIO],
    *,
    fs: Optional[ObjectStore] = None,
    batch_size: int = 65536,
    bbox: Tuple[float, float, float, float] | None = None
) -> Table

Read a FlatGeobuf file from a path on disk or a remote location into an Arrow Table.

Example:

Reading from a local path:

from geoarrow.rust.core import read_flatgeobuf
table = read_flatgeobuf("path/to/file.fgb")

Reading from a Python file object:

from geoarrow.rust.core import read_flatgeobuf

with open("path/to/file.fgb", "rb") as file:
    table = read_flatgeobuf(file)

Reading from an HTTP(S) url:

from geoarrow.rust.core import read_flatgeobuf

url = "http://flatgeobuf.org/test/data/UScounties.fgb"
table = read_flatgeobuf(url)

Reading from a remote file on an S3 bucket.

from geoarrow.rust.core import ObjectStore, read_flatgeobuf

options = {
    "aws_access_key_id": "...",
    "aws_secret_access_key": "...",
    "aws_region": "..."
}
fs = ObjectStore('s3://bucket', options=options)
table = read_flatgeobuf("path/in/bucket.fgb", fs=fs)

Parameters:

  • file (Union[str, Path, BinaryIO]) –

    the path to the file or a Python file object in binary read mode.

Other Parameters:

  • fs (Optional[ObjectStore]) –

    an ObjectStore instance for this url. This is required only if the file is at a remote location.

  • batch_size (int) –

    the number of rows to include in each internal batch of the table.

  • bbox (Tuple[float, float, float, float] | None) –

    A spatial filter for reading rows, of the format (minx, miny, maxx, maxy). If set to

Returns:

  • Table

    Table from FlatGeobuf file.

read_flatgeobuf_async async

read_flatgeobuf_async(
    path: str,
    *,
    fs: Optional[ObjectStore] = None,
    batch_size: int = 65536,
    bbox: Tuple[float, float, float, float] | None = None
) -> Table

Read a FlatGeobuf file from a url into an Arrow Table.

Example:

Reading from an HTTP(S) url:

from geoarrow.rust.core import read_flatgeobuf_async

url = "http://flatgeobuf.org/test/data/UScounties.fgb"
table = await read_flatgeobuf_async(url)

Reading from an S3 bucket:

from geoarrow.rust.core import ObjectStore, read_flatgeobuf_async

options = {
    "aws_access_key_id": "...",
    "aws_secret_access_key": "...",
    "aws_region": "..."
}
fs = ObjectStore('s3://bucket', options=options)
table = await read_flatgeobuf_async("path/in/bucket.fgb", fs=fs)

Parameters:

  • path (str) –

    the url or relative path to a remote FlatGeobuf file. If an argument is passed for fs, this should be a path fragment relative to the root passed to the ObjectStore constructor.

Other Parameters:

  • fs (Optional[ObjectStore]) –

    an ObjectStore instance for this url. This is required for non-HTTP urls.

  • batch_size (int) –

    the number of rows to include in each internal batch of the table.

  • bbox (Tuple[float, float, float, float] | None) –

    A spatial filter for reading rows, of the format (minx, miny, maxx, maxy). If set to

Returns:

  • Table

    Table from FlatGeobuf file.

read_geojson

read_geojson(
    file: Union[str, Path, BinaryIO], *, batch_size: int = 65536
) -> Table

Read a GeoJSON file from a path on disk into an Arrow Table.

Parameters:

  • file (Union[str, Path, BinaryIO]) –

    the path to the file or a Python file object in binary read mode.

  • batch_size (int, default: 65536 ) –

    the number of rows to include in each internal batch of the table.

Returns:

  • Table

    Table from GeoJSON file.

read_geojson_lines

read_geojson_lines(
    file: Union[str, Path, BinaryIO], *, batch_size: int = 65536
) -> Table

Read a newline-delimited GeoJSON file from a path on disk into an Arrow Table.

This expects a GeoJSON Feature on each line of a text file, with a newline character separating each Feature.

Parameters:

  • file (Union[str, Path, BinaryIO]) –

    the path to the file or a Python file object in binary read mode.

  • batch_size (int, default: 65536 ) –

    the number of rows to include in each internal batch of the table.

Returns:

  • Table

    Table from GeoJSON file.

read_parquet

read_parquet(
    path: Union[str, Path, BinaryIO],
    *,
    fs: Optional[ObjectStore] = None,
    batch_size: int = 65536
) -> Table

Read a GeoParquet file from a path on disk into an Arrow Table.

Example:

Reading from a local path:

from geoarrow.rust.core import read_parquet
table = read_parquet("path/to/file.parquet")

Reading from an HTTP(S) url:

from geoarrow.rust.core import read_parquet

url = "https://raw.githubusercontent.com/opengeospatial/geoparquet/v1.0.0/examples/example.parquet"
table = read_parquet(url)

Reading from a remote file on an S3 bucket.

from geoarrow.rust.core import ObjectStore, read_parquet

options = {
    "aws_access_key_id": "...",
    "aws_secret_access_key": "...",
    "aws_region": "..."
}
fs = ObjectStore('s3://bucket', options=options)
table = read_parquet("path/in/bucket.parquet", fs=fs)

Parameters:

  • path (Union[str, Path, BinaryIO]) –

    the path to the file

  • fs (Optional[ObjectStore], default: None ) –

    the ObjectStore to read from. Defaults to None.

  • batch_size (int, default: 65536 ) –

    the number of rows to include in each internal batch of the table.

Returns:

  • Table

    Table from GeoParquet file.

read_parquet_async async

read_parquet_async(
    path: Union[str, Path, BinaryIO],
    *,
    fs: Optional[ObjectStore] = None,
    batch_size: int = 65536
) -> Table

Read a GeoParquet file from a path on disk into an Arrow Table.

Examples:

Reading from an HTTP(S) url:

from geoarrow.rust.core import read_parquet_async

url = "https://raw.githubusercontent.com/opengeospatial/geoparquet/v1.0.0/examples/example.parquet"
table = await read_parquet_async(url)

Reading from a remote file on an S3 bucket.

from geoarrow.rust.core import ObjectStore, read_parquet_async

options = {
    "aws_access_key_id": "...",
    "aws_secret_access_key": "...",
    "aws_region": "..."
}
fs = ObjectStore('s3://bucket', options=options)
table = await read_parquet_async("path/in/bucket.parquet", fs=fs)

Parameters:

  • path (Union[str, Path, BinaryIO]) –

    the path to the file

  • fs (Optional[ObjectStore], default: None ) –

    the ObjectStore to read from. Defaults to None.

  • batch_size (int, default: 65536 ) –

    the number of rows to include in each internal batch of the table.

Returns:

  • Table

    Table from GeoParquet file.

read_postgis

read_postgis(connection_url: str, sql: str) -> Optional[Table]

Read a PostGIS query into an Arrow Table.

Parameters:

  • connection_url (str) –

    description

  • sql (str) –

    description

Returns:

read_postgis_async async

read_postgis_async(connection_url: str, sql: str) -> Optional[Table]

Read a PostGIS query into an Arrow Table.

Parameters:

  • connection_url (str) –

    description

  • sql (str) –

    description

Returns:

read_pyogrio

read_pyogrio(
    path_or_buffer: Path | str | bytes,
    /,
    layer: int | str | None = None,
    encoding: str | None = None,
    columns: Sequence[str] | None = None,
    read_geometry: bool = True,
    skip_features: int = 0,
    max_features: int | None = None,
    where: str | None = None,
    bbox: Tuple[float, float, float, float] | Sequence[float] | None = None,
    mask=None,
    fids=None,
    sql: str | None = None,
    sql_dialect: str | None = None,
    return_fids=False,
    batch_size=65536,
    **kwargs,
) -> RecordBatchReader

Read from an OGR data source to an Arrow Table

Parameters:

  • path_or_buffer (Path | str | bytes) –

    A dataset path or URI, or raw buffer.

  • layer (int | str | None, default: None ) –

    If an integer is provided, it corresponds to the index of the layer with the data source. If a string is provided, it must match the name of the layer in the data source. Defaults to first layer in data source.

  • encoding (str | None, default: None ) –

    If present, will be used as the encoding for reading string values from the data source, unless encoding can be inferred directly from the data source.

  • columns (Sequence[str] | None, default: None ) –

    List of column names to import from the data source. Column names must exactly match the names in the data source, and will be returned in the order they occur in the data source. To avoid reading any columns, pass an empty list-like.

  • read_geometry (bool, default: True ) –

    If True, will read geometry into a GeoSeries. If False, a Pandas DataFrame will be returned instead. Default: True.

  • skip_features (int, default: 0 ) –

    Number of features to skip from the beginning of the file before returning features. If greater than available number of features, an empty DataFrame will be returned. Using this parameter may incur significant overhead if the driver does not support the capability to randomly seek to a specific feature, because it will need to iterate over all prior features.

  • max_features (int | None, default: None ) –

    Number of features to read from the file. Default: None.

  • where (str | None, default: None ) –

    Where clause to filter features in layer by attribute values. If the data source natively supports SQL, its specific SQL dialect should be used (eg. SQLite and GeoPackage: SQLITE, PostgreSQL). If it doesn't, the OGRSQL WHERE syntax should be used. Note that it is not possible to overrule the SQL dialect, this is only possible when you use the sql parameter.

    Examples: "ISO_A3 = 'CAN'", "POP_EST > 10000000 AND POP_EST < 100000000"

  • bbox (Tuple[float, float, float, float] | Sequence[float] | None, default: None ) –

    If present, will be used to filter records whose geometry intersects this box. This must be in the same CRS as the dataset. If GEOS is present and used by GDAL, only geometries that intersect this bbox will be returned; if GEOS is not available or not used by GDAL, all geometries with bounding boxes that intersect this bbox will be returned. Cannot be combined with mask keyword.

  • mask

    Shapely geometry, optional (default: None) If present, will be used to filter records whose geometry intersects this geometry. This must be in the same CRS as the dataset. If GEOS is present and used by GDAL, only geometries that intersect this geometry will be returned; if GEOS is not available or not used by GDAL, all geometries with bounding boxes that intersect the bounding box of this geometry will be returned. Requires Shapely >= 2.0. Cannot be combined with bbox keyword.

  • fids

    array-like, optional (default: None) Array of integer feature id (FID) values to select. Cannot be combined with other keywords to select a subset (skip_features, max_features, where, bbox, mask, or sql). Note that the starting index is driver and file specific (e.g. typically 0 for Shapefile and 1 for GeoPackage, but can still depend on the specific file). The performance of reading a large number of features usings FIDs is also driver specific.

  • sql (str | None, default: None ) –

    The SQL statement to execute. Look at the sql_dialect parameter for more information on the syntax to use for the query. When combined with other keywords like columns, skip_features, max_features, where, bbox, or mask, those are applied after the SQL query. Be aware that this can have an impact on performance, (e.g. filtering with the bbox or mask keywords may not use spatial indexes). Cannot be combined with the layer or fids keywords.

  • sql_dialect

    str, optional (default: None) The SQL dialect the SQL statement is written in. Possible values:

    • None: if the data source natively supports SQL, its specific SQL dialect will be used by default (eg. SQLite and Geopackage: SQLITE, PostgreSQL). If the data source doesn't natively support SQL, the OGRSQL dialect is the default.
    • 'OGRSQL': can be used on any data source. Performance can suffer when used on data sources with native support for SQL.
    • 'SQLITE': can be used on any data source. All spatialite functions can be used. Performance can suffer on data sources with native support for SQL, except for Geopackage and SQLite as this is their native SQL dialect.

Returns:

write_csv

write_csv(table: ArrowStreamExportable, file: str | Path | BinaryIO) -> None

Write a Table to a CSV file on disk.

Parameters:

  • table (ArrowStreamExportable) –

    the Arrow RecordBatch, Table, or RecordBatchReader to write.

  • file (str | Path | BinaryIO) –

    the path to the file or a Python file object in binary write mode.

Returns:

  • None

    None

write_flatgeobuf

write_flatgeobuf(
    table: ArrowStreamExportable,
    file: str | Path | BinaryIO,
    *,
    write_index: bool = True
) -> None

Write to a FlatGeobuf file on disk.

Parameters:

  • table (ArrowStreamExportable) –

    the Arrow RecordBatch, Table, or RecordBatchReader to write.

  • file (str | Path | BinaryIO) –

    the path to the file or a Python file object in binary write mode.

Other Parameters:

  • write_index (bool) –

    whether to write a spatial index in the FlatGeobuf file. Defaults to True.

write_geojson

write_geojson(
    table: ArrowStreamExportable, file: Union[str, Path, BinaryIO]
) -> None

Write to a GeoJSON file on disk.

Note that the GeoJSON specification mandates coordinates to be in the WGS84 (EPSG:4326) coordinate system, but this function will not automatically reproject into WGS84 for you.

Parameters:

Returns:

  • None

    None

write_geojson_lines

write_geojson_lines(
    table: ArrowStreamExportable, file: Union[str, Path, BinaryIO]
) -> None

Write to a newline-delimited GeoJSON file on disk.

Note that the GeoJSON specification mandates coordinates to be in the WGS84 (EPSG:4326) coordinate system, but this function will not automatically reproject into WGS84 for you.

Parameters:

Returns:

  • None

    None

write_parquet

write_parquet(
    table: ArrowStreamExportable,
    file: Union[str, Path, BinaryIO],
    *,
    encoding: GeoParquetEncoding | GeoParquetEncodingT = GeoParquetEncoding.WKB
) -> None

Write an Arrow RecordBatch, Table, or RecordBatchReader to a GeoParquet file on disk.

If a RecordBatchReader is passed, only one batch at a time will be materialized in memory.

Parameters: