partition#

Datalake partition utilities.

This module provides functions and classes for managing and manipulating partitions in a data lake stored on S3. It includes utilities for extracting partition information, encoding partition data, and listing partitions.

aws_sdk_polars.s3.partition.decode_hive_partition(s3dir_root: S3Path, s3dir_partition: S3Path) Dict[str, str][source]#

Extract partition data from the S3 directory path.

Parameters:
  • s3dir_root – The root S3 directory.

  • s3dir_partition – The partition S3 directory.

Example

>>> s3dir_root = S3Path("s3://bucket/folder/")
>>> s3dir_partition = S3Path("s3://bucket/folder/year=2021/month=01/day=15/")
>>> decode_hive_partition(s3dir_root, s3dir_partition)
{"year": "2021", "month": "01", "day": "15"}
aws_sdk_polars.s3.partition.encode_hive_partition(kvs: Dict[str, str]) str[source]#

Encode partition data into hive styled partition string.

Parameters:

kvs – A dictionary of partition key-value pairs.

For example:

>>> encode_hive_partition({"year": "2021", "month": "01", "day": "01"})
'year=2021/month=01/day=01'
aws_sdk_polars.s3.partition.build_hive_partition_dir(s3dir_root: S3Path, kvs: Dict[str, str]) S3Path[source]#

Get the S3 directory path of the partition.

Parameters:
  • s3dir_root – The root S3 directory.

  • kvs – A dictionary of partition key-value pairs.

Example

>>> s3dir_root = S3Path("s3://bucket/folder/")
>>> s3dir_partition = build_hive_partition_dir(s3dir_root, {"year": "2021", "month": "01", "day": "01"})
>>> s3dir_partition.uri
's3://bucket/folder/year=2021/month=01/day=01/'
class aws_sdk_polars.s3.partition.S3Partition(root_uri: str, data: Dict[str, str])[source]#

Represents a partition in an S3-based data lake.

A partition is a directory in S3 that contains data files but no subdirectories. It typically follows a hierarchical structure based on partition keys.

For example, in the following S3 directory structure:

s3://bucket/folder/year=2021/month=01/day=01/data.json
s3://bucket/folder/year=2021/month=01/day=02/data.json
s3://bucket/folder/year=2021/month=02/day=01/data.json
s3://bucket/folder/year=2021/month=02/day=02/data.json

Then:

  • s3://bucket/folder/year=2021/month=01/day=01/ is a partition.

  • s3://bucket/folder/year=2021/month=01/ is NOT a partition.

  • s3://bucket/folder/year=2021/ is NOT a partition.

Parameters:
  • root_uri – The S3 URI of the root folder of partition. For example: The root folder of s3://bucket/folder/year=2021/month=01/day=01/ is s3://bucket/folder/.

  • data – A dictionary of partition data. Note that the value is always a string, even if it represents a number. For example: {"year": "2021", "month": "01", "day": "01"}

property s3dir_root: S3Path#

The S3 directory path of the root directory.

If the partition is “s3://bucket/folder/year=2021/month=01/day=15/”, then the root directory is “s3://bucket/folder/”.

property s3dir_part: S3Path#

The S3 directory path of the partition.

property part_uri: str#

The S3 URI of the partition directory.

classmethod from_uri(s3uri_part: str, s3uri_root: str)[source]#

Construct a Partition object from S3 URIs.

Parameters:
  • s3dir_part – The S3 URI of the partition.

  • s3uri_root – The S3 URI of the root directory.

Returns:

A new Partition object.

classmethod from_part_uri(part_uri: str, n_levels: int)[source]#

Construct a Partition object from a partition URI.

Parameters:
  • part_uri – The S3 URI of the partition.

  • n_levels – The number of levels to go up to reach the root directory.

Returns:

A new Partition object.

list_files_by_ext(s3_client: S3Client, ext: str) List[S3Path][source]#

List files in the partition by file extension. Files in subdirectories are not included.

Parameters:

ext – File extension to filter. For example, “.parquet”.

Returns:

A list of S3Path objects representing Parquet files.

aws_sdk_polars.s3.partition.list_partitions(s3_client: S3Client, s3dir_root: S3Path) List[S3Partition][source]#

Efficiently scan the S3 directory and return a list of partitions.

For example, for the following S3 structure:

s3://bucket/folder/year=2021/month=01/day=01/data.json s3://bucket/folder/year=2021/month=01/day=02/data.json s3://bucket/folder/year=2021/month=02/day=01/data.json s3://bucket/folder/year=2021/month=02/day=02/data.json

The function will return partitions:

s3://bucket/folder/year=2021/month=01/day=01/
s3://bucket/folder/year=2021/month=01/day=02/
s3://bucket/folder/year=2021/month=02/day=01/
s3://bucket/folder/year=2021/month=02/day=02/

Note

This implementation has higher performance compared to get_partitions_v1() as it avoids recursive S3 API calls.