partition#
Datalake partition utilities.
This module provides functions and classes for managing and manipulating partitions in a data lake stored on S3. It includes utilities for extracting partition information, encoding partition data, and listing partitions.
- aws_sdk_polars.s3.partition.decode_hive_partition(s3dir_root: S3Path, s3dir_partition: S3Path) Dict[str, str][source]#
Extract partition data from the S3 directory path.
- Parameters:
s3dir_root – The root S3 directory.
s3dir_partition – The partition S3 directory.
Example
>>> s3dir_root = S3Path("s3://bucket/folder/") >>> s3dir_partition = S3Path("s3://bucket/folder/year=2021/month=01/day=15/") >>> decode_hive_partition(s3dir_root, s3dir_partition) {"year": "2021", "month": "01", "day": "15"}
- aws_sdk_polars.s3.partition.encode_hive_partition(kvs: Dict[str, str]) str[source]#
Encode partition data into hive styled partition string.
- Parameters:
kvs – A dictionary of partition key-value pairs.
For example:
>>> encode_hive_partition({"year": "2021", "month": "01", "day": "01"}) 'year=2021/month=01/day=01'
- aws_sdk_polars.s3.partition.build_hive_partition_dir(s3dir_root: S3Path, kvs: Dict[str, str]) S3Path[source]#
Get the S3 directory path of the partition.
- Parameters:
s3dir_root – The root S3 directory.
kvs – A dictionary of partition key-value pairs.
Example
>>> s3dir_root = S3Path("s3://bucket/folder/") >>> s3dir_partition = build_hive_partition_dir(s3dir_root, {"year": "2021", "month": "01", "day": "01"}) >>> s3dir_partition.uri 's3://bucket/folder/year=2021/month=01/day=01/'
- class aws_sdk_polars.s3.partition.S3Partition(root_uri: str, data: Dict[str, str])[source]#
Represents a partition in an S3-based data lake.
A partition is a directory in S3 that contains data files but no subdirectories. It typically follows a hierarchical structure based on partition keys.
For example, in the following S3 directory structure:
s3://bucket/folder/year=2021/month=01/day=01/data.json s3://bucket/folder/year=2021/month=01/day=02/data.json s3://bucket/folder/year=2021/month=02/day=01/data.json s3://bucket/folder/year=2021/month=02/day=02/data.json
Then:
s3://bucket/folder/year=2021/month=01/day=01/is a partition.s3://bucket/folder/year=2021/month=01/is NOT a partition.s3://bucket/folder/year=2021/is NOT a partition.
- Parameters:
root_uri – The S3 URI of the root folder of partition. For example: The root folder of
s3://bucket/folder/year=2021/month=01/day=01/iss3://bucket/folder/.data – A dictionary of partition data. Note that the value is always a string, even if it represents a number. For example:
{"year": "2021", "month": "01", "day": "01"}
- property s3dir_root: S3Path#
The S3 directory path of the root directory.
If the partition is “s3://bucket/folder/year=2021/month=01/day=15/”, then the root directory is “s3://bucket/folder/”.
- property s3dir_part: S3Path#
The S3 directory path of the partition.
- classmethod from_uri(s3uri_part: str, s3uri_root: str)[source]#
Construct a Partition object from S3 URIs.
- Parameters:
s3dir_part – The S3 URI of the partition.
s3uri_root – The S3 URI of the root directory.
- Returns:
A new
Partitionobject.
- aws_sdk_polars.s3.partition.list_partitions(s3_client: S3Client, s3dir_root: S3Path) List[S3Partition][source]#
Efficiently scan the S3 directory and return a list of partitions.
For example, for the following S3 structure:
s3://bucket/folder/year=2021/month=01/day=01/data.json s3://bucket/folder/year=2021/month=01/day=02/data.json s3://bucket/folder/year=2021/month=02/day=01/data.json s3://bucket/folder/year=2021/month=02/day=02/data.json
The function will return partitions:
s3://bucket/folder/year=2021/month=01/day=01/ s3://bucket/folder/year=2021/month=01/day=02/ s3://bucket/folder/year=2021/month=02/day=01/ s3://bucket/folder/year=2021/month=02/day=02/
Note
This implementation has higher performance compared to
get_partitions_v1()as it avoids recursive S3 API calls.