Getting the sizes of Top level Directories in an AWS S3 Bucket with Boto3

I was recently asked to create a report showing the total files within the top level folders and all the subdirs under the folder in our S3 Buckets.

S3 bucket ‘files’ are objects that will return a key that contains the path where the object is stored within the bucket.
I came up with this function to take a bucket and iterate over the objects within the bucket. For each item, the key is examined and added to a running total kept in a dictionary.

Here’s what I ended up with.

def get_top_dir_size_summary(bucket_to_search):
    """
    This function takes in the name of an s3 bucket and returns a dictionary
    containing the top level dirs as keys and total filesize and value.
    :param bucket_to_search: a String containing the name of the bucket
    """
    # Setup the output dictionary for running totals
    dirsizedict = {}
    # Create 1 entry for '.' to represent the root folder instead of the default.
    dirsizedict['.'] = 0

    # ------------
    # Setup the AWS Res. and Clients
    s3 = boto3.resource('s3')
    s3client = boto3.client('s3')

    # This is a check to ensure a bad bucket name wasn't passed in.   I'm sure there is a better
    # way to check this.   If you have a better method, please comment on the article. 
    try:
        response = s3client.head_bucket(Bucket=bucket_to_search)
    except:
        print('Bucket ' + bucket_to_search + ' does not exist or is unavailable. - Exiting')
        quit()

    # since buckets could have more than 1000 items, have to use paginator to iterate 1000 at a time
    paginator = s3client.get_paginator('list_objects')
    pageresponse = paginator.paginate(Bucket=bucket_to_search)

    # iterate through each object in the bucket through the paginator.
    for pageobject in pageresponse:

        # Check to see of a buckets has contents, without this an empty bucket would throw an error. 
        if 'Contents' in pageobject.keys():

            # if there are contents, then iterate through each 'file'.
            for file in pageobject['Contents']:
                itemtocheck = s3.ObjectSummary(bucket_to_search, file['Key'])

                # Get Top level directory from the file by splitting the key. 
                keylist = file['Key'].split('/')

                # See if file is on root, if keylist has 1 item (root dir), there are no dirs on item
                if len(keylist) == 1:
                    dirsizedict['.'] += itemtocheck.size
                else:
                    # Not root, check if key already exists, create it needed, and add value otherwise
                    # Just add the value to the running total
                    if keylist[0] in dirsizedict:
                        dirsizedict[keylist[0]] += itemtocheck.size
                    else:
                        dirsizedict[keylist[0]] = itemtocheck.size

    return dirsizedict

That script is probably a little rough to an elite coder, so if you have any thoughts on improvement, let me hear them.

Getting the Size of an S3 Bucket using Boto3 for AWS

I’m writing this on 9/14/2016. I make note of the date because the request to get the size of an S3 Bucket may seem a very important bit of information but AWS does not have an easy method with which to collect that info. I fully expect them to add that functionality at some point. As of this date, I could only come up with 2 methods to get the size of a bucket. One could list of all bucket items and iterate over all the objects while keeping a running total. That method does work, but I found that for a bucket with many thousands of items, this method could take hours per bucket.

A better method uses AWS Cloudwatch logs instead. When an S3 bucket is created, it also creates 2 cloudwatch metrics and I use that to pull the Average size over a set period, usually 1 day.

Here’s what I came up with:

 
import boto3
import datetime

now = datetime.datetime.now()

cw = boto3.client('cloudwatch')
s3client = boto3.client('s3')

# Get a list of all buckets
allbuckets = s3client.list_buckets()

# Header Line for the output going to standard out
print('Bucket'.ljust(45) + 'Size in Bytes'.rjust(25))

# Iterate through each bucket
for bucket in allbuckets['Buckets']:
    # For each bucket item, look up the cooresponding metrics from CloudWatch
    response = cw.get_metric_statistics(Namespace='AWS/S3',
                                        MetricName='BucketSizeBytes',
                                        Dimensions=[
                                            {'Name': 'BucketName', 'Value': bucket['Name']},
                                            {'Name': 'StorageType', 'Value': 'StandardStorage'}
                                        ],
                                        Statistics=['Average'],
                                        Period=3600,
                                        StartTime=(now-datetime.timedelta(days=1)).isoformat(),
                                        EndTime=now.isoformat()
                                        )
    # The cloudwatch metrics will have the single datapoint, so we just report on it. 
    for item in response["Datapoints"]:
        print(bucket["Name"].ljust(45) + str("{:,}".format(int(item["Average"]))).rjust(25))
        # Note the use of "{:,}".format.   
        # This is a new shorthand method to format output.
        # I just discovered it recently. 

Using Python Boto3 with Amazon AWS S3 Buckets

I’m here adding some additional Python Boto3 examples, this time working with S3 Buckets.

So to get started, lets create the S3 resource, client, and get a listing of our buckets.

import boto3

s3 = boto3.resource('s3')
s3client = boto3.client('s3')

response = s3client.list_buckets()
for bucket in response["Buckets"]:
    print(bucket['Name'])

Here we create the s3 client object and call ‘list_buckets()’. Response is a dictionary and has a key called ‘Buckets’ that holds a list of dicts with each bucket details.

To list out the objects within a bucket, we can add the following:

    theobjects = s3client.list_objects_v2(Bucket=bucket["Name"])
    for object in theobjects["Contents"]:
        print(object["Key"])

Note that if the Bucket has no items, then there will be no Contents to list and you will get an error thrown “KeyError: ‘Contents’.

Each object returned is a dictionary with Key Value pairs describing the object. Boto3 Docs are you friend here: https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2

Now if the Bucket has over 1,000 items, the list_objects is limited to 1000 replies. To get around this, we need to use a Paginator.

import boto3

s3 = boto3.resource('s3')
s3client = boto3.client('s3')

response = s3client.list_buckets()
print(response)
for bucket in response["Buckets"]:
    # Create a paginator to pull 1000 objects at a time
    paginator = s3client.get_paginator('list_objects')
    pageresponse = paginator.paginate(Bucket=bucket["Name"])
    
    # PageResponse Holds 1000 objects at a time and will continue to repeat in chunks of 1000. 
    for pageobject in pageresponse:
        for file in pageobject["Contents"]:
            print(object["Key"])