Getting the sizes of Top level Directories in an AWS S3 Bucket with Boto3

I was recently asked to create a report showing the total files within the top level folders and all the subdirs under the folder in our S3 Buckets.

S3 bucket ‘files’ are objects that will return a key that contains the path where the object is stored within the bucket.
I came up with this function to take a bucket and iterate over the objects within the bucket. For each item, the key is examined and added to a running total kept in a dictionary.

Here’s what I ended up with.

def get_top_dir_size_summary(bucket_to_search):
    """
    This function takes in the name of an s3 bucket and returns a dictionary
    containing the top level dirs as keys and total filesize and value.
    :param bucket_to_search: a String containing the name of the bucket
    """
    # Setup the output dictionary for running totals
    dirsizedict = {}
    # Create 1 entry for '.' to represent the root folder instead of the default.
    dirsizedict['.'] = 0

    # ------------
    # Setup the AWS Res. and Clients
    s3 = boto3.resource('s3')
    s3client = boto3.client('s3')

    # This is a check to ensure a bad bucket name wasn't passed in.   I'm sure there is a better
    # way to check this.   If you have a better method, please comment on the article. 
    try:
        response = s3client.head_bucket(Bucket=bucket_to_search)
    except:
        print('Bucket ' + bucket_to_search + ' does not exist or is unavailable. - Exiting')
        quit()

    # since buckets could have more than 1000 items, have to use paginator to iterate 1000 at a time
    paginator = s3client.get_paginator('list_objects')
    pageresponse = paginator.paginate(Bucket=bucket_to_search)

    # iterate through each object in the bucket through the paginator.
    for pageobject in pageresponse:

        # Check to see of a buckets has contents, without this an empty bucket would throw an error. 
        if 'Contents' in pageobject.keys():

            # if there are contents, then iterate through each 'file'.
            for file in pageobject['Contents']:
                itemtocheck = s3.ObjectSummary(bucket_to_search, file['Key'])

                # Get Top level directory from the file by splitting the key. 
                keylist = file['Key'].split('/')

                # See if file is on root, if keylist has 1 item (root dir), there are no dirs on item
                if len(keylist) == 1:
                    dirsizedict['.'] += itemtocheck.size
                else:
                    # Not root, check if key already exists, create it needed, and add value otherwise
                    # Just add the value to the running total
                    if keylist[0] in dirsizedict:
                        dirsizedict[keylist[0]] += itemtocheck.size
                    else:
                        dirsizedict[keylist[0]] = itemtocheck.size

    return dirsizedict

That script is probably a little rough to an elite coder, so if you have any thoughts on improvement, let me hear them.

Using Python and Boto3 to get Instance Tag information

Here are 2 sample functions to illustrate how you can get information about Tags on instances using Boto3 in AWS.

import boto3

def get_instance_name(fid):
    # When given an instance ID as str e.g. 'i-1234567', return the instance 'Name' from the name tag.
    ec2 = boto3.resource('ec2')
    ec2instance = ec2.Instance(fid)
    instancename = ''
    for tags in ec2instance.tags:
        if tags["Key"] == 'Name':
            instancename = tags["Value"]
    return instancename

In this function, I create the ec2 resource object using the instance ID passed to the function. I iterate through the Tags of the instance until I find the ‘Name’ Tag and return its value. This is a very simple function that can pull any tag value, really.

Next up, this function will list all instances with a certain Tag name and certain Value on that tag.

import boto3

def list_instances_by_tag_value(tagkey, tagvalue):
    # When passed a tag key, tag value this will return a list of InstanceIds that were found.

    ec2client = boto3.client('ec2')

    response = ec2client.describe_instances(
        Filters=[
            {
                'Name': 'tag:'+tagkey,
                'Values': [tagvalue]
            }
        ]
    )
    instancelist = []
    for reservation in (response["Reservations"]):
        for instance in reservation["Instances"]:
            instancelist.append(instance["InstanceId"])
    return instancelist

Here I use ‘response’ to collect the instances which fall into the Filter used. Take note that I used the tag: + tagkey. tag-value would return any instance that this value on any tag. Tag-key returns any instance with the tag name field that matches here, regardless of the value. I want a specific tag and specific value.

Getting the Size of an S3 Bucket using Boto3 for AWS

I’m writing this on 9/14/2016. I make note of the date because the request to get the size of an S3 Bucket may seem a very important bit of information but AWS does not have an easy method with which to collect that info. I fully expect them to add that functionality at some point. As of this date, I could only come up with 2 methods to get the size of a bucket. One could list of all bucket items and iterate over all the objects while keeping a running total. That method does work, but I found that for a bucket with many thousands of items, this method could take hours per bucket.

A better method uses AWS Cloudwatch logs instead. When an S3 bucket is created, it also creates 2 cloudwatch metrics and I use that to pull the Average size over a set period, usually 1 day.

Here’s what I came up with:

 
import boto3
import datetime

now = datetime.datetime.now()

cw = boto3.client('cloudwatch')
s3client = boto3.client('s3')

# Get a list of all buckets
allbuckets = s3client.list_buckets()

# Header Line for the output going to standard out
print('Bucket'.ljust(45) + 'Size in Bytes'.rjust(25))

# Iterate through each bucket
for bucket in allbuckets['Buckets']:
    # For each bucket item, look up the cooresponding metrics from CloudWatch
    response = cw.get_metric_statistics(Namespace='AWS/S3',
                                        MetricName='BucketSizeBytes',
                                        Dimensions=[
                                            {'Name': 'BucketName', 'Value': bucket['Name']},
                                            {'Name': 'StorageType', 'Value': 'StandardStorage'}
                                        ],
                                        Statistics=['Average'],
                                        Period=3600,
                                        StartTime=(now-datetime.timedelta(days=1)).isoformat(),
                                        EndTime=now.isoformat()
                                        )
    # The cloudwatch metrics will have the single datapoint, so we just report on it. 
    for item in response["Datapoints"]:
        print(bucket["Name"].ljust(45) + str("{:,}".format(int(item["Average"]))).rjust(25))
        # Note the use of "{:,}".format.   
        # This is a new shorthand method to format output.
        # I just discovered it recently. 

Using Python Boto3 with Amazon AWS S3 Buckets

I’m here adding some additional Python Boto3 examples, this time working with S3 Buckets.

So to get started, lets create the S3 resource, client, and get a listing of our buckets.

import boto3

s3 = boto3.resource('s3')
s3client = boto3.client('s3')

response = s3client.list_buckets()
for bucket in response["Buckets"]:
    print(bucket['Name'])

Here we create the s3 client object and call ‘list_buckets()’. Response is a dictionary and has a key called ‘Buckets’ that holds a list of dicts with each bucket details.

To list out the objects within a bucket, we can add the following:

    theobjects = s3client.list_objects_v2(Bucket=bucket["Name"])
    for object in theobjects["Contents"]:
        print(object["Key"])

Note that if the Bucket has no items, then there will be no Contents to list and you will get an error thrown “KeyError: ‘Contents’.

Each object returned is a dictionary with Key Value pairs describing the object. Boto3 Docs are you friend here: https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2

Now if the Bucket has over 1,000 items, the list_objects is limited to 1000 replies. To get around this, we need to use a Paginator.

import boto3

s3 = boto3.resource('s3')
s3client = boto3.client('s3')

response = s3client.list_buckets()
print(response)
for bucket in response["Buckets"]:
    # Create a paginator to pull 1000 objects at a time
    paginator = s3client.get_paginator('list_objects')
    pageresponse = paginator.paginate(Bucket=bucket["Name"])
    
    # PageResponse Holds 1000 objects at a time and will continue to repeat in chunks of 1000. 
    for pageobject in pageresponse:
        for file in pageobject["Contents"]:
            print(object["Key"])

How to use Python Boto3 to list Instances in Amazon AWS

Continuing on with simple examples to help beginners learn the basics of Python and Boto3.

This is a very simple tutorial showing how to get a list of instances in your Amazon AWS environment.

import boto3    
ec2client = boto3.client('ec2')
response = ec2client.describe_instances()
for reservation in response["Reservations"]:
    for instance in reservation["Instances"]:
        # This sample print will output entire Dictionary object
        print(instance)
        # This will print will output the value of the Dictionary key 'InstanceId'
        print(instance["InstanceId"])

You can also create a resource object from the instance item as well.

.
.
    for instance in reservation["Instances"]
        ec2 = boto3.resource('ec2')
        specificinstance = ec2.Instance(instance["InstanceId"])

Getting started with Amazon AWS and Boto3

If you are starting from scratch with using Python and Boto3 to script and automate Amazon AWS environments, then this should help get you going.

To begin, you’ll need a few items:
1) Download and install the latest Amazon AwsCli. The is the Command Line client for AWS.
2) You will need credentials for an Amazon AWS environment of course. There are lot’s of articles on how to setup AWS CLI with an AWS account.
3) Download and install the latest Python Release from python.org. For these examples I’m using Python 3.5.2.
4) You will need to ‘pip install boto3’ in a Python environment. You can do this from command line. I recommend creating a Virtual Environment for AWS-Boto3 to keep packages separate.
5) If you’ve never used a Python IDE before, try the free Pycharms Community Edition. I use it and it really helps speed up your coding.

At this point, you should have a working AWS CLI, Python Intepreter, and have pip installed the boto3 library.

It’s a good idea to keep the boto3 documentation handy in another browser tab. Because of how boto3 works, there is no premade library and, when working in pycharms, no autocomplete. Having the docs available to reference the objects and methods saves a lot of time.

In my experience with Boto3, there resources and there are clients.

import boto3
ec2 = boto3.resource('ec2')
ec2client = boto3.client('ec2')

I use the resource to get information or take action on a specific item. I think of it as being at a ‘higher’ level than the client. When using the boto3 resource, you usually have to provide an id of the item. For example:

import boto3
ec2 = boto3.resource('ec2')
vpc = ec2.Vpc('vpc-12345678')

There is usually no searching or enumerating items at the resource level. Now that the class vpc is defined, you can look at the attributes, references, or collections. You will also have a series of actions you can perform on the item.

# Attributes
print(vpc.cidr_block)
print(vpc.state)
# Actions
vpc.attach_internet_gateway(InternetGatewayId="igw-123456")
vpc.delete()
# Collections
vpclist = vpc.instances.all()
for instance in vpclist:
    print(instance)

When using a client instead of a resource, you get a level of control over AWS items that is very close to the AWS CLI interface. With the client, you get more detail, can search, and can get very granular in your filters and tasks. Almost every resource has a client available.
For example:

import boto3
# Ec2
ec2 = boto3.resource('ec2')
ec2client = boto3.client('ec2')
# S3
s3 = boto3.resource('s3')
s3client = boto3.client('s3')

One of the most useful benefits of using a client is that you can describe the AWS items in that resource, you can filter or iterate for specific items, and manipulate or take actions on those items.
In the VPC example above, when defining the VPC resource, we needed to know the ID of the vpc. With a client, you can list all the items and find the one you want.

import boto3
ec2 = boto3.resource('ec2')
ec2client = boto3.client('ec2')
response = ec2client.describe_vpcs()
print(response)

What you get back is a dictionary with the important item called “Vpcs”. This dictionary item is the output containing all the VPCs. (We could expect this from the Boto3 docs).
So let’s list out all the VPCs from the response

import boto3
ec2 = boto3.resource('ec2')
ec2client = boto3.client('ec2'))
response = ec2client.describe_vpcs()
print(response)
for vpc in response["Vpcs"]:
    print(vpc)

Now you can see how it starts to break down into the info you want. Here you can see a list of dictionaries from the Vpc.
Next, lets get to the individual items of each VPC.

import boto3
ec2 = boto3.resource('ec2')
ec2client = boto3.client('ec2')
response = ec2client.describe_vpcs()
for vpc in response["Vpcs"]:
    print("VpcId = " + vpc["VpcId"] + " uses Cidr of " + vpc["CidrBlock"])

That’s the very basics of it. Post any questions in the comments below. And I’ll be adding more posts on Boto3 in the coming days and weeks.