Brian Keating

How to find the oldest file in an S3 bucket

24 July 2023 Brian-Keating AWS

This python example searches for the oldest file in a bucket and prints same.

Ensure to pip install boto3

import boto3

# Specify the bucket name
bucket_name = 'my-bucket'

s3 = boto3.resource('s3')

bucket = s3.Bucket(bucket_name)

# Initialize oldest_file and oldest_date to None
oldest_file = None
oldest_date = None

# Iterate through all files in the bucket
for obj in bucket.objects.all():
    # If oldest_file is None, or this file was modified before the oldest file
    # then update oldest_file and oldest_date to this file's name and last_modified date
    if oldest_file is None or obj.last_modified < oldest_date:
        oldest_file = obj.key
        oldest_date = obj.last_modified

# Print the oldest file's name and its last_modified date
if oldest_file is not None:
    print('The oldest file is {0} and was last modified on {1}'.format(oldest_file, oldest_date))
else:
    print('No files in bucket')

Local debug of an AWS Java Lambda using IntelliJ

07 June 2023 Brian-Keating AWS, Cloud

Title: Getting Started with AWS Glue: A Comprehensive Guide

29 March 2023 Brian-Keating AWS

Introduction

AWS Glue is a fully managed, serverless data integration service offered by Amazon Web Services (AWS) that simplifies the process of extracting, transforming, and loading (ETL) data for analytics purposes. With its scalable, pay-as-you-go model, and a wide range of built-in features, AWS Glue has become a popular choice for data engineers and analysts to streamline their data workflows. In this blog post, we'll walk you through the process of getting started with AWS Glue, from setting up the necessary components to running your first ETL job.

Understanding AWS Glue Components

Before diving into AWS Glue, it's essential to understand its core components:

a. AWS Glue Data Catalog - A central metadata repository that stores information about your data sources, transformations, and targets. The Data Catalog helps manage and discover data assets across various data stores.

b. AWS Glue Crawlers - Automated programs that connect to your data source, extract metadata, and store it in the Data Catalog.

c. AWS Glue ETL Jobs - Scripts that read data from a source, apply transformations, and write the output to a target. These jobs are written in either Python or Scala and run on AWS Glue's distributed, serverless Apache Spark environment.

d. AWS Glue Triggers - Event-driven mechanisms that can start, stop, or chain ETL jobs based on a schedule or the completion of another job.

Setting Up AWS Glue

To get started with AWS Glue, you'll need to perform the following steps:

a. Sign in to your AWS Management Console and navigate to the AWS Glue service.

b. Set up an AWS Identity and Access Management (IAM) role for AWS Glue. This role defines the permissions required to access the necessary resources, such as data stores and Amazon S3 buckets.

c. Create an Amazon S3 bucket to store your data, scripts, and output files. Make sure to configure appropriate access permissions.

Creating and Running a Crawler

A crawler connects to your data source, extracts metadata, and creates table definitions in the Data Catalog. To create a crawler:

a. In the AWS Glue Console, navigate to Crawlers and click "Add Crawler."

b. Provide a name, description, and choose the IAM role created earlier.

c. Configure the data store and connection settings, such as the data source type (e.g., S3, JDBC), path or connection URL, and any necessary authentication information.

d. Choose or create a database in the Data Catalog to store the table definitions.

e. Configure a schedule for the crawler to run (e.g., on-demand, hourly, daily).

f. Review the configuration and create the crawler. You can now run the crawler to populate the Data Catalog with table definitions.

Creating and Running an ETL Job

Now that your Data Catalog is populated, you can create an ETL job to process the data:

a. In the AWS Glue Console, navigate to Jobs and click "Add Job."

b. Provide a name, description, and select the IAM role created earlier.

c. Choose a data source and target from the Data Catalog.

d. Select an ETL language (Python or Scala) and configure the job properties, such as the number of data processing units (DPUs) and timeout.

e. Write or generate an ETL script to define the transformations. AWS Glue can auto-generate a script based on the selected source and target, but you may need to customize it to meet your requirements.

f. Save and run the job. Monitor the progress and view the output in the specified target location.

Automating ETL Workflows with Triggers

To automate your ETL workflows, you can use triggers to start, stop, or chain jobs based on specific conditions:

a. In the AWS Glue Console, navigate to Triggers and click "Add Trigger."

b. Provide a name, description, and select a trigger type (schedule, job event, or on-demand).

c. If you choose a schedule-based trigger, configure the schedule (e.g., cron expression). For a job event-based trigger, select the parent job(s) that should trigger the current job upon completion.

d. Add the job(s) that you want to trigger, and set any conditions (e.g., run only if the parent job succeeds).

e. Review the configuration and create the trigger.

Monitoring and Troubleshooting

AWS Glue provides various monitoring and troubleshooting features to help you manage your ETL jobs:

a. Use AWS Glue Console's job history and logs to track job progress, view runtime statistics, and analyze errors.

b. Enable Amazon CloudWatch metrics and alarms for monitoring job performance and sending notifications based on specific thresholds.

c. Access the underlying Apache Spark logs and UI for a more in-depth analysis of your ETL job execution.

Conclusion

In this blog post, we've introduced you to AWS Glue, its core components, and the process of setting up and running ETL jobs. By leveraging AWS Glue's serverless, pay-as-you-go model, you can streamline your data integration workflows and focus on deriving valuable insights from your data. Don't hesitate to explore AWS Glue further and dive deeper into its advanced features to make the most out of this powerful data integration service.

Disclaimer: Generated by GPT but checked by a Brian.

Proxying a Rest API with AWS API Gateway

30 June 2021 Brian-Keating AWS

In this video I show you how to use AWS API Gateway to quickly proxy an existing API.

Powershell Oracle Db Backup

25 April 2017 brianbruff AWS, Powershell

Hi all,

I thought I would share with you this quick Powershell script I created to Export an oracle database, 7Zip it and then upload it to AmazonS3

Export.ps1

param([String]$dumpname=(get-date -format dd-MM-yyyy))
Set-Alias sz "$env:ProgramFiles\7-Zip\7z.exe"
expdp <myDbUser>/<mypasswd> 
   DIRECTORY=dmpdir DUMPFILE=$dumpname.dmp 
   LOGFILE=$dumpname.log
sz a -mx "$dumpname.7z" "$dumpname.*"
Write-S3Object -bucket "company-ps" 
   -profilename brian.keating 
   -file "$dumpname.7z" -key "customer/dbdumps/$dumpname.7z"
remove-item "$dumpname.dmp"
remove-item "$dumpname.log"              
              
        

How it works:

It uses the current date as the dumpname variable (unless specified as a parameter to the script)

It uses Oracle DbPump to export the database (assumes 7-Zip is installed in %\program files\)

It then writes the 7z file to AWS S3

Finally it removes the dump and log files.

AWS Powershell

ASW Powershell is installed on this server and I have already set a profile “brian.keating” with the following command.

set-awscredentials –AccessKey <ACCESSKEY> –SecretKey <KEY> -storeAs brian.keating

Trivia

The server in this case is running in AWS, the 7zipped db is about 4Gb, and it uploads to S3 in about 30seconds (nice!)

Out with the old in with the new(er)

03 January 2017 brianbruff AWS, Angular.js, Azure, c#, java, javascript, Typescript

With 2016 drawing to a close and 2017 already in full swing for me, I thought this was a good opportunity to reflect on how 2016 went and what 2017 has in store for me from a technological point of view.

2016

If asked how 2016 was from a professional perspective I’d probably try to sum it up as follows “Technology continued to roll out at an ever increasing pace, not only was new technology appearing faster than ever before, existing technology stacks started iterate and churn under our feet!”

Nearing the end of 2016 was where I finally admitted defeat and realized that I can’t keep up with everything and I while I sure am greedy and to know everything about everything, it was getting to the point that I was becoming a ‘Jack of all trades and master of none’ dare I say a full stack developer! I’d actually like to think I’m master of some, but certainly it was a big effort to stay on top of everything.

What did I get up to?

Azure : I got certified in Azure this was without doubt my most prized professional achievement of 2016, I’ve been using Azure for years and I feel quite confident in acclaiming it to be the best public could in the world today.
I’ve also started work on a state of the art data distribution network using Server-less architecture. I finally got down and dirty with Swagger/API Apps/LogicApps/AzureFunctions.
I got a lot better at networking, Load balancing resiliency, Azure/AWS causes a, devops inner persona to ooze its way to the top.
I listened with baited breath to the Azure Weekly Azure Podcast to see what was new (and always scratched my head when Cale got excited about BlockChains, perhaps next year I’ll look back and kick my self for not being an early adopter, it does seem to be an area that’s heating up).

AWS: I got certified as an AWS Solutions Architect, it was great to get a better understanding of AWS and indeed for a few offerings they I’d choose them over Azure. Got heavily involved in AWS CloudFormation and helped regain some control on AWS madness.

Google Cloud: I spent a few weeks playing with it just to see how it’s coming along, at least now I’m somewhat informed but I’d only consider myself as a beginner (I’d consider GoogleCloud a beginner also , unless they put in a massive investment into the portal and services, they simply can’t compete with Azure and AWS.

Docker: I can create images, start stop then understand volumes, I didn’t get as far as any of the clustering techniques such as swarm but I see huge value in Docker!

AngularJS: Architected and delivered a cutting edge data visualization system based on Angular 1.x, typescript,sccs,gulp.
Introduced AngularJS in to multiple smaller projects.

Typescript: This is a fantastic language, and now especially with all the bells and whistles in v2.1 (not least async/await for es5 targets). If you are writing any Javascript you need to learn this no one will ever convince me that a dynamically typed language is better than a statically typed language for starters, but with all the new Standards based features now baked in, it’s certainly taking the industry by storm, I can’t see how Babel will continue to fight for its place in the world alongside it.

Ionic2: I wrote another mobile app, I’ve done this in many languages to date, I started out with iPhones and xamarin c#, moved to objectiveC and java, and finally settled back on the Typescript/Angular2 based Ionic2 framework. It’s a pleasure to deal with, and with my other investments in the underlying stack it has become a natural fit.

Java8: Finally spend some time getting up to speed on the new jdk and it’s offerings. While not strictly Java8, I’m including Sprint Bootstrap, Wildfly10 Application server, CDI, JaxRS etc in this section.

Camel: Gained a basic understanding and working knowledge of the Camel EAI framework.

ActiveMQ: I debated about putting this one on here, all queues fulfil the same core requirements to pass messages right? But I did approach ActiveMQ from three different sides camel/c#/java, so that was interesting.

.NET 2017 I’m now informed about what’s coming down the line. Some interesting thing like C#7 (which I will admit I had to read twice before I saw the value in the language changes), better support for the web stacks (although I’d admit with a tear in my eye that I’ve moved to Jetbrains software and am unlikely to come back to VStudio unless it’s an ASP based backend).

Client Products: Not only the development stacks have been changing, products in use by my clients have been moving at a rapid pace also and given they pay the bills I dedicate a fair amount of time to understanding them in depth.

Resource Consumption:
DNR -Listened to nearly every episode of DotNetRocks.
~~Hanselminutes~~ - Funnily enough I found DotNetRocks as I used to subscribe the Hanselminutes; I say used to, as I’ve finally given up on Hanselminutes it appears to have moved in a different direction in the last year or so, don’t get me wrong, Scott is a great guy one of the best technical speakers in MS if you ask me, I even follow the weekly ASP.net stand-ups which he’s in, it only the podcast that I gave up on in December.

Other recommended podcasts:
Angular In Action
Javascript Jabber
RunAs Radio
Azure Podcast

2017

As you can image it takes a lot of time to get proficient in any of the above stacks I’ve mentioned . I’ve been trying to stay on top of them all and I’ve now reached the point or realization that need to let some go (think of Kate Winslet prying Leonardo DiCaprio’s icy fingers off that board she was on, it’ll be oh so sad). I’m going to narrow down the field, I’ll still keep in touch with them and if I encounter anything I don’t understand then I’ll make it my business to understand it, I simply won’t actively go pursuing them all. I’ve been burnt before with that approach, I learn Sliverlight after all, it wasn’t all bad as I wrote a windows phone app and many WPF apps around the same time, so the experience transferred nicely, it’s just that I’m not writing much WPF these days so I’ll put effort back in that direction only if and when needed.

Q: So the question remains: Where am I going to put my extended effort this year?
A: Azure first approach. Azure will be the primary topic of my blogs whether the implementation is in C#/JavaScript/Typescript/Java I don’t really care. If the backend is .NET or Java, again I don’t really care but I do intend on blogging on practical usage cases for Azure services, I may even create a video or two!

Happy new year!

Azure vs. AWS Text to Blob with SDKs

08 June 2016 brianbruff AWS, Azure, CloudWars

This demonstrates what is involved in writing and reading some text to an Azure and an AWS blob.

Use case

What i set out to achieve was to demonstrate how to read and write some text to a blob with the SDKs. Just to make it a little more interesting I decided to use .NET for the reading and Java for the writing.

Obtaining the SDKs

Adding the SDKs was a seamless process, for .NET Nuget was used and for Java Maven was used

.Net

Java

Write

Azure

AWS

Read

Azure

AWS

Conclusions

Both SDK’s were trivial to install and use, the Azure SDK’s suited my use case a little better in that they didn’t need me to deal with files in my Application code (I expect text is not a mainstream use case).

AWS as always relies on the region being specified which I can’t say I like that much.

Media Indexing In the Cloud

01 April 2016 brianbruff AWS, Azure

So out of the blue I found myself giving Azure Media Indexing a trial run, for no other reason other than I could, this is why I love cloud tech so much, it brings something that would have been very difficult 5-10 years ago, within reach of anyone with an public cloud account.

AWS vs Azure

Both AWS and Azure have media services, typical used to manage digital media and serve it up to consumer playback devices at scale.

AWS has the the Elastic Transcode and Azure has Azure Media Services, however only Azure has the ability to dig into audio or media files and extract the text within.

Azure Media Indexer

Azure Media Indexer enables you to make content of your media files searchable and to generate a full-text transcript for closed captioning and keywords. You can process one media file or multiple media files in a batch. Have a look at this post for some details on how to do it from code https://azure.microsoft.com/en-us/documentation/articles/media-services-index-content/

The code uploads a file then starts an indexing job, then downloads the results:
Note: The source code seen above has a typo, I’ve submitted a pull request to hopefully this will be fixed, but easy to spot.
Also the download part failed with an exception for me so i just pulled it down with a little bit of code on a second pass.

The above code is possibly all you need if you wish to upload content and start the indexing job manually with the old portal.

Here’s how:

On your media account upload some content

Once the content is uploaded, start the indexer process, set a good title as Azure will reach out to the interweb and use it to seed the language extraction.

There is no way to download the output from the portal so use use the code i shared above to download the content.

I processed the latest podcast at time of publishing from https://www.dotnetrocks.com/
https://s3.amazonaws.com/dnr/dotnetrocks_1276_news_from_build.mp3

In hindsight it was possibly not the best podcast to index as it was recorded live @build (i expect, i’m two episodes behind on DNR this week so have not listened to it yet), the DNR guys typically have exceedingly good audio, at some stage it might be worth indexing another episode.

Results

You can find the results here here , initially my knee jerk reaction was , “ah this is poor” but after reflecting on it I’m blown away by what was done and so so, sooo easily!

With a bit of editing this can be thrown into Azure Search / Sql Server etc for full text search and direct seek media playback.

See for yourself:

For sure it needs some editing, e.g.
I release the eleventh music decode by
should in-fact be
I released the eleventh music to code by

but what a great start!!!

Cloud costs: Shut those VM’s down

20 March 2016 brianbruff AWS, Azure

The public cloud is fantastic for numerous reasons, if you’re not faced with some restriction such as where you data lives or other factors, then my advise is get away from private clouds and get to the public clouds as fast as your legs can carry you!

However once you’re there it’s not all plain sailing, if you let a team of people loose to play with with all these new toys, on the back of your company’s credit card, then costs can start to accumulate very quickly!

Sometimes, your VM’s are not being used for production and what invariably happens, is that these machines get forgotten about or are left running for no good reason, now while there are a few ways to capture such scenarios, what I’ll show you now is a very quick way of scheduling those known VM’s to shutdown (or start up) as on a predefined schedule,

AWS

For AWS the easiest way of scheduling a single standalone VM to shutdown is to use the AWS Data Pipeline service.

Lets quickly show the workflow:

1) Create new Pipeline with CLI Command

2) Enter the Stop EC2 CLI commands

Note: This field only shows as one line of text vertically in chrome so I modified to styles to show the full command.

You can see that i have two different stop commands, I could combine these into the one command with the two IDs however if one fails then they both fail, this can be problematic if for example an Instance gets terminated.

3) Schedule

4) Set log file bucket

5) Select role

Choose custom and then select the two defaults.
Security Note: Roles needs to be configured to allow Data Pipeline access to your VM’s, please see here: https://aws.amazon.com/premiumsupport/knowledge-center/stop-start-ec2-instances/

6) Done

That’s it, you now have a scheduled task that will switch off your vm’s nightly. It should be noted that this will start a micro data pipeline ec2 instance VM with a default run time of 50 minutes, so you need to ensure the end justifies the means, better yet reduce the run time by editing the workflow to e.g. 15 minutes.

Azure

In order to achieve the same results with Azure we are going to select Azure automation,

If you’re familiar with Azure you will know that there are currently two ways of creating VM’s, the classic approach and the RM (resource manager approach). In this post I’ll show you the RM approach, but feel free to substitute classic in it’s place with a nearly identical approach.

1) Open or create an Azure Automation account.

2) Edit Assets

Add a variable for the AzureSubscriptionId you’ll be using
Select your service principle account, you’ll have to search for it to appear.

3) Runbook

We have two options now, we can either use some powershell or some graphically defined workflows, let’s do this with a graphical version, we don’t need to create this, we simply import from the gallery.

After importing choose Edit on the runbook

4) Set inputs

Then we set the two Assets we provided earlier and optionally a ResourceGroupName (to stop all vm’s in a resource group) or a VMName The “Auto” you see above isn’t a keyword, it’s my badly named ResourceGroup.

5) Publish

6) Set schedule

Go back to the Runbook and choose schedule

With the schedule you can specify any of the input parameters and override the defaults if you so wish.

Security Note: Much the same as Azure you’ll need to ensure you’ve permission to access the VM’s from Azure Automation, the best option is to create a SecurityPrinciple application. See: https://azure.microsoft.com/en-us/documentation/articles/resource-group-create-service-principal-portal/

Conclusion:

While it does look like the Azure approach is much more convoluted it is much more powerful, e.g. it is very easy to extend the azure run book to check all VM’s for a “Production” tag and only shutdown vm’s if they are not production (because that would be bad right!); with AWS, we are simply relying on a feature of Data Pipeline that allows us to run simple cli commands.

Pricing is much of a much-ness between each, with Azure you can run for free (to a limit)

AWS the 15 mins with a micro instance is not even worth worrying about.

Web App deployment to AWS and Azure

15 March 2016 brianbruff AWS, Azure

As promised, hereby the first instalment of the AWS vs Azure blog post saga, again I’m trying to remain impartial throughout.

What I intend to outline is at this stage is the show to get started deploying a new application to AWS and to Azure from within Visual Studio. I’m sure there are those of you that are shouting, “.NET, Visual Studio, Azure? Of course Azure will do it better!!!” however rest assured this is only the first of a few posts related to Azure App Service and AWS elastic beanstalk and AWS doesn’t fair all that badly.

Sample Application

The sample application in this case is just a File/New ASP MVC5 project using .net 4..6.1, I’m only hitting the home page as a test and not worrying about databases for now (databases will make another interesting series of blog posts!).

AWS Elastic Beanstalk

AWS has a AWS Toolkit plugin for Visual Studio, this allows you to view and manipulate AWS resources

It also lets you Publish Applications to AWS by right clicking on the solution and choosing “Publish to AWS”

Once you choose this option you’ll be presented with a dialog that lets you choose your environment or create a new one.

If you don’t already have one, lets create one, you will choose a name for the environment

Next you choose your instance size (the underlying VM size, or any custom Amazon Machine Image you’ve created previously), other options of interest are, use non-default VPC, this is basically the network you’ll be running on, all AWS accounts get a default VPC per region (and if you delete it you’ll need to contact AWS to get it back!). The option of single instance environment is selected here as this is just a test. If i wasn’t running in single instance mode, I would be able to Enable Rolling Deployment to keep my app running while it gets updated (more about that here: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.rollingupdates.html)

Lastly we choose the application settings, I’m just deploying a .net 4 runtime debug application.

Once you review and finish, you can see your application start deploying on the portal

Once it’s finished which can take a few minutes after the upload you should see the Health go Green and you can access your application

Note: If you’re following along and wish to stop this ElasticBeanstalk environment to minimize costs/free tier bandwidth, then please ensure to terminate it from the ElasticBeanstalk section of the console, Stopping the underlying EC2 instance will only serve to signal the autoscaling group it belongs to, to start a new instance and restore the health of this application.

Azure App Service

Now lets deploy this same application to azure. Right click on solution explorer and choose Publish

Choose to Azure

Like AWS where we chose a server environment we need to choose an app hosting plan, with Azure you can sign up for a free trial, if you have a subscription you can choose to deploy a free cloud app (you get 10 free per region, there are some limitations which we are not concerned with just now).

After creating this new hosting plan we arrive back at the publish dialog

Visual Studio then starts the publish task and opens the application in your default Visual Studio specified web browser.

You can also see your new application seeding life in the Azure portal http://portal.azure.com

Summary

So in this blog post I’ve run through how to deploy applications to PaaS offerings on AWS and Azure, in the next post I’m going to drill down and and do some more comparing and contrasting of these two applications, stay tuned!