Construct, release, and run Glow tasks on Amazon EMR with the open-source EMR CLI tool

Today, we’re delighted to present the Amazon EMR CLI, a brand-new command line tool to plan and release PySpark jobs throughout various Amazon EMR environments. With the intro of the EMR CLI, you now have a basic method to not just release a wide variety of PySpark jobs to remote EMR environments, however likewise incorporate with your CI/CD option of option.

In this post, we demonstrate how you can utilize the EMR CLI to produce a brand-new PySpark job from scratch and release it to Amazon EMR Serverless in one command.

Introduction of option

The EMR CLI is an open-source tool to assist enhance the designer experience of establishing and releasing tasks on Amazon EMR. When you’re simply getting going with Apache Glow, there are a range of alternatives with regard to how to package, release, and run tasks that can be frustrating or need deep domain know-how. The EMR CLI supplies easy commands for these actions that get rid of the uncertainty from releasing Glow tasks. You can utilize it to produce brand-new jobs or together with existing PySpark jobs.

In this post, we stroll through developing a brand-new PySpark job that examines weather condition information from the NOAA Global Surface Area Summary of Day open dataset. We’ll utilize the EMR CLI to do the following:

Initialize the job.
Bundle the reliances.
Release the code and reliances to Amazon Simple Storage Service (Amazon S3).
Run the task on EMR Serverless.

Requirements

For this walkthrough, you ought to have the following requirements:

An AWS account
An EMR Serverless application in the us-east-1 Area
An S3 pail for your code and logs in the us-east-1 Area
An AWS Identity and Gain Access To Management (IAM) task function that can run EMR Serverless tasks and gain access to S3 containers
Python variation >>= 3.7
Docker

If you do not currently have an existing EMR Serverless application, you can utilize the following AWS CloudFormation design template or utilize the emr bootstrap command after you have actually set up the CLI.

Set Up the EMR CLI

You can discover the source for the EMR CLI in the GitHub repo, however it’s likewise dispersed through PyPI. It needs Python variation >>= 3.7 to run and is checked on macOS, Linux, and Windows. To set up the current variation, utilize the following command:

You ought to now have the ability to run the emr-- aid command and see the various subcommands you can utilize:

 â¯ emr-- aid.
Use: emr [OPTIONS] COMMAND [ARGS] ...

Bundle, release, and run PySpark jobs on EMR.

Alternatives:.
-- aid Program this message and exit.

Commands:.
bootstrap Bootstrap an EMR Serverless environment.
release Copy a regional job to S3.
init Initialize a regional PySpark job.
plan Bundle a task and reliances into dist/.
run Run a task on EMR, additionally develop and release.
status

If you didn’t currently produce an EMR Serverless application, the bootstrap command can produce a sample environment for you and a setup file with the appropriate settings. Presuming you utilized the offered CloudFormation stack, set the following environment variables utilizing the details on the Outputs tab of your stack. Set the Area in the terminal to us-east-1 and set a couple of other environment variables we’ll require along the method:

 export AWS_REGION= us-east-1.
export APPLICATION_ID =<< YOUR_EMR_SERVERLESS_APPLICATION_ID>>
export JOB_ROLE_ARN =<< YOUR_EMR_SERVERLESS_JOB_ROLE_ARN>>
export S3_BUCKET =<< YOUR_S3_BUCKET_NAME>>

We utilize us-east-1 since that’s where the NOAA GSOD information pail is. EMR Serverless can access S3 containers and other AWS resources in the very same Area by default. To gain access to other services, set up EMR Serverless with VPC gain access to

Initialize a task

Next, we utilize the emr init command to initialize a default PySpark job for us in the offered directory site. The default design templates produce a basic Python job that utilizes pyproject.toml to specify its reliances. In this case, we utilize Pandas and PyArrow in our script, so those are currently pre-populated.

 â¯ emr init my-project.
[emr-cli]: Initializing job in my-project.
[emr-cli]: Job initialized.

After the job is initialized, you can run cd my-project or open the my-project directory site in your code editor of option. You ought to see the following set of files:

 my-project.
â â â Dockerfile.
â â â entrypoint.py.
â â â tasks.
â â â â extreme_weather. py.
â â â pyproject.toml

Keep in mind that we likewise have a Dockerfile here. This is utilized by the plan command to make sure that our job reliances are constructed on the best architecture and os for Amazon EMR.

If you utilize Poetry to handle your Python reliances, you can likewise include a -- project-type poetry flag to the emr init command to produce a Poetry job.

If you currently have an existing PySpark job, you can utilize emr init-- dockerfile to produce the Dockerfile essential to package things up.

Run the job

Now that we have actually got our sample job produced, we require to package our reliances, release the code to Amazon S3, and begin a task on EMR Serverless. With the EMR CLI, you can do all of that in one command. Ensure to run the command from the my-project directory site:

 emr run.
-- entry-point entrypoint.py.
-- application-id $ {APPLICATION_ID}
-- job-role $ {JOB_ROLE_ARN}
-- s3-code-uri s3://$ {S3_BUCKET} / tmp/emr-cli-demo/.
-- develop.
-- wait

This command carries out a number of actions:

Auto-detects the kind of Glow job in the existing directory site.
Starts a construct for your job to package up reliances.
Copies your entry point and resulting develop files to Amazon S3.
Begins an EMR Serverless task.
Awaits the task to complete, leaving with a mistake status if it stops working.

You ought to now see the list below output in your terminal as the task starts running in EMR Serverless:

[emr-cli]: Task sent to EMR Serverless (Task Run ID: 00f8uf1gpdb12r0l).
[emr-cli]: Waiting on task to finish ...
[emr-cli]: Task state is now: SET UP.
[emr-cli]: Task state is now: RUNNING.
[emr-cli]: Task state is now: SUCCESS.
[emr-cli]: Task finished effectively!

Which’s it! If you wish to run the very same code on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), you can change-- application-id with -- cluster-id j-11111111 The CLI will look after sending out the right spark-submit commands to your EMR cluster.

Now let’s stroll through a few of the other commands.

emr plan

PySpark jobs can be packaged in various methods, from a single.py file to an intricate Poetry job with numerous reliances. The EMR CLI can assist regularly package your jobs without needing to fret about the information.

For instance, if you have a single.py file in your job directory site, the plan command does not require to do anything. If, nevertheless, you have multiple.py files in a common Python job design, the emr plan command will zip these files up as a bundle that can later on be published to Amazon S3 and offered to your PySpark task utilizing the -- py-files choice. If you have 3rd party reliances specified in pyproject.toml, emr plan will produce a virtual environment archive and begin your EMR task with the spark.archive choice.

The EMR CLI likewise supports Poetry for reliance management and product packaging. If you have a Poetry job with a matching poetry.lock file, there’s absolutely nothing else you require to do. The emr plan command will identify your poetry.lock file and immediately develop the job utilizing the Poetry Package plugin. You can utilize a Poetry job in 2 methods:

Produce a task utilizing the emr init command. The commands take a -- project-type poetry choice that produce a Poetry job for you:.

 â¯ emr init-- project-type poetry emr-poetry.
[emr-cli]: Initializing job in emr-poetry.
[emr-cli]: Job initialized.
â¯ cd emr-poetry.
â¯ poetry set up

If you have a pre-existing job, you can utilize the emr init-- dockerfile choice, which develops a Dockerfile that is immediately utilized when you run emr plan

Lastly, as kept in mind previously, the EMR CLI supplies you a default Dockerfile based upon Amazon Linux 2 that you can utilize to dependably develop plan artifacts that work with various EMR environments.

emr deploy

The emr deploy command looks after copying the essential artifacts for your job to Amazon S3, so you do not need to fret about it. No matter how the job is packaged, emr deploy will copy the resulting files to your Amazon S3 area of option.

One usage case for this is with CI/CD pipelines. Often you wish to release a particular variation of code to Amazon S3 to be utilized in your information pipelines. With emr deploy, this is as easy as altering the -- s3-code-uri specification.

For instance, let’s presume you have actually currently packaged your job utilizing the emr plan command. Many CI/CD pipelines enable you to access the git tag. You can utilize that as part of the emr deploy command to release a brand-new variation of your artifacts. In GitHub actions, this is github.ref _ name, and you can utilize this in an action to release a versioned artifact to Amazon S3. See the following code:

 emr deploy.
-- entry-point entrypoint.py.
-- s3-code-uri s3://<< BUCKET_NAME>>/<< PREFIX>>/$ { {github.ref _ name} }/

In your downstream tasks, you might then upgrade the area of your entry point files to indicate this brand-new area when you’re all set, or you can utilize the emr run command gone over in the next area.

emr run

Let’s take a glimpse at the emr run command. We have actually utilized it before to package, release, and run in one command, however you can likewise utilize it to work on already-deployed artifacts. Let’s take a look at the particular alternatives:

 â¯ emr run-- aid.
Use: emr run[OPTIONS]

Run a task on EMR, additionally develop and release.

Alternatives:.
-- application-id TEXT EMR Serverless Application ID.
-- cluster-id TEXT EMR on EC2 Cluster ID.
-- entry-point FILE Python or Container declare the primary entrypoint.
-- job-role TEXT IAM Function ARN to utilize for the task execution.
-- wait Await task to complete.
-- s3-code-uri TEXT Where to copy/run code artifacts to/from.
-- job-name TEXT The name of the task.
-- job-args TEXT Comma-delimited string of arguments to be passed.
to Trigger task.

-- spark-submit-opts TEXT String of spark-submit alternatives.
-- develop Bundle and release task artifacts.
-- show-stdout Program the stdout of the task after it's ended up.
-- aid Program this message and exit.

If you wish to run your code on EMR Serverless, the emr run command takes an -- application-id and -- job-role specifications. If you wish to work on EMR on EC2, you just require the -- cluster-id choice.

Needed for both alternatives are -- entry-point and -- s3-code-uri -- entry-point is the primary script that will be called by Amazon EMR. If you have any reliances, -- s3-code-uri is where they get published to utilizing the emr release command, and the EMR CLI will develop the appropriate spark-submit homes indicating these artifacts.

There are a couple of various methods to tailor the task:

— job-name— Permits you to define the task or action name
— job-args— Permits you to offer command line arguments to your script
— spark-submit-opts— Permits you to include extra spark-submit alternatives like -- conf spark.jars or others
— show-stdout— Currently just deals with single-file. py tasks on EMR on EC2, however will show stdout in your terminal after the task is total

As we have actually seen prior to, -- develop conjures up both the plan and release commands. This makes it simpler to repeat on regional advancement when your code still requires to run from another location. You can merely utilize the very same emr run command over and over once again to develop, release, and run your code in your environment of option.

Future updates

The EMR CLI is under active advancement. Updates are presently in development to assistance Amazon EMR on EKS and enable the development of regional advancement environments to make regional model of Glow tasks even easier. Do not hesitate to add to the job in the GitHub repository

Tidy Up

To prevent sustaining future charges, stop or erase your EMR Serverless application. If you utilized the CloudFormation design template, make sure to erase your stack.

Conclusion

With the release of the EMR CLI, we have actually made it simpler for you to release and run Glow tasks on EMR Serverless. The energy is offered as open source on GitHub We’re preparing a host of brand-new performances; if there specify demands you have, do not hesitate to submit a concern or open a pull demand!

About the author

Damon is a Principal Designer Supporter on the EMR group at AWS. He’s dealt with information and analytics pipelines for over ten years and divides his group in between splitting service logs and stacking fire wood.