Today, we’re delighted to present the Amazon EMR CLI, a brand-new command line tool to plan and release PySpark jobs throughout various Amazon EMR environments. With the intro of the EMR CLI, you now have a basic method to not just release a wide variety of PySpark jobs to remote EMR environments, however likewise incorporate with your CI/CD option of option.
In this post, we demonstrate how you can utilize the EMR CLI to produce a brand-new PySpark job from scratch and release it to Amazon EMR Serverless in one command.
Introduction of option
The EMR CLI is an open-source tool to assist enhance the designer experience of establishing and releasing tasks on Amazon EMR. When you’re simply getting going with Apache Glow, there are a range of alternatives with regard to how to package, release, and run tasks that can be frustrating or need deep domain know-how. The EMR CLI supplies easy commands for these actions that get rid of the uncertainty from releasing Glow tasks. You can utilize it to produce brand-new jobs or together with existing PySpark jobs.
In this post, we stroll through developing a brand-new PySpark job that examines weather condition information from the NOAA Global Surface Area Summary of Day open dataset. We’ll utilize the EMR CLI to do the following:
- Initialize the job.
- Bundle the reliances.
- Release the code and reliances to Amazon Simple Storage Service (Amazon S3).
- Run the task on EMR Serverless.
Requirements
For this walkthrough, you ought to have the following requirements:
- An AWS account
- An EMR Serverless application in the
us-east-1
Area - An S3 pail for your code and logs in the
us-east-1
Area - An AWS Identity and Gain Access To Management (IAM) task function that can run EMR Serverless tasks and gain access to S3 containers
- Python variation >>= 3.7
- Docker
If you do not currently have an existing EMR Serverless application, you can utilize the following AWS CloudFormation design template or utilize the emr bootstrap
command after you have actually set up the CLI.
Set Up the EMR CLI
You can discover the source for the EMR CLI in the GitHub repo, however it’s likewise dispersed through PyPI. It needs Python variation >>= 3.7 to run and is checked on macOS, Linux, and Windows. To set up the current variation, utilize the following command:
You ought to now have the ability to run the emr-- aid
command and see the various subcommands you can utilize:
If you didn’t currently produce an EMR Serverless application, the bootstrap
command can produce a sample environment for you and a setup file with the appropriate settings. Presuming you utilized the offered CloudFormation stack, set the following environment variables utilizing the details on the Outputs tab of your stack. Set the Area in the terminal to us-east-1
and set a couple of other environment variables we’ll require along the method:
We utilize us-east-1
since that’s where the NOAA GSOD information pail is. EMR Serverless can access S3 containers and other AWS resources in the very same Area by default. To gain access to other services, set up EMR Serverless with VPC gain access to
Initialize a task
Next, we utilize the emr init
command to initialize a default PySpark job for us in the offered directory site. The default design templates produce a basic Python job that utilizes pyproject.toml
to specify its reliances. In this case, we utilize Pandas and PyArrow in our script, so those are currently pre-populated.
After the job is initialized, you can run cd my-project
or open the my-project
directory site in your code editor of option. You ought to see the following set of files:
Keep in mind that we likewise have a Dockerfile here. This is utilized by the plan
command to make sure that our job reliances are constructed on the best architecture and os for Amazon EMR.
If you utilize Poetry to handle your Python reliances, you can likewise include a -- project-type poetry
flag to the emr init
command to produce a Poetry job.
If you currently have an existing PySpark job, you can utilize emr init-- dockerfile
to produce the Dockerfile essential to package things up.
Run the job
Now that we have actually got our sample job produced, we require to package our reliances, release the code to Amazon S3, and begin a task on EMR Serverless. With the EMR CLI, you can do all of that in one command. Ensure to run the command from the my-project
directory site:
This command carries out a number of actions:
- Auto-detects the kind of Glow job in the existing directory site.
- Starts a construct for your job to package up reliances.
- Copies your entry point and resulting develop files to Amazon S3.
- Begins an EMR Serverless task.
- Awaits the task to complete, leaving with a mistake status if it stops working.
You ought to now see the list below output in your terminal as the task starts running in EMR Serverless:
Which’s it! If you wish to run the very same code on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), you can change-- application-id
with -- cluster-id j-11111111
The CLI will look after sending out the right spark-submit
commands to your EMR cluster.
Now let’s stroll through a few of the other commands.
emr plan
PySpark jobs can be packaged in various methods, from a single.py file to an intricate Poetry job with numerous reliances. The EMR CLI can assist regularly package your jobs without needing to fret about the information.
For instance, if you have a single.py file in your job directory site, the plan
command does not require to do anything. If, nevertheless, you have multiple.py files in a common Python job design, the emr plan
command will zip these files up as a bundle that can later on be published to Amazon S3 and offered to your PySpark task utilizing the -- py-files
choice. If you have 3rd party reliances specified in pyproject.toml
, emr plan
will produce a virtual environment archive and begin your EMR task with the spark.archive
choice.
The EMR CLI likewise supports Poetry for reliance management and product packaging. If you have a Poetry job with a matching poetry.lock
file, there’s absolutely nothing else you require to do. The emr plan
command will identify your poetry.lock
file and immediately develop the job utilizing the Poetry Package plugin. You can utilize a Poetry job in 2 methods:
- Produce a task utilizing the
emr init
command. The commands take a-- project-type
poetry choice that produce a Poetry job for you:. - If you have a pre-existing job, you can utilize the
emr init-- dockerfile
choice, which develops a Dockerfile that is immediately utilized when you runemr plan
Lastly, as kept in mind previously, the EMR CLI supplies you a default Dockerfile based upon Amazon Linux 2 that you can utilize to dependably develop plan artifacts that work with various EMR environments.
emr deploy
The emr deploy
command looks after copying the essential artifacts for your job to Amazon S3, so you do not need to fret about it. No matter how the job is packaged, emr deploy
will copy the resulting files to your Amazon S3 area of option.
One usage case for this is with CI/CD pipelines. Often you wish to release a particular variation of code to Amazon S3 to be utilized in your information pipelines. With emr deploy
, this is as easy as altering the -- s3-code-uri
specification.
For instance, let’s presume you have actually currently packaged your job utilizing the emr plan
command. Many CI/CD pipelines enable you to access the git tag. You can utilize that as part of the emr deploy
command to release a brand-new variation of your artifacts. In GitHub actions, this is github.ref _ name
, and you can utilize this in an action to release a versioned artifact to Amazon S3. See the following code:
In your downstream tasks, you might then upgrade the area of your entry point files to indicate this brand-new area when you’re all set, or you can utilize the emr run
command gone over in the next area.
emr run
Let’s take a glimpse at the emr run
command. We have actually utilized it before to package, release, and run in one command, however you can likewise utilize it to work on already-deployed artifacts. Let’s take a look at the particular alternatives:
If you wish to run your code on EMR Serverless, the emr run
command takes an -- application-id
and -- job-role
specifications. If you wish to work on EMR on EC2, you just require the -- cluster-id
choice.
Needed for both alternatives are -- entry-point
and -- s3-code-uri
-- entry-point
is the primary script that will be called by Amazon EMR. If you have any reliances, -- s3-code-uri
is where they get published to utilizing the emr release command, and the EMR CLI will develop the appropriate spark-submit homes indicating these artifacts.
There are a couple of various methods to tailor the task:
- — job-name— Permits you to define the task or action name
- — job-args— Permits you to offer command line arguments to your script
- — spark-submit-opts— Permits you to include extra
spark-submit
alternatives like-- conf spark.jars
or others - — show-stdout— Currently just deals with single-file. py tasks on EMR on EC2, however will show
stdout
in your terminal after the task is total
As we have actually seen prior to, -- develop
conjures up both the plan
and release
commands. This makes it simpler to repeat on regional advancement when your code still requires to run from another location. You can merely utilize the very same emr run
command over and over once again to develop, release, and run your code in your environment of option.
Future updates
The EMR CLI is under active advancement. Updates are presently in development to assistance Amazon EMR on EKS and enable the development of regional advancement environments to make regional model of Glow tasks even easier. Do not hesitate to add to the job in the GitHub repository
Tidy Up
To prevent sustaining future charges, stop or erase your EMR Serverless application. If you utilized the CloudFormation design template, make sure to erase your stack.
Conclusion
With the release of the EMR CLI, we have actually made it simpler for you to release and run Glow tasks on EMR Serverless. The energy is offered as open source on GitHub We’re preparing a host of brand-new performances; if there specify demands you have, do not hesitate to submit a concern or open a pull demand!
About the author
Damon is a Principal Designer Supporter on the EMR group at AWS. He’s dealt with information and analytics pipelines for over ten years and divides his group in between splitting service logs and stacking fire wood.