As noted in our brief primer on Dataproc, there are two ways to create and control a Spark cluster on Dataproc: through a form in Google's web-based console, or directly through
gcloud, _ak.a. Google Cloud SDK. Using the form is easier because it shows you all the possible options for configuring your cluster, but
gcloud is faster when you already know what you want, and you can automate it through a script.
To follow these instructions, you will need a Google login and billing account, as well as the
gcloud command-line utility installed on your computer.
In the browser, from your Google Cloud console, click on the main menu’s triple-bar icon that looks like an abstract hamburger in the upper-left corner. Navigate to Menu > Dataproc > Clusters. If this is the first time you land here, then click the
Enable API button and wait a few minutes as it enables.
Here's some guidance for filling out the cluster creation form.
The name must be all lowercase, start and end with a letter and contain no spaces nor special characters except for dashes. Cluster names within a project must be unique and it’s possible to reuse the names of deleted clusters. I named mine parsley because I was growing parsley this summer.
Google Cloud resources are compartmentalized into Regions, which are themselves further divided into Zones, corresponding to the physical locations of the datacenters where the machines are housed. You can find a full list/diagram here. There are three things to watch out for here:
By default the cluster region is set to
global1, which neatly sidesteps that latter constraint but costs you a bit more money. So you can set that to be more specific, but keep in mind that if you change it to something different from your
gcloud region, your job submissions will error about non-existant clusters, for example:
ERROR: (gcloud.dataproc.jobs.submit.spark) NOT_FOUND: No current cluster for project id 'XYZ' with name 'ABC'
The other consequence is that the
Zone parameter should be set to where most of your data resides to minimize egress fees. Assuming you configured your
gcloud settings to match where (most of) your data lives, you can set the Zone on your cluster to match that as well.
These determine the number and type of nodes (=machines) that will be in your cluster. The more nodes and the beefier the nodes, the faster your command will run, but the more money it'll cost you. See our Spark-enabled tool and workflow documentation for recommendations where available. Some of our benchmarking efforts are still underway so we may not have recommendations for everything.
The defaults are set to use a basic
n1-standard-1 mode as master and two of the same as workers, all with 10 Gb boot disks attached.
The advanced options are accordioned under Preemptible workers, bucket, network, version, initialization, & access options. The most useful of these is
Initialization actions, which allows you to run your own script(s) on the nodes when they get started, e.g. to install programs on the nodes. The scripts have to live in a GCS bucket.
Note that the
Image Version setting under this section changes the version of the underlying Dataproc API (afaict), which by default is set to the latest stable version. As of this writing, the default
Image Version is v1.1. Selecting another version, e.g. an earlier version, may cause your GATK command to fail. We try to stay up to date with Google's updates, but sometimes there's a short lag between an update on Dataproc and our ability to update GATK accordingly. When that happens, switching to the previous image version sometimes works. If you run into this and get stuck, let us know in the forum and we'll do our best to get you unstuck.
Create button and you're off to the races. Note that it can take a few minutes for the cluster to provision.
In this view, so long as the status of this VM is "Running", it is costing you money. As of this writing, Dataproc rounds billing up to ten minutes if you use less, and then bills by the minute, according to the Google pricing docs.
You can view the individual cluster nodes under Menu > Compute Engine > VM instances. If there are many VMs listed (e.g. if you share the billing project with others), type the name of the cluster into the text box to filter for specific cluster VMs. Each of the master and worker nodes will be listed separately, e.g. parsley-m, parsley-w-0 and parsley-w1, where
m stands for master and
w stands for worker.
Instead of clicking through a GUI in your web browser to generate a cluster, you can use the
gcloud command-line utility to create a cluster straight from your terminal. Assuming you've already set up
gcloud on your machine and set up authentication so
gcloud can use your login credentials, you can send commands to Dataproc or any other Google Cloud service you want.
Here is the command that creates a new cluster, called
basil, that is identical in specifications to
parsley. You'll need to replace the
project_id value with your billing project ID.
gcloud dataproc clusters create basil \ --zone us-central1-a \ --master-machine-type n1-standard-1 \ --master-boot-disk-size 10 \ --num-workers 2 \ --worker-machine-type n1-standard-1 \ --worker-boot-disk-size 10 \ --project project_id
Again, we are careful to match our
--zone to that of our local
gcloud setup, e.g. in my case
us-central1-a. By omitting the
--region argument, we allow Dataproc to set it to the default
Once you've run this command, you can go check the Dataproc status page in the console, where you'll see the new cluster appear if everything went well. Or you can just stay in the terminal and run this
gcloud command to get a text version of the Dataproc status summary:
gcloud dataproc clusters list
We won't go into detail on this here; we have a separate tutorial in the works focused on that. The key points are that you'll need to add a few Spark-specific arguments to your GATK command, using the syntax indicated here:
gatk-launch [GATK arguments] -- [Spark arguments]
Don't forget that
-- separator before the Spark arguments; it's not a copy-paste error, it's necessary to tell the command-line parser to interpret those arguments differently.
For running your jobs on Dataproc, the two most important arguments are
--sparkRunner GCS, which tells GATK you want to run on Google, and
--cluster your_cluster_name which identifies which cluster you want to run on.
Remember that your data will need to be located in Google Cloud Storage (GCS), and your file paths will need to be prefixed
gs:// accordingly. And be sure to look into jar caching to speed up the process!
Seriously. As long as your cluster is in "Running" state, it's costing you money, even if it's not doing anything. Fortunately, a recent Dataproc update makes it possible to set an auto-shutdown timer to avoid incurring catastrophic expenses. The relevant arguments are the following:
--max-idle=30mto shutdown after thirty minutes of idleness
--expiration-timegiven in ISO 8601 date-time format
--max-age=3dfor deletion three days from creation
You can use these individually or in any combination you like; your cluster will be shut down as soon as one of the conditions you specify is satisfied.
But generally you should shut down your cluster manually if you know it's no longer running. You can do so in the Console, or through
gcloud. Note that shutting down the cluster does not delete it; you can restart it and use it again at will. There is of course a separate deletion command you can use to delete unwanted clusters permanently.