Skip to content

googleDataprocCluster

Manages a Cloud Dataproc cluster resource within GCP.

!> Warning: Due to limitations of the API, all arguments except labels,clusterConfigWorkerConfigNumInstances and clusterConfigPreemptibleWorkerConfigNumInstances are non-updatable. Changing others will cause recreation of the whole cluster!

Example Usage - Basic

/*Provider bindings are generated by running cdktf get.
See https://cdk.tf/provider-generation for more details.*/
import * as google from "./.gen/providers/google";
/*The following providers are missing schema information and might need manual adjustments to synthesize correctly: google.
For a more precise conversion please use the --provider flag in convert.*/
new google.dataprocCluster.DataprocCluster(this, "simplecluster", {
  name: "simplecluster",
  region: "us-central1",
});

Example Usage - Advanced

/*Provider bindings are generated by running cdktf get.
See https://cdk.tf/provider-generation for more details.*/
import * as google from "./.gen/providers/google";
/*The following providers are missing schema information and might need manual adjustments to synthesize correctly: google.
For a more precise conversion please use the --provider flag in convert.*/
const googleServiceAccountDefault = new google.serviceAccount.ServiceAccount(
  this,
  "default",
  {
    account_id: "service-account-id",
    display_name: "Service Account",
  }
);
new google.dataprocCluster.DataprocCluster(this, "mycluster", {
  cluster_config: [
    {
      gce_cluster_config: [
        {
          service_account: googleServiceAccountDefault.email,
          service_account_scopes: ["cloud-platform"],
          tags: ["foo", "bar"],
        },
      ],
      initialization_action: [
        {
          script:
            "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh",
          timeout_sec: 500,
        },
      ],
      master_config: [
        {
          disk_config: [
            {
              boot_disk_size_gb: 30,
              boot_disk_type: "pd-ssd",
            },
          ],
          machine_type: "e2-medium",
          num_instances: 1,
        },
      ],
      preemptible_worker_config: [
        {
          num_instances: 0,
        },
      ],
      software_config: [
        {
          image_version: "2.0.35-debian10",
          override_properties: [
            {
              "dataproc:dataproc.allow.zero.workers": "true",
            },
          ],
        },
      ],
      staging_bucket: "dataproc-staging-bucket",
      worker_config: [
        {
          disk_config: [
            {
              boot_disk_size_gb: 30,
              num_local_ssds: 1,
            },
          ],
          machine_type: "e2-medium",
          min_cpu_platform: "Intel Skylake",
          num_instances: 2,
        },
      ],
    },
  ],
  graceful_decommission_timeout: "120s",
  labels: [
    {
      foo: "bar",
    },
  ],
  name: "mycluster",
  region: "us-central1",
});

Example Usage - Using a GPU accelerator

/*Provider bindings are generated by running cdktf get.
See https://cdk.tf/provider-generation for more details.*/
import * as google from "./.gen/providers/google";
/*The following providers are missing schema information and might need manual adjustments to synthesize correctly: google.
For a more precise conversion please use the --provider flag in convert.*/
new google.dataprocCluster.DataprocCluster(this, "accelerated_cluster", {
  cluster_config: [
    {
      gce_cluster_config: [
        {
          zone: "us-central1-a",
        },
      ],
      master_config: [
        {
          accelerators: [
            {
              accelerator_count: "1",
              accelerator_type: "nvidia-tesla-k80",
            },
          ],
        },
      ],
    },
  ],
  name: "my-cluster-with-gpu",
  region: "us-central1",
});

Argument Reference

  • name - (Required) The name of the cluster, unique within the project and zone.

  • project - (Optional) The ID of the project in which the cluster will exist. If it is not provided, the provider project is used.

  • region - (Optional) The region in which the cluster and associated nodes will be created in. Defaults to global.

  • labels - (Optional, Computed) The list of labels (key/value pairs) to be applied to instances in the cluster. GCP generates some itself including googDataprocClusterName which is the name of the cluster.

  • virtualClusterConfig - (Optional) Allows you to configure a virtual Dataproc on GKE cluster. Structure defined below.

  • clusterConfig - (Optional) Allows you to configure various aspects of the cluster. Structure defined below.

  • gracefulDecommissionTimeout - (Optional) Allows graceful decomissioning when you change the number of worker nodes directly through a terraform apply. Does not affect auto scaling decomissioning from an autoscaling policy. Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress. Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs). Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is 1 day. (see JSON representation of Duration). Only supported on Dataproc image versions 1.2 and higher. For more context see the docs


The virtualClusterConfig block supports:

    virtual_cluster_config {
        auxiliary_services_config { ... }
        kubernetes_cluster_config { ... }
    }
  • stagingBucket - (Optional) The Cloud Storage staging bucket used to stage files, such as Hadoop jars, between client machines and the cluster. Note: If you don't explicitly specify a stagingBucket then GCP will auto create / assign one for you. However, you are not guaranteed an auto generated bucket which is solely dedicated to your cluster; it may be shared with other clusters in the same region/zone also choosing to use the auto generation option.

  • auxiliaryServicesConfig (Optional) Configuration of auxiliary services used by this cluster. Structure defined below.

  • kubernetesClusterConfig (Required) The configuration for running the Dataproc cluster on Kubernetes. Structure defined below.


The auxiliaryServicesConfig block supports:

    virtual_cluster_config {
      auxiliary_services_config {
        metastore_config {
          dataproc_metastore_service = google_dataproc_metastore_service.metastore_service.id
        }

        spark_history_server_config {
          dataproc_cluster = google_dataproc_cluster.dataproc_cluster.id
        }
      }
    }
  • metastoreConfig (Optional) The Hive Metastore configuration for this workload.

    • dataprocMetastoreService (Required) Resource name of an existing Dataproc Metastore service.
  • sparkHistoryServerConfig (Optional) The Spark History Server configuration for the workload.

    • dataprocCluster (Optional) Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload.

The kubernetesClusterConfig block supports:

    virtual_cluster_config {
      kubernetes_cluster_config {
        kubernetes_namespace = "foobar"

        kubernetes_software_config {
          component_version = {
            "SPARK" : "3.1-dataproc-7"
          }

          properties = {
            "spark:spark.eventLog.enabled": "true"
          }
        }

        gke_cluster_config {
          gke_cluster_target = google_container_cluster.primary.id

          node_pool_target {
            node_pool = "dpgke"
            roles = ["DEFAULT"]

            node_pool_config {
              autoscaling {
                min_node_count = 1
                max_node_count = 6
              }

              config {
                machine_type      = "n1-standard-4"
                preemptible       = true
                local_ssd_count   = 1
                min_cpu_platform  = "Intel Sandy Bridge"
              }

              locations = ["us-central1-c"]
            }
          }
        }
      }
    }
  • kubernetesNamespace (Optional) A namespace within the Kubernetes cluster to deploy into. If this namespace does not exist, it is created. If it exists, Dataproc verifies that another Dataproc VirtualCluster is not installed into it. If not specified, the name of the Dataproc Cluster is used.

  • kubernetesSoftwareConfig (Required) The software configuration for this Dataproc cluster running on Kubernetes.

    • componentVersion (Required) The components that should be installed in this Dataproc cluster. The key must be a string from the\ KubernetesComponent enumeration. The value is the version of the software to be installed. At least one entry must be specified.

      • NOTE : componentVersion[spark] is mandatory to set, or the creation of the cluster will fail.
    • properties (Optional) The properties to set on daemon config files. Property keys are specified in prefix:property format, for example spark:spark.kubernetes.container.image.

  • gkeClusterConfig (Required) The configuration for running the Dataproc cluster on GKE.

    • gkeClusterTarget (Optional) A target GKE cluster to deploy to. It must be in the same project and region as the Dataproc cluster (the GKE cluster can be zonal or regional)

    • nodePoolTarget (Optional) GKE node pools where workloads will be scheduled. At least one node pool must be assigned the default GkeNodePoolTarget.Role. If a GkeNodePoolTarget is not specified, Dataproc constructs a default GkeNodePoolTarget. Each role can be given to only one GkeNodePoolTarget. All node pools must have the same location settings.

      • nodePool (Required) The target GKE node pool.

      • roles (Required) The roles associated with the GKE node pool. One of "default", "controller", "sparkDriver" or "sparkExecutor".

      • nodePoolConfig (Input only) The configuration for the GKE node pool. If specified, Dataproc attempts to create a node pool with the specified shape. If one with the same name already exists, it is verified against all specified fields. If a field differs, the virtual cluster creation will fail.

        • autoscaling (Optional) The autoscaler configuration for this node pool. The autoscaler is enabled only when a valid configuration is present.

          • minNodeCount (Optional) The minimum number of nodes in the node pool. Must be >= 0 and <= maxNodeCount.

          • maxNodeCount (Optional) The maximum number of nodes in the node pool. Must be >= minNodeCount, and must be > 0.

        • config (Optional) The node pool configuration.

          • machineType (Optional) The name of a Compute Engine machine type.

          • localSsdCount (Optional) The number of local SSD disks to attach to the node, which is limited by the maximum number of disks allowable per zone.

          • preemptible (Optional) Whether the nodes are created as preemptible VM instances. Preemptible nodes cannot be used in a node pool with the CONTROLLER role or in the DEFAULT node pool if the CONTROLLER role is not assigned (the DEFAULT node pool will assume the CONTROLLER role).

          • minCpuPlatform (Optional) Minimum CPU platform to be used by this instance. The instance may be scheduled on the specified or a newer CPU platform. Specify the friendly names of CPU platforms, such as "Intel Haswell" or "Intel Sandy Bridge".

          • spot (Optional) Spot flag for enabling Spot VM, which is a rebrand of the existing preemptible flag.

        • locations (Optional) The list of Compute Engine zones where node pool nodes associated with a Dataproc on GKE virtual cluster will be located.


The clusterConfig block supports:

    cluster_config {
        gce_cluster_config        { ... }
        master_config             { ... }
        worker_config             { ... }
        preemptible_worker_config { ... }
        software_config           { ... }

        # You can define multiple initialization_action blocks
        initialization_action     { ... }
        encryption_config         { ... }
        endpoint_config           { ... }
        metastore_config          { ... }
    }
  • stagingBucket - (Optional) The Cloud Storage staging bucket used to stage files, such as Hadoop jars, between client machines and the cluster. Note: If you don't explicitly specify a stagingBucket then GCP will auto create / assign one for you. However, you are not guaranteed an auto generated bucket which is solely dedicated to your cluster; it may be shared with other clusters in the same region/zone also choosing to use the auto generation option.

  • tempBucket - (Optional) The Cloud Storage temp bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files. Note: If you don't explicitly specify a tempBucket then GCP will auto create / assign one for you.

  • gceClusterConfig (Optional) Common config settings for resources of Google Compute Engine cluster instances, applicable to all instances in the cluster. Structure defined below.

  • masterConfig (Optional) The Google Compute Engine config settings for the master instances in a cluster. Structure defined below.

  • workerConfig (Optional) The Google Compute Engine config settings for the worker instances in a cluster. Structure defined below.

  • preemptibleWorkerConfig (Optional) The Google Compute Engine config settings for the additional instances in a cluster. Structure defined below.

    • NOTE : preemptibleWorkerConfig is an alias for the api's secondaryWorkerConfig. The name doesn't necessarily mean it is preemptible and is named as such for legacy/compatibility reasons.
  • softwareConfig (Optional) The config settings for software inside the cluster. Structure defined below.

  • securityConfig (Optional) Security related configuration. Structure defined below.

  • autoscalingConfig (Optional) The autoscaling policy config associated with the cluster. Note that once set, if autoscalingConfig is the only field set in clusterConfig, it can only be removed by setting policyUri = "", rather than removing the whole block. Structure defined below.

  • initializationAction (Optional) Commands to execute on each node after config is completed. You can specify multiple versions of these. Structure defined below.

  • encryptionConfig (Optional) The Customer managed encryption keys settings for the cluster. Structure defined below.

  • lifecycleConfig (Optional) The settings for auto deletion cluster schedule. Structure defined below.

  • endpointConfig (Optional) The config settings for port access on the cluster. Structure defined below.

  • dataprocMetricConfig (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times. Structure defined below.

  • metastoreConfig (Optional) The config setting for metastore service with the cluster. Structure defined below.


The clusterConfigGceClusterConfig block supports:

  cluster_config {
    gce_cluster_config {
      zone = "us-central1-a"

      # One of the below to hook into a custom network / subnetwork
      network    = google_compute_network.dataproc_network.name
      subnetwork = google_compute_network.dataproc_subnetwork.name

      tags = ["foo", "bar"]
    }
  }
  • zone - (Optional, Computed) The GCP zone where your data is stored and used (i.e. where the master and the worker nodes will be created in). If region is set to 'global' (default) then zone is mandatory, otherwise GCP is able to make use of Auto Zone Placement to determine this automatically for you. Note: This setting additionally determines and restricts which computing resources are available for use with other configs such as clusterConfigMasterConfigMachineType and clusterConfigWorkerConfigMachineType.

  • network - (Optional, Computed) The name or self_link of the Google Compute Engine network to the cluster will be part of. Conflicts with subnetwork. If neither is specified, this defaults to the "default" network.

  • subnetwork - (Optional) The name or self_link of the Google Compute Engine subnetwork the cluster will be part of. Conflicts with network.

  • serviceAccount - (Optional) The service account to be used by the Node VMs. If not specified, the "default" service account is used.

  • serviceAccountScopes - (Optional, Computed) The set of Google API scopes to be made available on all of the node VMs under the serviceAccount specified. Both OAuth2 URLs and gcloud short names are supported. To allow full access to all Cloud APIs, use the cloudPlatform scope. See a complete list of scopes here.

  • tags - (Optional) The list of instance tags applied to instances in the cluster. Tags are used to identify valid sources or targets for network firewalls.

  • internalIpOnly - (Optional) By default, clusters are not restricted to internal IP addresses, and will have ephemeral external IP addresses assigned to each instance. If set to true, all instances in the cluster will only have internal IP addresses. Note: Private Google Access (also known as privateIpGoogleAccess) must be enabled on the subnetwork that the cluster will be launched in.

  • metadata - (Optional) A map of the Compute Engine metadata entries to add to all instances (see Project and instance metadata).

  • reservationAffinity - (Optional) Reservation Affinity for consuming zonal reservation.

    • consumeReservationType - (Optional) Corresponds to the type of reservation consumption.
    • key - (Optional) Corresponds to the label key of reservation resource.
    • values - (Optional) Corresponds to the label values of reservation resource.
  • nodeGroupAffinity - (Optional) Node Group Affinity for sole-tenant clusters.

    • nodeGroupUri - (Required) The URI of a sole-tenant node group resource that the cluster will be created on.
  • shieldedInstanceConfig (Optional) Shielded Instance Config for clusters using Compute Engine Shielded VMs.


The clusterConfigGceClusterConfigShieldedInstanceConfig block supports:

cluster_config{
  gce_cluster_config{
    shielded_instance_config{
      enable_secure_boot          = true
      enable_vtpm                 = true
      enable_integrity_monitoring = true
    }
  }
}
  • enableSecureBoot - (Optional) Defines whether instances have Secure Boot enabled.

  • enableVtpm - (Optional) Defines whether instances have the vTPM enabled.

  • enableIntegrityMonitoring - (Optional) Defines whether instances have integrity monitoring enabled.


The clusterConfigMasterConfig block supports:

cluster_config {
  master_config {
    num_instances    = 1
    machine_type     = "e2-medium"
    min_cpu_platform = "Intel Skylake"

    disk_config {
      boot_disk_type    = "pd-ssd"
      boot_disk_size_gb = 30
      num_local_ssds    = 1
    }
  }
}
  • numInstances- (Optional, Computed) Specifies the number of master nodes to create. If not specified, GCP will default to a predetermined computed value (currently 1).

  • machineType - (Optional, Computed) The name of a Google Compute Engine machine type to create for the master. If not specified, GCP will default to a predetermined computed value (currently n1Standard4).

  • minCpuPlatform - (Optional, Computed) The name of a minimum generation of CPU family for the master. If not specified, GCP will default to a predetermined computed value for each zone. See the guide for details about which CPU families are available (and defaulted) for each zone.

  • imageUri (Optional) The URI for the image to use for this worker. See the guide for more information.

  • diskConfig (Optional) Disk Config

    • bootDiskType - (Optional) The disk type of the primary disk attached to each node. One of "pdSsd" or "pdStandard". Defaults to "pdStandard".

    • bootDiskSizeGb - (Optional, Computed) Size of the primary disk attached to each node, specified in GB. The primary disk contains the boot volume and system libraries, and the smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.

    • numLocalSsds - (Optional) The amount of local SSD disks that will be attached to each master cluster node. Defaults to 0.

  • accelerators (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times.

    • acceleratorType - (Required) The short name of the accelerator type to expose to this instance. For example, nvidiaTeslaK80.

    • acceleratorCount - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1, 2, 4, or 8.

\~> The Cloud Dataproc API can return unintuitive error messages when using accelerators; even when you have defined an accelerator, Auto Zone Placement does not exclusively select zones that have that accelerator available. If you get a 400 error that the accelerator can't be found, this is a likely cause. Make sure you check accelerator availability by zone if you are trying to use accelerators in a given zone.


The clusterConfigWorkerConfig block supports:

cluster_config {
  worker_config {
    num_instances    = 3
    machine_type     = "e2-medium"
    min_cpu_platform = "Intel Skylake"

    disk_config {
      boot_disk_type    = "pd-standard"
      boot_disk_size_gb = 30
      num_local_ssds    = 1
    }
  }
}
  • numInstances- (Optional, Computed) Specifies the number of worker nodes to create. If not specified, GCP will default to a predetermined computed value (currently 2). There is currently a beta feature which allows you to run a Single Node Cluster. In order to take advantage of this you need to set "dataproc:dataprocAllowZeroWorkers" = "true" in clusterConfigSoftwareConfigProperties

  • machineType - (Optional, Computed) The name of a Google Compute Engine machine type to create for the worker nodes. If not specified, GCP will default to a predetermined computed value (currently n1Standard4).

  • minCpuPlatform - (Optional, Computed) The name of a minimum generation of CPU family for the master. If not specified, GCP will default to a predetermined computed value for each zone. See the guide for details about which CPU families are available (and defaulted) for each zone.

  • diskConfig (Optional) Disk Config

    • bootDiskType - (Optional) The disk type of the primary disk attached to each node. One of "pdSsd" or "pdStandard". Defaults to "pdStandard".

    • bootDiskSizeGb - (Optional, Computed) Size of the primary disk attached to each worker node, specified in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.

    • numLocalSsds - (Optional) The amount of local SSD disks that will be attached to each worker cluster node. Defaults to 0.

  • imageUri (Optional) The URI for the image to use for this worker. See the guide for more information.

  • accelerators (Optional) The Compute Engine accelerator configuration for these instances. Can be specified multiple times.

    • acceleratorType - (Required) The short name of the accelerator type to expose to this instance. For example, nvidiaTeslaK80.

    • acceleratorCount - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1, 2, 4, or 8.

\~> The Cloud Dataproc API can return unintuitive error messages when using accelerators; even when you have defined an accelerator, Auto Zone Placement does not exclusively select zones that have that accelerator available. If you get a 400 error that the accelerator can't be found, this is a likely cause. Make sure you check accelerator availability by zone if you are trying to use accelerators in a given zone.


The clusterConfigPreemptibleWorkerConfig block supports:

cluster_config {
  preemptible_worker_config {
    num_instances = 1

    disk_config {
      boot_disk_type    = "pd-standard"
      boot_disk_size_gb = 30
      num_local_ssds    = 1
    }
  }
}

Note: Unlike workerConfig, you cannot set the machineType value directly. This will be set for you based on whatever was set for the workerConfigMachineType value.

  • numInstances- (Optional) Specifies the number of preemptible nodes to create. Defaults to 0.

  • preemptibility- (Optional) Specifies the preemptibility of the secondary workers. The default value is preemptible Accepted values are:

    • PREEMPTIBILITY_UNSPECIFIED
    • NON_PREEMPTIBLE
    • PREEMPTIBLE
    • SPOT
  • diskConfig (Optional) Disk Config

    • bootDiskType - (Optional) The disk type of the primary disk attached to each preemptible worker node. One of "pdSsd" or "pdStandard". Defaults to "pdStandard".

    • bootDiskSizeGb - (Optional, Computed) Size of the primary disk attached to each preemptible worker node, specified in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.

    • numLocalSsds - (Optional) The amount of local SSD disks that will be attached to each preemptible worker node. Defaults to 0.


The clusterConfigSoftwareConfig block supports:

cluster_config {
  # Override or set some custom properties
  software_config {
    image_version = "2.0.35-debian10"

    override_properties = {
      "dataproc:dataproc.allow.zero.workers" = "true"
    }
  }
}
  • imageVersion - (Optional, Computed) The Cloud Dataproc image version to use for the cluster - this controls the sets of software versions installed onto the nodes when you create clusters. If not specified, defaults to the latest version. For a list of valid versions see Cloud Dataproc versions

  • overrideProperties - (Optional) A list of override and additional properties (key/value pairs) used to modify various aspects of the common configuration files used when creating a cluster. For a list of valid properties please see Cluster properties

  • optionalComponents - (Optional) The set of optional components to activate on the cluster. Accepted values are:

    • ANACONDA
    • DRUID
    • FLINK
    • HBASE
    • HIVE_WEBHCAT
    • JUPYTER
    • PRESTO
    • RANGER
    • SOLR
    • ZEPPELIN
    • ZOOKEEPER

The clusterConfigSecurityConfig block supports:

cluster_config {
  # Override or set some custom properties
  security_config {
    kerberos_config {
      kms_key_uri = "projects/projectId/locations/locationId/keyRings/keyRingId/cryptoKeys/keyId"
      root_principal_password_uri = "bucketId/o/objectId"
    }
  }
}
  • kerberosConfig (Required) Kerberos Configuration

    • crossRealmTrustAdminServer - (Optional) The admin server (IP or hostname) for the remote trusted realm in a cross realm trust relationship.

    • crossRealmTrustKdc - (Optional) The KDC (IP or hostname) for the remote trusted realm in a cross realm trust relationship.

    • crossRealmTrustRealm - (Optional) The remote realm the Dataproc on-cluster KDC will trust, should the user enable cross realm trust.

    • crossRealmTrustSharedPasswordUri - (Optional) The Cloud Storage URI of a KMS encrypted file containing the shared password between the on-cluster Kerberos realm and the remote trusted realm, in a cross realm trust relationship.

    • enableKerberos - (Optional) Flag to indicate whether to Kerberize the cluster.

    • kdcDbKeyUri - (Optional) The Cloud Storage URI of a KMS encrypted file containing the master key of the KDC database.

    • keyPasswordUri - (Optional) The Cloud Storage URI of a KMS encrypted file containing the password to the user provided key. For the self-signed certificate, this password is generated by Dataproc.

    • keystoreUri - (Optional) The Cloud Storage URI of the keystore file used for SSL encryption. If not provided, Dataproc will provide a self-signed certificate.

    • keystorePasswordUri - (Optional) The Cloud Storage URI of a KMS encrypted file containing the password to the user provided keystore. For the self-signed certificated, the password is generated by Dataproc.

    • kmsKeyUri - (Required) The URI of the KMS key used to encrypt various sensitive files.

    • realm - (Optional) The name of the on-cluster Kerberos realm. If not specified, the uppercased domain of hostnames will be the realm.

    • rootPrincipalPasswordUri - (Required) The Cloud Storage URI of a KMS encrypted file containing the root principal password.

    • tgtLifetimeHours - (Optional) The lifetime of the ticket granting ticket, in hours.

    • truststorePasswordUri - (Optional) The Cloud Storage URI of a KMS encrypted file containing the password to the user provided truststore. For the self-signed certificate, this password is generated by Dataproc.

    • truststoreUri - (Optional) The Cloud Storage URI of the truststore file used for SSL encryption. If not provided, Dataproc will provide a self-signed certificate.


The clusterConfigAutoscalingConfig block supports:

cluster_config {
  # Override or set some custom properties
  autoscaling_config {
    policy_uri = "projects/projectId/locations/region/autoscalingPolicies/policyId"
  }
}
  • policyUri - (Required) The autoscaling policy used by the cluster.

Only resource names including projectid and location (region) are valid. Examples:

https://wwwGoogleapisCom/compute/v1/projects/[projectId]/locations/[dataprocRegion]/autoscalingPolicies/[policyId] projects/[projectId]/locations/[dataprocRegion]/autoscalingPolicies/[policyId] Note that the policy must be in the same project and Cloud Dataproc region.


The initializationAction block (Optional) can be specified multiple times and supports:

cluster_config {
  # You can define multiple initialization_action blocks
  initialization_action {
    script      = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
    timeout_sec = 500
  }
}
  • script- (Required) The script to be executed during initialization of the cluster. The script must be a GCS file with a gs:// prefix.

  • timeoutSec - (Optional, Computed) The maximum duration (in seconds) which script is allowed to take to execute its action. GCP will default to a predetermined computed value if not set (currently 300).


The encryptionConfig block supports:

cluster_config {
  encryption_config {
    kms_key_name = "projects/projectId/locations/region/keyRings/keyRingName/cryptoKeys/keyName"
  }
}
  • kmsKeyName - (Required) The Cloud KMS key name to use for PD disk encryption for all instances in the cluster.

The dataprocMetricConfig block supports:

dataproc_metric_config {
      metrics {
        metric_source = "HDFS"
        metric_overrides = ["yarn:ResourceManager:QueueMetrics:AppsCompleted"]
      }
    }
  • metrics - (Required) Metrics sources to enable.

    • metricSource - (Required) A source for the collection of Dataproc OSS metrics (see available OSS metrics).

    • metricOverrides - (Optional) One or more [available OSS metrics] (https://cloud.google.com/dataproc/docs/guides/monitoring#available_oss_metrics) to collect for the metric course.


The lifecycleConfig block supports:

cluster_config {
  lifecycle_config {
    idle_delete_ttl = "10m"
    auto_delete_time = "2120-01-01T12:00:00.01Z"
  }
}
  • idleDeleteTtl - (Optional) The duration to keep the cluster alive while idling (no jobs running). After this TTL, the cluster will be deleted. Valid range: [10m, 14d].

  • autoDeleteTime - (Optional) The time when cluster will be auto-deleted. A timestamp in RFC3339 UTC "Zulu" format, accurate to nanoseconds. Example: "2014-10-02T15:01:23.045123456Z".


The endpointConfig block (Optional, Computed, Beta) supports:

cluster_config {
  endpoint_config {
    enable_http_port_access = "true"
  }
}
  • enableHttpPortAccess - (Optional) The flag to enable http access to specific ports on the cluster from external sources (aka Component Gateway). Defaults to false.

The metastoreConfig block (Optional, Computed, Beta) supports:

cluster_config {
  metastore_config {
    dataproc_metastore_service = "projects/projectId/locations/region/services/serviceName"
  }
}
  • dataprocMetastoreService - (Required) Resource name of an existing Dataproc Metastore service.

Only resource names including projectid and location (region) are valid. Examples:

projects/[projectId]/locations/[dataprocRegion]/services/[serviceName]

Attributes Reference

In addition to the arguments listed above, the following computed attributes are exported:

  • clusterConfig0MasterConfig0InstanceNames - List of master instance names which have been assigned to the cluster.

  • clusterConfig0WorkerConfig0InstanceNames - List of worker instance names which have been assigned to the cluster.

  • clusterConfig0PreemptibleWorkerConfig0InstanceNames - List of preemptible instance names which have been assigned to the cluster.

  • clusterConfig0Bucket - The name of the cloud storage bucket ultimately used to house the staging data for the cluster. If stagingBucket is specified, it will contain this value, otherwise it will be the auto generated name.

  • clusterConfig0SoftwareConfig0Properties - A list of the properties used to set the daemon config files. This will include any values supplied by the user via clusterConfigSoftwareConfigOverrideProperties

  • clusterConfig0LifecycleConfig0IdleStartTime - Time when the cluster became idle (most recent job finished) and became eligible for deletion due to idleness.

  • clusterConfig0EndpointConfig0HttpPorts - The map of port descriptions to URLs. Will only be populated if enableHttpPortAccess is true.

Import

This resource does not support import.

Timeouts

This resource provides the following Timeouts configuration options: configuration options:

  • create - Default is 45 minutes.
  • update - Default is 45 minutes.
  • delete - Default is 45 minutes.