Skip to content

Resource: awsGlueCrawler

Manages a Glue Crawler. More information can be found in the AWS Glue Developer Guide

Example Usage

DynamoDB Target Example

/*Provider bindings are generated by running cdktf get.
See https://cdk.tf/provider-generation for more details.*/
import * as aws from "./.gen/providers/aws";
new aws.glueCrawler.GlueCrawler(this, "example", {
  databaseName: "${aws_glue_catalog_database.example.name}",
  dynamodbTarget: [
    {
      path: "table-name",
    },
  ],
  name: "example",
  role: "${aws_iam_role.example.arn}",
});

JDBC Target Example

/*Provider bindings are generated by running cdktf get.
See https://cdk.tf/provider-generation for more details.*/
import * as aws from "./.gen/providers/aws";
new aws.glueCrawler.GlueCrawler(this, "example", {
  databaseName: "${aws_glue_catalog_database.example.name}",
  jdbcTarget: [
    {
      connectionName: "${aws_glue_connection.example.name}",
      path: "database-name/%",
    },
  ],
  name: "example",
  role: "${aws_iam_role.example.arn}",
});

S3 Target Example

/*Provider bindings are generated by running cdktf get.
See https://cdk.tf/provider-generation for more details.*/
import * as aws from "./.gen/providers/aws";
new aws.glueCrawler.GlueCrawler(this, "example", {
  databaseName: "${aws_glue_catalog_database.example.name}",
  name: "example",
  role: "${aws_iam_role.example.arn}",
  s3Target: [
    {
      path: "s3://${aws_s3_bucket.example.bucket}",
    },
  ],
});

Catalog Target Example

/*Provider bindings are generated by running cdktf get.
See https://cdk.tf/provider-generation for more details.*/
import * as aws from "./.gen/providers/aws";
new aws.glueCrawler.GlueCrawler(this, "example", {
  catalogTarget: [
    {
      databaseName: "${aws_glue_catalog_database.example.name}",
      tables: ["${aws_glue_catalog_table.example.name}"],
    },
  ],
  configuration:
    '{\n  "Version":1.0,\n  "Grouping": {\n    "TableGroupingPolicy": "CombineCompatibleSchemas"\n  }\n}\n',
  databaseName: "${aws_glue_catalog_database.example.name}",
  name: "example",
  role: "${aws_iam_role.example.arn}",
  schemaChangePolicy: {
    deleteBehavior: "LOG",
  },
});

MongoDB Target Example

/*Provider bindings are generated by running cdktf get.
See https://cdk.tf/provider-generation for more details.*/
import * as aws from "./.gen/providers/aws";
new aws.glueCrawler.GlueCrawler(this, "example", {
  databaseName: "${aws_glue_catalog_database.example.name}",
  mongodbTarget: [
    {
      connectionName: "${aws_glue_connection.example.name}",
      path: "database-name/%",
    },
  ],
  name: "example",
  role: "${aws_iam_role.example.arn}",
});

Configuration Settings Example

/*Provider bindings are generated by running cdktf get.
See https://cdk.tf/provider-generation for more details.*/
import * as aws from "./.gen/providers/aws";
new aws.glueCrawler.GlueCrawler(this, "events_crawler", {
  configuration:
    '${jsonencode(\n    {\n      Grouping = {\n        TableGroupingPolicy = "CombineCompatibleSchemas"\n      }\n      CrawlerOutput = {\n        Partitions = { AddOrUpdateBehavior = "InheritFromTable" }\n      }\n      Version = 1\n    }\n  )}',
  databaseName: "${aws_glue_catalog_database.glue_database.name}",
  name: "events_crawler_${var.environment_name}",
  role: "${aws_iam_role.glue_role.arn}",
  s3Target: [
    {
      path: "s3://${aws_s3_bucket.data_lake_bucket.bucket}",
    },
  ],
  schedule: "cron(0 1 * * ? *)",
  tags: "${var.tags}",
});

Argument Reference

\~> NOTE: Must specify at least one of dynamodbTarget, jdbcTarget, s3Target, mongodbTarget or catalogTarget.

The following arguments are supported:

  • databaseName (Required) Glue database where results are written.
  • name (Required) Name of the crawler.
  • role (Required) The IAM role friendly name (including path without leading slash), or ARN of an IAM role, used by the crawler to access other resources.
  • classifiers (Optional) List of custom classifiers. By default, all AWS classifiers are included in a crawl, but these custom classifiers always override the default classifiers for a given classification.
  • configuration (Optional) JSON string of configuration information. For more details see Setting Crawler Configuration Options.
  • description (Optional) Description of the crawler.
  • dynamodbTarget (Optional) List of nested DynamoDB target arguments. See Dynamodb Target below.
  • jdbcTarget (Optional) List of nested JBDC target arguments. See JDBC Target below.
  • s3Target (Optional) List nested Amazon S3 target arguments. See S3 Target below.
  • mongodbTarget (Optional) List nested MongoDB target arguments. See MongoDB Target below.
  • schedule (Optional) A cron expression used to specify the schedule. For more information, see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(1512 * * ? *).
  • schemaChangePolicy (Optional) Policy for the crawler's update and deletion behavior. See Schema Change Policy below.
  • lakeFormationConfiguration (Optional) Specifies Lake Formation configuration settings for the crawler. See Lake Formation Configuration below.
  • lineageConfiguration (Optional) Specifies data lineage configuration settings for the crawler. See Lineage Configuration below.
  • recrawlPolicy (Optional) A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.. See Recrawl Policy below.
  • securityConfiguration (Optional) The name of Security Configuration to be used by the crawler
  • tablePrefix (Optional) The table prefix used for catalog tables that are created.
  • tags - (Optional) Key-value map of resource tags. If configured with a provider defaultTags configuration block present, tags with matching keys will overwrite those defined at the provider-level.

Dynamodb Target

  • path - (Required) The name of the DynamoDB table to crawl.
  • scanAll - (Optional) Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table. defaults to true.
  • scanRate - (Optional) The percentage of the configured read capacity units to use by the AWS Glue crawler. The valid values are null or a value between 0.1 to 1.5.

JDBC Target

  • connectionName - (Required) The name of the connection to use to connect to the JDBC target.
  • path - (Required) The path of the JDBC target.
  • exclusions - (Optional) A list of glob patterns used to exclude from the crawl.
  • enableAdditionalMetadata - (Optional) Specify a value of rawtypes or comments to enable additional metadata intable responses. rawtypes provides the native-level datatype. comments provides comments associated with a column or table in the database.

S3 Target

  • path - (Required) The path to the Amazon S3 target.
  • connectionName - (Optional) The name of a connection which allows crawler to access data in S3 within a VPC.
  • exclusions - (Optional) A list of glob patterns used to exclude from the crawl.
  • sampleSize - (Optional) Sets the number of files in each leaf folder to be crawled when crawling sample files in a dataset. If not set, all the files are crawled. A valid value is an integer between 1 and 249.
  • eventQueueArn - (Optional) The ARN of the SQS queue to receive S3 notifications from.
  • dlqEventQueueArn - (Optional) The ARN of the dead-letter SQS queue.

Catalog Target

  • connectionName - (Optional) The name of the connection for an Amazon S3-backed Data Catalog table to be a target of the crawl when using a Catalog connection type paired with a network Connection type.
  • databaseName - (Required) The name of the Glue database to be synchronized.
  • tables - (Required) A list of catalog tables to be synchronized.
  • eventQueueArn - (Optional) A valid Amazon SQS ARN.
  • dlqEventQueueArn - (Optional) A valid Amazon SQS ARN.

\~> Note: deletionBehavior of catalog target doesn't support DEPRECATE_IN_DATABASE.

-> Note: configuration for catalog target crawlers will have { ... "grouping": { "tableGroupingPolicy": "combineCompatibleSchemas"} } by default.

MongoDB Target

  • connectionName - (Required) The name of the connection to use to connect to the Amazon DocumentDB or MongoDB target.
  • path - (Required) The path of the Amazon DocumentDB or MongoDB target (database/collection).
  • scanAll - (Optional) Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table. Default value is true.

Delta Target

  • connectionName - (Optional) The name of the connection to use to connect to the Delta table target.
  • createNativeDeltaTable (Optional) Specifies whether the crawler will create native tables, to allow integration with query engines that support querying of the Delta transaction log directly.
  • deltaTables - (Required) A list of the Amazon S3 paths to the Delta tables.
  • writeManifest - (Required) Specifies whether to write the manifest files to the Delta table path.

Schema Change Policy

  • deleteBehavior - (Optional) The deletion behavior when the crawler finds a deleted object. Valid values: log, DELETE_FROM_DATABASE, or DEPRECATE_IN_DATABASE. Defaults to DEPRECATE_IN_DATABASE.
  • updateBehavior - (Optional) The update behavior when the crawler finds a changed schema. Valid values: log or UPDATE_IN_DATABASE. Defaults to UPDATE_IN_DATABASE.

Lake Formation Configuration

  • accountId - (Optional) Required for cross account crawls. For same account crawls as the target data, this can omitted.
  • useLakeFormationCredentials - (Optional) Specifies whether to use Lake Formation credentials for the crawler instead of the IAM role credentials.

Lineage Configuration

  • crawlerLineageSettings - (Optional) Specifies whether data lineage is enabled for the crawler. Valid values are: enable and disable. Default value is disable.

Recrawl Policy

  • recrawlBehavior - (Optional) Specifies whether to crawl the entire dataset again, crawl only folders that were added since the last crawler run, or crawl what S3 notifies the crawler of via SQS. Valid Values are: CRAWL_EVENT_MODE, CRAWL_EVERYTHING and CRAWL_NEW_FOLDERS_ONLY. Default value is CRAWL_EVERYTHING.

Attributes Reference

In addition to all arguments above, the following attributes are exported:

  • id - Crawler name
  • arn - The ARN of the crawler
  • tagsAll - A map of tags assigned to the resource, including those inherited from the provider defaultTags configuration block.

Import

Glue Crawlers can be imported using name, e.g.,

$ terraform import aws_glue_crawler.MyJob MyJob