You can translate the document:

Content

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easier to prepare and load your data for analytics. You can create and run an ETL job with a few clicks on the AWS Management Console.

AWS Glue has few native connectors to data sources using JDBC drivers.. For connecting to Denodo Platform, we will be using the JDBC driver distributed with the Denodo Platform, and can run an ETL job integrating the Denodo Platform views.

This document explains how this connectivity can be accomplished in a ‘non-native’ way with respect to Denodo being treated as a data store and extracting views in the Virtual DataPort server.

Pre-requisites

  • Denodo Platform installed in EC2 instance. Note: You can also connect to the on-premises Denodo Installation but you should have the necessary network setup. Check with your AWS Administrator for the network configuration.
  • VPC and subnets (Recommended while installing Denodo in the EC2 instance)
  • Security Configuration. Refer to the Default Ports Used by the Denodo Platform User Manual for ports needed by Denodo Platform

Step 1: Configuring S3 buckets

Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services that provides object storage through a web service interface.

As a first step of the ETL process, it is necessary to create S3 Buckets for storing several pieces of information including the output of the extraction process.  

Sign in to the AWS Management Console, navigate to S3 in the AWS Console and then create the necessary buckets for this job following the Glue naming standards,

The location where the tailored script will be stored,

 

s3://aws-glue-scripts-<ACCOUNT_ID>-<REGION>/<USER>

To store temporary scripts while running the Glue job,

s3://aws-glue-temporary-<ACCOUNT_ID>-<REGION>/<USER>

For storing the Denodo JDBC driver, you can create a similar location:

s3://aws-glue-jdbc-drivers-<ACCOUNT_ID>-<REGION>/<USER>

Also for the Parquet data (output), you can create a similar location:

s3://aws-glue-data-output-<ACCOUNT_ID>-<REGION>/<USER>

Remember, the user denoted in the bucket name is the AWS user who is creating the buckets. In this example, all the S3 buckets are created in the US-East-1 Region only to avoid cross-region transfer costs:

Step 2 : Creating an IAM Role 

AWS Identity and Access Management (IAM) enables you to manage access to AWS services and resources securely. Using IAM, you can create and manage AWS users and groups, and use permissions to allow and deny their access to AWS resources. For the ETL job, we are going to create a Role with a Custom Policy, and then attach it to the Glue Service.

  1. Search for IAM:

  1. Create a policy to allow access database credentials that are stored in AWS Secrets Manager. For that, click on Policies on the left side pane and then Create policy.
  2. You would need to replace the placeholders for <REGION> and <ACCOUNT_ID> with the actual values, for example:

JSON

{

    "Version": "2012-10-17",

    "Statement": [

        {

            "Sid": "AccessSecretValue",

            "Effect": "Allow",

            "Action": [

                "secretsmanager:GetSecretValue",

                "secretsmanager:DescribeSecret"

            ],

            "Resource": [

                "arn:aws:secretsmanager:<REGION>:<ACCOUNT_ID>:secret:Denodo_Database_Connection_Info"

            ]

        }

    ]

}

Note: You could also configure this custom policy through Visual Editor instead of using the JSON. Make sure to define the GetSecretValue and DescribeSecret actions. Here, Denodo_Database_Connection_Info is the secret name in this example.

  1. Click on Review policy.
  2. Give this policy a name, for example, GlueAccessSecretValue.
  3. On the IAM console, choose Roles in the left navigation pane.
  4. Choose Create role. The role type of trusted entity must be an AWS Service, specifically AWS Glue.
  5. Click Next: Permissions.
  6. Now, we have to select the policies for the IAM Role. In the Policies Search bar, type the AWSGlueServiceRole policy, and select it.

10. Search again, now for the GlueAccessSecretValue policy created before.

11. Finally, search for the SecretsManagerReadWrite policy.

  1. Once the policies are selected, choose Next: Tags and then, Next: Review.
  2. Give your role a name, for example, GluePermissions
  3. Choose Create role.
  4. Review the Summary and confirm that the policies were selected as shown below,

        

Step 3: Upload the driver

AWS Glue natively supports a few databases for the ETL process just like Denodo offers adapters for the databases. Since Denodo isn’t one of them, we have to use the Denodo JDBC driver for establishing the connection. Using the s3://aws-glue-jdbc-drivers-<ACCOUNT_ID>-<REGION>/<USER> bucket previously created, upload the denodo-vdp-jdbcdriver.jar located under “<DENODO_HOME>\tools\client-drivers\jdbc'' directory. The S3 specific path of this jar file will be later used in the Glue Job script.

        

Step 4: Leveraging AWS Secrets Manager for storing credentials

AWS Secrets Manager helps you protect secrets needed to access your applications, services, and IT resources. The service enables you to easily rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle.

The credentials for establishing the connection to the VDP server from the Glue job script will be stored as Secrets. We illustrate the steps below,

  • Open the console, and search for Secrets Manager.
  • In the AWS Secrets Manager console, choose Store a new secret.
  • Under Select a secret type, choose Other type of secrets.
  • In the Secret key/value, set one row for each of the following parameters:
  • db_username
  • db_password
  • db_url (for example, jdbc:vdb://52.10.22.45:9999/database)
  • db_table (for example, bv_series)
  • driver_name (com.denodo.vdp.jdbc.Driver)
  • output_bucket: (for example, aws-glue-data-output-1234567890-us-east-1/aws_user)

Note: The key names are just for representational purposes, you can also provide different names for the parameters.

  • Choose Next.
  • Provide the Secret name, for example Denodo_Database_Connection_Info.
  • Choose Next.
  • Keep the Disable automatic rotation check box selected.
  • Choose Next and then click Store.
  • Once secret has been created, click on the Retrieve Secret and it should look like the  example below,

Step 5: Create a Glue Job

AWS Glue offers three options for creating a Glue script out of which you can select your preferred method. In this example, we are going to author a new script explained below,

  1. Search for Glue in AWS Management Console,

  1. Navigate to the Jobs under ETL section and then click Add job.
  2. Fill in the Configuring the job properties:
  3. Provide a name for the job, for example, DenodoJob.
  4. Choose the IAM role that we have created earlier (GluePermissions)
  5. Choose Type as Spark and select the preferred Glue version. In this example, we have selected Spark 2.2, Python 2 (Glue version 0.9)
  6. For This job runs, choose A new script to be authored by you.

  1. For ETL language, choose Python.
  2. In the S3 path where the script is stored, provide the S3 bucket path created for storing the script and the temporary script bucket path in the Temporary directory.
  3. Scroll down and select the Security configuration, script libraries, and job parameters section, then choose the S3 bucket location of the Denodo JDBC driver in the Dependent jars path. Note that you have to select the jar as shown below,

  1. Choose Next.
  2. On the Connections page, choose Next
  3. On the summary page, choose Save job and edit script. This creates the job and opens the script editor.

Note: You can customize the script based on the needs but make sure that you modify the mappings according to the view in the Virtual DataPort server. This example script will retrieve the metadata and store the output in Parquet format. If you prefer a different format, then you would need to tailor the script in accordance with the Development Team.

Sample Job script - DenodoJob

import sys

import boto3

import json

from awsglue.transforms import *

from awsglue.utils import getResolvedOptions

from pyspark.context import SparkContext

from awsglue.context import GlueContext

from awsglue.dynamicframe import DynamicFrame

from awsglue.job import Job

## @params: [JOB_NAME]

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()

glueContext = GlueContext(sc)

spark = glueContext.spark_session

job = Job(glueContext)

job.init(args['JOB_NAME'], args)

# Getting DB credentials from Secrets Manager

client = boto3.client("secretsmanager", region_name="us-east-1")

get_secret_value_response = client.get_secret_value(

        SecretId="Denodo_Database_Connection_Info"

)

secret = get_secret_value_response['SecretString']

secret = json.loads(secret)

db_username = secret.get('db_username')

db_password = secret.get('db_password')

db_url = secret.get('db_url')

table_name = secret.get('db_table')

jdbc_driver_name = secret.get('driver_name')

s3_output = "s3://" + secret.get('output_bucket') + "/" + table_name

# Connecting to the source

df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).load()

df.printSchema()

print df.count()

datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")

# Defining mapping for the transformation

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "int", "id", "int"), ("name", "string", "name", "string"), ("dept", "string", "dept", "string")], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

# Writing to destination

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_output}, format = "parquet", transformation_ctx = "datasink4")

job.commit()

  1. Click Save to save the job

Step 6: Run the job

As the ETL job has been created, follow the steps below to execute it,

  • Navigate to the Jobs section, and click on the newly created job. On the Action menu, choose Run job.

  • Once the job execution gets successfully completed, Click the Logs under History tab to view the logs in CloudWatch

  • Click on View as Text checkbox and then scroll down to view the output,

 

  • Since we had used the df.count() and df.printschema() in our Python script, we were able to see the metadata extracted from the view.

  • Finally, navigate to the S3 bucket we have created earlier and you should see the Parquet file as expected.  

Conclusion

As the parquet file has been generated, you could use one of the AWS Services or any other external tools to proceed further in the ETL process. For instance, you could create a table in Glue with this output file as a data store and run a ‘SELECT’ query in AWS Athena Query Editor against this Glue table.

To conclude, this is how Denodo can be connected to the AWS Glue and the views can be extracted for the ETL. Using the GlueStudio and creating Connections isn’t supported by the AWS yet and hence the connection will not work until Denodo is natively supported as a database.

References

Denodo Knowledge Base - Installation steps on cloud environment

Denodo Installation Guide - Default ports used by Denodo Platform Modules

Questions

Ask a question

You must sign in to ask a question. If you do not have an account, you can register here