Data is an integral part of business and converting your data into actionable insights helps businesses to make better decisions and predictions and foster business growth. A large set of diverse data is stored in multiple systems. In order to turn the data into actionable intelligence, data needs to be pulled from multiple sources, cleansed, analysed, and transformed into a format that can be efficiently stored and used to create data visualisation to provide intuitive insights.
AWS, the leader in the cloud space, provides the best-of-breed tools and services to derive meaningful insights from data rapidly.
The purpose of this blog post is to walk you through how to implement data analytics and visualisation solutions with minimal coding using AWS fully managed and serverless services that are highly available and scalable. Here, we use ServiceNow and Mobile Device Management Systems data as an example use case.
The diagram below presents the reference architecture to implement data analytics and visualization with ServiceNow and Mobile Device Management Systems data on the AWS cloud. Data is extracted from multiple sources and if direct integration is not possible with the source system, it provides the facility to upload data extracts manually to the Amazon S3 bucket. Once all the data extracts are received, they are cleansed, transformed, and copied to another bucket. The metadata definition of the transformed data is inferred with the crawler, and the table structure is created in the analytics service, which is used as a data source in the visualisation tool such as Amazon QuickSight, Tableau and Power BI.
This is the first step in the flow. Data is extracted from ServiceNow using Amazon Appflow, which is a fully managed integration service and automates data extraction from the SaaS application ServiceNow, with a few clicks. Before creating the flow, the ServiceNow account is required to establish the connection. Once the connection is established, Amazon AppFlow will list all the tables in ServiceNow for you to select the relevant table and allow mapping of the source fields to the destination fields. Furthermore, it allows you to add filters for the data and run the flow on demand, or on schedule.
Amazon Appflow does not support integration with on-premises MDM systems. Custom adapters need to be created as microservice and deployed into AWS ECS Fargate for extracting data from the MDM system and copied to the S3 bucket.
Some source systems do not expose APIs for direct integration for security reasons. Data can be extracted manually from those systems and copied to an S3 bucket via AWS console or AWS CLI.
After extracting raw data into the S3 bucket, the next step in the flow is data cleansing and transformation. AWS Glue DataBrew is a no-code visual data preparation tool that enables data analysts to visually inspect and analyse the data. It provides over 250 pre-built transformations to automate data preparation tasks.
The first step in AWS Glue DataBrew is creating a project and a new dataset. DataBrew dataset allows choosing an Amazon S3 bucket or Amazon AppFlow as one of the sources to import the data for data preparation. The transformed data is saved in an Amazon S3 bucket.
Once finished the data transformation on sample data, the steps can be stored as a recipe, which can be reused with other datasets to automate the transformation.
Metadata Definition Inference
The next step is to determine the schema of transformed data. AWS Glue crawler crawls the transformed data stored in the Amazon S3 bucket and automatically infers schema information and creates tables in the AWS Glue Data Catalog. The table does not contain data and it is the metadata definition of transformed data.
Data Analysis and Visualisation
Now, data is cleansed and transformed, and the structure of data is determined. It is ready for data analysis and visualisation.
Amazon Athena is an interactive analytics service, which integrates with AWS Glue Data Catalog out of the box and uses the tables created in AWS Glue Data Catalog for querying and analysing the data in Amazon S3 buckets using standard SQL. Amazon Athena is serverless and you do not need to manage any underlying compute infrastructure and you pay only for the queries that you run.
Amazon QuickSight is a powerful serverless cloud-native business intelligence service and you can easily configure Amazon Athena as a data source. PowerBI is another visualisation tool with a rich set of features from Microsoft and it connects to Amazon Athena via PowerBI Data Gateway deployed in Amazon EC2 instance as shown in the diagram. Tableau is also an excellent visualisation product, and it requires Athena Driver installation to connect to Amazon Athena and retrieve data.
Extracting data from multiple sources and converting it to actionable insight quickly help the business make an informed decision on time. Automating the data extraction and preparation process with AWS fully managed and serverless services as shown in the reference architecture diagram enables your business to turn the data into business intelligence more effectively in a short period of time with low-code.
Q: What software and services currently integrate with Amazon AppFlow?
AWS keeps on adding new integration all the time. Currently, available integrations are listed at https://aws.amazon.com/appflow/integrations/.
Q: Can Amazon AppFlow reuse the connection?
Yes, you reuse the same connections with flows. When you create the AppFlow, it allows selecting existing connections.
Q: Why is Amazon Athena used?
Why cannot use Amazon Redshift?Amazon Athena provides the easiest way to run ad hoc SQL queries for data in Amazon S3 without the need to set up or manage any servers and analysis unstructured, semi-structured, and structured data stored in Amazon S3. It is very cost-effective.
Amazon Redshift is a data warehouse product and is good for building sophisticated business reports from historical data. It is very costly, but it is optimised to perform well in running queries against highly structured data with lots of joins across lots of very large tables.
If the volume of data is extremely high and you want to run complex queries with a large number of joins, then the best choice is Amazon Redshift, otherwise choose Amazon Athena.
Q: What is the difference between AWS Glue ETL and AWS Glue DataBrew?
AWS Glue DataBrew is a no-code tool and has over 250 built-in transformations whereas AWS Glue ETL has Glue Studio, which allows writing custom code for transformations.
Q: Is Amazon Athena accessible using JDBC and integrate any application that supports JDBC connectivity?
Yes, Amazon Athena provides JDBC drivers, which you can download, install and configure. Follow the instruction on https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html.
If you have any questions or need help implementing the above-mentioned solutions, please leave a comment.