In this 3 part series of blog posts, I will be describing an approach I have used recently to implement Continuous Integration and Continuous Delivery in Azure Data Factory V2 using Powershell module.
In this part two of the series, we will be focusing on creating a basic Azure Data Factory (ADF) V2 pipeline. In this pipeline, we will be using a Copy Activity to copy data from a CSV file present in Azure Storage Account and loading it into a database table in Azure SQL Database.
- Go through Part 1 of the blog series to get much needed context. It describes what CI/CD mean in Azure Data Factory V2 and its process workflow. It also explains how to provision Azure resources for environment setup using Azure CLI.
- Azure environment already setup and ready to go. If not present, you can follow along by going through “Azure Resources setup using Azure CLI” section in Part 1 of the blog post series. Essentially, we need the following resources to be provisioned for both Development (DEV) and Production (PROD) environments in order to proceed and setup a data pipeline.
Azure Data Factory V2
Azure Key Vault
Azure Storage Account
Azure SQL Database
Service Principal with appropriate access to fetch secrets from Azure Key Vault
Step 2: Setup Azure Data Factory v2 pipeline
If you would prefer to copy the respective files created for datasets, linked services and pipeline, created as part of this blog post rather than creating it through the DEV ADF user interface, you can access these at https://github.com/ashisharora1909/ADF-CICD-Demo under “2-SampleDataPipeline” folder.
If you want to follow along, please follow the below steps.
Go to your Internet Browser and enter “adf.azure.com” to launch Azure Data Factory V2. It will prompt you for your credentials. Once successfully logged in, you will be presented with the following screen.
Select the Azure Active Directory and Subscription to navigate to the Data Factory V2 of DEV i.e. DF-Ash-Dev and click Continue. This will open the Data Factory and select the master branch by default. If it prompts you to select a branch, select master branch for now and then your screen should look something like this:
The next step is to create a feature branch so that we can start adding objects to implement the data factory pipeline. Navigate to New branch option as highlighted in the below image and create a new feature branch, for example feature_XY where XY can be the work item number from Azure DevOps board on which you are working.
On creating of feature_XY branch, it will automatically be selected which means it is currently being checked out in the repository and any changes will be done now will be committed in that branch. Each time you click Save, it will create a corresponding commit in the selected branch.
Let’s start by creating linked services. These are connections to the Azure resources which we want to use in ADF v2.
To create a linked service, Go to Manage -> Linked service -> New.
The ones which we will be creating are :
Azure Key Vault
To create Azure Key Vault linked service, select New -> Azure Key Vault -> Continue. This will open a pane to enter details as below. Select KV-Ash-Dev from drop down to create linked service for DEV Key Vault. Click on Test connection to see if the connection is successful. If not, you may have to grant Data Factory serviced managed identity access to your Azure Key Vault. If it is successful, click Create to create the linked service.
To give Data Factory service managed identity access to Azure Key Vault, you can either run the following command in Azure CLI.
Alternatively, you can go to Azure Key Vault and create an access policy under Access Policies to grant permissions to ADF V2.
Azure Storage Account
To create an Azure Storage Account linked service, select New -> Azure Blob Storage -> Continue. This will open a pane to enter details as below. Select Account key as Authentication method and Azure Key Vault to fetch the connection string of Azure Storage Account that we have stored as a secret in there i.e. StorageAccountConnString. Click on Test connection to see if the connection is successful and click Create to create the linked service.
Azure SQL Database
To create an Azure SQL Database linked service, Select New -> Azure SQL Database -> Continue. This will open a pane to enter details as below. Select Azure Key Vault to fetch the connection string of Azure SQL database that we have stored as a secret i.e. AzureSQLConnString and select Authentication type as Sql Authentication or Managed Identity. Click on Test connection to see if the connection is successful and click Create to create the linked service.
If the connection is failing that could be because of ADF not able to access Azure SQL Database due to firewall rules. To fix this, go to SQL Server (ADBSrv-Ash-Dev) and Select Firewalls and virtual networks and enable property "Allow Azure services and resources to access this server" to Yes. This should fix the issue in most cases.
We have now all the three linked services created.
To save time, I have already uploaded a sample CSV file (input.csv) on the source container of Azure Storage Account in DEV environment (saashdev) and the contents of the file looks like below:
Also, I have created a table dbo.output in Azure SQL database. Below is the create table script
Let’s jump on to creating a Data pipeline. Follow the below steps to create Data pipeline and insert Copy data activity to copy data from input.csv in Azure Storage to dbo.output table in Azure SQL database in DEV ADF environment.
1. Create a new pipeline
2. Name the pipeline as samplePipeline and Use copy data activity in it.
3. Configure the Source properties in Copy data activity. This should point to the Source dataset which in our case is the CSV file in Azure Storage Account (input.csv). As we haven’t created a dataset for input.csv, we will be creating that from inside the pipeline itself. Alternatively, you can follow the same steps below to create a dataset by clicking on Datasets in Author tab and click New.
To create a dataset categorized as source within pipeline, Click Source and then click New.
This will open another pane to select a data store. Select Azure Blob Storage and then format as DelimitedText and click Continue. Enter the values of the properties as shown below and click OK to create the dataset. You can Preview data to see if the dataset is created correctly.
4. Configure the Sink properties in Copy data activity. This should point to the Sink dataset which in our case is the database table in Azure SQL database (dbo.output). As we haven’t created a dataset for dbo.output, we will be creating that from inside the pipeline itself. Alternatively, you can follow the same steps below to create a dataset by clicking on Datasets in Author tab and click New.
To create a dataset categorized as Sink within pipeline, Click Sink and then click New.
This will open another pane to select a data store. Select Azure SQL database and click Continue. Enter the values of the properties as shown below and click OK to create the dataset.
5. Next step is to go to the Mapping tab and Click on Import schemas to import the data mapping from source to destination.
This should configure the data pipeline. Click on Validate to see if there are any errors present. If no errors are present, click Save to create a corresponding commit in feature_XY branch for the pipeline. You will also notice other commits for linked service and datasets in the feature_XY branch. Amazing, isn’t it?
Now is also a good time to test the pipeline by using the Debug option which triggers Debug run on the current pipeline. On success, it should give a popup about the status of the pipeline. Alternatively, you can view the status of each activity in the pipeline in the Output window.
On validating the table dbo.output in the DEV SQL Database, you will find that the data from input.csv is now populated in the table.
This concludes our part two of the blog series in which we have created a data pipeline in DEV environment. At this stage, the feature_XY branch in Azure DevOps should look something like this:
In the next post, we will be using the sample data pipeline created today to package as a build artefact (Continuous Integration) and then release to Production (Continuous Deployment) using Azure pipelines.
Thanks for reading and do check out the final post in this blog series.