Overview
In this series of blog posts, I will be describing an approach I have used recently to implement Continuous Integration and Continuous Delivery in Azure Data Factory V2.
Part 1 describes the following:
- What is CI/CD, in general along with its benefits
- What does CI/CD mean in Azure Data Factory V2 and its process workflow. We will be implementing this using Powershell Module called azure.datafactory.tools.
- Provisioning of Azure resources using Azure CLI for environment setup.
Part 2 describes the following:
- Creating a basic Azure Data Factory V2 pipeline which copies data from a CSV file present in Azure Storage Account to a table in Azure SQL Database.
- Describes how to create linked services to fetch Azure Storage Account Key and Azure SQL Database Connection String from Azure Key Vault.
Part 3 describes the following:
- Creating a build pipeline using YAML that will package the artefacts for continuous integration whenever the code changes. Currently, there are no unit & integration testing options for Azure Data Factory V2. You can find more information here.
- Creating a release pipeline using a Powershell module called azure.datafactory.tools. To view the source code, refer to here.
Main Goals
- Whenever a new Pull Request (PR) is merged from the feature branch to the master branch, the code should be built and packaged in a build artifact along with its dependent objects.
- Whenever a build artifact is generated successfully from the build pipeline, it should be deployed automatically to the DEV environment using Release pipeline. Any issues with the code will be picked up in here if there is any failure in the deployment.
- The release pipeline should be able to promote the changes in the build artifact from one stage to another (e.g. DEV -> PROD) using the configuration setting defined for each stage in Azure DevOps which are being picked up from Azure Key Vault directly.
- An automated email is sent to the developers, reviewers and release approvers to notify the status of the change in the Release process and take actions accordingly.
Prerequisites
- An Azure Subscription for setting up the environment to create a Data factory V2 pipeline to copy data from Azure Storage Account to Azure SQL Database along with storing keys and secrets in Azure Key Vault. Go to https://azure.microsoft.com to get started.
- An Azure DevOps account for creating the build and release pipelines used in this series. Source code is hosted in a git repository in Azure Repos. Go to https://dev.azure.com to get started.
- You must have the Azure CLI installed.
What is CI/CD?
Many experts defined CI/CD as the collection of practices of Continous Integration and Continous Delivery in modern software engineering. It bridges the gap between development and operations teams by injecting automation at different stages in the development life cycle which includes building, testing and deployment of applications.
Below are some of the benefits of implementing CI/CD in your development lifecycle.
Improvement in Team Collaboration and Code Quality
- Continuous integration results in lesser merge conflicts due to a shorter commit lifecycle; saves many developer hours therefore, increasing productivity.
- With frequent code commits and automated tests, its easier to identify defects and fix them rapidly, thereby increasing code quality.
- Increase team transparency and accountability.
Deliver software at speed and better user experience, making the customer happy
- Mean time to resolution (MTTR) is shorter because of the smaller code changes and quicker fault isolation.
- End-user involvement and feedback during continuous development can lead to usability improvements
- Faster time to market is more likely to bring success in the product ahead of market demands or user's expectations, thereby increasing ROI considerably.
What is CI/CD in Azure Data Factory V2?
In Azure Data Factory V2, Continuous Integration and Delivery simply means copying data factory pipelines and its dependent objects (linked services, datasets, triggers, integration runtime) from one environment (Development) to another (Production) using automated pipelines.
In this blog, we will be implementing Azure Data Factory CI/CD processes using a Powershell module called azure.datafactory.tools. The key advantage of this is to publish all the Azure Data Factory service code from JSON files by calling one method. Some of the key benefits that pushed me to use this method are:
- Finding the right order for deploying objects
- Start/Stop triggers automatically (instead of creating a pre- and post-deployment scripts in ARM Template Deployment as explained here.)
- Include/Exclude deployment of objects by name and/or type
Some of the features that are planned for the future are:
- Unit test of Pipelines and Linked services
- Build function to support validation of files, dependencies and config
For the complete list of the methods supported by this Powershell module, please refer here.
Process Workflow
The process workflow of CI/CD in Azure Data Factory V2 is as follows:
- A developer creates a feature Branch to implement a code change in Dev ADF which is having Git configured.
- A developer will make iterative changes in their feature branch and click "Save" which will create corresponding commits in their feature branch. They can test their changes individually in their feature branch (for e.g. “Test Connection” to test linked services, “Debug” option to test a data pipeline, etc.). Please note that the feature branch can’t be tested individually by releasing on Dev ADF because Publish option is only available through the collaboration branch i.e. master branch by default.
- When the developer is happy with the changes in their feature branch and want to test it on DEV ADF along with the existing code present, they create a pull request to merge the changes to the master branch. It is good practice to have at least one approver (who is not the author) to do code reviews and merge the pull request into master branch. Please note while merging the pull request to master branch, it is a good practice to select option for squash changes when merge and delete branch after merging. This helps in keeping the commit history and repo clean.
- Upon every commit to master branch, a build pipeline will be automatically triggered. It is a good practice to set branch policies on master branch to only update via Pull request. Refer here for more details on branch policies. If any required policy is enabled, this branch cannot be deleted and changes must be made via pull request.
- On successful completion of the build pipeline, the artifact of the repository holding the new changes along with the existing code will be published.
- A release pipeline is configured as a two stage deployment (DEV -> PROD) and set to trigger every time a new build artifact is published via build pipeline. The artefact will be released automatically to the Dev ADF. This is where we can identify issues after the deployment to see if the new changes along with existing code are working as expected by validating the results in “Data Factory” mode of Dev ADF. The PROD stage has pre-deployment conditions set to approve deployment to PROD therefore, it will wait for a manual intervention to approve the deployment to PROD stage. You should approve the deployment only if the results in DEV are meeting expectations. Also, this is a good place to set up automated test cases in future.
- If there are issues found in the build artefact, go back to Step 1 and create another feature branch to rectify the issues found. You can disregard the current build artefact as each deployment artefact is independent of one another and can be rolled out exclusively. The Powershell module deploys all the objects (new and existing) in ADF from each build artefact and hence, there are no dependencies on previous build artefact.
- If you are happy with the changes in the DEV stage, then approve the Release pipeline to deploy to PROD stage and your new code changes will be rolled out to Production.
Step 1: Azure Resources setup using Azure CLI
For demonstration purpose, I have assumed that there are two environments which needs to be set up i.e. Development (DEV) and Production (PROD). You can follow the same process to extend this to additional environments depending on your organization model.
I will be following some of the best practices for CI/CD in Azure Data Factory V2 as explained here.
The resources we will be creating are:
- Azure DevOps: This is where we'll be setting up a Version Control System (VCS) using Azure Repos to host our code and setup Azure Pipelines for Continuous Integration and Continuous Deployment.
- Azure Resource Group: This is the container that will hold all related resources in our Azure solution.
- Azure Data Factory V2: This is where our sample data factory pipeline will be hosted. I will be setting up Git Integration for DEV ADF to host all objects in Azure Repos to implement version control.
- Azure Storage Account: This is where we will store our sample CSV file which will be used as a source in our sample pipeline.
- Azure SQL Database: This is where we will copy the data from our sample file into a SQL table as a destination in our sample pipeline.
- Azure Key Vault: This is where we will be storing Azure Storage Account connection strings and SQL Database connection strings as secrets. We will also be storing passwords of Azure Service Principal.
- Azure Service Principal: This will be required to access Azure Key Vault from Azure DevOps
I will be creating two resources in each category (except Azure DevOps) , one for each environment. As a standard naming convention, I will be suffixing it with Dev and Prod.
Azure DevOps
1. Navigate to dev.azure.com to sign into Azure DevOps. You’ll be greeted with a page similar to the one below:
2. Click on the “+ New Project” Button. Fill in the name of your project and description and click Create:
3. Now navigate to Azure Repos within your new DevOps project, you should be greeted with a page that looks similar to the screenshot below. Click on the "Initialize with a README or gitignore" to initialize a repository.
After initializing the repository, your screen should look something like this:
Azure CLI
If you would prefer to copy the following commands rather than writing it from scratch, you can access these at https://github.com/ashisharora1909/ADF-CICD-Demo under “1-ResourceSetup” folder. The file is created with “.azcli” extension which can be opened in Visual Studio Code after installing Azure CLI Tools extension from Microsoft which is free.
1. Login to Azure
2. List subscriptions in Azure
3. Set subscription to desired subscription name under which you want to create the resources. In my case, it's Visual Studio Professional.
4. Create resource group "RG-Ash-Dev" & "RG-Ash-Prod" and set location (Australia East)
5. Install extension for data factory in Azure CLI. Please note that the extension is currently experimental and not covered by customer support.
6. Create data factory "DF-Ash-Dev" & "DF-Ash-Prod". Set location to Australia East.
7. Configure Git Repository in Dev ADF that you created in the Azure DevOps step above.
where
factory-resource-id is the resource Id of Azure Data Factory V2 DEV
account-name is Azure DevOps organisation
collaboration-branch is master (general practice)
project-name is Azure DevOps project
repository-name is Data Factory repo
tenant-id is the current Azure Active Directory in Azure Devops
Please note we don't need to configure Git repository in Prod ADF as the changes will be pushed to Production using CD pipeline.
8. Create storage account “saashdev” & “saashprod”. Set location to Australia East
9. List account keys of Storage Account. We will need them in following steps.
10. Create container source in storage account for Dev & Prod. This container will contain the sample CSV file.
11. Create Azure SQL Database Server for Dev & Prod. This will be the database server where the SQL Database will be hosted. Create Azure SQL Database for Dev & Prod. This will be the SQL Database where the destination table will get created.
12. Create Azure Key Vault for both Dev & Prod. This will be used to store secrets. Set location to AustraliaEast.
13. Create a secret to store Storage Account connection string in Key Vault
where connectionString will be something like
"DefaultEndpointsProtocol=https;AccountName=saashdev;AccountKey=XXAccountKey_DEVXX;EndpointSuffix=core.windows.net"
14. Create a secret to store Azure SQL Database Connection String in Key Vault.
where connectionString will be something like
"Server=tcp:XXDatabaseServerNameXX.database.windows.net,1433;Initial Catalog=XXDatabaseNameXX;Persist Security Info=False;User ID=XXUserName_DEVXX;Password=XXPassword_DEVXX;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;"
15. Create a service principal for Dev & Prod
Please keep a note of the values that are being returned especially the password as you will not be able to retrieve it back. We will be using the appId and password in the following steps
16. Create a secret to store Service Principal password generated in the previous step in Azure Key Vault for both Dev and Prod
17. Give service principal access to Azure Key Vault.
This will be required to retrieve secrets in Azure DevOps pipelines.
After going through the above steps, you should see something similar to the following in the resource group RG-Ash-Dev and RG-Ash-Prod.
RG-Ash-Dev
RG-Ash-Prod
That's all the changes we will be making for now. In the next post, we will be focusing on creating a data pipeline in the Development environment of Azure Data Factory V2. The pipeline will copy data from a CSV file in Azure Storage Account to a database table in Azure SQL Database. In the last post, this sample data pipeline will then be used as an example to package as a build artefact and then release to Production using Azure pipelines calling the Powershell module.
Thanks for reading and do check out the follow-up posts in this blog series.