How to Setup CI/CD For Databricks on Azure DevOps?

Data Platform in Use:

The Data platform consists of Data Lake, Azure Data Factory, Azure Databricks, Synapse Analytics and Power BI. Here’s the high-level architecture:

What is Azure DevOps & Why?

1a. What is Azure DevOps?

Azure DevOps provides set of tools and practices to collaborate across teams within the organization to build, integrate and deploy applications/Services.

1b. Why Azure DevOps?

We have opted DevOps component in Azure for the following reasons:

Easy and tight integration with Azure native components such Azure Data factory, Synapse Analytics, Azure Databricks etc.
Maintenance-free operations
Allows to manage security roles and groups
Elastic scaling

Pre-requisites:

Azure DevOps Organization and Project should be setup and configured
Git Repository is already setup in Repos section of DevOps

Databricks CI/CD with Azure DevOps

1. Configure Azure DevOps git in Databricks by navigating to Azure Databricks -> settings -> User Settings

2. Under User Settings -> Git Integration -> Git provider dropdown, Select “Azure DevOps Services”

You can choose either Azure Active Directory or personal access token options. We preferred AD since it is integrated across the Organization.

Save the changes.

3. Go to Repos section and Click on Add Repo.

configure the DevOps Git repo to your AD login

4. Click on the current branch you’re on (generally “master” branch if no feature branches are created). You can create a feature branch out of master and develop the code or modify the existing the code to add your changes.

5. Navigate to Azure DevOps -> Repos -> Branches and create a pull request to merge your feature branch to Master

6. The feature branch can be deleted after merging into Master.

CI Process in Azure DevOps for Databricks:

1. To setup CI for Databricks, Create a pipeline by clicking on Pipelines – Use the Classic editor

2. Select the repository and master branch to gather artifacts:

3. Select Empty Job from the select template tab:

4. Add Publish Artifact: Notebooks task in the pipeline to build the artifacts out of the Databricks notebooks:

5. Select Path to publish and Artifact name from the DevOps Git repository:

Setup CD process for the above built artifacts:

1. Select Releases from the Pipelines section and select New Release Pipeline
2. Select artifacts built from the CI in the Artifact section

3. Add Development Job with 2 tasks – Configure Databricks CLI and Deploy Notebooks to Workspace:

3a. Add Target Databricks Workspace URL (can be copied by opening Databricks workspace in another tab) and Access Token created in Databricks.

To obtain Access token, you may need to navigate to Databricks workspace -> Settings -> User settings -> Generate new token (if not generated). Copy the generated token and save it in a local file to refer in future.

3b. Add Target Databricks Notebooks folder and workspace folder where the artifacts should be deployed.

Save the changes

4. Repeat step-3 for Production Job and connect it to Development Job

5. Configure the continuous deployment trigger to the release pipeline by configuring which branch to look up to. In this example, any change that is committed to “master” branch will trigger the release pipeline

6. Enable and Configure Pre-deployment conditions to Production Job by adding Users to approve the changes:

Conclusion:

This concludes the CI/CD for Databricks using Azure DevOps. In further posts, I will cover CI/CD for Azure Data Factory and Synapse Analytics.