How to pass parameters between Data Factory and Databricks

Ilse Epskamp
Azure Tutorials
Published in
5 min readJan 10, 2022

--

Passing parameters between Data Factory and Databricks by Azure Tutorials
Data Factory and Databricks. Image by Azure Tutorials.

When working with data in Azure, running a Databricks notebook as part of a Data Factory pipeline is a common scenario. There could be various arguments for choosing Databricks as an activity in your Data Factory flow, for example when the default Data Factory activities are not sufficient to meet your data pre-processing or transformation requirements. In this blog I explain how to pass parameters between your Data Factory pipeline and Databricks notebook, so you can easily use variables from your Data Factory pipeline in your Databricks notebooks and vice versa, and integrate these components in your data workflow.

Prerequisites

· Data Factory and Databricks are available in your Azure resource group
· Linked service for Databricks is available in Data Factory
· Cluster is available in Databricks

For the purpose of this blog, we use a very simple scenario where we:

1. Generate a constant value in a Data Factory pipeline variable named input_value;
2. pass input_value to a Databricks notebook, execute some simple logic, and return a result variable to Data Factory;
3. pick up the result from the notebook in Data Factory, and store it in a Data Factory pipeline variable named output_value for further processing.

Setup Databricks notebook
Let’s start by setting up the Databricks notebook. We create a simple notebook, taking variable adf_input_value as input, and generate an output variable adf_output_value which we will pass back to Data Factory.

As you can see, to fetch a parameter passed by Data Factory, you can use:

dbutils.widgets.get({fromDataFactoryVariableName})

To output a value on notebook exit, you can use:

dbutils.notebook.exit(json.dumps({
“{toDataFactoryVariableName}”:{databricksVariableName}
}))

Setup Data Factory pipeline
Now we setup the Data Factory pipeline. We will setup a pipeline with two pipeline variables, and three activities. First we create two pipeline variables input_value and output_value, both of type String:

Create pipeline variables in Data Factory

We add 3 activities to the pipeline; Set variable, Notebook, and Set variable.

Pipeline workflow in Data Factory

1. Set variable for input_value. Select the activity, and in tab Variables we set the variable input_value to a constant value of 1.

Set a pipeline variable in Data Factory

2. Notebook. In this activity we will trigger the Databricks notebook. In Tab General, give the activity a name. You will need this name later when you fetch the notebook output in your pipeline. We called it Run notebook. Next configure the Databricks linked service in tab Azure Databricks.

Configure Linked Service for Notebook activity in Data Factory

Next, in tab Settings, select the notebook that you want to trigger in your Databricks workspace by clicking Browse:

Configure Databricks notebook path in Data Factory

Now we will configure the parameters that we want to pass from Data Factory to our Databricks notebook. This can be done by creating a Base parameter for every variable that you want to pass. In our scenario, we want to pass pipeline variable input_value to the notebook. Click on New to add a new Base parameter and give it a name. Be aware this is the parameter name that you will fetch in your Databricks notebook. In our example, we name it adf_input_value. Next, assign a value to the parameter. Click on the Value text box > Add dynamic content, and select input_value from the pane that appears. This ensures that the value of pipeline variable input_value is passed to the notebook. When done, your Base parameters configuration should look like this:

Set base parameters in Databricks notebook activity

3. Set variable for output_value. Here we will fetch the result from the Databricks notebook activity and assign it to the pipeline variable output_value. In our Databricks notebook we configured the notebook to return a variable called “adf_output_value” on exit. In Tab Variables, select the variable output_value. As its value, select adf_output_value from the Notebook activity result:

Set pipeline variable in Data Factory to fetch Databricks output

As you can see, to fetch the output of a notebook activity and assign it to a variable use:

@activity(‘{notebookActivityName}’).output[‘runOutput’][‘{toDataFactoryVariableName}’]

Run the pipeline and assess the results of the individual activities. In above example, we are passing “1” to the Databricks notebook, and based on the logic expect “2” to be returned to Data Factory:

Data Factory Pipeline Run result

Pass Array instead of String
In this example we are passing a string type variable between Data Factory and Databricks. Besides string, you can also pass arrays. To achieve this, set the Data Factory variable type of the relevant variable to Array. For example, if the notebook will return an Array to Data Factory, then make sure the Data Factory pipeline variable that will pick up the notebook result is of type Array. In Data Factory, you can easily fetch items from the Array using indexes: variableName[0], variableName[1] etc.

Summary
To pass parameters between Data Factory and Databricks, we performed the following steps:

(1) set Data Factory “pipeline variable” input_value = 1
(2) set Data Factory “Notebook activity Base parameter”
adf_input_value = input_value
(3) pick up adf_input_value in Databricks notebook
(4) generate and return adf_output_value from Databricks to Data Factory
(5) set Data Factory “pipeline variable” output_value = adf_output_value

Key takeaways
- To pass a value from Data Factory to Databricks, configure Base Parameters in the Notebook activity, specifying what Data Factory variables you want to pass.

- To fetch passed parameters in Databricks, use dbutils.widgets.get()

- To return parameters from Databricks to Data Factory, you can use dbutils.notebook.exit(json.dumps({}))

- To access the Databricks result in Data Factory, you can use

@activity(‘{notebookActivityName}’).output[‘runOutput’][‘{toDataFactoryVariableName}’]

Pipeline json:

Azure Tutorials frequently publishes tutorials, best practices, insights or updates about Azure Services, to contribute to the Azure Community. Azure Tutorials is driven by two enthusiastic Azure Cloud Engineers, combining over 15 years of IT experience in several domains. Stay tuned for weekly blog updates and follow us if you are interested!

https://www.linkedin.com/company/azure-tutorials

--

--

Ilse Epskamp
Azure Tutorials

Azure Certified IT Engineer with 9+ years of experience in the banking industry. Focus areas: Azure, Data Engineering, DevOps, CI/CD, Automation, Python