Best practices when using Databricks notebooks in an automated workflow

Ilse Epskamp
Azure Tutorials
Published in
4 min readMar 21, 2022

--

Photo by Waldemar Brandt on Unsplash

With Azure Databricks in your resource group you have a powerful tool to handle your data and analytics use cases. The platform provides several great features, and as a data engineer I feel Databricks combined with Data Factory gives you the tools to handle any data requirement that is raised. In this blog we describe a few best practices or lessons learned that we encountered when designing automated workflows with Databricks, and provide some useful generic (Python) functions which hopefully come in handy when working on your automated workflow.

  1. Generalize functions

This seems logical right? However when handling many data sources with different file formats, delivery timestamps, and validation requirements the pitfall to write custom functions for specific scenarios is just around the corner. Don’t! Spend a little more effort to keep your functions generic so you can reuse them, even though when writing you don’t have sight on a second requirement for such a function.

Let’s look at an example. As part of your validation process, you run several generic functions on your incoming data to assess if the incoming data meets the set standards (e.g. file is latest file). Based on experience only file ABC has proven issues with column Y. Therefore you write a specific validation function for this file, which checks the content of column Y. When writing this function, don’t limit the function to only handle file ABC and column Y, but ensure any file and any column can be passed.

Not generic:

def validate_file_abc_col_y(file_abc):
file_abc.select("Y")

Generic:

def validate_col(file, columnname):
file.select(columnname)

2. Centralize functions

What is the use of generic functions if you don’t centralize them and make them available for other notebooks? For example create a separate notebook where you keep all the functions so it serves as library for your other notebooks. In every notebook that needs the library, run this to load the content of the lib (or edit the path accordingly):

%run ./library_notebook

Another important benefit of centralizing functions is the reduced maintainance efforts. If a function requires a fix or update, you only have to do this once!

3. Assign a separate cell to each function

By assigning each function to it’s own cell, your notebook not only looks better, it is also easier to troubleshoot in case of any code errors.

One function per cell.

4. Ensure variables are consistent across notebooks and across environments

It shouldn’t matter if you are working in notebook A, B or C or in DTA or P environment, common variables such as the name of a sourced datafile passed by Data Factory, or the name of the first dataframe that you create should be the same in all the notebooks. This way you don’t get confused, and more importantly, you can speed up the development. As a bonus, your notebook will have the same look and feel which makes it easier for others to understand.

For example, in every notebook name the variable that takes the triggerFileName from Data Factory also triggerFileName.

triggerFileName=dbutils.widgets.get("triggerFileName")

When using CI/CD to deploy your code to DTAP the variable names will be consistent cross environment.

Generic functions: some examples

Get the current date in format yyyymmdd and current timestamp in hhmmss, usable when generating current processing datetime

import datetime#returns yyyymmdd
def get_yyyymmdd():

now = datetime.datetime.now().strftime('%Y%m%d')
return now
#return hhmmss
def get_hhmmss():
now = datetime.datetime.now()
currenthour=str(now.hour)
currentminute=str(now.minute)
currentsecond=str(now.second)

if len(currenthour)==1:
currenthour="0"+currenthour
if len(currentminute)==1:
currentminute="0"+currentminute

if len(currentsecond)==1:
currentsecond="0"+currentsecond

currenttime=currenthour+currentminute+currentsecond
return currenttime

Create a spark dataframe for csv, txt, parquet, json and orc files with columnnames in uppercase by passing the filepath and file extension

#first fetch the file extensiondef get_file_extension(filename):
return filename.split(".")[-1]
#then pass the filepath and filetype to create a dataframedef create_dataframe(filepath,filetype):

if filetype=="csv" or filetype=="txt":
file_body = spark.read.option("header","true")
.option("multiline","true")
.option("sep",",").csv(filepath)
elif filetype=="parquet":
file_body=spark.read.parquet(filepath)
elif filetype=="json":
file_body = json.loads(spark.read.option("multiline","true")
.json(filepath).toJSON().first())
else:
file_body = spark.read.option("inferSchema",True)
.option("escape",'"')
.option('"','')
.option("multiLine",True).orc(filepath)

#convert column names into uppercase
for col in file_body.columns:
file_body = file_body.withColumnRenamed(col, col.upper())

return file_body

Azure Tutorials frequently publishes tutorials, best practices, insights or updates about Azure Services, to contribute to the Azure Community. Azure Tutorials is driven by two enthusiastic Azure Cloud Engineers, combining over 15 years of IT experience in several domains. Stay tuned for weekly blog updates and follow us if you are interested!
https://www.linkedin.com/company/azure-tutorials

--

--

Ilse Epskamp
Azure Tutorials

Azure Certified IT Engineer with 9+ years of experience in the banking industry. Focus areas: Azure, Data Engineering, DevOps, CI/CD, Automation, Python