Do’s and Don’ts when working with CosmosDB Gremlin API

Ilse Epskamp
Azure Tutorials
Published in
2 min readApr 18, 2022

--

Photo by Kelly Sikkema on Unsplash

In one of our previous blogs we described how to connect Azure Databricks to CosmosDB with Gremlin API and run queries on the database. Based on our experience with using CosmosDB (Gremlin API) in an automated workflow we share our view on Do’s and Don’ts when setting up the database and ingestion scripts.

Don’t….

  • configure the amount of Request Units based on the largest file your workflow will process. For this file it will be sufficient, but for other files it will be overkill thus costly.
    NB: the largest file is not necessarily the file with most rows. Rows simply indicate file length, while the amount of columns determine the load on the operation to create or edit vertices. For example, querying or creating a node with 5 properties requires way less Request Units than querying or creating a node with 50 properties.
  • postpone your decision on the partition strategy. Set is set.

Do….

  • Autoscale the Request Units if it meets your requirement and business policy. As an alternative you can use provisioned throughput with a custom autoscale mechanism, where you determine the required Request Units based on your file details and increase/decrease the Request Unit by triggering a powershell script in your pipeline, for example in an Azure DevOps release pipeline. When doing this, make sure you queue your data ingestion pipeline, since the Request Units are pinpointed to a specific file. Example flow:
Image by Azure Tutorials.
  • When adding properties to a vertex, add them in a list so you can easily add more properties for that key. This is usable for all properties, for example not for the primary key and the partition key.
g.addV('VERTEX_LABEL')
.property('{partition_key}','{partition_value}')
.property('primary_key','{primary_key_value}')
.property(list, '{property-name}','{property-value}')
  • When querying data, try to query the partition which holds that data element. This will improve performance and decrease cost since the required RU is less. For example, when you partitioned your data based on its vertex label you are looking for a node in a particular label, you can do:
g.V().hasLabel('{vertex_label}')
.has('{partition_key}','{partition_value}')
.has('{primary_key}','{primary_key_value}')
  • Keep queried data in memory to reduce the amount of required read operations on your database.

Azure Tutorials frequently publishes tutorials, best practices, insights or updates about Azure Services, to contribute to the Azure Community. Azure Tutorials is driven by two enthusiastic Azure Cloud Engineers, combining over 15 years of IT experience in several domains. Stay tuned for weekly blog updates and follow us if you are interested!
https://www.linkedin.com/company/azure-tutorials

--

--

Ilse Epskamp
Azure Tutorials

Azure Certified IT Engineer with 9+ years of experience in the banking industry. Focus areas: Azure, Data Engineering, DevOps, CI/CD, Automation, Python