CosmosDB (Gremlin) database design: positioning of edges in partitions

Ilse Epskamp
Azure Tutorials
Published in
4 min readJul 18, 2022

--

Photo by Guillermo Ferla on Unsplash

When designing and maintaining CosmosDB with Gremlin API for your workload you will need to decide on the approach for various topics, such as:

  • Request Units provisioning strategy
  • Database consistency
  • Partition strategy
  • Data versioning
  • Data patterns

In the current blog we will focus on CosmosDB data patterns, and will look specifically at the position and usage of edges in your graph database.

Nodes and edges in database partitions
When creating a vertex in CosmosDB, you must specify a value for the assigned partition key, to indicate to what partition the vertex belongs. Assume your partition key is called PARTITION, and you want to assign a record with primary key 123 to partition abc, the query for ingestion could look like this:

g.addV({label}).property('primary_key', '123').property('PARTITION','abc')

Contrary to nodes, when creating edges you do not have to specify the partition for that edge. By default the edge is stored in the partition of their source vertex, in the outgoing direction. If both nodes are in separate partitions, it will look like:

Storage of vertices and edges in logical partitions in CosmosDB. Image by Azure Tutorials.

From this storing behavior we can state that traversing to a neighbor node using the out() query (from A to B) will be relatively quick and cheap, while traversing using the in() query (from B to A) will be an expensive query, since in the first scenario you are already in the partition where the edge resides, while in the second scenario you have to run (possible) multi cross-partition queries to locate the edge.

Example: employees, stores and regions
Let’s look at an example. Assume 3 entities: an employee, a store, and a region. Each entity is assigned toa dedicated partition. So in your database, you will have a partition per employee, a partition per store and a partition per region. NB we are not implying that partitioning the data based on entity is best practice. The partition strategy suitable for your data is depending on the distribution and volume of the data, and common query patterns.

The employee can be manager of the store, and the store is located in a specific region. If you model your data exactly like this, it will look like:

Plotting partitions per entity, you will get:

Now let’s assume that a common query pattern is “give me all the stores, their regions and store managers. With above model, the query to fetch this information could look like this:

g.V()
.hasLabel('STORES')
.project('storeName','storeManager','region')
.by('store_name')
.by(inE('is_manager_of').outV().values('employee_name'))
.by(outE('store_located_in').inV().values('region_name'))

See how to be able to retrieve the storemanagers, you have to use the incoming relation ‘is_manager_of’. Following the recommendation to use outgoing edges when possible, a better implementation would be a relation from store to employee with label ‘has_storemanager’. The model would look like this:

Now you can use both outgoing relations from the store entity. Of course this is just a simple and small example of handling edges in CosmosDB. In practice we have seen scenarios with large volumes of data where the execution time and amount of provisioned Request Units for a query dropped by 50–80% because of providing efficient traversal paths!

Designing CosmosDB database patterns
When starting with CosmosDB in a project, the exact common query patterns might not be known yet. Therefore consider:

  • building flexible ingestion logic so you can easily add/remove/update edges in your physical implementation, adapting to consumer’s common query patterns
  • actively involve your consumers in the database design to understand their data requirements, to be able to design fit database patterns
  • in the Azure Portal, use Logs to analyze querying behaviour to find improvement areas for your database design. For example, you can analyse the top queries by RU usage:
AzureDiagnostics | 
where Category == "GremlinRequests" |
where TimeGenerated > ago(6h) |
project piiCommandText_s, activityId_g, operationType_s, requestCharge_s, TimeGenerated |
order by requestCharge_s desc |
take 100

Azure Tutorials frequently publishes tutorials, best practices, insights or updates about Azure Services, to contribute to the Azure Community. Azure Tutorials is driven by two enthusiastic Azure Cloud Engineers, combining over 15 years of IT experience in several domains. Stay tuned for weekly blog updates and follow us if you are interested!
https://www.linkedin.com/company/azure-tutorials

--

--

Ilse Epskamp
Azure Tutorials

Azure Certified IT Engineer with 8+ years of experience in the banking industry. Focus areas: Azure, Data Engineering, DevOps, CI/CD, Automation, Python.