SQL operations you might need one day

Ilse Epskamp
4 min readMay 2, 2024
Photo by fabio on Unsplash

If you work with data you will most likely use SQL every now and then. If data is not populated in your SQL tables according to basic normalization standards, you might find yourself struggling to find the correct SQL syntax to handle the data properly. For example, multiple values could be stored in a column separated by a comma, or you need to do some date conversions or calculations. In my work as a data engineer I’ve encountered several scenarios like this and thought of sharing a somewhat random collection of SQL operations to handle such scenarios.

In this blog:

  • Use SUBSTRING() to split values on whitespace and take the value of the first position
  • Use ROWNUMBER() to select the latest version of an entity by last update date
  • Use COALESCE() to assign a true/false indicator based on occurrence of a value in another table
  • Use STRING_AGG() to convert a 1-to-many structure into a single column with comma separated values

Use SUBSTRING() to split values on whitespace and take the value of the first position

Scenario: The first and last name of a person is stored in a single column. You need to select the first name. NB in this example I assume the first name has no whitespace.

Table data:

+-----+----------------+
| ID | NAME |
+-----+----------------+
| 123 | Keanu Reeves |
| 456 | Sandra Bullock |
+-----+----------------+

SQL:

SELECT
ID,
NAME,
SUBSTRING(NAME,0,CHARINDEX(' ', NAME)) as FIRST_NAME
FROM table

Returns:


+-----+----------------+------------+
| ID | NAME | FIRST_NAME |
+-----+----------------+------------+
| 123 | Keanu Reeves | Keanu |
| 456 | Sandra Bullock | Sandra |
+-----+----------------+------------+

Use ROWNUMBER() to select the latest version of an entity by last update date

Scenario: An item has an owner, which changed over time. You need to select the current owner based on latest update date.

Table data:

+---------+------------+------------------+
| ITEM_ID | OWNER_NAME | DATE_LAST_UPDATE |
+---------+------------+------------------+
| 1 | James | 20220101 |
| 1 | Noah | 20220305 |
| 1 | Kelly | 20230512 |
| 2 | William | 20230601 |
| 2 | Patricia | 20231001 |
| 2 | David | 20240206 |
+---------+------------+------------------+

SQL:

First rank by date:

SELECT
ITEM_ID,
OWNER_NAME,
DATE_LAST_UPDATE,
ROW_NUMBER() OVER (PARTITION BY ITEM_ID ORDER BY DATE_LAST_UPDATE DESC) as DATE_LAST_UPDATED_RANKED
FROM table

Returns:

+---------+------------+------------------+-------------------------+
| ITEM_ID | OWNER_NAME | DATE_LAST_UPDATE | DATE_LAST_UPDATE_RANKED |
+---------+------------+------------------+-------------------------+
| 1 | James | 20220101 | 3 |
| 1 | Noah | 20220305 | 2 |
| 1 | Kelly | 20230512 | 1 |
| 2 | William | 20230601 | 3 |
| 2 | Patricia | 20231001 | 2 |
| 2 | David | 20240206 | 1 |
+---------+------------+------------------+-------------------------+

Then select the records where DATE_LAST_UPDATE_RANKED=1.
Full query:

WITH

item_owners_ranked_by_date AS (
SELECT
ITEM_ID,
OWNER_NAME,
DATE_LAST_UPDATE,
ROW_NUMBER() OVER (PARTITION BY ITEM_ID ORDER BY DATE_LAST_UPDATE DESC) as DATE_LAST_UPDATED_RANKED
FROM table
)

SELECT
ITEM_ID,
OWNER_NAME,
DATE_LAST_UPDATE
FROM item_owners_ranked_by_date
WHERE DATE_LAST_UPDATED_RANKED=1

Returns:

+---------+------------+------------------+
| ITEM_ID | OWNER_NAME | DATE_LAST_UPDATE |
+---------+------------+------------------+
| 1 | Kelly | 20230512 |
| 2 | David | 20240206 |
+---------+------------+------------------+

Use COALESCE() to assign a true/false indicator based on occurrence of a value in another table

Scenario: let’s say you have a table with IDs called table_valid_ids. The IDs in that table are considered to be valid. You have another table called all_ids, and need to add a valid indicator column. In case the ID occurs in table_valid_ids, you assign 1, else 0.

Table data:

all_ids

+----+
| ID |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
| 11 |
| 12 |
| 13 |
| 14 |
| 15 |
| 16 |
+----+


valid_ids

+----+
| ID |
+----+
| 1 |
| 2 |
| 5 |
| 6 |
| 8 |
| 9 |
| 10 |
| 12 |
| 16 |
+----+

SQL:


SELECT
ID,
COALESCE((SELECT DISTINCT('1') FROM valid_ids WHERE ID = a.ID), '0') as VALID_INDICATOR
FROM all_ids a

Returns:

+----+-----------------+
| ID | VALID_INDICATOR |
+----+-----------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 0 |
| 4 | 0 |
| 5 | 1 |
| 6 | 1 |
| 7 | 0 |
| 8 | 1 |
| 9 | 1 |
| 10 | 1 |
| 11 | 0 |
| 12 | 1 |
| 13 | 0 |
| 14 | 0 |
| 15 | 0 |
| 16 | 1 |
+----+-----------------+

Use STRING_AGG() to convert a 1-to-many structure into a single column with comma separated values

Scenario: you have a table with a 1-many relation and need to convert this structure into a list, so the result returns a single records per ID.

Table data:

+---------+-------+
| ITEM_ID | USER |
+---------+-------+
| 1 | Jane |
| 1 | Kevin |
| 2 | John |
| 2 | David |
| 3 | Kat |
| 3 | Lucas |
| 3 | Noah |
+---------+-------+

SQL:

SELECT 
ITEM_ID,
STRING_AGG([USER], ',') WITHIN GROUP (ORDER BY [USER]) AS userList
FROM mdlake.tst
GROUP BY ITEM_ID

Returns:

+---------+----------------+
| ITEM_ID | userList |
+---------+----------------+
| 1 | Jane,Kevin |
| 2 | David,John |
| 3 | Kat,Lucas,Noah |
+---------+----------------+

NB: The equivalent function for STRING_AGG() in SparkSQL is array_join():

 array_join(collect_set(USER), ',') userList

Happy data engineering!

--

--

Ilse Epskamp

Azure Certified IT Engineer with 9+ years of experience in the banking industry. Focus areas: Azure, Data Engineering, DevOps, CI/CD, Automation, Python