SQL operations you might need one day
If you work with data you will most likely use SQL every now and then. If data is not populated in your SQL tables according to basic normalization standards, you might find yourself struggling to find the correct SQL syntax to handle the data properly. For example, multiple values could be stored in a column separated by a comma, or you need to do some date conversions or calculations. In my work as a data engineer I’ve encountered several scenarios like this and thought of sharing a somewhat random collection of SQL operations to handle such scenarios.
In this blog:
- Use SUBSTRING() to split values on whitespace and take the value of the first position
- Use ROWNUMBER() to select the latest version of an entity by last update date
- Use COALESCE() to assign a true/false indicator based on occurrence of a value in another table
- Use STRING_AGG() to convert a 1-to-many structure into a single column with comma separated values
Use SUBSTRING() to split values on whitespace and take the value of the first position
Scenario: The first and last name of a person is stored in a single column. You need to select the first name. NB in this example I assume the first name has no whitespace.
Table data:
+-----+----------------+
| ID | NAME |
+-----+----------------+
| 123 | Keanu Reeves |
| 456 | Sandra Bullock |
+-----+----------------+
SQL:
SELECT
ID,
NAME,
SUBSTRING(NAME,0,CHARINDEX(' ', NAME)) as FIRST_NAME
FROM table
Returns:
+-----+----------------+------------+
| ID | NAME | FIRST_NAME |
+-----+----------------+------------+
| 123 | Keanu Reeves | Keanu |
| 456 | Sandra Bullock | Sandra |
+-----+----------------+------------+
Use ROWNUMBER() to select the latest version of an entity by last update date
Scenario: An item has an owner, which changed over time. You need to select the current owner based on latest update date.
Table data:
+---------+------------+------------------+
| ITEM_ID | OWNER_NAME | DATE_LAST_UPDATE |
+---------+------------+------------------+
| 1 | James | 20220101 |
| 1 | Noah | 20220305 |
| 1 | Kelly | 20230512 |
| 2 | William | 20230601 |
| 2 | Patricia | 20231001 |
| 2 | David | 20240206 |
+---------+------------+------------------+
SQL:
First rank by date:
SELECT
ITEM_ID,
OWNER_NAME,
DATE_LAST_UPDATE,
ROW_NUMBER() OVER (PARTITION BY ITEM_ID ORDER BY DATE_LAST_UPDATE DESC) as DATE_LAST_UPDATED_RANKED
FROM table
Returns:
+---------+------------+------------------+-------------------------+
| ITEM_ID | OWNER_NAME | DATE_LAST_UPDATE | DATE_LAST_UPDATE_RANKED |
+---------+------------+------------------+-------------------------+
| 1 | James | 20220101 | 3 |
| 1 | Noah | 20220305 | 2 |
| 1 | Kelly | 20230512 | 1 |
| 2 | William | 20230601 | 3 |
| 2 | Patricia | 20231001 | 2 |
| 2 | David | 20240206 | 1 |
+---------+------------+------------------+-------------------------+
Then select the records where DATE_LAST_UPDATE_RANKED=1.
Full query:
WITH
item_owners_ranked_by_date AS (
SELECT
ITEM_ID,
OWNER_NAME,
DATE_LAST_UPDATE,
ROW_NUMBER() OVER (PARTITION BY ITEM_ID ORDER BY DATE_LAST_UPDATE DESC) as DATE_LAST_UPDATED_RANKED
FROM table
)
SELECT
ITEM_ID,
OWNER_NAME,
DATE_LAST_UPDATE
FROM item_owners_ranked_by_date
WHERE DATE_LAST_UPDATED_RANKED=1
Returns:
+---------+------------+------------------+
| ITEM_ID | OWNER_NAME | DATE_LAST_UPDATE |
+---------+------------+------------------+
| 1 | Kelly | 20230512 |
| 2 | David | 20240206 |
+---------+------------+------------------+
Use COALESCE() to assign a true/false indicator based on occurrence of a value in another table
Scenario: let’s say you have a table with IDs called table_valid_ids. The IDs in that table are considered to be valid. You have another table called all_ids, and need to add a valid indicator column. In case the ID occurs in table_valid_ids, you assign 1, else 0.
Table data:
all_ids
+----+
| ID |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
| 11 |
| 12 |
| 13 |
| 14 |
| 15 |
| 16 |
+----+
valid_ids
+----+
| ID |
+----+
| 1 |
| 2 |
| 5 |
| 6 |
| 8 |
| 9 |
| 10 |
| 12 |
| 16 |
+----+
SQL:
SELECT
ID,
COALESCE((SELECT DISTINCT('1') FROM valid_ids WHERE ID = a.ID), '0') as VALID_INDICATOR
FROM all_ids a
Returns:
+----+-----------------+
| ID | VALID_INDICATOR |
+----+-----------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 0 |
| 4 | 0 |
| 5 | 1 |
| 6 | 1 |
| 7 | 0 |
| 8 | 1 |
| 9 | 1 |
| 10 | 1 |
| 11 | 0 |
| 12 | 1 |
| 13 | 0 |
| 14 | 0 |
| 15 | 0 |
| 16 | 1 |
+----+-----------------+
Use STRING_AGG() to convert a 1-to-many structure into a single column with comma separated values
Scenario: you have a table with a 1-many relation and need to convert this structure into a list, so the result returns a single records per ID.
Table data:
+---------+-------+
| ITEM_ID | USER |
+---------+-------+
| 1 | Jane |
| 1 | Kevin |
| 2 | John |
| 2 | David |
| 3 | Kat |
| 3 | Lucas |
| 3 | Noah |
+---------+-------+
SQL:
SELECT
ITEM_ID,
STRING_AGG([USER], ',') WITHIN GROUP (ORDER BY [USER]) AS userList
FROM mdlake.tst
GROUP BY ITEM_ID
Returns:
+---------+----------------+
| ITEM_ID | userList |
+---------+----------------+
| 1 | Jane,Kevin |
| 2 | David,John |
| 3 | Kat,Lucas,Noah |
+---------+----------------+
NB: The equivalent function for STRING_AGG() in SparkSQL is array_join():
array_join(collect_set(USER), ',') userList
Happy data engineering!