Our discovery led us to find albums containing tracks of multiple genres, and for the analytics we were then pursuing, we wanted to clean the data set and assign a single genre per album. We did that in SQL of course, and didn’t actually edit the data.
Finding the most frequent input value in a group is a job for the
WITHIN GROUP (ORDER BY sort_expression) Ordered-Set Aggregate Function, as
documented in the PostgreSQL page about Aggregate
For more about the Mode function name, read the Wikipedia article that explain the Mathematics background behind it at Mode (statistics).
In order to dive into the data, let’s have a look at it from the genre point of view. As we saw in the Chinook database model a given album is always edited by a single artist, but might contain several different genres, one per track. Do we have multi-genre albums?
select title as album, string_agg( distinct genre.name, ', ' order by genre.name ) as genres from track join genre using(genreid) join album using (albumid) group by title having count(distinct genre.name) > 1;
We have the following list of multi-genre albums in our collection:
album │ genres ════════════════════════════════╪═════════════════════════════════════════════ Battlestar Galactica, Season 3 │ Sci Fi & Fantasy, Science Fiction, TV Shows Greatest Hits │ Metal, Reggae, Rock Heroes, Season 1 │ Drama, TV Shows LOST, Season 4 │ Drama, TV Shows Live After Death │ Heavy Metal, Metal Lost, Season 2 │ Drama, TV Shows Lost, Season 3 │ Drama, TV Shows Rock In Rio [CD2] │ Metal, Rock The Number of The Beast │ Metal, Rock The Office, Season 3 │ Comedy, TV Shows Unplugged │ Blues, Latin (11 rows)
Selecting a single genre per album
In the previous article, we manually counted the number of tracks of each genre per album and selected the most frequent genre: the one with the greatest number of tracks.
SQL provides an Ordered-Set Aggregate Function that does just that:
with multi_genre_albums as ( select albumid from track group by albumid having count(distinct genreid) > 1 ) select artist.name as artist, album.title as title, genre.name as genre, mode() within group (order by genreid) from multi_genre_albums join genre using(genreid) join album using(albumid) join artist using (artistid);
In this query we limit our processing to albums containing tracks of several
genres, thanks to the
multi_genre_albums CTE, where we use a simple
HAVING clause. We then use the result of the CTE and fetch
supplementary information about the albums, and use the
Aggregate Function to retain only the most frequent
artist │ title │ genre ══════════════════════╪════════════════════════════════╪═════════════════ Eric Clapton │ Unplugged │ Latin Iron Maiden │ The Number of The Beast │ Metal Iron Maiden │ Rock In Rio [CD2] │ Rock Iron Maiden │ Live After Death │ Metal Lenny Kravitz │ Greatest Hits │ Rock Battlestar Galactica │ Battlestar Galactica, Season 3 │ Science Fiction Heroes │ Heroes, Season 1 │ Drama Lost │ LOST, Season 4 │ Drama Lost │ Lost, Season 2 │ TV Shows Lost │ Lost, Season 3 │ Drama The Office │ The Office, Season 3 │ Comedy (11 rows)
Full documentation for the
mode() function says:
returns the most frequent input value (arbitrarily choosing the first one if there are multiple equally-frequent results)
Use it next time you want to select only the most-frequent element of a group in your result-set.