Sometimes you need to dive in an existing data set that you know very little about. Let’s say we’ve been lucky to have had a high level description of the business case covered by a database, and then access to it. Our next step is figuring out data organisation, content and quality. Our tool box: the world’s most advanced open source database, PostgreSQL, and its Structured Query Language, SQL.
In this article we are going to explore the music catalog parts of the Chinook sample database, so that you can easily reproduce our queries locally should you want to.
Being the author of pgloader I use the Chinook SQLite distribution and then load it with pgloader. It’s a good test case for its migration capabilities. Nowadays a PostgreSQL specific distribution is available tho.
Table of Contents
Artists, Albums, Tracks, Genre
The Chinook database model contains a music collection of tracks, artists,
albums and genre. Each track has a genreid which is a foreign key to the
genre reference table, and also an albumid which is a foreing key to
the album table. Then album has a foreign key to artist. To discover that
we use the \d
facility of psql, here’s an example for the track table:
chinook# \d track
Table "public.track"
Column │ Type │ Modifiers
══════════════╪═════════╪═══════════
trackid │ bigint │ not null
name │ text │
albumid │ bigint │
mediatypeid │ bigint │
genreid │ bigint │
composer │ text │
milliseconds │ bigint │
bytes │ bigint │
unitprice │ numeric │
Indexes:
"idx_189229_ipk_track" PRIMARY KEY, btree (trackid)
"idx_189229_ifk_trackalbumid" btree (albumid)
"idx_189229_ifk_trackgenreid" btree (genreid)
"idx_189229_ifk_trackmediatypeid" btree (mediatypeid)
Foreign-key constraints:
"track_albumid_fkey" FOREIGN KEY (albumid) REFERENCES album(albumid)
"track_genreid_fkey" FOREIGN KEY (genreid) REFERENCES genre(genreid)
"track_mediatypeid_fkey" FOREIGN KEY (mediatypeid) REFERENCES mediatype(mediatypeid)
Referenced by:
TABLE "invoiceline" CONSTRAINT "invoiceline_trackid_fkey" FOREIGN KEY (trackid) REFERENCES track(trackid)
TABLE "playlisttrack" CONSTRAINT "playlisttrack_trackid_fkey" FOREIGN KEY (trackid) REFERENCES track(trackid)
I could offer a diagram of the relations in between those tables but instead let’s write our first query against this model, showing some details about a single album:
> \set albumid 193
> select artist.name as artist,
album.title as album,
track.name as track
from track
join album using(albumid)
join artist using(artistid)
where albumid = :albumid;
Note: we saw the \set
trick in our previous
entry How to Write SQL.
We see that the model has been done with surrogate keys: the primary keys
are all derived from a sequence. Also, those key names embed the table
names, which makes them unique in the whole schema, allowing us to use the
JOIN ... USING(...)
syntax. This syntax has a nice side effect: in the
query output we find the joining column only once.
artist │ album │ track
═══════════════════════╪═══════════════════════╪══════════════════════════════
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ The Power Of Equality
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ If You Have To Ask
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Breaking The Girl
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Funky Monks
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Suck My Kiss
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ I Could Have Lied
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Mellowship Slinky In B Major
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ The Righteous & The Wicked
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Give It Away
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Blood Sugar Sex Magik
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Under The Bridge
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Naked In The Rain
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Apache Rose Peacock
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ The Greeting Song
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ My Lovely Man
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Sir Psycho Sexy
Red Hot Chili Peppers │ Blood Sugar Sex Magik │ They're Red Hot
(17 rows)
In the track table we saw some more interesting columns, such as milliseconds and bytes:
select name,
milliseconds * interval '1 millisecond' as duration,
pg_size_pretty(bytes) as bytes
from track
where albumid = :albumid;
Which gives the following view of the album of choice here:
name │ duration │ bytes
══════════════════════════════╪══════════════╪═════════
The Power Of Equality │ 00:04:03.591 │ 7957 kB
If You Have To Ask │ 00:03:36.79 │ 7030 kB
Breaking The Girl │ 00:04:55.497 │ 9576 kB
Funky Monks │ 00:05:23.395 │ 10 MB
Suck My Kiss │ 00:03:37.234 │ 6962 kB
I Could Have Lied │ 00:04:04.506 │ 7899 kB
Mellowship Slinky In B Major │ 00:04:00.091 │ 7785 kB
The Righteous & The Wicked │ 00:04:08.084 │ 7943 kB
Give It Away │ 00:04:43.01 │ 9091 kB
Blood Sugar Sex Magik │ 00:04:31.229 │ 8731 kB
Under The Bridge │ 00:04:24.359 │ 8479 kB
Naked In The Rain │ 00:04:25.717 │ 8520 kB
Apache Rose Peacock │ 00:04:42.226 │ 9094 kB
The Greeting Song │ 00:03:13.593 │ 6198 kB
My Lovely Man │ 00:04:39.118 │ 9004 kB
Sir Psycho Sexy │ 00:08:16.692 │ 16 MB
They're Red Hot │ 00:01:11.941 │ 2326 kB
(17 rows)
If you’re interested into compression at the album level, you might want to compare rolling sum of both the album duration and its size in bytes, track after track. The following query uses window functions to track that:
select name,
round( sum(bytes::numeric) over(order by trackid)
/ sum(milliseconds) over(order by trackid), 2)
as bps,
pg_size_pretty(
sum(bytes) over(order by trackid)
) as album_bytes,
sum(milliseconds) over(order by trackid)
* interval '1 millisecond' as progres
from track
where albumid = :albumid
order by trackid;
And here’s our rolling aggregate result where we can see the bps column for bytes per second, computed after the accumulated sum of milliseconds and bytes throughout the album’s track data:
name │ bps │ album_bytes │ progres
══════════════════════════════╪═══════╪═════════════╪══════════════
The Power Of Equality │ 33.45 │ 7957 kB │ 00:04:03.591
If You Have To Ask │ 33.34 │ 15 MB │ 00:07:40.381
Breaking The Girl │ 33.28 │ 24 MB │ 00:12:35.878
Funky Monks │ 33.23 │ 34 MB │ 00:17:59.273
Suck My Kiss │ 33.16 │ 41 MB │ 00:21:36.507
I Could Have Lied │ 33.15 │ 49 MB │ 00:25:41.013
Mellowship Slinky In B Major │ 33.15 │ 56 MB │ 00:29:41.104
The Righteous & The Wicked │ 33.11 │ 64 MB │ 00:33:49.188
Give It Away │ 33.08 │ 73 MB │ 00:38:32.198
Blood Sugar Sex Magik │ 33.07 │ 81 MB │ 00:43:03.427
Under The Bridge │ 33.05 │ 90 MB │ 00:47:27.786
Naked In The Rain │ 33.03 │ 98 MB │ 00:51:53.503
Apache Rose Peacock │ 33.03 │ 107 MB │ 00:56:35.729
The Greeting Song │ 33.01 │ 113 MB │ 00:59:49.322
My Lovely Man │ 33.02 │ 122 MB │ 01:04:28.44
Sir Psycho Sexy │ 33.01 │ 137 MB │ 01:12:45.132
They're Red Hot │ 33.01 │ 140 MB │ 01:13:57.073
(17 rows)
If that’s your first encounter with a window function you might want to read Understanding Window Functions, an all time favorite article of this website.
Album genres
So, what kind of music do we have in this collection of Chinook data:
select genre.name, count(*),
sum(count(*)) over () as tracks,
round(100.0 * count(*) / sum(count(*)) over(), 2) as pct
from genre
left join track using(genreid)
group by genre.name order by genre.name;
We have here the repartition of tracks per genre and the percentage (in the pct column) of them with respect to the whole set of our 25 different music genres here:
name │ count │ tracks │ pct
════════════════════╪═══════╪════════╪═══════
Alternative │ 40 │ 3503 │ 1.14
Alternative & Punk │ 332 │ 3503 │ 9.48
Blues │ 81 │ 3503 │ 2.31
Bossa Nova │ 15 │ 3503 │ 0.43
Classical │ 74 │ 3503 │ 2.11
Comedy │ 17 │ 3503 │ 0.49
Drama │ 64 │ 3503 │ 1.83
Easy Listening │ 24 │ 3503 │ 0.69
Electronica/Dance │ 30 │ 3503 │ 0.86
Heavy Metal │ 28 │ 3503 │ 0.80
Hip Hop/Rap │ 35 │ 3503 │ 1.00
Jazz │ 130 │ 3503 │ 3.71
Latin │ 579 │ 3503 │ 16.53
Metal │ 374 │ 3503 │ 10.68
Opera │ 1 │ 3503 │ 0.03
Pop │ 48 │ 3503 │ 1.37
R&B/Soul │ 61 │ 3503 │ 1.74
Reggae │ 58 │ 3503 │ 1.66
Rock │ 1297 │ 3503 │ 37.03
Rock And Roll │ 12 │ 3503 │ 0.34
Sci Fi & Fantasy │ 26 │ 3503 │ 0.74
Science Fiction │ 13 │ 3503 │ 0.37
Soundtrack │ 43 │ 3503 │ 1.23
TV Shows │ 93 │ 3503 │ 2.65
World │ 28 │ 3503 │ 0.80
(25 rows)
A more visual way to have a look at this data right from the console is with the following query:
select genre.name, count(*),
repeat('■', ( 100.0
* count(*)
/ sum(count(*)) over()
)::integer
) as pct
from genre
left join track using(genreid)
group by genre.name
order by genre.name;
Which gives this time:
│ count │ pct
════════════════════╪═══════╪═══════════════════════════════════════
Alternative │ 40 │ ■
Alternative & Punk │ 332 │ ■■■■■■■■■
Blues │ 81 │ ■■
Bossa Nova │ 15 │
Classical │ 74 │ ■■
Comedy │ 17 │
Drama │ 64 │ ■■
Easy Listening │ 24 │ ■
Electronica/Dance │ 30 │ ■
Heavy Metal │ 28 │ ■
Hip Hop/Rap │ 35 │ ■
Jazz │ 130 │ ■■■■
Latin │ 579 │ ■■■■■■■■■■■■■■■■■
Metal │ 374 │ ■■■■■■■■■■■
Opera │ 1 │
Pop │ 48 │ ■
R&B/Soul │ 61 │ ■■
Reggae │ 58 │ ■■
Rock │ 1297 │ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Rock And Roll │ 12 │
Sci Fi & Fantasy │ 26 │ ■
Science Fiction │ 13 │
Soundtrack │ 43 │ ■
TV Shows │ 93 │ ■■■
World │ 28 │ ■
(25 rows)
Now we easily spot that the music collection distribution by genre is skewed over Rock, Latin, Metal and Punk music. Also there’s a non trivial amount of TV Shows in there.
Multi-genres albums
In order to dive into the data, let’s have a look at it from the genre point of view. As we saw in the Chinook database model a given album is always edited by a single artist, but might contain several different genres, one per track. Do we have multi-genre albums?
select title as album,
array_agg(distinct genre.name order by genre.name) as genres
from track
join genre using(genreid)
join album using (albumid)
group by title
having count(distinct genre.name) > 1;
In this query, we use the array_agg PostgreSQL aggregates. As is name
suggests, array_agg build an array in which it aggregates all the data
it sees, in this case the data is distinct genre.names
in a specific
ordering.
So apparently we have 11 albums with tracks from different genres:
album │ genres
════════════════════════════════╪═══════════════════════════════════════════════════
Battlestar Galactica, Season 3 │ {"Sci Fi & Fantasy","Science Fiction","TV Shows"}
Greatest Hits │ {Metal,Reggae,Rock}
Heroes, Season 1 │ {Drama,"TV Shows"}
LOST, Season 4 │ {Drama,"TV Shows"}
Live After Death │ {"Heavy Metal",Metal}
Lost, Season 2 │ {Drama,"TV Shows"}
Lost, Season 3 │ {Drama,"TV Shows"}
Rock In Rio [CD2] │ {Metal,Rock}
The Number of The Beast │ {Metal,Rock}
The Office, Season 3 │ {Comedy,"TV Shows"}
Unplugged │ {Blues,Latin}
(11 rows)
Many of those seem to not actually be music but rather video based content such as TV Shows and Drama, maybe we can filter them out? What I would like to do is to filter out entire albums based on a list of genres we’re not interested into (such as TV Shows and Drama, not that we don’t like them, but they won’t probably compare well with the rest of the music collection).
If we have a look at the Unplugged album, we can see that we have a data quality problem here:
select genre.name,
array_agg(track.name order by trackid) as tracks
from track
join album using(albumid)
join genre using(genreid)
where album.title = 'Unplugged'
group by genre.name;
─[ RECORD 1 ]─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
name │ Blues
tracks │ {Signe,"Before You Accuse Me","Hey Hey","Tears In Heaven","Lonely Stranger","Nobody Knows You When You're Down & Out",Layla,"Running On Faith","Walkin' Blues",Alberta,"San Francisco Bay Blues","Malted Milk","Old Love","Rollin' And Tumblin'"}
═[ RECORD 2 ]═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
name │ Latin
tracks │ {"A Novidade","Tenho Sede",Refazenda,Realce,Esotérico,Drão,"A Paz","Beira Mar",Sampa,Parabolicamará,"Tempo Rei","Expresso 2222","Aquele Abraço",Palco,"Toda Menina Baiana","Sítio Do Pica-Pau Amarelo"}
My guess is that the Latin tracks associated with the album are actually from another album and got there erroneously. That situation with a real production data set might trigger some data quality assessment work.
The way to have the complete list of tracks per genre for each multi-genred album is the following:
with multi_genre_albums(albumid, artistid, title) as
(
select albumid, artistid, title
from track
join genre using(genreid)
join album using (albumid)
group by albumid, artistid, title
having count(distinct genre.name) > 1
)
select artist.name as artist,
title as album,
genre.name as genre,
array_agg(track.name order by trackid) as tracks
from multi_genre_albums
join track using(albumid)
join genre using(genreid)
join artist using(artistid)
group by artist, album, genre;
We reuse the previous query in a Common Table Expression and this time only output interesting columns: those we’re going to reuse in the main query, either for reference in later joins, or because we already have it at this stage.
Selecting a single genre per album
In order to be able to filter by genre at the album level, we need to first be able to assign a single genre per album. Let’s have some more statistics about the multi-genred albums:
select artist.name as artist,
album.title as album,
track.name as track,
row_number()
over(partition by albumid
order by trackid)
as n,
count(genreid)
over(partition by albumid, genreid)
as same
from track
join
(
select albumid,
array_agg(distinct genre.name
order by genre.name)
as genres
from track
join genre using(genreid)
group by albumid
having count(distinct genre.name) > 1
)
as multi_genre_album(albumid, genres) using (albumid)
join genre using(genreid)
join album using(albumid)
join artist using(artistid)
order by artist, album, trackid;
In this query n is the track order within its album and same is how many tracks of the same genre do we have in said album. It’s meant as an intermediate query, the one you would cook just to have a manual look over the data. Also you might notice that this time rather than using multi_genre_album as a Common Table Expression we inlined it as a subquery.
What we learn from the result of this query is that we can select the genre having the highest number of tracks attached in an album as being the album’s genre. In particular we don’t see equal scores when using that simple rule, so we don’t need a tie-breaker. Let’s have a try at that:
with album_genre_count(albumid, genreid, count, max)
as (
select albumid,
genreid,
count(trackid) as tracks,
max(count(trackid)) over(partition by albumid)
from track
group by albumid, genreid
order by albumid
)
select artist.name as artist,
album.title as title,
genre.name as genre
from album_genre_count
join genre using(genreid)
join album using(albumid)
join artist using (artistid)
where count = max
and albumid in (select albumid
from track
group by albumid
having count(distinct genreid) > 1
order by albumid);
In the query we limit the output to those albums with more than one genre attached to their tracks so that we can manually check our query. If you remove this restriction by albumid, the query applies as-is to the whole data set we have, so it could be made a view for later use. So let’s have a look at our problematic cases:
artist │ title │ genre
══════════════════════╪════════════════════════════════╪═════════════════
Eric Clapton │ Unplugged │ Latin
Iron Maiden │ Live After Death │ Metal
Iron Maiden │ Rock In Rio [CD2] │ Rock
Iron Maiden │ The Number of The Beast │ Metal
Lenny Kravitz │ Greatest Hits │ Rock
Battlestar Galactica │ Battlestar Galactica, Season 3 │ Science Fiction
Heroes │ Heroes, Season 1 │ Drama
Lost │ Lost, Season 3 │ Drama
Lost │ Lost, Season 2 │ TV Shows
The Office │ The Office, Season 3 │ Comedy
Lost │ LOST, Season 4 │ Drama
(11 rows)
All those albums are successfully assigned a single genre now. The genre we chose is the most common value found in the album’s track, and for our interactive discovery of the data set, it’s plenty good enough.
There’s a better way to write this query, as you can read in the comments below this post. I’ve written another post about the technique, which uses the Mode Ordered-Set Aggregate Function.
Statistics per genre
Now that we are able to actually play with a single genre per album, what can we learn about musical genre in that music collection? What about the average track length? Well as we all know averages are not saying much (if anything at all) about the data, let’s have a look at the median, 90, 95, and 99 percentiles instead:
with album_genre_count(albumid, genreid, count, max)
as (
select albumid,
genreid,
count(trackid) as tracks,
max(count(trackid)) over(partition by albumid)
from track
group by albumid, genreid
order by albumid
)
select genre.name,
percentile_cont(array[0.5, 0.9, 0.95, 0.99])
within group (order by ceil(milliseconds/60000))
as distribution,
repeat('■',
ceil(
percentile_cont(0.99)
within group (order by ceil(milliseconds/60000))
)::integer
)
as hist
from genre
left join track using(genreid)
left join
(
select trackid, album_genre_count.genreid
from track
join album_genre_count using(albumid)
where count = max
)
as normalized_track_genre using(genreid)
where genre.name not in ('Comedy',
'Drama',
'Sci Fi & Fantasy',
'Science Fiction',
'TV Shows'
)
group by genre.name
order by genre.name;
And now we can see our genre stats applied to only the music selection:
name │ distribution │ hist
════════════════════╪═══════════════════════════╪════════════════
Alternative │ {3.5,5,6,11} │ ■■■■■■■■■■■
Alternative & Punk │ {3,5,6,8} │ ■■■■■■■■
Blues │ {4,6,7,9} │ ■■■■■■■■■
Bossa Nova │ {3,6,6,6} │ ■■■■■■
Classical │ {4,8,9,9} │ ■■■■■■■■■
Easy Listening │ {3,4,4,4} │ ■■■■
Electronica/Dance │ {5,6,7,8} │ ■■■■■■■■
Heavy Metal │ {4.5,7,8,8} │ ■■■■■■■■
Hip Hop/Rap │ {3,3,5,6} │ ■■■■■■
Jazz │ {4,6.10000000000036,9,14} │ ■■■■■■■■■■■■■■
Latin │ {3,5,5,7} │ ■■■■■■■
Metal │ {4,7,8,10} │ ■■■■■■■■■■
Opera │ {2,2,2,2} │ ■■
Pop │ {3,4,5,11} │ ■■■■■■■■■■■
R&B/Soul │ {3,5,5,6} │ ■■■■■■
Reggae │ {3,5,5,6} │ ■■■■■■
Rock │ {4,6,8,13} │ ■■■■■■■■■■■■■
Rock And Roll │ {2,2,2,2} │ ■■
Soundtrack │ {4,4,5,6} │ ■■■■■■
World │ {3,4,4,5} │ ■■■■■
(20 rows)
The array of numbers are the median and then the 0.90, 0.95 and 0.99 percentiles of the duration of tracks per genre, to the next minute. It means that the median duration of a Rock track is under 4 minutes and up to 5 minutes in Electronica/Dance genre. Also the longest tracks are to be found in Jazz and Rock with up to respectively 14 and 13 minutes.
The quick histogram is based on the 0.99 percentile, so we have a visual of the longest tracks per genre.
The genre we skipped in the query have duration distribution in between 21 and 88 minutes, so they would have ruined our quick analysis here.
Conclusion
When given a new database we know nothing about, it’s good to do some preliminary exploration of the data. Which angle we pick exactly is not that important first, because we want to have grasp of what we are playing with. It’s all about having a taste of what’s in there.
In this article we took the choice of not creating any SQL object. We could have chosen to create views and maybe even to store their result for the next query thanks to a Materialized View.
When the dataset is too big to explore that way, we can also use the TABLESAMPLE clause, as per the PostgreSQL SELECT documentation.
Mainly this exploration exercise is meant to teach some SQL, so I hope you did discover several tricks you didn’t know yet!