Exploring a Data Set in SQL

Sometimes you need to dive in an existing data set that you know very little about. Let’s say we’ve been lucky to have had a high level description of the business case covered by a database, and then access to it. Our next step is figuring out data organisation, content and quality. Our tool box: the world’s most advanced open source database, PostgreSQL, and its Structured Query Language, SQL.

In this article we are going to explore the music catalog parts of the Chinook sample database, so that you can easily reproduce our queries locally should you want to.

Being the author of pgloader I use the Chinook SQLite distribution and then load it with pgloader. It’s a good test case for its migration capabilities. Nowadays a PostgreSQL specific distribution is available tho.

Artists, Albums, Tracks, Genre

Album genres

Multi-genres albums

Selecting a single genre per album

Statistics per genre

Conclusion

Artists, Albums, Tracks, Genre

The Chinook database model contains a music collection of tracks, artists, albums and genre. Each track has a genreid which is a foreign key to the genre reference table, and also an albumid which is a foreing key to the album table. Then album has a foreign key to artist. To discover that we use the \d facility of psql, here’s an example for the track table:

chinook# \d track
        Table "public.track"
    Column    │  Type   │ Modifiers 
══════════════╪═════════╪═══════════
 trackid      │ bigint  │ not null
 name         │ text    │ 
 albumid      │ bigint  │ 
 mediatypeid  │ bigint  │ 
 genreid      │ bigint  │ 
 composer     │ text    │ 
 milliseconds │ bigint  │ 
 bytes        │ bigint  │ 
 unitprice    │ numeric │ 
Indexes:
    "idx_189229_ipk_track" PRIMARY KEY, btree (trackid)
    "idx_189229_ifk_trackalbumid" btree (albumid)
    "idx_189229_ifk_trackgenreid" btree (genreid)
    "idx_189229_ifk_trackmediatypeid" btree (mediatypeid)
Foreign-key constraints:
    "track_albumid_fkey" FOREIGN KEY (albumid) REFERENCES album(albumid)
    "track_genreid_fkey" FOREIGN KEY (genreid) REFERENCES genre(genreid)
    "track_mediatypeid_fkey" FOREIGN KEY (mediatypeid) REFERENCES mediatype(mediatypeid)
Referenced by:
    TABLE "invoiceline" CONSTRAINT "invoiceline_trackid_fkey" FOREIGN KEY (trackid) REFERENCES track(trackid)
    TABLE "playlisttrack" CONSTRAINT "playlisttrack_trackid_fkey" FOREIGN KEY (trackid) REFERENCES track(trackid)

I could offer a diagram of the relations in between those tables but instead let’s write our first query against this model, showing some details about a single album:

> \set albumid 193
> select artist.name as artist,
         album.title as album,
         track.name as track
    from track
         join album using(albumid)
         join artist using(artistid)
   where albumid = :albumid;

Note: we saw the \set trick in our previous entry How to Write SQL.

We see that the model has been done with surrogate keys: the primary keys are all derived from a sequence. Also, those key names embed the table names, which makes them unique in the whole schema, allowing us to use the JOIN ... USING(...) syntax. This syntax has a nice side effect: in the query output we find the joining column only once.

        artist         │         album         │            track             
═══════════════════════╪═══════════════════════╪══════════════════════════════
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ The Power Of Equality
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ If You Have To Ask
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Breaking The Girl
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Funky Monks
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Suck My Kiss
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ I Could Have Lied
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Mellowship Slinky In B Major
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ The Righteous & The Wicked
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Give It Away
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Blood Sugar Sex Magik
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Under The Bridge
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Naked In The Rain
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Apache Rose Peacock
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ The Greeting Song
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ My Lovely Man
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ Sir Psycho Sexy
 Red Hot Chili Peppers │ Blood Sugar Sex Magik │ They're Red Hot
(17 rows)

In the track table we saw some more interesting columns, such as milliseconds and bytes:

select name,
       milliseconds * interval '1 millisecond' as duration,
       pg_size_pretty(bytes) as bytes
  from track
 where albumid = :albumid;

Which gives the following view of the album of choice here:

             name             │   duration   │  bytes  
══════════════════════════════╪══════════════╪═════════
 The Power Of Equality        │ 00:04:03.591 │ 7957 kB
 If You Have To Ask           │ 00:03:36.79  │ 7030 kB
 Breaking The Girl            │ 00:04:55.497 │ 9576 kB
 Funky Monks                  │ 00:05:23.395 │ 10 MB
 Suck My Kiss                 │ 00:03:37.234 │ 6962 kB
 I Could Have Lied            │ 00:04:04.506 │ 7899 kB
 Mellowship Slinky In B Major │ 00:04:00.091 │ 7785 kB
 The Righteous & The Wicked   │ 00:04:08.084 │ 7943 kB
 Give It Away                 │ 00:04:43.01  │ 9091 kB
 Blood Sugar Sex Magik        │ 00:04:31.229 │ 8731 kB
 Under The Bridge             │ 00:04:24.359 │ 8479 kB
 Naked In The Rain            │ 00:04:25.717 │ 8520 kB
 Apache Rose Peacock          │ 00:04:42.226 │ 9094 kB
 The Greeting Song            │ 00:03:13.593 │ 6198 kB
 My Lovely Man                │ 00:04:39.118 │ 9004 kB
 Sir Psycho Sexy              │ 00:08:16.692 │ 16 MB
 They're Red Hot              │ 00:01:11.941 │ 2326 kB
(17 rows)

If you’re interested into compression at the album level, you might want to compare rolling sum of both the album duration and its size in bytes, track after track. The following query uses window functions to track that:

  select name,
         round(  sum(bytes::numeric) over(order by trackid)
               / sum(milliseconds) over(order by trackid), 2)
           as bps,
  
         pg_size_pretty(
            sum(bytes) over(order by trackid)
         ) as album_bytes,
         
         sum(milliseconds) over(order by trackid)
            * interval '1 millisecond' as progres

    from track
   where albumid = :albumid
order by trackid;

And here’s our rolling aggregate result where we can see the bps column for bytes per second, computed after the accumulated sum of milliseconds and bytes throughout the album’s track data:

             name             │  bps  │ album_bytes │   progres    
══════════════════════════════╪═══════╪═════════════╪══════════════
 The Power Of Equality        │ 33.45 │ 7957 kB     │ 00:04:03.591
 If You Have To Ask           │ 33.34 │ 15 MB       │ 00:07:40.381
 Breaking The Girl            │ 33.28 │ 24 MB       │ 00:12:35.878
 Funky Monks                  │ 33.23 │ 34 MB       │ 00:17:59.273
 Suck My Kiss                 │ 33.16 │ 41 MB       │ 00:21:36.507
 I Could Have Lied            │ 33.15 │ 49 MB       │ 00:25:41.013
 Mellowship Slinky In B Major │ 33.15 │ 56 MB       │ 00:29:41.104
 The Righteous & The Wicked   │ 33.11 │ 64 MB       │ 00:33:49.188
 Give It Away                 │ 33.08 │ 73 MB       │ 00:38:32.198
 Blood Sugar Sex Magik        │ 33.07 │ 81 MB       │ 00:43:03.427
 Under The Bridge             │ 33.05 │ 90 MB       │ 00:47:27.786
 Naked In The Rain            │ 33.03 │ 98 MB       │ 00:51:53.503
 Apache Rose Peacock          │ 33.03 │ 107 MB      │ 00:56:35.729
 The Greeting Song            │ 33.01 │ 113 MB      │ 00:59:49.322
 My Lovely Man                │ 33.02 │ 122 MB      │ 01:04:28.44
 Sir Psycho Sexy              │ 33.01 │ 137 MB      │ 01:12:45.132
 They're Red Hot              │ 33.01 │ 140 MB      │ 01:13:57.073
(17 rows)

If that’s your first encounter with a window function you might want to read Understanding Window Functions, an all time favorite article of this website.

Album genres

So, what kind of music do we have in this collection of Chinook data:

  select genre.name, count(*),
         sum(count(*)) over () as tracks,
         round(100.0 * count(*) / sum(count(*)) over(), 2) as pct
    from           genre
         left join track using(genreid)
group by genre.name order by genre.name;

We have here the repartition of tracks per genre and the percentage (in the pct column) of them with respect to the whole set of our 25 different music genres here:

        name        │ count │ tracks │  pct  
════════════════════╪═══════╪════════╪═══════
 Alternative        │    40 │   3503 │  1.14
 Alternative & Punk │   332 │   3503 │  9.48
 Blues              │    81 │   3503 │  2.31
 Bossa Nova         │    15 │   3503 │  0.43
 Classical          │    74 │   3503 │  2.11
 Comedy             │    17 │   3503 │  0.49
 Drama              │    64 │   3503 │  1.83
 Easy Listening     │    24 │   3503 │  0.69
 Electronica/Dance  │    30 │   3503 │  0.86
 Heavy Metal        │    28 │   3503 │  0.80
 Hip Hop/Rap        │    35 │   3503 │  1.00
 Jazz               │   130 │   3503 │  3.71
 Latin              │   579 │   3503 │ 16.53
 Metal              │   374 │   3503 │ 10.68
 Opera              │     1 │   3503 │  0.03
 Pop                │    48 │   3503 │  1.37
 R&B/Soul           │    61 │   3503 │  1.74
 Reggae             │    58 │   3503 │  1.66
 Rock               │  1297 │   3503 │ 37.03
 Rock And Roll      │    12 │   3503 │  0.34
 Sci Fi & Fantasy   │    26 │   3503 │  0.74
 Science Fiction    │    13 │   3503 │  0.37
 Soundtrack         │    43 │   3503 │  1.23
 TV Shows           │    93 │   3503 │  2.65
 World              │    28 │   3503 │  0.80
(25 rows)

A more visual way to have a look at this data right from the console is with the following query:

  select genre.name, count(*),
         repeat('■', (  100.0
                      * count(*)
                      / sum(count(*)) over()
                     )::integer
               ) as pct
    from genre
         left join track using(genreid)
group by genre.name
order by genre.name;

Which gives this time:

                    │ count │                  pct                  
════════════════════╪═══════╪═══════════════════════════════════════
 Alternative        │    40 │ ■
 Alternative & Punk │   332 │ ■■■■■■■■■
 Blues              │    81 │ ■■
 Bossa Nova         │    15 │ 
 Classical          │    74 │ ■■
 Comedy             │    17 │ 
 Drama              │    64 │ ■■
 Easy Listening     │    24 │ ■
 Electronica/Dance  │    30 │ ■
 Heavy Metal        │    28 │ ■
 Hip Hop/Rap        │    35 │ ■
 Jazz               │   130 │ ■■■■
 Latin              │   579 │ ■■■■■■■■■■■■■■■■■
 Metal              │   374 │ ■■■■■■■■■■■
 Opera              │     1 │ 
 Pop                │    48 │ ■
 R&B/Soul           │    61 │ ■■
 Reggae             │    58 │ ■■
 Rock               │  1297 │ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
 Rock And Roll      │    12 │ 
 Sci Fi & Fantasy   │    26 │ ■
 Science Fiction    │    13 │ 
 Soundtrack         │    43 │ ■
 TV Shows           │    93 │ ■■■
 World              │    28 │ ■
(25 rows)

Now we easily spot that the music collection distribution by genre is skewed over Rock, Latin, Metal and Punk music. Also there’s a non trivial amount of TV Shows in there.

Multi-genres albums

In order to dive into the data, let’s have a look at it from the genre point of view. As we saw in the Chinook database model a given album is always edited by a single artist, but might contain several different genres, one per track. Do we have multi-genre albums?

  select title as album,
         array_agg(distinct genre.name order by genre.name) as genres
    from      track
         join genre using(genreid)
         join album using (albumid)
group by title
  having count(distinct genre.name) > 1;

In this query, we use the array_agg PostgreSQL aggregates. As is name suggests, array_agg build an array in which it aggregates all the data it sees, in this case the data is distinct genre.names in a specific ordering.

So apparently we have 11 albums with tracks from different genres:

             album              │                      genres                       
════════════════════════════════╪═══════════════════════════════════════════════════
 Battlestar Galactica, Season 3 │ {"Sci Fi & Fantasy","Science Fiction","TV Shows"}
 Greatest Hits                  │ {Metal,Reggae,Rock}
 Heroes, Season 1               │ {Drama,"TV Shows"}
 LOST, Season 4                 │ {Drama,"TV Shows"}
 Live After Death               │ {"Heavy Metal",Metal}
 Lost, Season 2                 │ {Drama,"TV Shows"}
 Lost, Season 3                 │ {Drama,"TV Shows"}
 Rock In Rio [CD2]              │ {Metal,Rock}
 The Number of The Beast        │ {Metal,Rock}
 The Office, Season 3           │ {Comedy,"TV Shows"}
 Unplugged                      │ {Blues,Latin}
(11 rows)

Many of those seem to not actually be music but rather video based content such as TV Shows and Drama, maybe we can filter them out? What I would like to do is to filter out entire albums based on a list of genres we’re not interested into (such as TV Shows and Drama, not that we don’t like them, but they won’t probably compare well with the rest of the music collection).

If we have a look at the Unplugged album, we can see that we have a data quality problem here:

  select genre.name,
         array_agg(track.name order by trackid) as tracks
    from      track
         join album using(albumid)
         join genre using(genreid)
   where album.title = 'Unplugged'
group by genre.name;

─[ RECORD 1 ]─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
name   │ Blues
tracks │ {Signe,"Before You Accuse Me","Hey Hey","Tears In Heaven","Lonely Stranger","Nobody Knows You When You're Down & Out",Layla,"Running On Faith","Walkin' Blues",Alberta,"San Francisco Bay Blues","Malted Milk","Old Love","Rollin' And Tumblin'"}
═[ RECORD 2 ]═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
name   │ Latin
tracks │ {"A Novidade","Tenho Sede",Refazenda,Realce,Esotérico,Drão,"A Paz","Beira Mar",Sampa,Parabolicamará,"Tempo Rei","Expresso 2222","Aquele Abraço",Palco,"Toda Menina Baiana","Sítio Do Pica-Pau Amarelo"}

My guess is that the Latin tracks associated with the album are actually from another album and got there erroneously. That situation with a real production data set might trigger some data quality assessment work.

The way to have the complete list of tracks per genre for each multi-genred album is the following:

with multi_genre_albums(albumid, artistid, title) as
 (
    select albumid, artistid, title
      from      track
           join genre using(genreid)
           join album using (albumid)
  group by albumid, artistid, title
    having count(distinct genre.name) > 1
 )
  select artist.name as artist,
         title as album,
         genre.name as genre,
         array_agg(track.name order by trackid) as tracks
    from multi_genre_albums
         join track using(albumid)
         join genre using(genreid)
         join artist using(artistid)
group by artist, album, genre;

We reuse the previous query in a Common Table Expression and this time only output interesting columns: those we’re going to reuse in the main query, either for reference in later joins, or because we already have it at this stage.

Selecting a single genre per album

In order to be able to filter by genre at the album level, we need to first be able to assign a single genre per album. Let’s have some more statistics about the multi-genred albums:

select artist.name as artist,
       album.title as album,
       track.name as track,
       row_number()
         over(partition by albumid
                  order by trackid)
         as n,
       count(genreid)
         over(partition by albumid, genreid)
         as same

  from track
       join
       (
          select albumid,
                 array_agg(distinct genre.name
                           order by genre.name)
                   as genres
            from track
                 join genre using(genreid)
        group by albumid
          having count(distinct genre.name) > 1
        )
        as multi_genre_album(albumid, genres) using (albumid)

        join genre using(genreid)
        join album using(albumid)
        join artist using(artistid)
        
order by artist, album, trackid;

In this query n is the track order within its album and same is how many tracks of the same genre do we have in said album. It’s meant as an intermediate query, the one you would cook just to have a manual look over the data. Also you might notice that this time rather than using multi_genre_album as a Common Table Expression we inlined it as a subquery.

What we learn from the result of this query is that we can select the genre having the highest number of tracks attached in an album as being the album’s genre. In particular we don’t see equal scores when using that simple rule, so we don’t need a tie-breaker. Let’s have a try at that:

with album_genre_count(albumid, genreid, count, max)
 as (
      select albumid,
             genreid,
             count(trackid) as tracks,
             max(count(trackid)) over(partition by albumid)
        from track
    group by albumid, genreid
    order by albumid
 )
 select artist.name as artist,
        album.title as title,
        genre.name as genre
   from      album_genre_count
        join genre using(genreid)
        join album using(albumid)
        join artist using (artistid)
  where count = max
    and albumid in (select albumid
                      from track
                  group by albumid
                    having count(distinct genreid) > 1
                  order by albumid);

In the query we limit the output to those albums with more than one genre attached to their tracks so that we can manually check our query. If you remove this restriction by albumid, the query applies as-is to the whole data set we have, so it could be made a view for later use. So let’s have a look at our problematic cases:

        artist        │             title              │      genre      
══════════════════════╪════════════════════════════════╪═════════════════
 Eric Clapton         │ Unplugged                      │ Latin
 Iron Maiden          │ Live After Death               │ Metal
 Iron Maiden          │ Rock In Rio [CD2]              │ Rock
 Iron Maiden          │ The Number of The Beast        │ Metal
 Lenny Kravitz        │ Greatest Hits                  │ Rock
 Battlestar Galactica │ Battlestar Galactica, Season 3 │ Science Fiction
 Heroes               │ Heroes, Season 1               │ Drama
 Lost                 │ Lost, Season 3                 │ Drama
 Lost                 │ Lost, Season 2                 │ TV Shows
 The Office           │ The Office, Season 3           │ Comedy
 Lost                 │ LOST, Season 4                 │ Drama
(11 rows)

All those albums are successfully assigned a single genre now. The genre we chose is the most common value found in the album’s track, and for our interactive discovery of the data set, it’s plenty good enough.

There’s a better way to write this query, as you can read in the comments below this post. I’ve written another post about the technique, which uses the Mode Ordered-Set Aggregate Function.

Statistics per genre

Now that we are able to actually play with a single genre per album, what can we learn about musical genre in that music collection? What about the average track length? Well as we all know averages are not saying much (if anything at all) about the data, let’s have a look at the median, 90, 95, and 99 percentiles instead:

with album_genre_count(albumid, genreid, count, max)
 as (
      select albumid,
             genreid,
             count(trackid) as tracks,
             max(count(trackid)) over(partition by albumid)
        from track
    group by albumid, genreid
    order by albumid
 )
  select genre.name, 

         percentile_cont(array[0.5, 0.9, 0.95, 0.99])
           within group (order by ceil(milliseconds/60000))
         as distribution,

         repeat('■',
            ceil(
              percentile_cont(0.99)
                within group (order by ceil(milliseconds/60000))
            )::integer
         )
         as hist

    from genre
         left join track using(genreid)
         left join
         (
             select trackid, album_genre_count.genreid
               from track 
                    join album_genre_count using(albumid)
              where count = max
         )
         as normalized_track_genre using(genreid)
where genre.name not in ('Comedy',
                         'Drama',
                         'Sci Fi & Fantasy',
                         'Science Fiction',
                         'TV Shows' 
                         )
group by genre.name
order by genre.name;

And now we can see our genre stats applied to only the music selection:

        name        │       distribution        │      hist      
════════════════════╪═══════════════════════════╪════════════════
 Alternative        │ {3.5,5,6,11}              │ ■■■■■■■■■■■
 Alternative & Punk │ {3,5,6,8}                 │ ■■■■■■■■
 Blues              │ {4,6,7,9}                 │ ■■■■■■■■■
 Bossa Nova         │ {3,6,6,6}                 │ ■■■■■■
 Classical          │ {4,8,9,9}                 │ ■■■■■■■■■
 Easy Listening     │ {3,4,4,4}                 │ ■■■■
 Electronica/Dance  │ {5,6,7,8}                 │ ■■■■■■■■
 Heavy Metal        │ {4.5,7,8,8}               │ ■■■■■■■■
 Hip Hop/Rap        │ {3,3,5,6}                 │ ■■■■■■
 Jazz               │ {4,6.10000000000036,9,14} │ ■■■■■■■■■■■■■■
 Latin              │ {3,5,5,7}                 │ ■■■■■■■
 Metal              │ {4,7,8,10}                │ ■■■■■■■■■■
 Opera              │ {2,2,2,2}                 │ ■■
 Pop                │ {3,4,5,11}                │ ■■■■■■■■■■■
 R&B/Soul           │ {3,5,5,6}                 │ ■■■■■■
 Reggae             │ {3,5,5,6}                 │ ■■■■■■
 Rock               │ {4,6,8,13}                │ ■■■■■■■■■■■■■
 Rock And Roll      │ {2,2,2,2}                 │ ■■
 Soundtrack         │ {4,4,5,6}                 │ ■■■■■■
 World              │ {3,4,4,5}                 │ ■■■■■
(20 rows)

The array of numbers are the median and then the 0.90, 0.95 and 0.99 percentiles of the duration of tracks per genre, to the next minute. It means that the median duration of a Rock track is under 4 minutes and up to 5 minutes in Electronica/Dance genre. Also the longest tracks are to be found in Jazz and Rock with up to respectively 14 and 13 minutes.

The quick histogram is based on the 0.99 percentile, so we have a visual of the longest tracks per genre.

The genre we skipped in the query have duration distribution in between 21 and 88 minutes, so they would have ruined our quick analysis here.

Conclusion

PostgreSQL is YeSQL

When given a new database we know nothing about, it’s good to do some preliminary exploration of the data. Which angle we pick exactly is not that important first, because we want to have grasp of what we are playing with. It’s all about having a taste of what’s in there.

In this article we took the choice of not creating any SQL object. We could have chosen to create views and maybe even to store their result for the next query thanks to a Materialized View.

When the dataset is too big to explore that way, we can also use the TABLESAMPLE clause, as per the PostgreSQL SELECT documentation.

Mainly this exploration exercise is meant to teach some SQL, so I hope you did discover several tricks you didn’t know yet!