In our previous article Aggregating NBA data, PostgreSQL vs MongoDB we spent time comparing the pretty new MongoDB Aggregation Framework with the decades old SQL aggregates. Today, let’s showcase more of those SQL aggregates, producing a nice histogram right from our SQL console.
PostgreSQL and Mathematics
The other day while giving a
Practical SQL training my attention drifted to
width_bucket function available as part of the
Mathematical Functions and Operators PostgreSQL is offering to its fearless
Here’s what the documentation says about it:
The function width_bucket(op numeric, b1 numeric, b2 numeric, count int) returns (as an int) the bucket to which operand would be assigned in an equidepth histogram with count buckets, in the range b1 to b2.
Let’s have a look at our dataset from the NBA games and statistics, and get
back to counting
rebounds in the
drb field. A preliminary query informs us
that we have stats ranging from 10 to 54 rebounds per team in a single game,
a good information we can use in the following query:
select width_bucket(drb, 10, 54, 9) as buckets, count(*) from team_stats group by buckets order by buckets;
We asked for 9 separations so we have 10 groups as a result:
width_bucket | count --------------+------- 1 | 52 2 | 1363 3 | 8832 4 | 20917 5 | 20681 6 | 9166 7 | 2093 8 | 247 9 | 20 10 | 1 (10 rows)
Now, what would it take to actually be able to display the full story right
psql console, for preview before actually integrating a new
diagram in our reporting solution? Turns out it’s not very complex.
First, we want to avoid hard coding the range of rebounds we’re processing, so we are going to compute that in a first step. Then we want the histogram data, which is a ordered list of ranges with a min and a max value and a frequency, which is how many games were recorded with a number or rebounds within any given bucket range. And last, we want to display something a little more visual than just a list of numbers:
with drb_stats as ( select min(drb) as min, max(drb) as max from team_stats ), histogram as ( select width_bucket(drb, min, max, 9) as bucket, int4range(min(drb), max(drb), '') as range, count(*) as freq from team_stats, drb_stats group by bucket order by bucket ) select bucket, range, freq, repeat('■', ( freq::float / max(freq) over() * 30 )::int ) as bar from histogram;
The first part of the query is the drb_stats common table expression, and its role is to fetch the min and max values of drb in our whole dataset. It’s better than doing the query separately and then injecting the result in the next query, because there’s no roundtrip, but it’s basically the idea.
Then the histogram common table expression is the meat of the query, where we compute the bucket distribution of our data set. Finally, we use a cheap trick with the repeat function where we scale down the frequency into a display of 30 characters:
bucket | range | freq | bar --------+---------+-------+-------------------------------- 1 | [10,15) | 52 | 2 | [15,20) | 1363 | ■■ 3 | [20,25) | 8832 | ■■■■■■■■■■■■■ 4 | [25,30) | 20917 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 5 | [30,35) | 20681 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 6 | [35,40) | 9166 | ■■■■■■■■■■■■■ 7 | [40,45) | 2093 | ■■■ 8 | [45,50) | 247 | 9 | [50,54) | 20 | 10 | [54,55) | 1 | (10 rows) Time: 53.570 ms
Also, as we’re using PostgreSQL, just having two columns with the min and
max as separate values is not enough, what we actually need is
of rebounds for each bucket, hence using the
The only remaining step then consists into hacking our way into actually
displaying something visual enough for a quick less-than-1-minute effort
of data crunching, using the
repeat function which is part
PostgreSQL String Functions and Operators.
Note that we’re using
the Window Function expression
max(freq) over() to have access the highest frequency value from each and
every result row.