In our previous article
Aggregating NBA data, PostgreSQL vs MongoDB we spent
time comparing the pretty new
*MongoDB Aggregation Framework* with the decades
old SQL aggregates. Today, let’s showcase more of those SQL aggregates,
producing a nice
*histogram* right from our SQL console.

# PostgreSQL and Mathematics

The other day while giving a
Practical SQL training my attention drifted to
the
`width_bucket`

function available as part of the
Mathematical Functions and Operators PostgreSQL is offering to its fearless
SQL users.

Here’s what the documentation says about it:

The function width_bucket(op numeric, b1 numeric, b2 numeric, count int) returns (as an int) the bucket to which operand would be assigned in an equidepth histogram with count buckets, in the range b1 to b2.

Let’s have a look at our dataset from the NBA games and statistics, and get
back to counting
*rebounds* in the
`drb`

field. A preliminary query informs us
that we have stats ranging from 10 to 54 rebounds per team in a single game,
a good information we can use in the following query:

```
select width_bucket(drb, 10, 54, 9) as buckets,
count(*)
from team_stats
group by buckets
order by buckets;
```

We asked for 9 separations so we have 10 groups as a result:

```
width_bucket | count
--------------+-------
1 | 52
2 | 1363
3 | 8832
4 | 20917
5 | 20681
6 | 9166
7 | 2093
8 | 247
9 | 20
10 | 1
(10 rows)
```

# Console Histograms

Now, what would it take to actually be able to display the full story right
into our `psql`

console, for preview before actually integrating a new
diagram in our reporting solution? Turns out it’s not very complex.

First, we want to avoid hard coding the range of *rebounds* we’re
processing, so we are going to compute that in a first step. Then we want
the *histogram* data, which is a ordered list of ranges with a min and a max
value and a *frequency*, which is how many games were recorded with a number
or *rebounds* within any given *bucket* range. And last, we want to display
something a little more visual than just a list of numbers:

```
with drb_stats as (
select min(drb) as min,
max(drb) as max
from team_stats
),
histogram as (
select width_bucket(drb, min, max, 9) as bucket,
int4range(min(drb), max(drb), '[]') as range,
count(*) as freq
from team_stats, drb_stats
group by bucket
order by bucket
)
select bucket, range, freq,
repeat('■',
( freq::float
/ max(freq) over()
* 30
)::int
) as bar
from histogram;
```

The first part of the query is the *drb_stats* common table expression, and
its role is to fetch the min and max values of *drb* in our whole dataset.
It’s better than doing the query separately and then injecting the result in
the next query, because there’s no *roundtrip*, but it’s basically the idea.

Then the *histogram* common table expression is the meat of the query, where
we compute the bucket distribution of our data set. Finally, we use a cheap
trick with the *repeat* function where we scale down the frequency into a
display of 30 characters:

```
bucket | range | freq | bar
--------+---------+-------+--------------------------------
1 | [10,15) | 52 |
2 | [15,20) | 1363 | ■■
3 | [20,25) | 8832 | ■■■■■■■■■■■■■
4 | [25,30) | 20917 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
5 | [30,35) | 20681 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
6 | [35,40) | 9166 | ■■■■■■■■■■■■■
7 | [40,45) | 2093 | ■■■
8 | [45,50) | 247 |
9 | [50,54) | 20 |
10 | [54,55) | 1 |
(10 rows)
Time: 53.570 ms
```

Also, as we’re using PostgreSQL, just having two columns with the min and
max as separate values is not enough, what we actually need is
a
discrete range
of *rebounds* for each *bucket*, hence using the `int4range`

range
constructor function.

# Conclusion

The only remaining step then consists into hacking our way into actually
displaying something *visual enough* for a quick less-than-1-minute effort
of data crunching, using the `repeat`

function which is part
of
PostgreSQL String Functions and Operators.
Note that we’re using
the Window Function expression
`max(freq) over()`

to have access the highest frequency value from each and
every result row.

By the way, the whole scripting and data and SQL is available at github/dimitri/nba, and there’s an Hacker News entry to comment on the article if you’re interested.