Friday, October 05 2012
PostgreSQL, Window-Functions, YeSQL

Reset Counter

I've been given a nice puzzle that I think is a good blog article opportunity, as it involves some thinking and window functions.

What's to solve

Say we store in a table entries from a counter that only increases and the time stamp when we did the measurement. So that when you read 30 then later 40 in fact that means we counted 10 more the second reading when compared to the first, in other words the first 30 are counted again in the second counter value, 40.

Now of course it's a real world counter. Think network traffic counter on a network interface, if you want something real to play with in your mind. So the counter will sometime reset and you will read measure sequences such as 40, 0, 20 if you happen to read just when the counter is reset, or most of the time that will look like 45, 25, 50.

The question we want to answer is, given a series of that counter measures, including some resets, what is the current logical value of the counter?

Given the sequence of measures 0, 10, 20, 30, 40, 0, 20, 30, 60 the result we want is 40 + 60, that is 100. Right?

Playing with some data

Let's model an hypothetical dataset easy enough to play with. What about just the previous example? We also need to time stamp the measurements, let's just use a tick for now, as it's easier to think about:

create table measures(tick int, nb int);

insert into measures
     values (1, 0), (2, 10), (3, 20), (4, 30), (5, 40),
            (6, 0), (7, 20), (8, 30), (9, 60);

Now that we have some data in a table to play with, let's try to find out the numbers we are interested in: we only want to keep the latest measure we read on the counter just before it wraps. That means values where the next one (in tick or time stamping order) is lesser than the current counter value.

As we are lucky enough to be playing with the awesome PostgreSQL which brings window functions on the table, we can easily implement just what we said in a readable way:

select tick, nb,
         case when lead(nb) over w < nb then nb
              when lead(nb) over w is null then nb
              else null
          end as max
    from measures
  window w as (order by tick);

The firt case is the exact translation of the problem as spelled in english in just the previous paragraph where we stated we want to keep the current counter value in case of a wraparound, so I guess it's easy enough to get at.

Then we have a couple of tricks in that query in order to massage the data as we want it. First, the last row of the output won't have a lead, that window function call is going to return NULL. In that case, we keep the current counter value as if we just did a wraparound. And finally, when there's no wraparound, we don't care about the data. Well, for the purpose of knowing the current logical value of the counter, that is.

And we get that encouraging result:

tick | nb  | max 
------+-----+-----
    1 |   0 |    
    2 |  10 |    
    3 |  20 |    
    4 |  30 |    
    5 |  40 |  40
    6 |   0 |    
    7 |  20 |    
    8 |  30 |    
    9 |  60 |  60

As you see, we have been able to create a new column out of the dataset, and that new column only contains the data we are interested into.

Finding the current counter value

All we have to do now is sum this computed columns entries. Remember that the sum() aggregate function will simply discard nulls, so that we don't have to turn them into a bunch of 0.

with t(tops) as (
  select case when lead(nb) over w < nb then nb
              when lead(nb) over w is null then nb
              else null
          end as max
    from measures
  window w as (order by tick)
)
select sum(tops) from t;

And here's the expected result:

sum 
-----
 100

Now what about testing with another set of data or two, just to be sure that the counter is allowed to wrap more than once within our solution?

insert into measures
     values (10, 0), (11, 10), (12, 30), (13, 35), (14, 45),
            (15, 25), (16, 50), (17, 100), (18, 110);

Then we have:

with t(tops) as (
   select case when lead(nb) over w < nb then nb
               when lead(nb) over w is null then nb
               else null
           end as max
     from measures
   window w as (order by tick)
)
select sum(tops) from t;
 sum 
-----
 255
(1 row)

All good!

Counter logical value over a given period

Now of course what we want is to find the logical value of the counter for a given day's or month's worth of measures. We then need to pay attention to the value of the counter at the start of our period so that we know to substract it from the logical sum over the period.

Here's an SQL version of the same sentence, applied to the period in between ticks 4 and 14, in a completely arbitrary choosing of mine:

with t as (
  select tick,
         first_value(nb) over w as first,
         case when lead(nb) over w < nb then nb
              when lead(nb) over w is null then nb
              else null
          end as max
    from measures
   where tick >= 4 and tick < 14
  window w as (order by tick)
)
select sum(max) - min(first) as sum
 from t;

Here we are using the first_value() window function to retain it in the whole resultset of the Common Table Expression (the inner query introduced by the keyword WITH is called that way). And when doing the sum we're interested in at the outer level, we didn't forget to substract the first value: we need to use an aggregate here because we're doing a sum() aggregate at the same query level, and we have the same value in each row of the resultset, so we used min(), max() would have been as good.

Another important trick we're using in that query is how to express the date range. Never use between for that, as you would end up counting boundaries twice, and customer won't like your accounting process if you do that. Always use a combo of inclusive and exclusive boundaries comparison, as in that WHERE clause in the previous query.

Let's have a quick look at the raw data in that range, using another nice aggregate that PostgreSQL comes with:

select array_agg(nb) from measures where tick >= 4 and tick < 14;
           array_agg           
-------------------------------
 {30,40,0,20,30,60,0,10,30,35}
(1 row)

And now, the logical counter value for that period is computed as the following value by the previous query:

sum 
-----
 105
(1 row)

We can verify it manually, we want 40 + 60 + 35 - 30, I think we're all good again. Don't forget we have to substract the first measure from the period!

Extending the problem

Another interesting problem, that we didn't have here but that I find interesting enough to extend this article, is finding the ranges of time (here, ticks) within which the counter didn't reset.

The query is more complex because we need to split the data into partitions, each partition containing data from the same counter series of measures without wrapping. The usual trick is to self-join our data set so that for each given row we have a set of rows from the same partition, we are going to instead use a correlated subquery to go fetch the next wraparound value:

with tops as (
  select tick, nb,
         case when lead(nb) over w < nb then nb
              when lead(nb) over w is null then nb
             else null
         end as max
    from measures
  window w as (order by tick)
)
  select tick, nb, max,
         (select tick
            from tops t2
           where t2.tick >= t1.tick and max is not null
        order by t2.tick
           limit 1) as p
    from tops t1;

 tick | nb  | max | p  
------+-----+-----+----
    1 |   0 |     |  5
    2 |  10 |     |  5
    3 |  20 |     |  5
    4 |  30 |     |  5
    5 |  40 |  40 |  5
    6 |   0 |     |  9
    7 |  20 |     |  9
    8 |  30 |     |  9
    9 |  60 |  60 |  9
   10 |   0 |     | 14
   11 |  10 |     | 14
   12 |  30 |     | 14
   13 |  35 |     | 14
   14 |  45 |  45 | 14
   15 |  25 |     | 18
   16 |  50 |     | 18
   17 | 100 |     | 18
   18 | 110 | 110 | 18
(18 rows)

With that as an input it's then possible to build ranges of ticks including non wrapping set of measures from our counter, and get for each range the logical value tat the counter had at the end of it:

with tops as (
  select tick, nb,
         case when lead(nb) over w < nb then nb
              when lead(nb) over w is null then nb
             else null
         end as max
    from measures
  window w as (order by tick)
),
     parts as (
  select tick, nb, max,
         (select tick
            from tops t2
           where t2.tick >= t1.tick and max is not null
        order by t2.tick
           limit 1) as p
    from tops t1
),
     ranges as (
  select first_value(tick) over w as start,
         last_value(tick) over w as end,
         max(max) over w
    from parts
  window w as (partition by p order by tick)
)
select * from ranges where max is not null;

 start | end | max 
-------+-----+-----
     1 |   5 |  40
     6 |   9 |  60
    10 |  14 |  45
    15 |  18 | 110
(4 rows)

Conclusion

What I hope to have shown here, apart from some window function tips and some nice use cases for common table expressions, is that as a developper adding SQL to your tool set is a very good idea.

You don't want to have several parts of your code dealing with a logical counter like this, because you want the reporting, accounting, quota, billing and other software to all agree on the values. And you most probably want to avoid to fetch a huge result set of data and process it in the application memory (it'd better fit) rather than just get back a single integer column single row resultset, right?

If you find this SQL example to be off the limits, it's a good sign that you need to improve on your skills so that SQL is a real asset of your developer multi languages multi paradygm talents.