Blog

  • Avr 17 2014

    New York!

    A couple of week ago I had the chance to participate into the PGConf NYC 2014 Conference, one of the biggest conferences about PostgreSQL worldwide.

    I presented one of my favourite talks over there, where the whole goal is to blow the mind of innocent developers and show them how much they can do in just SQL.

    PostgreSQL for developers, window functions galore!

    The basis for the talk is this detailed blog entry about the Reset Counter application and how to leverage SQL to write the code for it.

     

  • Avr 14 2014

    Announcing Barman 1.3.1

    Version 1.3.1 of Barman, Backup and Recovery Manager for PostgreSQL, has been released. Barman now supports concurrent...  

  • Avr 02 2014

    Indexando ando: Índices GIN

    PostgreSQL tiene varios tipos de índices: B-tree, Hash, GiST, Gin y SP-GiST. Obviamente cada uno de ellos...  

  • Mar 25 2014

    Nordic PostgreSQL Day 2014

    Last week some PostgreSQL users, contributors and advocates have organized a really great conference in Stockholm, Sweden, where I had the please to give the following talk:

    PostgreSQL is YeSQL!

    Nordic PostgreSQL Conference

    The conference was very well put together and the occasion to meet with plenty of well known PostgreSQL friends and newcomers to the community too. If you were there, I hope you had as much of a good time than I did!

    All the slides from this day of conferences have now been uploaded on the PostgreSQL Wiki Page for Nordic PGDAY, so that if you couldn't make it to the event you can still read through the material and send emails around to the authors with any and all questions about the contents!

    See you at the next PostgreSQL conference maybe, this time in New York City!

    The Hacker's Guide to Python

    In other news, my good friend Julien Danjou (you might know him for having hacked together the awesome window manager) just released today The Hacker's Guide to Python where you will Learn everything you need to build a successful Python project. I've had the pleasure to answer's Julien's questions about how to best benefit from PostgreSQL when coding in python, which makes it for an interview chapter within:

    In this book, we'll see how you can leverage Python to efficiently tackle your problems and build great Python applications.

    And while at it:

    Have a good read!

     

  • Mar 19 2014

    Announcing repmgr 2.0

    Version 2.0 of repmgr, Replication Manager for PostgreSQL clusters, has been released. This release introduces a new...  

  • Mar 03 2014

    Announcing repmgr 2.0RC2

    Version 2.0RC2 of repmgr, Replication Manager for PostgreSQL clusters, has been released. This release introduces a new...  

  • Fév 26 2014

    Row security in PostgreSQL – overview and current status

    In the next week I will be writing a series of posts about the row-security work I’ve...  

  • Fév 25 2014

    Streaming replication slots in PostgreSQL 9.4

    Streaming replication slots are a pending feature in PostgreSQL 9.4, as part of the logical changeset extraction....  

  • Fév 21 2014

    PostgreSQL, Aggregates and Histograms

    In our previous article Aggregating NBA data, PostgreSQL vs MongoDB we spent time comparing the pretty new MongoDB Aggregation Framework with the decades old SQL aggregates. Today, let's show case those SQL aggregates some more with producing a nice histogram rigth from our SQL console.

    PostgreSQL and Mathematics

    The other day while giving a Practical SQL training my attention drifted to the width_bucket function available as part of the Mathematical Functions and Operators PostgreSQL is offering to its fearless SQL users.

    Here's what the documentation says about it:

    The function width_bucket(op numeric, b1 numeric, b2 numeric, count int) returns (as an int) the bucket to which operand would be assigned in an equidepth histogram with count buckets, in the range b1 to b2.

    For example width_bucket(5.35, 0.024, 10.06, 5) returns 3.

    Let's have a look at our dataset from the NBA games and statistics, and get back to counting rebounds in the drb field. A preliminary query informs us that we have stats ranging from 10 to 54 rebounds per team in a single game, an good information we can use in the following query:

    select width_bucket(drb, 10, 54, 9), count(*)
        from team_stats
    group by 1
    order by 1;
     width_bucket | count 
    --------------+-------
                1 |    52
                2 |  1363
                3 |  8832
                4 | 20917
                5 | 20681
                6 |  9166
                7 |  2093
                8 |   247
                9 |    20
               10 |     1
    (10 rows)
    
    We've just calibrated our games in terms of rebounds here

    Console Histograms

    Now, what would it take to actually be able to display the full story right into our psql console, for preview before actually integrated a new diagram in our reporting solution? Turns out it's not very complex.

    First, we want to avoid hard coding the range of rebounds we're processing, so we are going to compute that in a first step. Then we want the histogram data, which is a ordered list of ranges with a min and a max value and a frequency, which is how many games were recorded with a number or rebounds within any given bucket range. And last, we want to display something a little more visual than just a list of numbers:

    with drb_stats as (
        select min(drb) as min,
               max(drb) as max
          from team_stats
    ),
         histogram as (
       select width_bucket(drb, min, max, 9) as bucket,
              int4range(min(drb), max(drb), '[]') as range,
              count(*) as freq
         from team_stats, drb_stats
     group by bucket
     order by bucket
    )
     select bucket, range, freq,
            repeat('*', (freq::float / max(freq) over() * 30)::int) as bar
       from histogram;
    
     bucket |  range  | freq  |              bar               
    --------+---------+-------+--------------------------------
          1 | [10,15) |    52 | 
          2 | [15,20) |  1363 | **
          3 | [20,25) |  8832 | *************
          4 | [25,30) | 20917 | ******************************
          5 | [30,35) | 20681 | ******************************
          6 | [35,40) |  9166 | *************
          7 | [40,45) |  2093 | ***
          8 | [45,50) |   247 | 
          9 | [50,54) |    20 | 
         10 | [54,55) |     1 | 
    (10 rows)
    
    Time: 53.570 ms
    

    The query is using the Common Table Expressions WITH syntax so that it's easier to read and understand, then the data-relatec magic happens in the histogram CTE. We use the width_bucket function to get a calibration number for each of our games, and we GROUP BY bucket to be able to aggregate the min and max values.

    As we're using PostgreSQL though, just having two columns with the min and max as separate values is not enough, what we actually need is a discrete range of rebounds for each bucket, hence using the int4range range constructor function.

    Of course, within the same GROUP BY aggregation here it's still possible to count the number of games having a rebounds stat within the bucket, defining the histogram's frequency.

    The only remaining step then consists into hacking our way into actually displaying something visual enough for a quick less-than-1-minute effort of data crunching, using the repeat function which is part of PostgreSQL String Functions and Operators. Note that we're using the Window Function expression max(freq) over() to have access the highest frequency value from each and every result row.

    So... Did I mention lately?

    PostgreSQL is YeSQL!

    By the way, the whole scripting and data and SQL is available at github/dimitri/nba, and there's an Hacker News entry to comment on the article if you're interested.

     

  • Fév 17 2014

    Aggregating NBA data, PostgreSQL vs MongoDB

    When reading the article Crunching 30 Years of NBA Data with MongoDB Aggregation I coulnd't help but think that we've been enjoying aggregates in SQL for 3 or 4 decades already. When using PostgreSQL it's even easy to actually add your own aggregates given the SQL command create aggregate.

    Photo Credit: Copyright All rights reserved by Segward Graupner

    The next step after thinking how obvious the queries written in the mentionned article would be to express in SQL was to actually load the data into PostgreSQL and write the aggregate queries, of course.

    Loading the data

    With the help of a little bit of Common Lisp code and using the mongo-cl-driver it was easy enough to parse the given BSON file. What was more complex was to actually understand enough of the data model to produce a relational design out of it, avoiding data redundancy as much as possible.

    We call that step normalization in old-style relational databases, and the goal of that process is to avoid functional dependency so that the data is easier to understand, verify and process the data once loaded.

    For instance, rather than have both scores from each team and a column won per team, which would be a boolean but is a number in the given BSON file, we store only the scores. Here's the main table definition of the stats we are going to be playing with, the game table:

    create table nba.game (
      id          serial primary key,
      date        timestamptz,
      host        int references nba.team(id),
      guest       int references nba.team(id),
      host_score  int,
      guest_score int
    );
    

    As much of the aggregates in the referenced article are playing with statistics from teams who actually won the game, let's create a view to simplify our SQL queries thereafter:

    create view winners as
      select id,
             date,
             case when host_score > guest_score
                  then host
                  else guest
              end as winner
        from game;
    

    If you're not doing much SQL, remember that creating such a view is common practice in the relational world.

    Running the Aggregates

    Now that we have the extra useful view, it's possible to implement the first MongoDB query in SQL. First, let's have a look at the MongoDB query:

    db.games.aggregate([
      {
        $match : {
          date : {
            $gt : ISODate("1999-08-01T00:00:00Z"),
            $lt : ISODate("2000-08-01T00:00:00Z")
          }
        }
      },
      {
        $unwind : '$teams'
      },
      {
        $match : {
          'teams.won' : 1
        }
      },
      {
        $group : {
          _id : '$teams.name',
          wins : { $sum : 1 }
        }
      },
      {
        $sort : { wins : -1 }
      },
      {
        $limit : 5
      }
    ]);
    

    I don't know about you, but I have quite a hard time deciphering what that query is actually doing, and when the explanation text talks about using a 6-stage pipeline my understanding is that the application developper has been writing the execution plan of the query here. Let's ignore the query format itself, as it's obviously meant to be generated by a tool rather than typed by a human being.

    Here's the same query in SQL, with the result this time:

      SELECT abbrev, name, count(*)
        FROM winners JOIN team ON team.id = winners.winner
       WHERE     date > '1999-08-01T00:00:00Z'
             AND date < '2000-08-01T00:00:00Z'
    GROUP BY winner, abbrev, name
    ORDER BY count(*) DESC
       LIMIT 5;
    
     abbrev |          name          | count 
    --------+------------------------+-------
     LAL    | Los Angeles Lakers     |    67
     POR    | Portland Trail Blazers |    59
     IND    | Indiana Pacers         |    56
     UTA    | Utah Jazz              |    55
     SAS    | San Antonio Spurs      |    53
    (5 rows)
    
    Time: 8.101 ms
    

    What we have here is a pretty basic query using a join, a where clause to restrict the data set we are playing with, a group by clause to define which data to computa the aggregates against, with an order by and a limit clause for presenting the result. To be realistic, if you've ever done any SQL at all, then you know how to read that query because you've been writing dozens of similar ones.

    Here's, as in the original article, the same query against a much larger data set this time, with all games of the 2000s decade:

       SELECT abbrev, name, count(*)
         FROM winners join team on team.id = winners.winner
        WHERE     date > '2000-08-01T00:00:00Z'
              AND date < '2010-08-01T00:00:00Z'
     GROUP BY winner, abbrev, name
     ORDER BY count(*) DESC
        LIMIT 5;
    
     abbrev |        name        | count 
    --------+--------------------+-------
     SAS    | San Antonio Spurs  |   579
     DAL    | Dallas Mavericks   |   568
     LAL    | Los Angeles Lakers |   524
     PHO    | Phoenix Suns       |   495
     DET    | Detroit Pistons    |   489
    (5 rows)
    
    Time: 24.713 ms
    

    Correlating stats with wins

    The goal here is to compute how often a team wins when they record more defensive rebounds than their opponent across the entire data set.

    To be able to compute the percentage, we have to have a count of all registered games, of course. Then we are going to count how many times the winner team registered a greater team_stats.drb than the loser, and count how many times in SQL is usually written as a sum(case when <condition> then 1 else 0 end), which is what we're doing here:

    select count(*) as games,
           sum(case when ws.drb > ls.drb then 1 else 0 end) as drb,
           sum(case when ws.drb > ls.drb then 1 else 0 end)::float / count(*) * 100 as pct
      from winlose wl
           join team w on wl.winner = w.id
           join team l on wl.loser = l.id
           join team_stats ws on ws.game = wl.id and ws.team = wl.winner
           join team_stats ls on ls.game = wl.id and ls.team = wl.loser;
    
     games |  drb  |       pct        
    -------+-------+------------------
     31686 | 22292 | 70.3528372151739
    (1 row)
    
    Time: 276.669 ms
    

    We note here than in the original MongoDB article the aggregation query is short of computing the percentage directly, apparently it's been done in the client tool, maybe using a spreadsheet application or something.

    Defensive Rebounds and Total Rebounds Versus Win Percentage

    Next, still following on our inspirational article Crunching 30 Years of NBA Data with MongoDB Aggregation, we’re going to compute what percentage of the time a team wins as a function of the number of defensive rebounds they recorded.

    I'm not sure I understand what they achieve with averaging ones when a team wins and zero when a team loses, so I couldn't quite reproduce their result. Here's an approaching query tho:

    with game_stats as (
        select t.id, count(*)
          from team t join game on game.host = t.id or game.guest = t.id
       group by t.id
    )
    select ts.team, round(avg(drb), 2) as drb,
           round(count(*) / gs.count::numeric * 100, 2) as winpct,
           count(*) as wins, gs.count as games
      from team_stats ts
           join game on game.id = ts.game
                    and game.host = ts.team
                    and game.host_score > game.guest_score
           join game_stats gs on gs.id = ts.team
    group by ts.team, gs.count;
    
     team |  drb  | winpct | wins | games 
    ------+-------+--------+------+-------
        4 | 31.59 |  31.46 |  710 |  2257
        7 | 32.55 |  31.89 |  720 |  2258
       16 | 31.70 |  37.62 |  849 |  2257
    

    I only pasted the first few lines of the result because I'm not sure how to make sense of it, really.

    Interesting factoid

    What I find most interesting in the following factoid proposed in the MongoDB article is the complete lack of the query you need to run in order to grab the matching data:

    An interesting factoid: the team that recorded the fewest defensive rebounds in a win was the 1995-96 Toronto Raptors, who beat the Milwaukee Bucks 93-87 on 12/26/1995 despite recording only 14 defensive rebounds.

    When doing the necessary query in SQL, using a Common Table Expression (the WITH syntax) and a Window Function for good measure, we get actually 4 different games with the minimum defensive rebounds in our history of NBA games, 14:

    with stats(game, team, drb, min) as (
        select ts.game, ts.team, drb, min(drb) over ()
          from team_stats ts
               join winners w on w.id = ts.game and w.winner = ts.team
    )
    select game.date::date,
           host.name || ' -- ' || host_score as host,
           guest.name || ' -- ' || guest_score as guest,
           stats.drb as winner_drb
      from stats
           join game on game.id = stats.game
           join team host on host.id = game.host
           join team guest on guest.id = game.guest
     where drb = min;
    
    -[ RECORD 1 ]----------------------------
    date       | 1995-12-26
    host       | Toronto Raptors -- 93
    guest      | Milwaukee Bucks -- 87
    winner_drb | 14
    -[ RECORD 2 ]----------------------------
    date       | 1996-02-02
    host       | Golden State Warriors -- 114
    guest      | Toronto Raptors -- 111
    winner_drb | 14
    -[ RECORD 3 ]----------------------------
    date       | 1998-03-31
    host       | Vancouver Grizzlies -- 101
    guest      | Dallas Mavericks -- 104
    winner_drb | 14
    -[ RECORD 4 ]----------------------------
    date       | 2009-01-14
    host       | New York Knicks -- 128
    guest      | Washington Wizards -- 122
    winner_drb | 14
    
    Time: 126.276 ms
    

    To understand all there's to know about window functions, have a look at my article on the topic: Understanding Window Functions.

    Total rebounds and wins

    The next interesting aside is the following:

    As an aside, the Cleveland Cavaliers beat the New York Knicks 101-97 on April 11, 1996, despite recording only 21 total rebounds. Inversely, the San Antonio Spurs lost to the Houston Rockets, 112-110, on January 4, 1992 despite recording 75 total rebounds.

    Which we translate in SQL as the following query:

    with stats as (
        select ts.game, ts.team, trb,
               min(trb) over () as min,
               max(trb) over () as max
          from team_stats ts
               join winners w on w.id = ts.game and w.winner = ts.team
    )
    select game.date::date,
           host.name || ' -- ' || host_score as host,
           guest.name || ' -- ' || guest_score as guest,
           stats.trb as winner_trb
      from stats
           join game on game.id = stats.game
           join team host on host.id = game.host
           join team guest on guest.id = game.guest
     where trb = min or trb = max;
    
    -[ RECORD 1 ]--------------------------
    date       | 1995-12-28
    host       | Dallas Mavericks -- 103
    guest      | Vancouver Grizzlies -- 101
    winner_trb | 76
    -[ RECORD 2 ]--------------------------
    date       | 1996-04-11
    host       | New York Knicks -- 97
    guest      | Cleveland Cavaliers -- 101
    winner_trb | 21
    -[ RECORD 3 ]--------------------------
    date       | 2007-01-29
    host       | Utah Jazz -- 115
    guest      | New Jersey Nets -- 116
    winner_trb | 21
    
    Time: 127.771 ms
    

    Again it's easy enough in SQL to have more details about the aside presented in our source article, and we get a slightly different story.

    Conclusion

    It's quite hard for me to appreciate the work done in the MongoDB aggregation framework really, when we've been enjoying advanced aggregation and statistics in PostgreSQL for a very long time. With the addition of Window Functions and Aggregate Functions for Statistics it's possible to implement advanced analysis right into your SQL queries.

    In next PostgreSQL release the set of analytical functions is going to expand again withe addition of both Ordered-Set Aggregate Functions (also known as inverse distribution functions) and Hypothetical-Set Aggregate Functions (also known as WITHIN GROUP).

    PostgreSQL is YeSQL!

    When the problem you have to solve involves analyzing data, one of the more advanced tooling you can find around certainly is the SQL language, in particular its implementation in PostgreSQL!

     

© 2001-2014 2ndQuadrant Ltd. All rights reserved. | Privacy Policy