• Nearest Big City

    In this article, we want to find the town with the greatest number of inhabitants near a given location.

    A very localized example

    We first need to find and import some data, and I found at the following place a CSV listing of french cities with coordinates and population and some numbers of interest for the exercise here.

    To import the data set, we first need a table, then a COPY command:

    CREATE TABLE lion1906 (
      insee       text,
      nom         text,
      altitude    integer,
      code_postal text,
      longitude   double precision,
      latitude    double precision,
      pop99       bigint,
      surface     double precision
    );
    
    \copy lion1906 from 'villes.csv' with csv header delimiter ';' encoding 'latin1'
    

    With that data in place, we can find the 10 nearest towns of a random choosing of us, let's pick Villeurbanne which is in the region of Lyon.

       select code_postal, nom, pop99
         from lion1906
     order by point(longitude, latitude) <->
              (select point(longitude, latitude)
                 from lion1906
                where nom = 'Villeurbanne')
        limit 10;
    
     code_postal |          nom           | pop99
    -------------+------------------------+--------
     69100       | Villeurbanne           | 124215
     69300       | Caluire-et-Cuire       |  41233
     69120       | Vaulx-en-Velin         |  39154
     69580       | Sathonay-Camp          |   4336
     69140       | Rillieux-la-Pape       |  28367
     69000       | Lyon                   | 445452
     69500       | Bron                   |  37369
     69580       | Sathonay-Village       |   1693
     01700       | Neyron                 |   2157
     69660       | Collonges-au-Mont-d'Or |   3420
    (10 rows)
    

    We find Lyon in our list in there, and we want the query now to return only that one as it has the greatest number of inhabitants in the list:

    with neighbours as (
       select code_postal, nom, pop99
         from lion1906
     order by point(longitude, latitude) <->
              (select point(longitude, latitude)
                 from lion1906 where nom = 'Villeurbanne')
        limit 10
    )
      select *
        from neighbours
    order by pop99 desc
       limit 1;
    
     code_postal | nom  | pop99
    -------------+------+--------
     69000       | Lyon | 445452
    (1 row)
    

    Well, thank you PostgreSQL, that was easy!

    Note that you can actually index such queries, that's called a KNN index. PostgreSQL knows how to use some kind of indexes to fetch data matching an expression such as ORDER BY a <-> b, which allow you to consider a KNN search in your application.

    Let's get worldwide

    The real scope of our exercise is to associate every known town in the world with some big city around, so let's first fetch and import some worldwide data this time, from http://download.maxmind.com/download/worldcities/worldcitiespop.txt.gz.

    CREATE TABLE maxmind_worldcities (
            country_code text,
            city_lower text,
            city_normal text,
            region_code text DEFAULT '',
            population INT DEFAULT '0',
            latitude float8 DEFAULT '0',
            longitude float8 DEFAULT '0'
    );
    
    \copy maxmind_worldcities FROM '/tmp/worldcitiespop.txt' WITH  DELIMITER ',' QUOTE E'\f' CSV HEADER ENCODING 'LATIN1';
    
    alter table maxmind_worldcities add column loc point;
    update maxmind_worldcities set loc = point(longitude, latitude);
    

    This time you can see that I created an extra column with the location in there, so that I don't have to compute it each time I need it, like I did before.

    Now is the time to test that data set and hopefully fetch the same result as before when we only had french cities loaded:

    with neighbours as (
       select country_code, city_lower, population
         from maxmind_worldcities
        where population is not null
     order by loc <->
              (select loc
                 from maxmind_worldcities
                where city_lower = 'villeurbanne')
       limit 10
    )
       select * from neighbours order by population desc limit 1;
    
      country_code | city_lower | population
    --------------+------------+------------
     fr           | lyon       |     463700
    (1 row)
    

    Ok, looks like we're all set for the real problem. Now we want to pick for each of those cities it's nearest neighboor, so here's how to do that:

    create index on maxmind_worldcities(country_code, region_code, city_lower);
    create index on maxmind_worldcities using gist(loc);
    
    create table maxmind_neighbours as
      select country_code, region_code, city_lower,
             (with neighbours as (
                 select country_code, city_lower, population
                   from maxmind_worldcities
                  where population is not null
                        and country_code = wc.country_code
                        and region_code = wc.region_code
               order by loc <-> wc.loc
                 limit 10)
                 select city_lower
                   from neighbours
               order by population desc
                  limit 1
             ) as neighbour
        from maxmind_worldcities wc ;
    

    To be fair, I have to tell you that this query took almost 2 hours to complete on my laptop here, but as I'm doing that for friend and a blog article, I've been lazy and didn't try to optimise it. It could be using LATERAL for sure, I don't know if that would help very much with performances: I didn't try.

    With that in hands we can now check some cities and their biggest neighbours, as in the following query:

    select * from maxmind_neighbours where city_lower = 'villeurbanne';
     country_code | region_code |  city_lower  | neighbour
    --------------+-------------+--------------+-----------
     fr           | B9          | villeurbanne | lyon
    (1 row)
    

    And looking for New-York City suburbs I did find a chinatown, which is a pretty common smaller town name apparently:

    select * from maxmind_neighbours where city_lower = 'chinatown';
     country_code | region_code | city_lower |   neighbour
    --------------+-------------+------------+---------------
     sb           | 08          | chinatown  | honiara
     us           | CA          | chinatown  | san francisco
     us           | DC          | chinatown  | washington
     us           | HI          | chinatown  | honolulu
     us           | IL          | chinatown  | chicago
     us           | MT          | chinatown  | missoula
     us           | NV          | chinatown  | reno
     us           | NY          | chinatown  | new york
    (8 rows)
    

    Big Cities in the big world

    We might need to change some of our views

    So, let's see how many smaller towns each of those random big cities have:

       select country_code, region_code, neighbour, count(*)
        from maxmind_neighbours
       where neighbour in ('london', 'new york', 'moscow',
                           'paris', 'tokyo', 'sao polo', 'chicago')
    group by country_code, region_code, neighbour;
     country_code | region_code | neighbour | count
    --------------+-------------+-----------+-------
     gb           | H9          | london    |     2
     jp           | 40          | tokyo     |   414
     us           | NY          | new york  |   131
     ca           | 08          | london    |    16
     ru           | 48          | moscow    |   245
     fr           | A8          | paris     |    16
     us           | IL          | chicago   |    13
    (7 rows)
    

    And now let's be fair and see where are the cities with the greatest number of towns nearby them, with the following query:

      select country_code, region_code, neighbour, count(*)
        from maxmind_neighbours
       where neighbour is not null
    group by country_code, region_code, neighbour
    order by 4 desc
       limit 25;
    
     country_code | region_code | neighbour  | count
    --------------+-------------+------------+-------
     cn           | 03          | nanchang   | 16759
     cn           | 26          | xian       | 12864
     id           | 18          | kupang     | 10715
     cn           | 24          | taiyuan    | 10550
     mm           | 11          | taunggyi   | 10253
     id           | 38          | makasar    |  9471
     ir           | 15          | ahvaz      |  9461
     id           | 01          | banda aceh |  9161
     cn           | 14          | lasa       |  8841
     cn           | 15          | lanzhou    |  8618
     ir           | 29          | kerman     |  8579
     id           | 26          | medan      |  7787
     ir           | 04          | iranshahr  |  7249
     ir           | 07          | shiraz     |  7219
     ma           | 55          | agadir     |  7121
     ir           | 42          | mashhad    |  7107
     af           | 08          | gazni      |  7011
     ir           | 33          | tabriz     |  6586
     cn           | 01          | hefei      |  6521
     bd           | 81          | dhaka      |  6480
     ir           | 08          | rasht      |  6471
     id           | 17          | mataram    |  6467
     id           | 33          | cilegon    |  6287
     af           | 23          | qandahar   |  6213
     cn           | 07          | fuzhou     |  6089
    (25 rows)
    

  • Retention of backups with Barman

    Defining a disaster recovery plan involves defining backup policies. A key aspect of backup policies is to define how long backup data is retained for disaster recovery purposes. This applies to all digital content, including PostgreSQL databases.

    Barman 1.2.0 introduces automated management of backup retention policies of PostgreSQL servers.

    Retention policies were one of the midterm goals that we had given ourselves when the whole Barman idea began to take shape. I am glad that, thanks to the vision of a French company (which explicitly requested to remain anonymous), we were able to undertake their open-source development.

    What are retention policies?

    A retention policy is a user-defined set of guidelines and principles that determines how long backups and their related archive logs need to be retained for recovery procedures. In a PostgreSQL database scenario:

    • with “backup” we refer to a full physical backup, performed periodically, when the database is online (hence the term “hot” very often used in this context);
    • with “archive logs” we refer to Write Ahead Log (WAL) files, responsible for implementing differential backup by continuously archiving every change made to the data files of the database server.

    Base backups and archive logs form the so-called “backup catalogue” and allow database administrators to perform Point-In-Time recovery operations.

    Further information on PostgreSQL physical backup and continuous archiving can be found in the Postgres documentation.

    Why are retention policies so important?

    In some countries and environments, it is the law that requires ICT departments to enforce them for data security and protection reasons. In Italy, for instance, the “Codice dell’Amministrazione Digitale” (CAD) requires that public organisations report retention periods in official documents such as their Disaster Recovery Plan.

    From a more practical point of view, retention policies improve automation of a backup solution, while reducing the management and configuration complexity.

    How are they implemented in Barman?

    Barman at any time retains:

    • the periodical backups required to satisfy the current retention policy for a given PostgreSQL server;
    • the archived WAL files required for the complete recovery of those backups.

    On the same topic, you may wish to read one of my previous blog posts about the backup catalogue and the WAL archive in Barman.

    Barman users can define a retention policy in terms of:

    • backup redundancy (how many periodical backups), or
    • a recovery window (how long).
    In case of a retention policy based on redundancy, the administrator decides how many periodical backups to keep.
    On the other hand, a retention policy based on recovery window allows the DBA to specify a period of time (recovery window). Barman ensures retention of backups and/or archived WAL files required for point-in-time recovery to any time during that recovery window.
    Retention policies are managed in Barman by the configuration option ‘retention_policy’ and the ‘barman cron’ command (responsible for maintenance operations). In the next article we will go through the configuration of retention policies with some simple examples.
    Where can I find Barman?
    Barman is an open source application for disaster recovery developed and maintained by 2ndQuadrant. It is written in Python and it can be installed through sources or PyPI. However, RPM packages and Debian packages are available for installation on RHEL/CentOS 5/6, Debian or Ubuntu 12.04 LTS.
    More information on Barman can be found on the website (www.pgbarman.org) and the available documentation.

  • Bulk Replication

    In the previous article here we talked about how to properly update more than one row at a time, under the title Batch Update. We did consider performances, including network round trips, and did look at the behavior of our results when used concurrently.

    A case where we want to apply the previous article approach is when replicating data with a trigger based solution, such as SkyTools and londiste. Well, maybe not in all cases, we need to have a amount of UPDATE trafic worthy of setting up the solution. As soon as we know we're getting to replay important enough batches of events, though, certainly using the batch update tricks makes sense.

    It so happens that londiste 3 includes the capability to use handlers. Those are plugins written in python (like all the client side code from SkyTools) whose job is to handle the processing of the event batches. Several of them are included in the londiste sources, and one of them is named bulk.py.

    Bulk loading data with londiste

    To use set in londiste.ini:

    handler_modules = londiste.handlers.bulk
    

    then add table with one of those commands:

    londiste3 add-table xx --handler="bulk"
    londiste3 add-table xx --handler="bulk(method=X)"
    

    The default method is 0, and the available methods are the following:

    correct (0)

    • inserts as COPY into table
    • update as COPY into temp table and single UPDATE from there
    • delete as COPY into temp table and single DELETE from there

    delete (1)

    • as correct, but update are done as DELETE then COPY

    merged (2)

    • as delete, but merge insert rows with update rows

    Conclusion

    Yes, by using that handler which is provided by default in londiste, you will apply the previous article tricks in your replication solution. And you can even choose to use that for only some of the tables you are replicating.

  • Batch Update

    Performance consulting involves some tricks that you have to teach over and over again. One of them is that SQL tends to be so much better at dealing with plenty of rows in a single statement when compared to running as many statements, each one against a single row.

    Another kind of Batch to update

    So when you need to UPDATE a bunch of rows from a given source, remember that you can actually use a JOIN in the update statement. Either the source of data is already in the database, in which case it's as simple as using the FROM clause in the update statement, or it's not, and we're getting back to that in a minute.

    UPDATE FROM

    It's all about using that FROM clause in an update statement, right?

        UPDATE target t
           SET counter = t.counter + s.counter,
          FROM source s
         WHERE t.id = s.id
    

    Using that, you can actually update thousands of rows in our target table in a single statement, and you can't really get faster than that.

    Preparing the Batch

    Now, if you happen to have the source data in your application process' memory, the previous bits is not doing you any good, you think. Well, the trick is that pushing your in-memory data into the database and then joining against the now local source of data is generally faster than looping in the application and having to do a whole network round trip per row.

    What about that round trip?

    Let's see how it goes:

    CREATE TEMP TABLE source(LIKE target INCLUDING ALL) ON COMMIT DROP;
    
    COPY source FROM STDIN;
    
    UPDATE target t
       SET counter = t.counter + s.counter,
      FROM source s
     WHERE t.id = s.id
    

    As we're talking about performances, the trick here is to use the COPY protocol to fill in the temporary table we just create to hold our data. So we're now sending the whole data set in a temporary location in the database, then using that as the UPDATE source. And that's way faster than doing a separate UPDATE statement per row in your batch, even for small batches.

    Also, rather than using the SQL COPY command, you might want to look up the docs of the PostgreSQL driver you are currently using in your application, it certainly includes some higher level facilities to deal with pushing the data into the streaming protocol.

    Insert or Update

    And now sometime some of the rows in the batch have to be updated while some others are new and must be inserted. How do you do that? Well, PostgreSQL 9.1 brings on the table WITH support for all DML queries, which means that you can do the following just fine:

    WITH upd AS (
        UPDATE target t
           SET counter = t.counter + s.counter,
          FROM source s
         WHERE t.id = s.id
     RETURNING s.id
    )
    INSERT INTO target(id, counter)
         SELECT id, sum(counter)
           FROM source s LEFT JOIN upd USING(id)
          WHERE t.id IS NULL
       GROUP BY s.id
      RETURNING t.id
    

    That query here is updating all the rows that are known in both the target and the source and returns what we took from the source in the operation, so that we can do an anti-join in the next step of the query, where we're inserting any row that was not taken care of in the update part of the statement.

    Note that when the batch gets to bigger size it's usually better to join against the target table in the INSERT statement, because that will have an index on the join key.

    Concurrency patterns

    Now, you will tell me that we just solved the UPSERT problem. Well what happens if more than one transaction is trying to do the WITH (UPDATE) INSERT dance at the same time? It's a single statement, so it's a single snapshot. What can go wrong?

    Concurrent processing

    What happens is that as soon as the concurrent sources contain some data for the same primary key, you get a duplicate key error on the insert. As both the transactions are concurrent, they are seeing the same target table where the new data does not exists, and both will conclude that they need to INSERT the new data into the target table.

    There are two things that you can do to avoid the problem. The first thing is to make it so that you're doing only one batch update at any time, by architecting your application around that constraint. That's the most effective way around the problem, but not the most practical.

    The other thing you can do, is force the concurrent transactions to serialize one after the other, using an explicit locking statement:

    LOCK TABLE target IN SHARE ROW EXCLUSIVE MODE;
    

    That lock level is not automatically acquired by any PostgreSQL command, so the only way it helps you is when you're doing that for every transaction you want to serialize. When you know you're not at risk (that is, when not playing the insert or update dance), you can omit taking that lock.

    Conclusion

    The SQL language has its quirks, that's true. It's been made for efficient data processing, and with recent enough PostgreSQL releases you even have some advanced pipelining facilities included in the language. Properly learning how to make the most out of that old component of your programming stack still makes a lot of sense today!

  • Emacs Conference

    The Emacs Conference is happening, it's real, and it will take place at the end of this month in London. Check it out, and register at Emacs Conference Event Brite. It's free and there's still some availability.

    It's all about Emacs, and it rocks!

    We have a great line-up for this conference, which makes me proud to be able to be there. If you've ever been paying attention when using Emacs then you've already heard those names: Sacha Chua is frequently blogging about how she manages to improve her workflow thanks to Emacs Lisp, John Wiegley is a proficient Emacs contributor maybe best known for his ledger Emacs Mode, then we have Luke Gorrie who hacked up SLIME among other things, we also have Nic Ferrier who is starting a revolution in how to use Emacs Lisp with elnode. And more! Including Steve Yegge!

    See you there in London.

  • HyperLogLog Unions

    In the article from yesterday we talked about PostgreSQL HyperLogLog with some details. The real magic of that extension has been skimmed over though, and needs another very small article all by itself, in case you missed it.

    Which Set Operation do you want for counting unique values?

    The first query here has the default level of magic in it, really. What happens is that each time we do an update of the HyperLogLog hash value, we update some data which are allowing us to compute its cardinality.

    => select date,
              #users as daily,
              pg_column_size(users) as bytes
         from daily_uniques
     order by date;
        date    |      daily       | bytes
    ------------+------------------+-------
     2013-02-22 | 401676.779509985 |  1287
     2013-02-23 | 660187.271908359 |  1287
     2013-02-24 | 869980.029947449 |  1287
     2013-02-25 | 580865.296677817 |  1287
     2013-02-26 | 240569.492722719 |  1287
    (5 rows)
    

    And has advertized the data is kept in a static sized data structure. The magic here all happens at hll_add() time, the function you have to call to update the data.

    Now on to something way more magic!

    Are those the aggregates you're looking for?

    => select to_char(date, 'YYYY/MM') as month,
              round(#hll_union_agg(users)) as monthly
         from daily_uniques group by 1;
      month  | monthly
    ---------+---------
     2013/02 | 1960380
    (1 row)
    

    The HyperLogLog data structure is allowing the implementation of an union algorithm that will be able to compute how many unique values you happen to have registered in both one day and the next. Extended in its general form, and doing SQL, what you get is an aggregate that you can use in GROUP BY constructs and window functions. Did you read about them yet?

  • PostgreSQL HyperLogLog

    If you've been following along at home the newer statistics developments, you might have heard about this new State of The Art Cardinality Estimation Algorithm called HyperLogLog. This technique is now available for PostgreSQL in the extension postgresql-hll available at https://github.com/aggregateknowledge/postgresql-hll and soon to be in debian.

    How to Compute Cardinality?

    Installing postgresql-hll

    It's as simple as CREATE EXTENSION hll; really, even if to get there you must have installed the package on your system. We did some packaging work for debian and the result should appear soon in a distro near you.

    Then you also need to keep your data in some table, straight from the documentation we can use that schema:

    -- Create the destination table
    CREATE TABLE daily_uniques (
    DATE            DATE UNIQUE,
    users           hll
    );
    

    Then to add some data for which you want to know the cardinality of, it's as simple as in the following UPDATE statement:

    UPDATE daily_uniques
       SET users = hll_add(users, hll_hash_text('123.123.123.123'))
     WHERE date = current_date;
    

    So in our example what you see is that we want to decipher how many unique IP addresses we saw, and we do that by first creating a hash of that source data then calling hll_add() with the current value and the hash result.

    The current value must be initialized using hll_empty().

    Concurrency

    The most awake readers among you have already spotted that: using an UPDATE on the same row over and over again is a good recipe to kill any form of concurrency, so you don't want to do that on your production setup unless you don't care about those UPDATE waiting piling up in your system.

    The idea is then to fill-in a queue of updates and asynchronously update the daily_uniques table from that queue, possibly using the hll_add_agg aggregate that the extension provides, so that you do only one update per batch of values to process.

    ∅: Empty Set and NULL

    Yes there's a unicode entry for that, ∅

    Now, what happens when the batch of new unique values you want to update from is itself empty? Well I would have expected hll_add_agg over an empty set to return an empty hll value, the same as returned by hll_empty(), but it turns out it's returning NULL instead.

    And then hll_add(users, NULL) will happily return NULL. So the next UPDATE is cancelling all the previous work, which is not nice. We had to cater for that case explicitely in the UPDATE query that's working from the batch of new values to add to our current HyperLogLog hash entry, and I can't resist to show off one of the most awesome PostgreSQL features here: writable CTE.

    WITH hll(agg) AS (
      SELECT hll_add_agg(hll_hash_text(value)) FROM new_batch
    )
      UPDATE daily_uniques
         SET users = CASE WHEN hll.agg IS NULL THEN users
                          ELSE hll_union(users, hll.agg)
                      END
        FROM hll
       WHERE date = current_date;
    

    That's how you protect against an empty set being turned into a NULL. I think the real fix would need to be included in postgresql-hll itself, in making it so that the hll_add_agg aggregate returns hll_empty() on an empty set, and I will report that bug (with that very article as the detailed explanation of it).

    Using postgresql-hll

    When using postgresql-hll on the production system, we were able to get some good looking numbers from our daily_uniques table:

    with stats as (
      select date, #users as daily, #hll_union_agg(users) over() as total
        from daily_uniques
    )
      select date,
             round(daily) as daily,
             round((daily/total*100)::numeric, 2) as percent
        from stats
    order by date;
        date    | daily  | percent
    ------------+--------+---------
     2013-02-22 | 401677 |   25.19
     2013-02-23 | 660187 |   41.41
     2013-02-24 | 869980 |   54.56
     2013-02-25 | 154996 |    9.72
    (4 rows)
    

    I coulnd't resist to show off two of my favorite SQL constructs in that example query here, which are the Common Table Expressions (or CTE) and window functions. If that over() clause reads strange to you, take a minute now and go read about it. Yes, do that now, we're waiting.

    The data here is showing that we did setup the facility in the middle of the first day, and that the morning's activity is quite low.

    Conclusion

    The HyperLogLog DV estimator

    When using postgresql-hll you need to be careful not to kill your application concurrency abilities, and you need to protect yourself against the ∅ killer too. The other thing to keep in mind is that the numbers you get out of the hll technique are estimates within a given precision, and you might want to read some more about what it means for your intended usage of the feature.

  • Playing with pgloader

    While making progress with both Event Triggers and Extension Templates, I needed to make a little break. My current keeping sane mental exercise seems to mainly involve using Common Lisp, a programming language that ships with about all the building blocks you need.

    Yes, that old language brings so much on the table

    When using Common Lisp, you have an awesome interactive development environment where you can redefine function and objects while testing them. That means you don't have to quit the interpreter, reload the new version of the code and put the interactive test case together all over again after a change. Just evaluate the change in the interactive environement: functions are compiled incrementally over their previous definition, objects whose classes have changed are migrated live.

    See, I just said objects and classes. Common Lisp comes with some advanced Object Oriented Programming facilities named CLOS and MOP where the Java and Python and C++ object models are just a subset of what you're being offered. Hint, those don't have Multiple Dispatch.

    And you have a very sophisticated Condition System where Exceptions are just a subset of what you can do (hint: have a look a restarts and tell me you didn't wish your programming language of choice had them). And it continues that way for about any basic building bloc you might want to be using.

    Loading data

    Back to pgloader will you tell me. Right. I've been spending a couple of evening on hacking on the new version of pgloader in Common Lisp, and wanted to share some preliminary results.

    Playing with the loader

    The current status of the new pgloader still is pretty rough, if you're not used to develop in Common Lisp you might not find it ready for use yet. I'm still working on the internal APIs and trying to make something clean and easy to use for a developer, and then I will provide some external ways to play with it, user oriented. I missed that step once with the Python based version of the tool, I don't want to do the same errors again this time.

    So here's a test run with the current pgloader, on a small enough data set of 226 MB of CSV files.

    time python pgloader.py -R.. --summary -Tc ../pgloader.dbname.conf
    
    Table name        |    duration |    size |  copy rows |     errors
    ====================================================================
    aaaaaaaaaa_aaaa   |      2.148s |       - |      24595 |          0
    bbbbbbbbbb_bbbb...|      0.609s |       - |        326 |          0
    cccccccccc_cccc...|      2.868s |       - |      25126 |          0
    dddddddddd_dddd...|      0.638s |       - |          8 |          0
    eeeeeeeeee_eeee...|      2.874s |       - |      36825 |          0
    ffffffffff_ffffff |      0.667s |       - |        624 |          0
    gggggggggg_gggg...|      0.847s |       - |       5638 |          0
    hhh_hhhhhhh       |      9.907s |       - |     120159 |          0
    iii_iiiiiiiiiiiii |      0.574s |       - |        661 |          0
    jjjjjjj           |      6.647s |       - |      30027 |          0
    kkk_kkkkkkkkk     |      0.439s |       - |         12 |          0
    lll_llllll        |      0.308s |       - |          4 |          0
    mmmm_mmm          |      2.139s |       - |      29669 |          0
    nnnn_nnnnnn       |      8.555s |       - |     100197 |          0
    oooo_ooooo        |     13.781s |       - |      93555 |          0
    pppp_ppppppp      |      8.275s |       - |      76457 |          0
    qqqq_qqqqqqqqqqqq |      8.568s |       - |     126159 |          0
    ====================================================================
    Total             |  01m09.902s |       - |     670042 |          0
    

    Streaming data

    With the new code in Common Lisp, I could benefit from real multi threading and higher level abstraction to make it easy to use: lparallel is a lib providing exactly what I need here, with workers and queues to communicate data in between them.

    What I'm doing is that two threads are separated, one is reading the data from either a CSV file or a MySQL database directly, and pushing that data in the queue; while the other thread is pulling data from the queue and writing it into our PostgreSQL database.

    CL-USER> (pgloader.csv:import-database "dbname"
                :csv-path-root "/path/to/csv/"
                :separator #\Tab
                :quote #\"
                :escape "\"\""
                :null-as ":null:")
                        table name       read   imported     errors       time
    ------------------------------  ---------  ---------  ---------  ---------
                   aaaaaaaaaa_aaaa      24595      24595          0     0.995s
              bbbbbbbbbb_bbbbbbbbb        326        326          0     0.570s
           cccccccccc_cccccccccccc      25126      25126          0     1.461s
          dddddddddd_dddddddddd_dd          8          8          0     0.650s
    eeeeeeeeee_eeeeeeeeee_eeeeeeee      36825      36825          0     1.664s
                 ffffffffff_ffffff        624        624          0     0.707s
         gggggggggg_ggggg_gggggggg       5638       5638          0     0.655s
                       hhh_hhhhhhh     120159     120159          0     3.415s
                 iii_iiiiiiiiiiiii        661        661          0     0.420s
                           jjjjjjj      30027      30027          0     2.743s
                     kkk_kkkkkkkkk         12         12          0     0.327s
                        lll_llllll          4          4          0     0.315s
                          mmmm_mmm      29669      29669          0     1.182s
                       nnnn_nnnnnn     100197     100197          0     2.206s
                        oooo_ooooo      93555      93555          0     9.683s
                      pppp_ppppppp      76457      76457          0     5.349s
                 qqqq_qqqqqqqqqqqq     126159     126159          0     2.495s
    ------------------------------  ---------  ---------  ---------  ---------
                 Total import time     670042     670042          0    34.836s
    NIL
    

    As you can see the control is still made for interactive developer usage, which is fine for now but will have to change down the road, when the APIs stabilize.

    Now, let's compare to reading directly from MySQL:

    CL-USER> (pgloader.mysql:stream-database "dbname")
                        table name       read   imported     errors       time
    ------------------------------  ---------  ---------  ---------  ---------
                   aaaaaaaaaa_aaaa      24595      24595          0     0.887s
              bbbbbbbbbb_bbbbbbbbb        326        326          0     0.617s
           cccccccccc_cccccccccccc      25126      25126          0     1.497s
          dddddddddd_dddddddddd_dd          8          8          0     0.582s
    eeeeeeeeee_eeeeeeeeee_eeeeeeee      36825      36825          0     1.697s
                 ffffffffff_ffffff        624        624          0     0.748s
         gggggggggg_ggggg_gggggggg       5638       5638          0     0.923s
                       hhh_hhhhhhh     120159     120159          0     3.525s
                 iii_iiiiiiiiiiiii        661        661          0     0.449s
                           jjjjjjj      30027      30027          0     2.546s
                     kkk_kkkkkkkkk         12         12          0     0.330s
                        lll_llllll          4          4          0     0.323s
                          mmmm_mmm      29669      29669          0     1.227s
                       nnnn_nnnnnn     100197     100197          0     2.489s
                        oooo_ooooo      93555      93555          0     9.148s
                      pppp_ppppppp      76457      76457          0     6.713s
                 qqqq_qqqqqqqqqqqq     126159     126159          0     4.571s
    ------------------------------  ---------  ---------  ---------  ---------
              Total streaming time     670042     670042          0    38.272s
    NIL
    

    The streaming here is a tad slower than the importing from files. Now if you want to be fair when comparing those, you would have to take into account the time it takes to export the data out from its source. When doing that export/import dance, a quick test shows a timing of 1m4.745s. Now, if we do an export only test, it runs in 31.822s. So yes streaming is a good thing to have here.

    Conclusion

    We just got twice as fast as the python version.

    Some will say that I'm not comparing fairly to the Python version of pgloader here, because I could have implemented the streaming facility in Python too. Well actually I did, the option are called section_threads and split_file_reading, that you can set so that a reader is pushing data into a set of queues and several workers are feeding each from its own queue. It didn't help with performances at all. Once again, read about the infamous Global Interpreter Lock to understand why not.

    So actually it's a fair comparison here where the new code is twice as fast as the previous one, with only some hours of hacking and before spending any time on optimisation. Well, apart from using a producer, a consumer and a queue, which I almost had to have for streaming in between two database connections anyways.

  • Live Upgrading PGQ

    Some skytools related new today, it's been a while. For those who where at my FOSDEM's talk about Implementing High Availability you might have heard that I really like working with PGQ. A new version has been released a while ago, and the most recent verion is now 3.1.3, as announced in the Skytools 3.1.3 email.

    Upgrade time!

    Skytools 3.1.3 enters debian

    First news is that Skytools 3.1.3 has been entering debian today (I hope that by the time you reach that URL, it's been updated to show information according to the news here, but I might be early). As there's current a debian freeze to release wheezy (and you can help squash some bugs), this version is only getting uploaded to experimental for now. Thanks to the tireless work of Christoph Berg though, this version is already available from apt.postgresql.org.

    Upgrading to PGQ 3

    The other news is that I've been testing live upgrade scenario where we want to upgrade from PGQ to PGQ3, and it works pretty well, and it's quite simple to achieve too. Here's how.

    So the first thing is to shut down the current ticker process. Then we install the new packages, assuming that you did follow the step in the wiki pointed above, please go read apt.postgresql.org again now if needs be.

    pgqadm.py ticker.ini -s
    sudo apt-get install postgresql-9.1-pgq3 skytools3-ticker skytools3
    

    The ticker is not running anymore, we have the right version of the software installed. Next step is to upgrade the database parts of PGQ:

    psql -f /usr/share/skytools3/pgq.upgrade_2.1_to_3.0.sql ...
    psql -1 -f /usr/share/postgresql/9.1/contrib/pgq.upgrade.sql ...
    

    Of course replace those ... with options such as your actual connection string. I tend to always add -vON_ERROR_STOP=1 to all these commands, so that I don't depend on having the right .psqlrc on the particular server I'm connected to. Also remember that if you want to do that for more than one database, you need to actually run that pair of commands for each of them.

    Now it's time to restart the new ticker. The main changes from the previous one is that it is now a C program called pgqd that knows how to tick for any number of databases, so that you only have to have one instance around per cluster now.

    sudo /etc/init.d/skytools3 start
    tail -f /var/log/skytools/pgqd.log
    

    Those two commands are taking for granted that you did prepare the pgqd setup the debian and skytools way, by adding your config in /etc/skytools3/pgqd.ini and editing /etc/skytools.ini accordingly, so that it's automatically taken into account at machine boot.

    Note that I did actually exercised the procedure above while running a pgbench test replicated with londiste. Of course the replication has been lagging a little while no ticker was running, and then it catched-up as fast as it could, in that case:

    INFO {count: 245673, ignored: 0, duration: 422.104366064}
    

    Happy Hacking!

    So if you have any batch processing needs, remember to consider what PGQ has to offer. And yes if you're running some cron job to compute things out of the database for you, you are doing some batch processing.

    Yes, I did search for Transactional Batch Processing

  • The first Australian PostgreSQL Day

    The first Australian PostgreSQL conference has come to an end and I am extremely happy that I was part of it.

    The event took place in Melbourne, at the offices of Experian Hitwise in St Kilda Rd. I must confess that, going back to the same building where I had worked in 2005/2006, was a strange but positive feeling. Hitwise is a long time Postgres user and it would appear that their cluster of PostgreSQL databases has grown to become one of the largest in the world. So, a double-special occasion for me!

    After the opening session introduced by Jason, Josh Berkus gave an overview of PostgreSQL 9.2 with his “Full throttle” presentation.
    I followed with an introduction to PostgreSQL, more focused on business reasons for its adoption in business contexts. My talk, entitled “Why use PostgreSQL? 10 reasons for using it”, is the result of several years of experience with Postgres, both as a member of the community and a consultant for 2ndQuadrant.

    Over the years I have been repeatedly asked questions from both existing and potential new users. From these questions I have tried to form topics and have come up with 10 broad reasons.

    The last talk of the morning was about the very interesting MADLib library for database analytics.

    In the afternoon, following Josh’s talk on performance, I brought Barman DownUnder and introduced the audience to Disaster Recovery concepts and methodology using our open source tool developed at 2ndQuadrant. It was a good occasion to promote the new 1.2.0 version which introduces retention policy management.

    Then it was over to Jason Godden who gave an extended review and coverage of replication tools with PostgreSQL. For some of them he did provide a demo session as well.

    We were very pleased with the number of participants and overall turnout, considering it was the first event of its kind. At any one time there was always a minimum of 40 participants, while in some cases there were over 60 people present (including of course staff from Hitwise).

    Finally, many thanks to Jason Godden of the Melbourne PostgreSQL Users Group, who organised the event at the local level and who arranged to have Experian Hitwise host the conference and offer food and beverages throughout the day.

    I am confident that some PostgreSQL users and fans that were here today in Melbourne can form an active group, led by Jason, and make PGDay a recurring national event.

  • Page 1 of 33 ( 329 articles )
  • >>

© 2001-2013 2ndQuadrant S.A.S. Tous droits réservés. | Politique de confidentialité