Saturday, July 03 2010

Next Generation PostgreSQL

CHAR(10) was a great venue with great people. The conference itself and the "Hall's Talks" have been a huge source of ideas popping up. Again...

As I've been talking to several people about those ideas over the time and past conferences, I figured I should try to put them down now. And the new ones too. So in this article I'll try to organise and share those ideas that I'm growing in my head, in the hope that some people will think that maybe the ideas are worthwile enough to think about them, make them theirs, then why not work on them!

Organising ideas is all about putting them togother in boxes and attach to them the more meaningfull label you can think of. Here, I think the labels are MVCC in the Cloud, and How to Further Optimise PostgreSQL.

DISCLAIMER

Those ideas that I try to organise and present in this documents are expressed like if they were good designs. I realise that may well be not true, but that's the best way I can think of to share them. So not only that is a Work In Progress but you're welcome to find this collection nothing but a waste of time. Apologies if that's the case.

Also those are all bird's eye view and lots of implementation problems are certainly hidden all around, but my current knowlegde of the internals of the system won't allow me to be the first to catch them. Hint, hint.

MVCC in the Cloud

Nowadays all the rage is in the virtualised environments, dynamic sizing of resources and pay-only-what-you-use. They even found a funny name to sell this, welcome to the Cloud!

The problem is that while it's quite easy to understand how to apply such a reasonning to stateless services, such as webservers, it's a lot more difficult to apply the same to persistent storage services, like, say, a traditional relational database. Let's not forget that the main services of an RDBMS ain't interpreting SQL, but offering transactions and durability.

I think forgetting about that would be awful and could lead to strange solutions appearing, where you play with highly dynamic non-structured non-normalized caches that people would be happy to use as persistent storage services. Oh and of course only provide APIs, so that the application has to know all about the storage and data location, rather than having to describe what data they need. Yeah, declarative languages are not cool anymore, we don't want no Lisp, no Haskell, no RegExp, no CSS, no XML, no XSLT, NoSQL!

So, well, is it possible to apply those Cloud concepts to ACID persistent storage services such as PostgreSQL, and get some benefits out of that? I think so, and will try to explain how. If you're a web indexer robot of some kind, please consider attaching that to your distributed computing label, would you have one.

Transparent cluster

The first problem I'd try to have an idea about is how to offer a transparent service to users in the presence of replication. Say you're running a master slave Hot Standby setup, using Streaming Replication for transfering the data. Now the application code will either connect to the master or to the slave, and depending on this choice will not be able to benefit from the same service.

So the idea would be for the service offered by both the nodes of this cluster to be the same, as far as the user is concerned. This simplification could come in two steps. First, you need XID feedback from the slave to the master. That means that the older snapshot still in use in any slave should be known by the master, so that it will refrain from cleaning up the house while the party's still ongoing. It's rude to do that, even more so to friends.

Once you can trust that any snapshot on the slave is still valid on the master, the second step would be for the slave to forward a transaction to its master ( primary_coninfo) as soon as it can't serve it itself. I think this is as soon as the transaction would require the Virtual XID to turn into a real one. Oh, and as you know the current transaction's snapshot is still valid on the master, the only thing you need is being able to open a transaction on this very snapshot over there.

And it so happens that being able to connect to a system and ask for a given snapshot you know about already would be useful for other projects (parallel dumps and distributing query processing). I think some hackers are already contemplating on how to offer such a facility. Thanks guys!

But I want my ticket to the cloud!

The previous idea goes a long way to help offering a global service to users but does nothing to help solve the real complex Cloud challenges. Those are related to the flexibility you want from the Cloud, the marketing name of it being Elasticity. To be a Cloud service, you need to be able to add and remove nodes to your service and stay online.

So, let's continue thinking, but focusing a little more.

Lots of efforts are ongoing in the domains of replications and remote acces to data. I'm thinking about Postgres-XC, Postgres-R and SQL/MED related work. I like those projects and architecture but I don't think getting them to the Cloud will be practical, except for SQL/MED.

If you missed it, SQL/MED is the part of the standard that allows a database server to expose objects from the outer space. Meaning anything from a local text file to another database server, of the same or another technology. It's very powerful and valuable, but the limitation is that you can't reach outer space within your transaction boundaries. That's a pretty good limit, though, because it means you now get autonomous transactions (you embed a transaction into another and their COMMIT/ROLLBACK statuses are independant, or, say, autonomous).

To provide elastic clustering in the Cloud, with some kind of transparency for the user, you don't want to lose MVCC, that thing that allows the database to respect the ACID guarantees. Maintaining MVCC in a cluster is, on the other hand, exactly what Postgres-XC, Postgres-R and Middle-R are working on.

The problem there is that either you have full support of PostgreSQL features, that's Postgres-R, or you have data distribution, that's Postgres-XC (which does not yet provide node fault-tolerance, by the way).

In the latter case, it's even a big problem to offer all the PostgreSQL features (user defined functions, triggers, rules, etc) because for parallelizing query processing (you have to do that if you're distributing the data) the vast majority of the work they're doing is in the planner and optimiser. I don't even want to think how they will make it possible for a trigger to touch a non-local table in a distributed cluster.

And one of the thing Markus showed at CHAR(10) is that what's killing replicated systems really is the communication overhead. In this light, really, the worst case would be Two Phase Commit, but it appears that Postgres-XC might not be that far away. The real big winner is Postgres-R which shows very low communication overhead, detailed next.

Remote tablespace

The idea here would be to keep the global MVCC facilities that those project provide. I'll confess that I like the Postgres-R implementation of it, even if it relies on a Group Communication System to deliver a total commit ordering of transactions in the cluster. But the advantage is that the design has been made with the idea of supporting dynamic cluster configuration: you can accept new nodes and lose some others online.

Now the GCS is a hard-sell in our community, because finding a good implementation of one ain't an easy task, and it seems that some of PostgreSQL developers either burnt themselves on a work-in-progress implementation, or just don't trust the theory. So I'd like to find out some trustworthy and light way to derive a distributed sequence that would be a reference for XID ordering in the cluster. I'm reading docs on Vector Clock and Paxos algorithm now.

So, say we have a global MVCC we can trust. Then, the data distribution would be easier to address at the executor level I think. What you need to be able to say is that some data are stored on another node. You need the catalogs to exists on all the nodes, and I think that the physical location property of the data is already well defined in the notion of a tablespace. So we should extend this notion to remote PostgreSQL nodes that happen to share the MVCC details with us.

That certainly already means that any and all DDL targeting the remote tablespace needs to happen using 2 Phase Commit, or maybe an asynchronous model would fit here, I'm not sure. But I don't really see how. The data would exist only on the master node for any given tablespace, but the catalogs should certainly be in-sync, unless you want to accept a whole new error domain, but which could be made to look like a serialisation failure after all: if the DDL changes are asynchronous, then your problem is the lack of remote exclusive lock, I guess.

That something I've been thinking about for a long time, you can find the distributing PostgreSQL wiki page and see its history for the details. If you see so you'll notice that the page needs some refreshing (at least the ideas behind the wiki page got some refresh here and there, the most recent and important one being CHAR(10).

Then the planner would work as it's doing now, but certain parts of the plan would have to be handled to another PostgreSQL instance. And as we're working with a global MVCC idea, we keep transactional behavior.

This does not provide data distributing off itself, for that to happen you would use current ddl partitioning facilities and have the partitions sit on different remote tablespaces. As soon as you do that, you can distribute the processing by having the network IO happen asynchronously, which I think would be required to achieve any kind of performances. Plus we already have something in this spirit with effective_io_concurrency.

Mirrored tablespace

Now of course we all like to have a choice. Here the choice would be about data locality. Certain systems already exists (distributed file systems, including GlusterFS or to some extend HAMMER) that propose to mix and match data replication and data distribution. Or simply think about RAID-10. You have both striping and mirroring there, so that's somewhat expected to want to have that in your database product too...

That leads me to think we should offer mirror tablespaces while at it. Implementing that can be as complex as optimising 2 Phase Commit or as simple (ahah) as having a per tablespace Streaming Replication solution.

Now of course you have to explain data locality in the network to the planner, so that it can compare the cost of running a JOIN over there then retrieving the result set against doing that locally. It seems clean what's best until you think that while the other node is hashing your fact table you're free to locally execute another part of the same query plan...

It's so cloudy now I can feel the rain coming

And as all the rage is about offering elastic capabilities to your clustering solution, the next step after that is having some smart agent in the system that will notice that we're doing this or that data transfers to solve user queries so often that we should setup a mirror of this remote tablespace now.

Of course, you also want newcomers to be able to extend your elasticity, like "hey guys I'm new there, how can I help?".

While at it, your smart cluster manager program should be able to arrange things so that the loss of any node at any time is just an optimisation problem, not a data loss one. Nothing more critical than a serialisation error, but maybe we should invent a new SQLERR here, I'd like to propose the following error message to the transactions you need to abort in case of some node unexpected disapearing: the elastic just shrinked.

How to Further Optimise PostgreSQL

It seems harder and harder to find good ideas that look effective and that are not too complex to understand. It's simply because about all of them already have been implemented into the product, if you ask me. But that does not mean there isn't a way to go, that means finding it is getting tough, and that when pursuing the ideas used to be some 1-man work for some consolidated weeks, it's getting to collaborative working of several talented people sharing a common goal and able to dedicate months of their time.

I hope it does not mean we should be happy with what we have but we rather continue improving our prefered product.

On the usage of indexes as column store

When you're handling very big tables, there's a very good chance that some column's value will get repeated all over the place. But if you index that column, you'll see the value only once and a list of pointers to the rows that are hosting it. Column storage is about compressing that data set in a way that you store the column value only once, and also about avoiding to pass the data around in the executor.

The idea here would need to first have the ability to use the index in an authoritative way, without refering the main table storage to confirm the visibility. That's already a work in progress.

When we have that, then we could think about how we're using the indexes now. Their only purpose in life is to help solve restrictive queries by accessing directly to the rows of interrest, avoiding to scan all those data that won't fit in main memory, so that you rely on your slow drives to get access to it.

Another idea here would be to be able to use the index to solve queries in other cases than just applying the restrictions and filtering they require. By that I mean that we could use our indexes to retrieve some column's value, rather than the main table's data, when we have the idea that the columns cardinality is low enough.

I guess that would mean we have optional column storage, in the case all the values you need from the table are in the available index(es).

Gavin reports that this idea may not provide the bangs for the bucks, but warns that it could be he's too much used to thinking on datawarehouse terms. Also that before engaging into such an effort, it might be good to have an idea of the benefit you're running after, which is the classic Amdahl's law or the old saying that you should always profile before to spend time on optimizing.

So it's impossible without some preliminary work to assess how useful this kind of index as a column store would be in the general case, but that wouldn't stop me to share the idea, see...

The Executor as a Virtual Machine

I've been talking a lot with Gavin, and his mind is full of datawarehouse optimisation problems. One of them is how to speed up the data retrieval pipeline from disk. Given what are modern CPU, we should be saturating read capabilities of any hard drive without a sweat, and we're not there yet. It happens that MonetDB is about there, with their MonetDB Assembly Language, further described on the MAL reference page.

The big difference in architecture between them and us is how they execute the queries, with code path that contain very little branches, loops, or function calls. That alone would be responsible for an incredible performance benefit, like, a factor of 25 to 50 times what we have now, have I been hinted.

If you think about it in the right angle, to be able to suppress all the looping and branching and function calling we currently have into the executor code, it could be that the simplest solution (ahah) would be to expose the executor capabilities as opcodes or assembly and have the planner and optimiser be just in time compilers for this new Virtual Machine.

And while at it, we then might benefit from some intermediate representation of the plan tree, which could start out as what the planner works with in term of data structure. Out of that, the optimiser job would be to generate the executor code, and I think it should be possible to define an optimisation effort target at this point, akin to -O0 to -O3 option of gcc. This is another much wanted feature, being able to set the amount of effort to put into finding the best plan possible, and this executor virtual machine or executor assembly idea may not be the most simple way to get there, but I see it as a conceptually simple fallout.

Another thing we might want to look at is how we batch the disk level reads. The current executor will fetch a row at a time, and fetching something like 2 MB at a time (maybe only for seqscans) might offer some great speed benefits, that should even balance out the waste in case you filter out some of those data.

Automatic use of Materialized VIEWs

The initial title of this section was Matching a query against some "template" at runtime, which is the hard part of the problem, I think.

The goal is simple and easy to understand, when you maintain some materialized views with triggers, you go a long way to ensure that the content is trustworthy in a transactional way. That means using that table or the normalized ones would not change any query result. So the best would be for the relational engine we love and trust to simply (from a user perspective) be able to use them when having to solve a query that could benefit.

As of course the dynamic query will not always be written the exact same way the VIEW is, you can't just hash the query text and compare with some index. Also, the user queries will not embed the VIEW anywhere in their definition, potentially, so you're not searching for a substring match either. It's more like matching a subplan of the query. Then I guess the planner and optimiser should consider, when evaluating the cost of any plan, if there's a materialized view somewhere that could help running it.

In case we have manually refreshed materialized views, I guess we could still benefit from the same mecanisms, but we would need a way to invalidate the plans when the data is not fresh enough, so that we decouple the refreshing interval to the acceptable lag in usage. Some metadata and auto-analyze support would certainly do.

Analyzing VIEWs as a correlated-statistics solution

This one is tricky. Oh well you may find all the ideas here as tricky, more often than not tricky enough that they don't need pursuing. But the other ones I've been thinking about them for long enough that they just feel natural for my brain. Aha.

So one of the major problem of PostgreSQL currently is related to how its planner and optimiser heavily depend on ANALYZE statistics. The problem is not so much having fresh ones nowadays, thanks to autovacuum, but to their quality. And the first quality problem we have is tied to correlated data.

What that means is that if you have a couple of colums a and b set in a way that each time a is not null, b = 2 * a, then PostgreSQL has no way to realize that. So if you have a query WHERE a=1 AND b=2 then it will think having the AND in there means you double the output restrictivity, and it's not the case.

There was a thread about exactly this problem on the mailing lists, where Simon proposed that we implement ANALYZE foo [WHERE .... ] in order to address this problem. I think a generalisation of this proposal is to be able to ANALYSE a VIEW, as I said here.

So if we ever get to implement a way to match a user query dynamically against some materialized view existing in the system, I suppose we could also benefit from this view's statistics in the case it's not materialized.

Planner costs and system usage statistics, or admission control

Kevin Grittner on the mailing lists proposed that we read section 2.4 of the Architecture of a Database System by Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. The idea revolves around accepting queries depending on the estimated resource usage they should need to execute, and the current availability of those on the system.

Now, it could be that a way to approach that would be to have some OS-specific statistic views in PostgreSQL, we already have a lot of those now. So we could query PostgreSQL for the current system load average, io wait, main memory usage and whatnot. The usual answer to that is that this is in no way the job of a database engine to care about such things, so please try and find the right tool for the right job, thank you very much.

Well, the idea here would be that the planner would now have access to those information, by calling a function. And we could expose some GUC thresholds not to overcome in the planning stage.

That means that the plan itself should store the information of the GUC as they were at the very moment of the planning, so that you possibly get plan invalidation to kick in and force a re-plan when you EXECUTE a PREPARE STATEMENT in the case the current resource usage changed enough.

Now the remaining issue is that you're using the system's resources usage indicators at planning time, to be able to prune some plans, but for complex queries there can be a meaningfull delay between the planning and the execution, so you only get illusions about resource availability.

Well, unless and until you get the possibility of issuing more than one plan for any given query and then have the executor pick one. In fact you want to apply this idea recursively, so that at any given part of the plan the executor has some way to check between what environment the planner was running with and what the executor is running with. Having such a feature would also greatly enhance data locality access patterns, that also would just be a choice of subplan left to executor to make.

Oh and should we get distributed query processing on a single node, then again we would benefit here by having the planner propose plans using the capability or bypassing it, and the executor check whether it makes sense to use it now given the estimated costs. At all plan levels.

The main drawback against such a line of thought is the major overhaul in executor design this represent, and in the case of only providing distributed sorting, e.g., then the cost / benefit ratio ain't foreseen as good enough to attack it.

In the context of this document though, it's some more reasons to consider the benefits of architecting the current executor code into a Virtual Machine set of instructions. Then some of the instructions would allow for checking the current environment and branching in the plan, following the best compromise as of current execution time. Sure, that's not symplifying the implementation. But that's a very good reason to have the use case in mind before making any low-level choice.

And some other thoughts

You know the famous joke saying there are 10 Kinds of People in the World, those who understand binary and those who don't. You though that counting to 3 in decimal was so much easier, didn't you? (Hint, I told about organizing this list of ideas under two big labels, of which this is the third)

Supervized guest deamons, with an API please

There are several projects that could benefit from being integrated alongside core processes. By that I mean being part of the start, reload, stop and restart procedures. The current list includes autovacuum, a pgagent scheduler, and a ticker. And helper backgrounds like PostgreSQL had in some 6.x branch, and that Markus implemented in Postgres-R. There's even an argument about including a connection pool into the mix, but Jan won that over a beer in Ottawa this year, so I won't insist. You'll notice there's a lucky winner here, we needed autovacuum so bad that it's thankfully already in core.

Providing an API that would allow to register user defined deamon processes would allow for including those other projects, and maybe some more, in a very easy way for the user.

My current thinking about that would be to steal as much as meaningfull from the Erlang/OTP supervisor processes API, including the MaxR and MaxT variables to protect the main system to suffer from user deamons mis-behaviors.

It would even make sense to only provide support for gen_fsm kind of hosted processes, meaning the API is to register a global unique name per process, a state and code entry points (transitions). Now that we have a mecanism to send signals to backends, with a payload, it's called LISTEN/NOTIFY. We could certainly reuse that to send messages to the hosted state machines: that would be the events that trigger te transitions --- the event name would match the code function.

In the case of a pgagent scheduler, we would need to be able to produce events internally, without user interaction, but my understanding of SIGALARM is that it's made for that. It's not clear what the best design would be here, but maybe registering a pgagent clock service that would NOTIFY the pgagent launcher service would do. It would also have to LISTEN for changes on its underlying job scheduling tables, I guess.

And for querying the database, we'd use SPI, I guess.

Logs analysis

Nowadays to analyze logs and provide insights, the more common tool to use is pgfouine, which does an excellent job. But there has been some improvements in logs capabilities that we're not benefiting from yet, and I'm thinking about the CSV log format.

So the idea would be to turn pgfouine into a set of SQL queries against the logs themselves once imported into the database. Wait. What about having our next PostgreSQL version, which is meant to include CSV support in SQL/MED, to directly expose its logs as a system view?

A good thing would be to expose that as a ddl-partitioned table following the log rotation scheme as setup in postgresql.conf, or maybe given in some sort of a setup, in order to support logrotate users. At least some facilities to do that would be welcome, and I'm not sure plain SQL/MED is that when it comes to source partitioning.

Then all that remains to be done is a set of SQL queries and some static or dynamic application to derive reports from there.

Conclusion

I hope some of those ideas are viable and interresting to some people, and should that be the case, seeing progress made on those would be awesome! Meanwhile, thanks for reading.