CHAR(10) was a great venue with great people. The conference itself and the "Hall's Talks" have been a huge source of ideas popping up. Again...
As I've been talking to several people about those ideas over the time and past conferences, I figured I should try to put them down now. And the new ones too. So in this article I'll try to organise and share those ideas that I'm growing in my head, in the hope that some people will think that maybe the ideas are worthwile enough to think about them, make them theirs, then why not work on them!
Organising ideas is all about putting them togother in boxes and attach to them the more meaningfull label you can think of. Here, I think the labels are MVCC in the Cloud, and How to Further Optimise PostgreSQL.
Those ideas that I try to organise and present in this documents are expressed like if they were good designs. I realise that may well be not true, but that's the best way I can think of to share them. So not only that is a Work In Progress but you're welcome to find this collection nothing but a waste of time. Apologies if that's the case.
Also those are all bird's eye view and lots of implementation problems are certainly hidden all around, but my current knowlegde of the internals of the system won't allow me to be the first to catch them. Hint, hint.
MVCC in the Cloud
Nowadays all the rage is in the virtualised environments, dynamic sizing of resources and pay-only-what-you-use. They even found a funny name to sell this, welcome to the Cloud!
The problem is that while it's quite easy to understand how to apply such a reasonning to stateless services, such as webservers, it's a lot more difficult to apply the same to persistent storage services, like, say, a traditional relational database. Let's not forget that the main services of an RDBMS ain't interpreting SQL, but offering transactions and durability.
I think forgetting about that would be awful and could lead to strange solutions appearing, where you play with highly dynamic non-structured non-normalized caches that people would be happy to use as persistent storage services. Oh and of course only provide APIs, so that the application has to know all about the storage and data location, rather than having to describe what data they need. Yeah, declarative languages are not cool anymore, we don't want no Lisp, no Haskell, no RegExp, no CSS, no XML, no XSLT, NoSQL!
So, well, is it possible to apply those Cloud concepts to ACID persistent storage services such as PostgreSQL, and get some benefits out of that? I think so, and will try to explain how. If you're a web indexer robot of some kind, please consider attaching that to your distributed computing label, would you have one.
The first problem I'd try to have an idea about is how to offer a transparent service to users in the presence of replication. Say you're running a master slave Hot Standby setup, using Streaming Replication for transfering the data. Now the application code will either connect to the master or to the slave, and depending on this choice will not be able to benefit from the same service.
So the idea would be for the service offered by both the nodes of this cluster to be the same, as far as the user is concerned. This simplification could come in two steps. First, you need XID feedback from the slave to the master. That means that the older snapshot still in use in any slave should be known by the master, so that it will refrain from cleaning up the house while the party's still ongoing. It's rude to do that, even more so to friends.
Once you can trust that any snapshot on the slave is still valid on the
master, the second step would be for the slave to forward a transaction to
its master (
primary_coninfo) as soon as it can't serve it itself. I think
this is as soon as the transaction would require the
Virtual XID to turn
into a real one. Oh, and as you know the current transaction's snapshot is
still valid on the master, the only thing you need is being able to open a
transaction on this very
snapshot over there.
And it so happens that being able to connect to a system and ask for a given snapshot you know about already would be useful for other projects (parallel dumps and distributing query processing). I think some hackers are already contemplating on how to offer such a facility. Thanks guys!
But I want my ticket to the cloud!
The previous idea goes a long way to help offering a global service to users but does nothing to help solve the real complex Cloud challenges. Those are related to the flexibility you want from the Cloud, the marketing name of it being Elasticity. To be a Cloud service, you need to be able to add and remove nodes to your service and stay online.
So, let's continue thinking, but focusing a little more.
Lots of efforts are ongoing in the domains of replications and remote acces
to data. I'm thinking about
work. I like those projects and architecture but I don't think getting them
Cloud will be practical, except for
If you missed it,
SQL/MED is the part of the
standard that allows a database
server to expose objects from the
outer space. Meaning anything from a local
text file to another database server, of the same or another
technology. It's very powerful and valuable, but the limitation is that you
can't reach outer space within your transaction boundaries. That's a pretty
limit, though, because it means you now get autonomous transactions
(you embed a transaction into another and their
COMMIT/ROLLBACK statuses are
independant, or, say,
To provide elastic clustering in the Cloud, with some kind of transparency for the user, you don't want to lose MVCC, that thing that allows the database to respect the ACID guarantees. Maintaining MVCC in a cluster is, on the other hand, exactly what Postgres-XC, Postgres-R and Middle-R are working on.
The problem there is that either you have full support of PostgreSQL features, that's Postgres-R, or you have data distribution, that's Postgres-XC (which does not yet provide node fault-tolerance, by the way).
In the latter case, it's even a big problem to offer all the PostgreSQL features (user defined functions, triggers, rules, etc) because for parallelizing query processing (you have to do that if you're distributing the data) the vast majority of the work they're doing is in the planner and optimiser. I don't even want to think how they will make it possible for a trigger to touch a non-local table in a distributed cluster.
And one of the thing Markus showed at CHAR(10) is that what's killing replicated systems really is the communication overhead. In this light, really, the worst case would be Two Phase Commit, but it appears that Postgres-XC might not be that far away. The real big winner is Postgres-R which shows very low communication overhead, detailed next.
The idea here would be to keep the global MVCC facilities that those project provide. I'll confess that I like the Postgres-R implementation of it, even if it relies on a Group Communication System to deliver a total commit ordering of transactions in the cluster. But the advantage is that the design has been made with the idea of supporting dynamic cluster configuration: you can accept new nodes and lose some others online.
GCS is a hard-sell in our community, because finding a good
implementation of one ain't an easy task, and it seems that some of
PostgreSQL developers either burnt themselves on a
implementation, or just don't trust the theory. So I'd like to find out some
trustworthy and light way to derive a distributed sequence that would be a
reference for XID ordering in the cluster. I'm reading docs on
Paxos algorithm now.
So, say we have a global MVCC we can trust. Then, the data distribution would be easier to address at the executor level I think. What you need to be able to say is that some data are stored on another node. You need the catalogs to exists on all the nodes, and I think that the physical location property of the data is already well defined in the notion of a tablespace. So we should extend this notion to remote PostgreSQL nodes that happen to share the MVCC details with us.
That certainly already means that any and all
DDL targeting the remote
tablespace needs to happen using
2 Phase Commit, or maybe an asynchronous
model would fit here, I'm not sure. But I don't really see how. The data
would exist only on the master node for any given tablespace, but the
catalogs should certainly be in-sync, unless you want to accept a whole new
error domain, but which could be made to look like a serialisation failure
after all: if the
DDL changes are asynchronous, then your problem is the
lack of remote
exclusive lock, I guess.
That something I've been thinking about for a long time, you can find the distributing PostgreSQL wiki page and see its history for the details. If you see so you'll notice that the page needs some refreshing (at least the ideas behind the wiki page got some refresh here and there, the most recent and important one being CHAR(10).
Then the planner would work as it's doing now, but certain parts of the plan would have to be handled to another PostgreSQL instance. And as we're working with a global MVCC idea, we keep transactional behavior.
This does not provide data distributing off itself, for that to happen you
would use current
ddl partitioning facilities and have the partitions sit on
remote tablespaces. As soon as you do that, you can distribute the
processing by having the network IO happen asynchronously, which I think
would be required to achieve any kind of performances. Plus we already have
something in this spirit with
Now of course we all like to have a choice. Here the choice would be about
data locality. Certain systems already exists (distributed file systems,
GlusterFS or to some extend
HAMMER) that propose to mix and match
data replication and data distribution. Or simply think about
mirroring there, so that's somewhat expected to want
to have that in your database product too...
That leads me to think we should offer mirror tablespaces while at it. Implementing that can be as complex as optimising 2 Phase Commit or as simple (ahah) as having a per tablespace Streaming Replication solution.
Now of course you have to explain data locality in the network to the
planner, so that it can compare the cost of running a
JOIN over there then
retrieving the result set against doing that locally. It seems clean what's
best until you think that while the other node is hashing your fact table
you're free to locally execute another part of the same query plan...
It's so cloudy now I can feel the rain coming
And as all the rage is about offering elastic capabilities to your clustering solution, the next step after that is having some smart agent in the system that will notice that we're doing this or that data transfers to solve user queries so often that we should setup a mirror of this remote tablespace now.
Of course, you also want newcomers to be able to extend your elasticity, like "hey guys I'm new there, how can I help?".
While at it, your smart cluster manager program should be able to arrange
things so that the loss of any node at any time is just an optimisation
problem, not a data loss one. Nothing more critical than a
error, but maybe we should invent a new
SQLERR here, I'd like to propose the
following error message to the transactions you need to abort in case of
some node unexpected disapearing:
the elastic just shrinked.
How to Further Optimise PostgreSQL
It seems harder and harder to find good ideas that look effective and that are not too complex to understand. It's simply because about all of them already have been implemented into the product, if you ask me. But that does not mean there isn't a way to go, that means finding it is getting tough, and that when pursuing the ideas used to be some 1-man work for some consolidated weeks, it's getting to collaborative working of several talented people sharing a common goal and able to dedicate months of their time.
I hope it does not mean we should be happy with what we have but we rather continue improving our prefered product.
On the usage of indexes as column store
When you're handling very big tables, there's a very good chance that some column's value will get repeated all over the place. But if you index that column, you'll see the value only once and a list of pointers to the rows that are hosting it. Column storage is about compressing that data set in a way that you store the column value only once, and also about avoiding to pass the data around in the executor.
The idea here would need to first have the ability to use the index in an authoritative way, without refering the main table storage to confirm the visibility. That's already a work in progress.
When we have that, then we could think about how we're using the indexes now. Their only purpose in life is to help solve restrictive queries by accessing directly to the rows of interrest, avoiding to scan all those data that won't fit in main memory, so that you rely on your slow drives to get access to it.
Another idea here would be to be able to use the index to solve queries in other cases than just applying the restrictions and filtering they require. By that I mean that we could use our indexes to retrieve some column's value, rather than the main table's data, when we have the idea that the columns cardinality is low enough.
I guess that would mean we have optional column storage, in the case all the values you need from the table are in the available index(es).
Gavin reports that this idea may not provide the bangs for the bucks, but warns that it could be he's too much used to thinking on datawarehouse terms. Also that before engaging into such an effort, it might be good to have an idea of the benefit you're running after, which is the classic Amdahl's law or the old saying that you should always profile before to spend time on optimizing.
So it's impossible without some preliminary work to assess how useful this kind of index as a column store would be in the general case, but that wouldn't stop me to share the idea, see...
The Executor as a Virtual Machine
I've been talking a lot with
Gavin, and his mind is full of datawarehouse
optimisation problems. One of them is how to speed up the data retrieval
pipeline from disk. Given what are modern
CPU, we should be saturating read
capabilities of any hard drive without a sweat, and we're not there yet. It
MonetDB is about there, with their
MonetDB Assembly Language,
further described on the
MAL reference page.
The big difference in architecture between them and us is how they execute
the queries, with code path that contain very little branches, loops, or
function calls. That alone would be responsible for an incredible
performance benefit, like, a factor of
50 times what we have now, have
I been hinted.
If you think about it in the right angle, to be able to suppress all the looping and branching and function calling we currently have into the executor code, it could be that the simplest solution (ahah) would be to expose the executor capabilities as opcodes or assembly and have the planner and optimiser be just in time compilers for this new Virtual Machine.
And while at it, we then might benefit from some intermediate representation
of the plan tree, which could start out as what the planner works with in
term of data structure. Out of that, the optimiser job would be to generate
executor code, and I think it should be possible to define an
optimisation effort target at this point, akin to
-O3 option of
gcc. This is another much wanted feature, being able to set the amount of
effort to put into finding the best plan possible, and this
executor assembly idea may not be the most simple way to get
there, but I see it as a conceptually simple fallout.
Another thing we might want to look at is how we batch the disk level
reads. The current executor will fetch a row at a time, and fetching
2 MB at a time (maybe only for
seqscans) might offer some
great speed benefits, that should even balance out the waste in case you
filter out some of those data.
Automatic use of Materialized VIEWs
The initial title of this section was Matching a query against some "template" at runtime, which is the hard part of the problem, I think.
The goal is simple and easy to understand, when you maintain some materialized views with triggers, you go a long way to ensure that the content is trustworthy in a transactional way. That means using that table or the normalized ones would not change any query result. So the best would be for the relational engine we love and trust to simply (from a user perspective) be able to use them when having to solve a query that could benefit.
As of course the dynamic query will not always be written the exact same way
VIEW is, you can't just
hash the query text and compare with some
index. Also, the user queries will not embed the
VIEW anywhere in their
definition, potentially, so you're not searching for a
either. It's more like matching a
subplan of the query. Then I guess the
planner and optimiser should consider, when evaluating the
cost of any plan,
if there's a materialized view somewhere that could help running it.
In case we have manually refreshed materialized views, I guess we could still benefit from the same mecanisms, but we would need a way to invalidate the plans when the data is not fresh enough, so that we decouple the refreshing interval to the acceptable lag in usage. Some metadata and auto-analyze support would certainly do.
Analyzing VIEWs as a correlated-statistics solution
This one is tricky. Oh well you may find all the ideas here as tricky, more often than not tricky enough that they don't need pursuing. But the other ones I've been thinking about them for long enough that they just feel natural for my brain. Aha.
So one of the major problem of PostgreSQL currently is related to how its
planner and optimiser
heavily depend on
ANALYZE statistics. The problem is
not so much having fresh ones nowadays, thanks to
autovacuum, but to their
quality. And the first quality problem we have is tied to
What that means is that if you have a couple of colums
b set in a way
that each time a is not null,
b = 2 * a, then PostgreSQL has no
way to realize that. So if you have a query
WHERE a=1 AND b=2
then it will think having the
AND in there means you double the output
restrictivity, and it's not the case.
There was a
thread about exactly this problem on the mailing lists, where
Simon proposed that we implement
ANALYZE foo [WHERE .... ] in order to
address this problem. I think a generalisation of this proposal is to be
VIEW, as I
So if we ever get to implement a way to match a user query dynamically against some materialized view existing in the system, I suppose we could also benefit from this view's statistics in the case it's not materialized.
Planner costs and system usage statistics, or admission control
Kevin Grittner on the mailing lists proposed that we read section 2.4 of the Architecture of a Database System by Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. The idea revolves around accepting queries depending on the estimated resource usage they should need to execute, and the current availability of those on the system.
Now, it could be that a way to approach that would be to have some OS-specific statistic views in PostgreSQL, we already have a lot of those now. So we could query PostgreSQL for the current system load average, io wait, main memory usage and whatnot. The usual answer to that is that this is in no way the job of a database engine to care about such things, so please try and find the right tool for the right job, thank you very much.
Well, the idea here would be that the planner would now have access to those
information, by calling a function. And we could expose some
not to overcome in the planning stage.
That means that the plan itself should store the information of the
they were at the very moment of the planning, so that you possibly get plan
invalidation to kick in and force a re-plan when you
STATEMENT in the case the current resource usage changed enough.
Now the remaining issue is that you're using the system's resources usage indicators at planning time, to be able to prune some plans, but for complex queries there can be a meaningfull delay between the planning and the execution, so you only get illusions about resource availability.
Well, unless and until you get the possibility of issuing more than one plan for any given query and then have the executor pick one. In fact you want to apply this idea recursively, so that at any given part of the plan the executor has some way to check between what environment the planner was running with and what the executor is running with. Having such a feature would also greatly enhance data locality access patterns, that also would just be a choice of subplan left to executor to make.
Oh and should we get distributed query processing on a single node, then again we would benefit here by having the planner propose plans using the capability or bypassing it, and the executor check whether it makes sense to use it now given the estimated costs. At all plan levels.
The main drawback against such a line of thought is the major overhaul in executor design this represent, and in the case of only providing distributed sorting, e.g., then the cost / benefit ratio ain't foreseen as good enough to attack it.
In the context of this document though, it's some more reasons to consider the benefits of architecting the current executor code into a Virtual Machine set of instructions. Then some of the instructions would allow for checking the current environment and branching in the plan, following the best compromise as of current execution time. Sure, that's not symplifying the implementation. But that's a very good reason to have the use case in mind before making any low-level choice.
And some other thoughts
You know the famous joke saying there are 10 Kinds of People in the World, those who understand binary and those who don't. You though that counting to 3 in decimal was so much easier, didn't you? (Hint, I told about organizing this list of ideas under two big labels, of which this is the third)
Supervized guest deamons, with an API please
There are several projects that could benefit from being
core processes. By that I mean being part of the start, reload,
stop and restart procedures. The current list includes
scheduler, and a ticker. And
helper backgrounds like PostgreSQL had in some
6.x branch, and that
Markus implemented in
Postgres-R. There's even an
argument about including a connection pool into the mix, but
Jan won that
over a beer in Ottawa this year, so I won't insist. You'll notice there's a
lucky winner here, we needed
autovacuum so bad that it's thankfully already
Providing an API that would allow to register user defined deamon processes would allow for including those other projects, and maybe some more, in a very easy way for the user.
My current thinking about that would be to steal as much as meaningfull from
the Erlang/OTP supervisor processes API, including the
variables to protect the main system to suffer from user deamons
It would even make sense to only provide support for
gen_fsm kind of hosted
processes, meaning the API is to register a global unique name per process,
a state and code entry points (transitions). Now that we have a mecanism to
send signals to backends, with a payload, it's called
could certainly reuse that to send messages to the hosted state machines:
that would be the events that trigger te transitions --- the event name
would match the code function.
In the case of a pgagent scheduler, we would need to be able to produce
events internally, without user interaction, but my understanding of
SIGALARM is that it's made for that. It's not clear what the best design
would be here, but maybe registering a
pgagent clock service that would
pgagent launcher service would do. It would also have to
for changes on its underlying job scheduling tables, I guess.
And for querying the database, we'd use
SPI, I guess.
Nowadays to analyze logs and provide insights, the more common tool to use
pgfouine, which does an excellent job. But there has been some
improvements in logs capabilities that we're not benefiting from yet, and
I'm thinking about the
CSV log format.
So the idea would be to turn
pgfouine into a set of
SQL queries against the
logs themselves once imported into the database. Wait. What about having our
next PostgreSQL version, which is meant to include CSV support in
to directly expose its logs as a system view?
A good thing would be to expose that as a ddl-partitioned table following
the log rotation scheme as setup in
postgresql.conf, or maybe given in some
sort of a setup, in order to support
logrotate users. At least some
facilities to do that would be welcome, and I'm not sure plain
that when it comes to
Then all that remains to be done is a set of
SQL queries and some static or
dynamic application to derive reports from there.
I hope some of those ideas are viable and interresting to some people, and should that be the case, seeing progress made on those would be awesome! Meanwhile, thanks for reading.