PostgreSQL, memory and the cloud™

Intro to Memory Management

Most software can not predict the exact memory needs during compile time, so memory is dynamically allocated during run time. When a program needs more memory it requests a new chunk of memory from the operating system. The OS can either grant or deny the request. Most modern software does not always care very much about the amount that is actually needed and tries to get a lot more than is currently necessary - just to have some spare for bad times.

The Broken Promise

Operating systems honor this behavior by not actually exclusively reserving memory for a program if a memory request is granted. The OS just hopes that the requestor is not actually using all the granted memory and therefore potentially promises way more memory than is physically available.

If - unexpectedly - the software starts using the promised memory and would exceed the available resources the OS has to act drastically, potential actions are:

  • Panic, halt the system
  • Freeze the requesting program
  • Terminate the requesting program
  • OOM-Killer, that terminates memory consuming processes

In most cases a so called OOM-Killer is started that terminates processes according to given metrics, this could be:

  • Memory consumption
  • Arbitrary priority
  • Combination of memory usage and priority

The common procedure is the following. When the kernel is not able to free enough memory in time the OOM-Killer is started and will kill the process with the highest score until the high memory stress situation is over.

Example - Desktop

This behavior is reasonable for many use cases. Think of a desktop system where a user has many running programs and leaves them running while starting more, or opens more and more new tabs in the browser. If in such a situation the OS decides to terminate a game running in the background or the web browser than this might be better for the user than not being able to start new software.

Example - Application Server

Think of an application server running hundreds of processes to serve hundreds of clients concurrently. Now an situation occurs in one of the processes which leads to massive memory consumption and this process tries to use more memory than available. Here again the OOM-Killer is activated and will most likely kill the problematic process and resolve the current situation with minimal impact.

Developers honor this behavior of modern operating systems by not bothering with proper memory management like making sure not to leak memory or by requesting huge amounts of memory if way less will actually be used.

Example - Database Server

Think of a PostgreSQL database server that has hundreds of processes serving hundreds of simultaneous users. One or more users have a high memory consumption and in sum all process want to consume more memory than the OS can provide fast enough, the OOM-Killer is started. Most likely one of the PostgreSQL processes executing queries is terminated. Unfortunately this leads to catastrophic failure. PostgreSQL processes can access shared memory segments (shared_buffers), when one process is terminated without being able to “clean up” the Postmaster can not be sure that the shared memory is still consistent and therefor the whole service needs to restart immediately, aborting all queries and terminating all connections. After this unclean shutdown PostgreSQL needs to perform a recovery on the next start which can lead to long unexpected downtimes.

Solution

The solution is simple, the memory overcommit functionality is only needed for software with loose memory management and is therefor deactivated on database servers. On Linux this is done by setting vm.overcommit_memory = 2. Now the kernel will deny a memory request if it is obvious that not enough memory is available for fulfillment. PostgreSQL handles this gracefully. If a query needs more memory, new dynamic memory is allocated. If the allocation fails - due to the “don’t overcommit policy” - the corresponding query is terminated and the involved memory is freed. So only one query from one user is affected here.

This is the default and expected behavior for PostgreSQL and many other database management systems, as mentioned in the documentation.

Testing your PostgreSQL

The low-memory situation can happen due to many reasons like wrong settings, not enough resources, to many connections, “hungry” extension, other processes on the server and many more.

Here is a simple test to trigger the problem in many cases.

SET work_mem TO '128GB'; -- should be >=2 times available RAM
EXPLAIN ANALYZE
        SELECT * FROM
                pg_class AS a,
                pg_class AS b,
                pg_class AS c
        ORDER BY
                random();

We start by allowing the coming queries to use more memory than available, if not PostgreSQL could just use the disk to execute. Than we do a triple cross join on a catalog table and sort the result randomly which needs to store all interim results. Sorting would do the trick but we also use random() to prevent any caching. To not transmit uninteresting output we use EXPLAIN ANALYZE to call the query.

The Cloud™ - Brave new World

If you don’t host services yourself they are running on someone elses computer and under their control. The good thing is that you and your staff can’t make bad decisions there, but you can’t correct them either if someone else has done them.

If we encounter the briefly described behavior on a server we just have to configure it according to the PostgreSQL documentation. When this or other problems materialize on Software as a Service (SaaS) offerings the options are … limited.

As an example please find recycled extracts from a public bugreport below where I tried to explain the problem.

PostgreSQL defined behavior

It certainly is not defined behavior of PostgreSQL itself to crash in an out-of-memory situation. PostgreSQL will handle any memory allocation failures reported by the operating system and report them to the client, cancelling only specific, memory-hungry queries with the proper “ERROR: out of memory exception”. All other clients are unaffected in this situation and the service remains uninterrupted even in the case of a prolonged low-memory stress situation.

What we see with your service is that your operating system doesn’t report the out-of-memory situation to PostgreSQL /at all/, otherwise the aforementioned error message would have been logged. Instead, the PostgreSQL processes are killed by the operating system. Our expectation is that proper “out of memory” errors are logged and the service is not crashing.

Normal Behavior under Memory Stress and OOM

If we run a PostgreSQL on an identical sized VM instances and produce low-memory stress, we see this sequence. The same is true at some other PostgreSQL as a Service providers. This is the expected normal behavior.

     A0
    ┌───────────────────────────────┐
    │                               │
 ┌─►│       Normal operation        |
 |  │                               │
 |  └──────────────┬────────────────┘
 |                 │
 |   A1            ▼
 |  ┌───────────────────────────────┐
 |  │                               │   We can trigger this with the
 │  |   Events that consume memory  |   provided SQL example. But it
 |  │                               │   occurs during normal operation
 |  └──────────────┬────────────────┘   as well.
 |                 │
 |   A2            ▼
 |  ┌───────────────────────────────┐
 |  │    Out of memory Situation    |   If no OOM-Killer is active
 |  │                               |   memory can be consumed,
 |  │    (nearly) no memory left    │   until further mallocs fail.
 |  └──────────────┬────────────────┘
 |                 |
 |   A3            ▼
 |  ┌───────────────────────────────┐
 |  │      PostgreSQL tries to      |   If in deed no memory is left
 |  |       claim more memory,      |   the malloc() call fails.
 |  |       memory claim fails!     |
 |  └──────────────┬────────────────┘
 |                 |
 |   A4            ▼
 |  ┌───────────────────────────────┐   Only the one backend failing
 |  │   PostgreSQL rolls back the   |   to claim memory has to act.
 |  │   query failing to allocate   |   It cancels only one query.
 |  |          new memory           │   All dynamic memory is freed
 |  └──────────────┬────────────────┘   by this backend.
 └─────────────────┘

Here only the query failing to claim memory gets aborted. The RDBMS stays operational, no other user, connection or query is disturbed.

Failure Case

This is the chain of events as we observed them on different Cloud SQL instances. Please correct us if you think the graph can be improved, not all information are transparent to us.

     B0
    ┌───────────────────────────────┐
    │                               │
 ┌─►│       Normal operation        |
 |  │                               │
 |  └──────────────┬────────────────┘
 |                 │
 |   B1            ▼
 |  ┌───────────────────────────────┐
 |  │                               │   We can trigger this with the
 |  │   Events that consume memory  |   provided SQL example. But it
 |  │                               │   occurs during normal operation
 |  └──────────────┬────────────────┘   as well.
 |                 │
 |   B2            ▼
 |  ┌───────────────────────────────┐   Situation can always occur,
 |  │     Low memory situation      │   even without bad intention.
 ├─►│                               |   Unclear what the threshold is,
 |  │     "Low" is not defined      │   could be 95%, 80%, or even 30%
 |  └──────────────┬────────────────┘   available memory consumed.
 |                 │
 |   B3            ▼
 |  ┌───────────────────────────────┐   This is not transparent to us,
 |  │  Some component in the GCP    │   but we observe an SIGKILL event.
 |  │    sends a SIGKILL to a       │   Educated guess: Linux OOM-Killer
 |  |  PostgreSQL worker processes  |   with k8s setting `oom_score_adj`
 |  └──────────────┬────────────────┘   per instance related to size.
 |                 |
 |   B4            ▼
 |  ┌───────────────────────────────┐   This is the normal and expected
 |  │ PostgreSQL needs to shutdown  |   behaviors in this failure case,
 |  │ in order to prevent the effect|   as we can all agree on.
 |  | of possible memory corruption │   If this step is reached, we have
 |  └──────────────┬────────────────┘   already lost the battle.
 |                 │
 |   B5            ▼
 |  ┌───────────────────────────────┐
 |  │                               │
 |  │     PostgreSQL (re)starts     |
 |  │                               │
 |  └──────────────┬────────────────┘
 |                 │
 |   B6            ▼
 |  ┌───────────────────────────────┐   This creates bloat which lets
 |  │      Disaster Recovery        │   the volume grow dramatically.
 |  │ after ungraceful shutdown B4, |   Bloat would be cleaned by VACUUM
 |  │   dead tuple remain as bloat  │   but the instance does not stay
 |  └──────────────┬────────────────┘   in normal operation long enough.
 └─────────────────┘

Step B3 is the behavior that is not expected and has nothing to do with PostgreSQL itself. This is what unavoidable leads to the outage. B4 and later are expected routines of PostgreSQL, but should never accrue during normal operation. This behavior is only in place to compensate disaster - like hardware failure, kernel panic or power outage - not low-memory stress.