Advertisement

10 Issues to Avoid When Setting Up and Managing Your PostgreSQL Database

By on
Read more about author Matt Yonkovit.

PostgreSQL is a hugely popular database and the leading example of how an open-source database can operate successfully for its community and for companies. It has been named database of the year by DB-Engines for three out of the past four years, and it will be no surprise if that continues.

However, there are areas where PostgreSQL can be problematic. There are common – and not-so-common – mistakes that can result in serious performance and security issues. In this article, I detail the 10 most common problems you might face around PostgreSQL and why they matter.

1. Tool selection

Alongside your database, you will normally need tools to help you manage that implementation. Picking the wrong tool at the very start can actually make that job harder and more time-consuming, compared to the time saving you achieve for a particular task. 

For example, we can look at three common PostgreSQL tools for load balancing and connection management: 

  • HAProxy 
  • pgBouncer  
  • pgPool-II 

These tools look similar, but they have different strengths and weaknesses. HAProxy is a lightweight load balancing tool that manages workloads across multiple PostgreSQL instances. PgBouncer is a similarly lightweight connection pooling tool. PgPool-II provides both connection pooling and load balancing services, but it is not a lightweight implementation.

At face value, each tool covers very similar ground. But picking the right one can make a huge difference to performance and maintenance. The wrong choice will add to your workload rather than reduce it. 

As part of this process, it is also important to check the tool version you use, especially those that ship as the default version in any distribution of your database. Bundling useful tools can help you get up and running faster, but an old tool can seriously impact productivity. For example, PostgreSQL on CentOS version 7 ships with HAProxy 1.5, while CentOS version 8 includes HAProxy 1.8 as standard. Each of these is behind the current version of HAProxy (version 2.4) so you should be updating this tool if you intend to use it.

2. Lax security

One of the biggest security problems over the past few years is poor database management. It is easy to discover database instances on the web that have not had any security controls implemented, or where misconfigurations have made them publicly accessible.

For developers, database security is probably the last thing on their minds, but getting this right is crucial. Common mistakes include not using SSL/TLS to encrypt data traffic between the application and the database nodes, not using role-based access control or opening up permissions too widely, or even opening up Unix_socket parameters too far. Similarly, it is easy to store data in public schemas, so the information can be accessed by unauthorized individuals.

To solve this, it is important to audit your activity and check security best practices are being followed. If you are not doing this already, you should implement auditing immediately.

3. Vacuuming

Every database has to manage data over time, particularly when it comes to keeping data organized. Vacuuming is how PostgreSQL deals with dead tuples that would otherwise take up space, impact performance, and lead to page splits if not removed. Carrying out vacuuming right can improve your overall database performance.

However, the vacuum task itself will have a significant performance hit, as the database has to be checked in order to identify and remove the dead tuples. If your application is undergoing normal traffic load at the time, it can affect how it performs and what your customers see.

To manage and implement vacuuming, you can plan ahead. One approach is to schedule this operation for periods with lower traffic volumes, to avoid impacting performance. Another method is to consider auto-scheduling so that vacuuming is carried out only when the volume of data is enough to make carrying out the action worthwhile, but not excessively impactful. 

The latest version of PostgreSQL offers some additional help on this front. PostgreSQL 14 has improved the vacuuming process, so dead tuples can be detected and removed between vacuums, reducing the number of page splits and reducing index bloat. The vacuum system has also been enhanced to eagerly remove deleted pages. Previously, it took two vacuum cycles to do this, with the first one marking the page as deleted and the second one actually freeing up that space. This is another good reason why you should ensure you are using the latest version of software, which brings new features and improvements.

4. Poor connection management

Every database must connect to an application to provide and receive data. Each application can have a number of connections to the database to manage that data getting written and read. However, too many connections can lead to problems.

A big element in this is that developers just look at the straight connection, rather than factoring in other overheads. PostgreSQL allows 2-3MB per connection as standard, but this is not enough. The actual overhead would be more like 20-50MB, when you account for semaphores, lock management, and maintaining old snapshots. Improving connection management and ensuring that each connection is thought out in real-world terms can help improve performance.

5. Sticking with default settings

For some products, the default settings are the best options for the widest range of people. They help users get started quickly without requiring any changes. Databases do not fall into that category, and they require tuning to perform at their best for any given application.

As part of any implementation, you should check the default settings on your database and determine whether they are right for your application and use case. In PostgreSQL, you should be checking a number of settings, including cache sizes, shared buffers, and how auto-vacuuming is implemented. In addition, you should check your write-ahead logging (WAL) and checkpointing, and decide how you will be managing logs and debugging data. Each of these individual elements should be updated so that overall performance does not suffer.

Changing default settings also touches upon the issue of security mentioned earlier. Many databases do not by default force a password or credential check to access them. This should be one of the first changes that you make when you implement any new instance.

6. Under- and over-indexing

Creating an index of data within your database makes it easier and faster to return results when a query takes place. However, indexing should be approached based on the Goldilocks model – not too much and not too little – as it creates additional work for the database instance.

For example, you should not index every column in a database, as this leads to over-indexing and creates additional work and overheads, which far exceed any improvement in performance. At the same time, any column that you filter should be indexed.

Equally, it is important to understand the available index models and apply the right one based on your use case. For instance, multi-value data held in JSON would probably be handled best using Gin, while Gist is better for geospatial data. Simple searches can be handled with a hash index, and a B-tree is a reliable indexing method useful for many data types.

7. Know your extensions

PostgreSQL is unique when it comes to adding functionality into its operations using extensions. These can be very powerful additions that help performance and data handling, but they can also break your application if you are not careful.

The wrong extension can substantially increase memory consumption, affect hardware performance, and even crash servers. It is crucial you fully understand the extensions you add to your PostgreSQL implementation, whether they are performing their role as expected, and if you can get support for that extension if you start having issues. Alongside this, checking the extensions originate from known and reputable sources can help you avoid problems in the future.

8. Missing backups and high availability

Business continuity planning should be included for all applications and database implementations. This includes implementing high availability (HA) for applications that need to run despite any problems and ensuring data is backed up for all database instances. For HA, check your WAL approach is in place and working, and implement a tool like Patroni to ensure the application carries on running despite any problems. 

Attacks such as ransomware have put more emphasis on protecting business data and having a good backup process. For database backup, you need to ensure you have a minimum data retention period (at least seven days) in place, and that you have local and remote locations for those backups to be sent to. Finally, it is essential that you test that your backups actually work and that you can recover your data easily and effectively when required.

9. Tune your workloads

Every application is different. So why do developers often use the same database and the settings for every application they create? You should be tuning your database to support a specific application workload each time.

To do this, look at your application data and see if patterns emerge. This observability data can help you understand short-term changes in demand over the course of a week or a month, as well as longer-term data growth and expansion. This data can then be used to determine how your application’s read and write data levels measure up, and how you can tune the application accordingly.

For example, can your working data set fit into the RAM installed in your machine, or available in your cloud instance? In PostgreSQL, you might want to increase the shared_buffer value to contain your entire database, so that the whole working set of data can reside in cache. This can improve performance as you won’t have the round-trip of writing and reading data from disk. 

You can look at how data is managed between WAL and checkpoints where it is moved to disk. This activity is done when CHECKPOINT occurs. This is an expensive operation and can cause a huge amount of IO, as this whole process involves expensive disk read/write operations. Increasing the time between checkpoints can lead to longer operations each time a checkpoint is carried out. The OS will accumulate all the dirty pages until the ratio is met and then go for a big flush.

It is also worth looking at the database in the context of the whole application infrastructure. For example, looking at the Linux kernel set-up can offer some opportunities to improve performance around shared memory with SHMMAX and SHMALL. SHMMAX defines the maximum size for a single shared memory instance, and SHMALL covers the system-wide total amount of shared memory pages available. Setting up Huge Pages in Linux can help manage more data in each page, which makes it possible to use a large amount of memory with reduced overheads. This results in fewer page lookups, fewer page faults, and faster read/write operations through larger buffers, culminating in performance improvements.

10. Poor database design

The biggest issue affecting database performance is due to poor database design in the first place. This can take place regardless of which database you decide to run, rather than PostgreSQL in particular. Examples of this in action include bad data types and poor schema design.

One issue for developers is over-reliance on Object–Relational Mapping or ORM tools to manage their databases. ORM makes it easier for developers to work with data, which is a good thing. However, relying on ORM can mean that developers don’t understand why performance is bad, or where the problems around data occur. This lack of database understanding can lead to developers picking the wrong hardware or cloud instance to support the workload.

Ultimately, one of the best ways to improve database performance is to store less data. In modern times, the amount of data that we now create and the reduction in storage costs means that we have much more data than we are ever likely to need, and we will store that data for longer than necessary. While cutting data volume might help, it will take time for developers to understand the principles behind those decisions.

Understanding how databases work is something that all developers and IT professionals should know, even if they don’t work with them on a daily basis, or get hands-on with tuning for performance. The levels of abstraction taking place between applications, data, and the people that put them together means a lot of this knowledge is being lost. 

The number of specialist database administrators within enterprises is dropping, as companies look to reskill their staff and cover more bases across software development, IT operations, cloud service, etc. We also now have more data at our disposal and more pressure to use it productively. By focusing attention on the right areas, you can improve performance and anticipate some of these issues before they become unmanageable.

Leave a Reply