PosgreSQL高可用集群实践

工作中一个第三方软件使用了Posgresql数据库，而在我们的场景里，我们需要保证Posgresql数据库的高可用，网上查找了一下，发现stolon这个高可用，在使用前，先研究一下它的原理。

Stolon的架构

Stolon is composed of 3 main components
keeper: it manages a PostgreSQL instance converging to the clusterview computed by the leader sentinel.
sentinel: it discovers and monitors keepers and proxies and computes the optimal clusterview.
proxy: the client’s access point. It enforce connections to the right PostgreSQL master and forcibly closes connections to old masters.
For more details and requirements see Stolon Architecture and Requirements

上面是Stolon项目Readme中的说明，可以看到其本质与Redis Sentinel的方案比较类似，都是哨兵模式。

由sentinel组件发现、观察keeper与proxy的信息，并计算出最优的集群视图。

每个keeper组件管理一个posgresql实例，并根据sentinel计算出的最优集群视图，将posgresql集群中各实例加以配置，最实现集群的最优方案。

除此之外，为了让客户端能透明地访问Posgresql集群，还提供了proxy组件处理客户端请求，最请求导向集群的master节点，这一点比redis sentinel方案更好了，就不用客户端驱动专门做sentinel模式支持了。

Stolon安装

官方文档中有写如何在kubernetes集群中部署Stolon集群，虽然也是用yaml文件分别3个组件，不过还是麻烦了些，幸好找到了对应的helm chart。

这样部署Stolon集群就很简单了：

git clone https://github.com/lwolf/stolon-chart
cd stolon-chart/stolon
helm install --namespace test --name stolon . --set store.backend=kubernetes --set persistence.enabled=true --set persistence.storageClassName=defaultScName

stolon-chart里使用storageclass的方式不太合规，顺便改了下，给它们发了个PR，不过貌似没有回应

然后kubernetes集群内部的其它pod配置stolon-proxy的service FQDN地址就可以访问到它了，比如用上面的命令部署的stolon集群可以以下面的地址来访问它：

stolon-stolon-proxy.test.svc.cluster.local:5432

验证高可用

既然是解决高可用性问题，光看看官方文档就这么部署了，还是不放心的，还是要模拟一下出问题的场景。官方文档里讲了验证的方法：

In a single node setup we can kill the current master keeper pod but usually the statefulset controller will recreate a new pod before the sentinel declares it as failed. To avoid the restart we’ll first remove the statefulset without removing the pod and then kill the master keeper pod. The persistent volume will be kept so we’ll be able to recreate the statefulset and the missing pods will be recreated with the previous data.
kubectl delete statefulset stolon-keeper --cascade=false
kubectl delete pod stolon-keeper-0
You can take a look at the leader sentinel log and will see that after some seconds it’ll declare the master keeper as not healthy and elect the other one as the new master:
no keeper info available db=cb96f42d keeper=keeper0
no keeper info available db=cb96f42d keeper=keeper0
master db is failed db=cb96f42d keeper=keeper0
trying to find a standby to replace failed master
electing db as the new master db=087ce88a keeper=keeper1
Now, inside the previous psql session you can redo the last select. The first time psql will report that the connection was closed and then it successfully reconnected:
postgres=# select * from test;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
postgres=# select * from test;
 id | value
----+--------
  1 | value1
(1 row)

就是删掉一个master节点的stolon-keeper组件，然后看sentinel的日志，可以很明显地看到一个新的master节点被选举出来了，这时posgresql客户端用原来的地址连上新的master节点了，验证成功了。

这里又学到一个小技巧，删除deployment、statefulset等时，加上--cascade=false可以保留住与这些资源对应的pod。

PosgreSQL高可用集群实践

文章目录

Stolon的架构

Stolon安装

验证高可用

参考