您刚刚安装了PostgreSQL并启动了您的第一个集群,创建了一些表,加载了数据,甚至对PostgreSQL配置进行了一些调整以提高性能。现在,您正在考虑如何使群集高度可用。不幸的是,PostgreSQL不知道在向导不可用时如何自动切换自身,但是幸运的是,这可以使用第三方实用程序来实现。任务很明确,您开始研究所有实用程序的优缺点,以便选择最佳实用程序。并且...您已经走错了轨道,因为您必须首先确定SLA,RTO和RPO值。在本演讲中,我计划介绍数据库管理员在设置和操作高可用性自动故障转移Postgres集群时犯的许多错误。
! . 6 . Zalando. , high availability postgres’ Patroni.
, , Zalando. , Lamoda. .
Postgres. -.
4- Amazon. Amazon. , , . Amazon EC2 instances, Kubernetes, EC2 instances.
, - . - high availability. , , high availability disaster recovery, .
, , high availability failover.
, . , , .
, HA , , . .
( ?). HA?
. ? . , . , , , .. . .
. - , , , , , BIOS - .
downtime. - , . , , - , : . . . , , . . . , , update delete where clause. , .
. . , --.
, , – . . 15 .
99,95 %, . . . ? , , Amazon, RDS, . , , , , .
, .
, . . , Google Amazon , , .
- ? . IBM, . , .
, , - - . - agreed.
? , level an agreed. , . . , , , , , . . - , , .
, , - level indication. , availability. level objective – , . , . .. , availability , , . , , - .
, .
:
- .
- .
- - .
- .
- .
? . disaster recovery, .. .
HA, . - .
DBA , -. , - disaster, .
. RPO (recovery point objective), . ., . RTO (recovery time objective). .
, SLA, SLI, SLA , . RTO – .
https://en.wikipedia.org/wiki/File:RPO_RTO_example_converted.png
disaster recovery. RPO, RTO. . RTO ( ).
, . , . , .
, , , . PostgreSQL ( ), . - (), .
RPO RTO – . . . , , .
Postgres? , failover . , - . Postgres.
, - archive command. - -. . . , Postgres WAL - -. , -, archive_timeout , . . , . - , , 5 30 , .
pg_receivewal - . , . archive_command.
, - , . , , .
, , Postgres . , - . , , , . Zalando - , . , - , , .
RTO . , , RTO , 15 , failover , , . . . - , DBA. , . , - . , - , DBA .
, failover.
- failover, . , , . . , availability, .
- , – , , - . , , - .
. , . . - . RPO RTO.
?
: « . . , . ».
Postgres? PostgreSQL XC/XL. , – global transaction manager. . - global transaction manager.
BDR . - . , – eventual consistency. , - . , . , , - , . - .
Eventual consistency – - . . . . . , , - , .
Postgres Pro. - . ? – . – , , . . . latency . , , , . , , , , , .
?
-, quorum. Quorum . . . . Google, : « Google ?». : «, ». ? - Google, , , . quorum .
, , — fencing. - , , . , , STONITH (Shoot The Other Node In The Head) , . . - , , .
. - switch . . - .
– watchdog. , Postgres, , , , .
Linux watchdog, . , - , , split-brain.
quorum, fencing, watchdog ? . . , GitHub.
https://github.com/MasahikoSawada/pg_keeper
? , - , slave, . , slave.
? network split, . . . slave . – split-brain.
, . Masahiko Sawada. , .
? - , primary, standby, , .
, – . , . , , .
, - network split, standby. .
standby. . . fencing . , , , , - , .
- , quorum. Quorum , , , , Etcd, , . . Patroni. Patroni Etcd, (), .
, Etcd, standby , .
.
https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/
– , , . GoCardless. Corosync + Pacemaker. , , .
Pacemaker quorum. IP , , watchdog. , , .
? RAID-. , . ? , () , Pacemaker , . . , .
, , , postgres’ . crash recovery, , failover .
, Pacemaker. , Pacemaker .
.
? ?
, , , . . . , Airbus A380. . - , .
, postgres’ , , . .
- , , . , , . .
https://github.blog/2018-10-30-oct21-post-incident-analysis/
. . GitHub, Postgres , .
GitHub -. , . . . , , , . , Jobs GitHub.com . Latency - 60 .
? . . , .
Jobs GitHub.com , , 60 . . . , , ( ).
, , . , , . .
? Failover . failover, , , , .
. Pg_rewind . MySQL, , . Postgres , .
https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/
, . GitLab. . .
- , . . WAL- , , .
pg_basebackup. pg_basebackup, . Ctrl+C, PGDATA, – .
? pg_basebackup checkpoint spread, . . , checkpoint. verbose mode, pg_basebackup.
- .
? , , . .
Pg_dump, , . , , , c . - , Postgres , , pg_dump .
Postgres, pg_dump. . 9.5, pg_dump 9.2 9.5. . . , .
– Microsoft Azure. , - , .
– LVM, . , staging. , – , . , .
. 6 . . . , . 24 . , , 6- .
?
RPO RTO , .
, - , , RPO 24 , .
. , .
, . Runbooks. , - . – pg_basebackup , .
, , . pg_basebackup , , . , . disaster.
, . . , .
, .
. , Patroni , . . CPU, . ., , , . , , . , , checkpoints . . .
. , alert. RTO ( : RTO, RPO) , , . . , Postgres, , . .
: « Patroni postgres’ ?». Patroni, , max connections , , .
Linux, huge pages, shared memory, semaphores, overcommit . .
Patroni postgres’ , . . shared_buffers, max_wal_size, checkpoint completion_target, random_page_cost . . . , , .
.
. RPO RTO. .
, RTO, , , HA, , , .
availability, . .
, , , .
. , . . disaster recovery .
. , ! , .
Questions
. , . , . . , HA . , . -. . . . disaster recovery – .
. , , , . , , , . , . .
, , . . , , , . – 100 % .
IBM, , .
, 99,99. .
, , . : IBM , . , . . , .
, , 10 .
, , . .
. , – . , . , . . .
switch - , . .
?
.
, , , switches?
, , .
, Patroni. , , - , . Patroni ?
Patroni recovery.conf . recovery_min_apply_delay, . Patroni – load balancing , , , stale . failover .
, !
! . , , , HA? , , witness. . Witness – . . . ?
, . , quorum. Quorum – . , . , . , . , . Postgres. . . -, quorum.
!
! , Zalando, , Amazon. , Amazon , ?*
, . - – . - . . . . , , . , . , , , . Amazon . . , . . . . Amazon .
, - , . , .
! . , Patroni? , , .
没有错。
您是否打算将Patroni现在将json中的Patroni斜杠返回的度量集成到Prometheus格式中?
基本上,Patroni是一个开源项目。如果需要,您可以实施拉取请求。我将很高兴看到它并重击它。
谢谢!
我认为,甚至在该帐户上开了一张票。
没有其他问题了,谢谢大家!