构建高可用性集群时的常见错误以及如何避免它们。亚历山大·库库什金



您刚刚安装了PostgreSQL并启动了您的第一个集群,创建了一些表,加载了数据,甚至对PostgreSQL配置进行了一些调整以提高性能。现在,您正在考虑如何使群集高度可用。不幸的是,PostgreSQL不知道在向导不可用时如何自动切换自身,但是幸运的是,这可以使用第三方实用程序来实现。任务很明确,您开始研究所有实用程序的优缺点,以便选择最佳实用程序。并且...您已经走错了轨道,因为您必须首先确定SLA,RTO和RPO值。在本演讲中,我计划介绍数据库管理员在设置和操作高可用性自动故障转移Postgres集群时犯的许多错误。





! . 6 . Zalando. , high availability postgres’ Patroni.





, , Zalando. , Lamoda. .





Postgres. -.



4- Amazon. Amazon. , , . Amazon EC2 instances, Kubernetes, EC2 instances.





, - . - high availability. , , high availability disaster recovery, .



, , high availability failover.



, . , , .



, HA , , . .





( ?). HA?





. ? . , . , , , .. . .





. - , , , , , BIOS - .



downtime. - , . , , - , : . . . , , . . . , , update delete where clause. , .





. . , --.



, , – . . 15 .



99,95 %, . . . ? , , Amazon, RDS, . , , , , .



, .



, . . , Google Amazon , , .





- ? . IBM, . , .



, , - - . - agreed.





? , level an agreed. , . . , , , , , . . - , , .



, , - level indication. , availability. level objective – , . , . .. , availability , , . , , - .





, .



:



  • .
  • .
  • - .
  • .
  • .




? . disaster recovery, .. .





HA, . - .



DBA , -. , - disaster, .



. RPO (recovery point objective), . ., . RTO (recovery time objective). .



, SLA, SLI, SLA , . RTO – .





https://en.wikipedia.org/wiki/File:RPO_RTO_example_converted.png



disaster recovery. RPO, RTO. . RTO ( ).



, . , . , .



, , , . PostgreSQL ( ), . - (), .



RPO RTO – . . . , , .





Postgres? , failover . , - . Postgres.



, - archive command. - -. . . , Postgres WAL - -. , -, archive_timeout , . . , . - , , 5 30 , .



pg_receivewal - . , . archive_command.



, - , . , , .



, , Postgres . , - . , , , . Zalando - , . , - , , .



RTO . , , RTO , 15 , failover , , . . . - , DBA. , . , - . , - , DBA .



, failover.





- failover, . , , . . , availability, .



- , – , , - . , , - .





. , . . - . RPO RTO.





?





: « . . , . ».



Postgres? PostgreSQL XC/XL. , – global transaction manager. . - global transaction manager.



BDR . - . , – eventual consistency. , - . , . , , - , . - .



Eventual consistency – - . . . . . , , - , .



Postgres Pro. - . ? – . – , , . . . latency . , , , . , , , , , .





?



-, quorum. Quorum . . . . Google, : « Google ?». : «, ». ? - Google, , , . quorum .



, , — fencing. - , , . , , STONITH (Shoot The Other Node In The Head) , . . - , , .



. - switch . . - .



– watchdog. , Postgres, , , , .



Linux watchdog, . , - , , split-brain.





quorum, fencing, watchdog ? . . , GitHub.





https://github.com/MasahikoSawada/pg_keeper



? , - , slave, . , slave.



? network split, . . . slave . – split-brain.



, . Masahiko Sawada. , .





? - , primary, standby, , .





, – . , . , , .





, - network split, standby. .





standby. . . fencing . , , , , - , .





- , quorum. Quorum , , , , Etcd, , . . Patroni. Patroni Etcd, (), .



, Etcd, standby , .





.





https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/



– , , . GoCardless. Corosync + Pacemaker. , , .



Pacemaker quorum. IP , , watchdog. , , .





? RAID-. , . ? , () , Pacemaker , . . , .



, , , postgres’ . crash recovery, , failover .



, Pacemaker. , Pacemaker .



.





? ?



, , , . . . , Airbus A380. . - , .



, postgres’ , , . .



- , , . , , . .





https://github.blog/2018-10-30-oct21-post-incident-analysis/



. . GitHub, Postgres , .



GitHub -. , . . . , , , . , Jobs GitHub.com . Latency - 60 .





? . . , .



Jobs GitHub.com , , 60 . . . , , ( ).



, , . , , . .





? Failover . failover, , , , .



. Pg_rewind . MySQL, , . Postgres , .





https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/



, . GitLab. . .



- , . . WAL- , , .



pg_basebackup. pg_basebackup, . Ctrl+C, PGDATA, – .



? pg_basebackup checkpoint spread, . . , checkpoint. verbose mode, pg_basebackup.



- .



? , , . .



Pg_dump, , . , , , c . - , Postgres , , pg_dump .



Postgres, pg_dump. . 9.5, pg_dump 9.2 9.5. . . , .



– Microsoft Azure. , - , .



– LVM, . , staging. , – , . , .



. 6 . . . , . 24 . , , 6- .





?



RPO RTO , .



, - , , RPO 24 , .



. , .



, . Runbooks. , - . – pg_basebackup , .



, , . pg_basebackup , , . , . disaster.



, . . , .





, .





. , Patroni , . . CPU, . ., , , . , , . , , checkpoints . . .





. , alert. RTO ( : RTO, RPO) , , . . , Postgres, , . .





: « Patroni postgres’ ?». Patroni, , max connections , , .



Linux, huge pages, shared memory, semaphores, overcommit . .



Patroni postgres’ , . . shared_buffers, max_wal_size, checkpoint completion_target, random_page_cost . . . , , .





.



. RPO RTO. .





, RTO, , , HA, , , .



availability, . .



, , , .



. , . . disaster recovery .



. , ! , .





Questions



. , . , . . , HA . , . -. . . . disaster recovery – .



. , , , . , , , . , . .



, , . . , , , . – 100 % .



IBM, , .



, 99,99. .



, , . : IBM , . , . . , .



, , 10 .



, , . .



. , – . , . , . . .



switch - , . .



?



.



, , , switches?



, , .



, Patroni. , , - , . Patroni ?



Patroni recovery.conf . recovery_min_apply_delay, . Patroni – load balancing , , , stale . failover .



, !



! . , , , HA? , , witness. . Witness – . . . ?



, . , quorum. Quorum – . , . , . , . , . Postgres. . . -, quorum.



!



! , Zalando, , Amazon. , Amazon , ?*



, . - – . - . . . . , , . , . , , , . Amazon . . , . . . . Amazon .



, - , . , .



! . , Patroni? , , .



没有错。



您是否打算将Patroni现在将json中的Patroni斜杠返回的度量集成到Prometheus格式中?



基本上,Patroni是一个开源项目。如果需要,您可以实施拉取请求。我将很高兴看到它并重击它。



谢谢!



我认为,甚至在该帐户上开了一张票。



没有其他问题了,谢谢大家!




All Articles