帕特罗尼和斯托隆瀑布的安装和练习。马克西姆·米尔尤丁(Maxim Milyutin)



Patroni和Stolon是PostgreSQL编排和高可用性(自动故障转移)Leader-Followers配置集群中最著名和最先进的两个解决方案。但是,从旧的行之有效的解决方案(Corosync&Pacemaker)迁移并从其他DBMS嵌入的工程师在安装这些工具时遇到困难,并且缺乏对每个组件的作用的理解。在该大师班中,将检查在虚拟机(而不是在容器中)上安装Patroni和Stolon群集的典型过程,以及这些群集在基础架构中出现各种故障的行为。整个过程将在使用预先构建的映像运行vagrant的三个虚拟机上进行演示。如果需要,收听者可以在事先准备好周围环境的情况下遵循该过程。



PGConf.Russia



! . Ozon . . Postgres Pro Patroni Stolon. .





-. , Stolon, Patroni . .



, Ansible , Postgres Pro , .



Patroni , , — https://github.com/vitabaks/postgresql_cluster. .





, .



  • PostgreSQL – shared-nothing, .
  • . , .
  • hot standby, . . .
  • :
  • pg_basebackup , , .
  • . standby .
  • pg_rewind, standby.
  • , .




https://eng.uber.com/mysql-migration/



https://github.com/sorintlab/stolon/issues/519



https://github.com/zalando/patroni/issues/538



  • 10- PostgreSQL . , , , . , , , . Write amplification, - , , WAL full page images, checkpoint. hit beat . . WAL. « PostgreSQL MySQL» .



  • .



  • , , DDL, sequence, , , . WAL. WAL -. GTID MySQL, CSN MS SQL Server.



  • pg_rewind.



  • Stolon Patroni , , , rolling upgrade Postgres .







, ? , . . - , health checks - .





, , – promote . .





, , , .





, ? , promote . .





, split brain . - , .



, , , .



, . .





? Postgres , . , , , , .





? , , , - .



– . , read only. .





fail. , . , .





https://github.com/citusdata/pg_auto_failover



https://github.com/citusdata/pg_auto_failover/issues/12#issuecomment-490551255



. . pg_auto_failover Citus Data.



. , . pg_stat_replication.





, . . , , . primary ( ) , .



, , . , , .



fail. , .





, , .





, . .



, . , , .



.





, . DCS (Distributed Configuration System – ). IP , .



DCS – Consul, Etcd, Raft Zookeeper, Zab. Zab – Paxos.



, DCS.



Patroni/ Stolon.



Postgres Postgres .





, Patroni/ Stolon.



  • -, autofailover. - .
  • . PostgreQSL.
  • , Kubernetes.
  • DBaaS (database as a service).
  • – . , - . , - .


(DCS) Etcd





https://raft.github.io/



. DCS. . , «» . DCS, , .



? . , Postgres, , DCS , , split , split brain. , fail DCS .



, DCS 3-5-7 , , 3- . ? . net split, , DCS.



Etcd RAFT . .





DCS , follower PostgreSQL. RAFT.



. . .



, . follower, . . - RTT fsync.



, follower, . , , . . .



, - .



14 42 .



vagrant status
Current machine states:

node1                     running (virtualbox)
node2                     running (virtualbox)
node3                     running (virtualbox)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.


. vagrant.





: , . , , . . .





. . , .





Etcd . Etcd , Etcd.





config Etcd. , Etcd, , IP , . . ETCD_LISTEN_CLIENT_URLS . ETCD_LISTEN_PEER_URLS .



ETCD_ADVERTISE_CLIENT_URLS ETCD_INITIAL_ADVERTISE_PEER_URLS. . discovery, .



: ETCD_HEARTBEAT_INTERVAL ETCD_ELECTION_TIMEOUT.





. . . Ansible. . , .





. Etcd .





, term 2. Term – timeline PostgreSQL. term .





etcdctl member list. , () , followers.



sudo pkill -STOP etcd


. , fail , . Etcd , . . .





. . , term.





, , .





«etcdctl cluster-health». , . .





Etcd. , . term follower’.





- . . ? – . Etcd . «comcast». API tables Etcd. , .



? «Comcast — - device eth1 – packet – loss 100 %».





. , . time line. , -, . , term 4.





. , heartbeat_interval election_timeout. , followers , heartbeat , followers , . follower heartbeat - - -, . .



, , - . , . heartbeat_interval – 100 . , -, . election_timeout – .





. . , , RTT , election_timeout. Election_timeout . Ansible. .



`comcast --device eth1 --stop



: comcast --device eth1 --latency 600. .





latency 600 . 600 – . RTT 200 .





ping . RTT 1 .





. , term . . , - , term. .





, heartbeat_interval election_timeout. , heartbeat , election_timeout 10 . Ansible. . Etcd-config. , . , . . , -. Etcd .





. . follower’.





member list, , , fallowers .



, , , , - 10 .





- Etcd, . bar. Deadline exceeded – , , . Etcd. timeout . 5 . total_timeout , 10 .





«get», . -. .





. , .





. Election_timeout , heartbeat 100 .



, RAFT - . , : , , . .





. Etcdctl member list. . – follower.





. bash – comcast – device, . . . - sleep . Comcast – device eth1 – stop sleep 1,5. done . , , . .





Etcd. , term , , - , term, . Term . . . .





, , Etcd, . 1 . , . . . , , Etcd fsync . , .



. Comcast – device eth1 – stop.





https://github.com/etcd-io/etcd/blob/master/Documentation/tuning.md#time-parameters

https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md#example-hardware-configurations



. Etcd , .



, , .



, Etcd , , , , .



Patroni Stolon. , .





. netsplit, , DCS. , , Postgres , DCS . , Postgres, .



DCS. , . . . , DCS Patroni Stolon.





Stolon.





. DCS stolon-sentinel, . DCS : election, , statefull .



Postgres’ Stolon-keeper. – stolon-proxy, .





https://github.com/sorintlab/stolon/issues/313



3- , . , , 2- sentinel, . , 2- stolon-proxy. , , 2- Stolon-keeper, postgres-.



41 20



, . . , . , stolon’ . -, . Etcd . . . . – superuser, . . , .



Stolon. stolon.d/test-cluster.conf. , «test.cluster» . , , . Postgres, -. ,



- . , . Superuser, Stolon-keeper . . . , .



«test.cluster»? system/system/Stolon-keeper@.service. template-, . , - . ? Stolon, … . , , - , -, .



Ansible. . . , . . . Stolon-keeper. Name=Stolon-keeper@test-cluster state=started enable=on. .



. Test-cluster. , . lock - . , : Stolon-keeper, sentinel proxy . .



sentinel. . , , DCS. . . . sentinel , sentinel . . State=started enabled=on. - . , . , test-cluster. . , - . .



.





https://postgrespro.ru/docs/postgrespro/12/server-shutdown

https://github.com/sorintlab/stolon/issues/707



workflow Stolon:



  • «stolonctl init».
  • PostgreSQL pg_hba update.
  • , PostgreSQL , , , . . , Keeper, post-master. Stolon-keeper PostgreSQL.
  • «automaticPgRestart», postgres- .
  • , . , max_connections, max_lock_per_transaction postgres-. . , , «max_connections» «max_lock_per_transaction». , , , . .
  • – Stolon-keeper. – Stolon-keeper. . , .


, pg_pba. , pba. /opt/stolon/test-cluster. . . Stolon-test-cluster-spec.json. , . . , .



.





https://github.com/sorintlab/stolon/blob/master/doc/initialization.md

https://github.com/sorintlab/stolon/blob/master/doc/standbycluster.md



Stolon :



  • – .
  • – PITR, . standby cluster.
  • – existing. , DCS. DCS , , . , «existing».


. unitdb, checksums, , pgrewind . . Stolonctl. . .



Keeper, . . Keeper , sentinel, . , unitdb, . standby.



. «status». , Keepers, heaths check Keepers Postgres, . , , . sentinel.



, . wantedgeneration currentgeneration. Stolon-keeper . sentinel , , , . Keeper . .



. json, . . . Keepers . , , . , . . . , . Etcd .



. : Etcd . , Etcd. , . , , Consul. Consul , . , , , , Stolon-keeper . Postgres, Stolon . , Stolon-keeper. systemd, on abort, kill -9 .



Postgres. kill -9 , . . – . . Stolon-keeper, Ok.



. . - . Postgres . Stolon-keeper . Postgres. .



. fail. Postgres-. , . pgbench.



- , Postgres, ? select , , select.



, checksums, , checksums , . Postgres , . , , checksums , - . Postgres. Patroni/Stolon .



pgbench. . , . 25432. . . Stolon/test-cluster/postgres/pg_hba.conf.



, Stolon superuser, , . , .



. «default», . «pg_hba». «update». json- pgHBA . local all posters. Posters trust. – host all postgres 172.20.20.0/24 trust.



, . . , Postgres. . Create user postgres superuser. , Postgres . pg_bench . HBA user test. Patroni. .



while. 20 , . , . .





Stolon . :



  • SleepInterval – .
  • RequestTimeout – deadline PostgreSQL. Deadline DCS – 5 .
  • FailInterval – , sentinel , . Sentinel failInterval, , . , , , . . - , . . failInterval .




autofailover Stolon?



1 – fail . Stolon-keeper Postgres . sentinel. , sentinel. . sleepInterval. 10 .





2 – - , , sentinel. , Keeper .





3 – sentinel. Keepers. sleepInterval.



: (λ1 + λ2) * sleepInterval. . .





4 – . DCS. sentinel , .





, , DCS sentinel , failover 25 50 .



fail sentinel’, failover sentinel. sentinel. failover .





, Stolon-proxy Keeper , Keeper read only . Postgres. Postgres Stolon-proxy.



. DCS, , , , .





  • Stolon. Stolon . , DCS . , «deadKeeperRemovalInterval». 48 . , DCS. , . , , WAL. 48 , .



  • , Stolon . . , -, deadlines - Postgres. , dbWaitReadyTimeout deadline . – 60 . checkpoints, deadline .



  • syncTimeout – deadline . 30 . , . .



  • InitTimeout – deadline , initdb .



  • -. conversion timeout. , Keeper . -. Stolon . - -, Stolon .







Patroni.





Patroni, , , . ? Stolon. Patroni . DCS, , Patroni.





Patroni, . . , DCS time to live . , , . . , - . s… . Patroni , WAL-, REST API, , . WAL . Proxy – .





. . . 3- Etcd. Postgres Pro HAProxy confd, Etcd .



2- Patroni. Patroni Postgres.





https://patroni.readthedocs.io/en/latest/existing_data.html



Patroni , . basebackup’ . Patroni , , .



basebackup. , , tablespace.





https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-ADMIN



workflow Patroni. , bootstrap. , , . . Stolon, , , . bootstrap. .



Patroni? postgres.conf pg_hba.conf, recovery.conf DCS , Stolon. . .



Patroni postgres-. , , .



, – , Patroni.





https://github.com/zalando/patroni/blob/master/haproxy.cfg

https://github.com/zalando/patroni/tree/master/extras/confd



.



– . . . Patroni- REST API endpoints, , , .



HAProxy, healthchecks Patroni.



Patroni callbacks. , . .



HAProxy , DCS HAProxy. HAProxy + confd. consul-temlate. . .



10- Postgres libpq , , «target_session_attrs», . ? – , target_session_attrs.



, , watchdog Postgres, , , Patroni-. ? Postgres , . .



Stolon Patroni , . - , .





https://www.consul.io/docs/guides/forwarding.html

https://learn.hashicorp.com/consul/day-2-operations/advanced-operations/dns-caching

https://pgconf.ru/2019/242817 https://pgconf.ru/2019/242821

https://github.com/cybertec-postgresql/vip-manager



, DNS. Consul . DNS . .



IP-. HAProxy + keepalived. vip-manager, DCS, IP- , . , Postgres Pro , , IP-. , kill stop keepalived’, VRRP IP- HAProxy, IP- . , , . vip-manager. vip-manager , switchover, IP . , .





, , . Stolon :



  • ttl – .
  • Loop_wait – Patroni-.
  • Retry-timeout – DCS PostgreSQL.
  • Master_start_timeout- PostgreSQL ( Patroni-).


, , . , Patroni- Postgres, DCS. - loop_wait. , .



failover Patroni?





  • , DCS . Patroni- . . – 20-30 .

  • Patroni- REST API, endpoint Patroni WAL-. - 2 . 2 , , . , , WAL-.




  • DCS. - .




  • .




DCS , - , , 5 .





, , .



, , Patroni-. - - .





https://www.postgresql.org/message-id/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C@yandex-team.ru

https://github.com/zalando/patroni/blob/master/docs/watchdog.rst



.



  • . , , , Postgres. , Postgres WAL-commit .
  • Zalando – watchdog. Patroni- - , : , .
  • HAProxy Confd, . . , .
  • Corosync & Pacemaker — ( ) , . . . , , , .




, HAProxy Confd .





, netsplit? HAProxy Patroni . . health check’ Patroni-.



Confd. Confd , DCS.





, HAProxy PgBouncer. PgBouncer DCS. , , Patroni .





  • , Patroni . . , , - DCS . downtime , . wal_keep_stgments, .
  • , , . . . , , , . .
  • Patroni? Patroni Stolon , enterprise . :
  • . .
  • . , , . , , failover , - . Max Availability Oracle Data Guard.
  • PostgreSQL Stolon.


!



.



Etcd, . , - ?



-, . , , Etcd, Consul mail , . fsync , .



-? , . , , , , ?



, -.



, Postgres , . DCS , .



, .



? ? . , , Consul . Etcd . - ?



Consul Etcd. RAFT. fsync . Postgres DCS , , . , . . , .



, !



Zookeeper? , ? ?



Zookeeper , . Etcd . Stolon , Patroni – .



- Patroni? . - ?



. wal_keep_segments, . . WAL- , . , issue Patroni. , Stolon, , , - .



! -! , . Patroni , . , , .



, . . .



. . . , . WAL-, . , WALs . . , . !



. , - -. , . . . switchover failover, . promote checkpoint, WALs . . , , .



! , - - . , ?



enterprise, Patroni. Stolon. , . . -- Kubernetes, , . Keeper, Sentinel. , , .



Patroni . WAL-, , DCS. DCS . (, , ), DCS, . . issue, Consul . Patroni. Stolon . Kubernetes.



. , Stolon ?



.



– master-slave Stolon.



, . , standby . – Stolon, . , , standby .



. . .



, ?



, . . , , , .



, .



. . ?



, .



Patroni . , , , . , . .



, Patroni , . Stolon , Postgres keeper data, .



, Stolon ?



open source, .



, , ?



, issue. .



- , . - . , .



issue. , , , .



-, . .



! . . , HAProxy, . . . . HAProxy "on-marked-down shutdown-sessions", , .



, ? health checks?



, http check REST API.



, -, HAProxy, IP- . – PgBouncer, health checks. HAProxy – , health checks , . , , – Patroni, - .



Patroni Etcd REST API.



, Etcd , Etcd.



Etcd? , , . watchdog, Patroni , , , watchdog reboot.



, watchdog – . watchdog. Patroni PostgreSQL, Patroni. watchdog – , , . .



, .



watchdog -, .. , , Patroni- , reboot. .



watchdog , , , , failover , . .



, . ? .



, …



, Patroni, . . - . watchdog – .



Patroni Etcd , , standby. , watchdog .



. , Patroni , , , . . : watchdog, HAProxy.



.



Etcd. ?



-.



- ?



-.



? , ?



-.



. . , , ?



是。我在一篇关于不稳定配置的论文中提到了这一点。超时是唯一的方法。这些尤其是heartbeat_interval和lection_timeout。



谢谢!






All Articles