Мезосфера — Кластер высокой доступности не может выбрать лидера, но журналы не показывают ошибок и, похоже, не могут вызвать выбор лидера

У меня есть кластер из 6 машин. Машины:

      HOST        MEM (GB) CPU
mesos-primary-1     8       2
mesos-primary-2     8       2
mesos-primary-3     8       2
mesos-worker-1      1       1
mesos-worker-2      1       1
mesos-worker-3      1       1

Размер моего кворума равен 2.

Основные машины имеют идентификаторы: 1, 2 и 3 соответственно. В веб-интерфейсе я посетил каждый отдельный IP-адрес mesos-primary-1, mesos-primary-2 и mesos-primary-3 через порт 5050 и не получил перенаправления на другой IP-адрес ни от одного из них.

Отсутствие перенаправления наводит меня на мысль, будто каждая машина думает, что у нее есть собственный кворум или что-то в этом роде, и поэтому они не видят друг друга и не выбирают лидера.

Посещение порта 8080 на любой из машин вызывает ошибку, потому что нет избранного лидера, но она разрешается.

$ cat /etc/mesos-master/quorum

выходы 2 на каждой главной машине.

Я также остановил / перезапустил все. На главных узлах:

$ sudo service mesos-master stop\
sudo service marathon stop\
sudo service zookeeper stop\
sudo service mesos-master start\
sudo service marathon start\
sudo service zookeeper start

И на каждой из ведомых машин

$ sudo service mesos-slave stop\
sudo service mesos-slave start

И до сих пор ни один из рабов не обнаружен и ни один лидер не избран.

Мои логи чисты на всех 3 IP-адресах (я получил каждый, так как нет редиректов), вы можете просмотреть каждый отдельный здесь:

mesos-primary-1

Log file created at: 2015/10/02 11:00:01
Running on machine: mesos-primary-2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:01.532337 13722 logging.cpp:172] INFO level logging started!
I1002 11:00:01.532865 13722 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:01.532894 13722 main.cpp:231] Version: 0.24.1
I1002 11:00:01.532903 13722 main.cpp:234] Git tag: 0.24.1
I1002 11:00:01.532909 13722 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:01.533020 13722 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:01.546877 13722 leveldb.cpp:176] Opened db in 13.691496ms
I1002 11:00:01.550370 13722 leveldb.cpp:183] Compacted db in 2.522303ms
I1002 11:00:01.550559 13722 leveldb.cpp:198] Created db iterator in 118591ns
I1002 11:00:01.550618 13722 leveldb.cpp:204] Seeked to beginning of db in 1151ns
I1002 11:00:01.550642 13722 leveldb.cpp:273] Iterated through 0 keys in the db in 767ns
I1002 11:00:01.551029 13722 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:01.553994 13743 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:01.556193 13740 recover.cpp:449] Starting replica recovery
I1002 11:00:01.561755 13722 main.cpp:465] Starting Mesos master
I1002 11:00:01.563489 13740 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:01.568989 13722 master.cpp:378] Master 20151002-110001-2874854303-5050-13722 (159.203.90.171) started on 159.203.90.171:5050
I1002 11:00:01.569059 13722 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="159.203.90.171" --initialize_driver_logging="true" --ip="159.203.90.171" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:01.569535 13722 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:01.569581 13722 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:01.569608 13722 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:01.569718 13722 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:01.570199 13722 authenticator.cpp:512] Initializing server SASL
I1002 11:00:01.582969 13722 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:01.584786 13743 contender.cpp:149] Joining the ZK group
I1002 11:00:11.573873 13747 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
I1002 11:01:06.547200 13743 http.cpp:321] HTTP GET for /master/state.json from 173.243.85.102:51963 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10
Log file created at: 2015/10/02 11:00:01
Running on machine: mesos-primary-2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:01.532337 13722 logging.cpp:172] INFO level logging started!
I1002 11:00:01.532865 13722 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:01.532894 13722 main.cpp:231] Version: 0.24.1
I1002 11:00:01.532903 13722 main.cpp:234] Git tag: 0.24.1
I1002 11:00:01.532909 13722 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:01.533020 13722 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:01.546877 13722 leveldb.cpp:176] Opened db in 13.691496ms
I1002 11:00:01.550370 13722 leveldb.cpp:183] Compacted db in 2.522303ms
I1002 11:00:01.550559 13722 leveldb.cpp:198] Created db iterator in 118591ns
I1002 11:00:01.550618 13722 leveldb.cpp:204] Seeked to beginning of db in 1151ns
I1002 11:00:01.550642 13722 leveldb.cpp:273] Iterated through 0 keys in the db in 767ns
I1002 11:00:01.551029 13722 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:01.553994 13743 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:01.556193 13740 recover.cpp:449] Starting replica recovery
I1002 11:00:01.561755 13722 main.cpp:465] Starting Mesos master
I1002 11:00:01.563489 13740 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:01.568989 13722 master.cpp:378] Master 20151002-110001-2874854303-5050-13722 (159.203.90.171) started on 159.203.90.171:5050
I1002 11:00:01.569059 13722 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="159.203.90.171" --initialize_driver_logging="true" --ip="159.203.90.171" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:01.569535 13722 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:01.569581 13722 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:01.569608 13722 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:01.569718 13722 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:01.570199 13722 authenticator.cpp:512] Initializing server SASL
I1002 11:00:01.582969 13722 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:01.584786 13743 contender.cpp:149] Joining the ZK group
I1002 11:00:11.573873 13747 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'

mesos-primary-2

Log file created at: 2015/10/02 11:00:01
Running on machine: mesos-primary-2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:01.532337 13722 logging.cpp:172] INFO level logging started!
I1002 11:00:01.532865 13722 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:01.532894 13722 main.cpp:231] Version: 0.24.1
I1002 11:00:01.532903 13722 main.cpp:234] Git tag: 0.24.1
I1002 11:00:01.532909 13722 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:01.533020 13722 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:01.546877 13722 leveldb.cpp:176] Opened db in 13.691496ms
I1002 11:00:01.550370 13722 leveldb.cpp:183] Compacted db in 2.522303ms
I1002 11:00:01.550559 13722 leveldb.cpp:198] Created db iterator in 118591ns
I1002 11:00:01.550618 13722 leveldb.cpp:204] Seeked to beginning of db in 1151ns
I1002 11:00:01.550642 13722 leveldb.cpp:273] Iterated through 0 keys in the db in 767ns
I1002 11:00:01.551029 13722 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:01.553994 13743 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:01.556193 13740 recover.cpp:449] Starting replica recovery
I1002 11:00:01.561755 13722 main.cpp:465] Starting Mesos master
I1002 11:00:01.563489 13740 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:01.568989 13722 master.cpp:378] Master 20151002-110001-2874854303-5050-13722 (159.203.90.171) started on 159.203.90.171:5050
I1002 11:00:01.569059 13722 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="159.203.90.171" --initialize_driver_logging="true" --ip="159.203.90.171" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:01.569535 13722 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:01.569581 13722 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:01.569608 13722 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:01.569718 13722 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:01.570199 13722 authenticator.cpp:512] Initializing server SASL
I1002 11:00:01.582969 13722 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:01.584786 13743 contender.cpp:149] Joining the ZK group
I1002 11:00:11.573873 13747 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying

mesos-primary-3

Log file created at: 2015/10/02 11:00:12
Running on machine: mesos-primary-3
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:12.609675 17105 logging.cpp:172] INFO level logging started!
I1002 11:00:12.610414 17105 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:12.610452 17105 main.cpp:231] Version: 0.24.1
I1002 11:00:12.610468 17105 main.cpp:234] Git tag: 0.24.1
I1002 11:00:12.610483 17105 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:12.610576 17105 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:12.618232 17105 leveldb.cpp:176] Opened db in 7.382537ms
I1002 11:00:12.619810 17105 leveldb.cpp:183] Compacted db in 1.512691ms
I1002 11:00:12.619876 17105 leveldb.cpp:198] Created db iterator in 27030ns
I1002 11:00:12.619910 17105 leveldb.cpp:204] Seeked to beginning of db in 1254ns
I1002 11:00:12.619925 17105 leveldb.cpp:273] Iterated through 0 keys in the db in 339ns
I1002 11:00:12.620028 17105 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:12.620930 17125 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:12.621615 17128 recover.cpp:449] Starting replica recovery
I1002 11:00:12.626735 17105 main.cpp:465] Starting Mesos master
I1002 11:00:12.627024 17128 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:12.633635 17123 master.cpp:378] Master 20151002-110012-321094504-5050-17105 (104.131.35.19) started on 104.131.35.19:5050
I1002 11:00:12.633828 17123 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="104.131.35.19" --initialize_driver_logging="true" --ip="104.131.35.19" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:12.635736 17123 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:12.635771 17123 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:12.635802 17123 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:12.635835 17123 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:12.636078 17123 authenticator.cpp:512] Initializing server SASL
I1002 11:00:12.643378 17125 contender.cpp:149] Joining the ZK group
I1002 11:00:12.643826 17123 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:22.633390 17130 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying

Я настроил машины в соответствии с рекомендациями, приведенными в этот цифровой путеводитель по океану.

Бег

MASTER=$(mesos-resolve `cat /etc/mesos/zk`) mesos-execute --master=$MASTER --name="cluster-test" --command="sleep 5”

Доходность:

2015-10-02 12:30:26,137:14558(0x7f8dbb743700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@716: Client environment:host.name=mesos-primary-1
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-57-generic
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@725: Client environment:os.version=#95-Ubuntu SMP Fri Jun 19 09:28:15 UTC 2015
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@753: Client environment:user.dir=/root
2015-10-02 12:30:26,142:14558(0x7f8dbb743700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181 sessionTimeout=10000 watcher=0x7f8dc3625610 sessionId=0 sessionPasswd=<null> context=0x7f8da8003960 flags=0
2015-10-02 12:30:26,142:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.35.19:2181]
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.35.19:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.117.124:2181]
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.117.124:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:26,145:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [159.203.90.171:2181]
2015-10-02 12:30:26,147:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [159.203.90.171:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,484:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.35.19:2181]
2015-10-02 12:30:29,485:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.35.19:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,485:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.117.124:2181]
2015-10-02 12:30:29,486:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.117.124:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,487:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [159.203.90.171:2181]
2015-10-02 12:30:29,488:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [159.203.90.171:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Failed to detect master from 'zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos' within 5secs
root@mesos-primary-1:~# mesos-execute --master=$MASTER --name="cluster-test" --command="sleep 5"`

У кого-нибудь есть идеи?


person yburyug    schedule 02.10.2015    source источник


Ответы (1)


Мне кажется, что либо ваши машины недоступны друг для друга, либо порты заблокированы на некоторых или всех ваших машинах на правильных портах. Гарантировать, что:

A. Разблокированы порты 2181 (zookeeper), 2888 и 3888 (подчиненное присоединение и выборы мастера соответственно) и 5050 (mesos)/8080 (если вы используете marathon) для пользовательского интерфейса на рабочем столе. /ноутбук. Рабам нужно только 2888, я считаю, что они доступны от мастеров.

B. То, что вы можете сначала пропинговать все остальные мастера с одной машины, т. е. использовать мастер 1 и пинговать мастера 2 и 3.

C. Попробуйте сначала правильно отладить мастеры, образующие кластер, прежде чем беспокоиться о ведомых устройствах.

Кажется, у вас есть хороший набор настроек и правильные настройки кворума. Как только вы определили, что машины могут подключаться друг к другу, вы можете исследовать другие потенциальные проблемы. Дайте нам знать, как это происходит!

person chuckwired    schedule 20.01.2016