Масштабируемость чтения Titan / Cassandra, случайные серверы отключаются или отбрасывают сообщения

Мы используем Titan-1.0.0 с поддержкой cassandra 2.1.7. Кластер состоит из 36 виртуальных машин, каждая из которых имеет 16 ГБ ОЗУ (размер кучи 6 ГБ) и 16 ядер ЦП. Мы используем SSD-диск для данных cassandra и обычный HDD для журналов фиксации. RF равно 3, и чтение / запись выполняются при CF = 2 каждое. Версия Java - 1.8.0_45. Через этот кластер обслуживаются тяжелые операции чтения. Мы сталкиваемся со случайными сбоями на некоторых серверах в среднем примерно два-три раза в день, когда сервер выходит из строя. Позвольте мне поделиться конкретным случаем примерно 10 мая 2017 года в 04:00 утра, было слишком много отказов READ / MUTATION, и нам пришлось перезапустить сервер. Ниже приведены некоторые из параметров, которые мы собрали, когда серверы выходят из строя:

Может кто-нибудь, пожалуйста, помогите нам с некоторыми указателями относительно того, что здесь происходит не так и что можно сделать для стабилизации серверов.

  • gc.log

    2017-05-11T21:09:44.018+0530: 1900952.543: Total time for which application threads were stopped: 0.0156063 seconds, Stopping threads took: 0.0125466 seconds 2017-05-11T21:09:45.021+0530: 1900953.546: Total time for which application threads were stopped: 0.0031342 seconds, Stopping threads took: 0.0006697 seconds 2017-05-11T21:09:46.025+0530: 1900954.550: Total time for which application threads were stopped: 0.0037100 seconds, Stopping threads took: 0.0011525 seconds 2017-05-11T21:09:47.031+0530: 1900955.556: Total time for which application threads were stopped: 0.0057972 seconds, Stopping threads took: 0.0030861 seconds 2017-05-11T21:09:48.034+0530: 1900956.559: Total time for which application threads were stopped: 0.0029592 seconds, Stopping threads took: 0.0007129 seconds 2017-05-11T21:09:49.061+0530: 1900957.586: Total time for which application threads were stopped: 0.0268731 seconds, Stopping threads took: 0.0242117 seconds 2017-05-11T21:10:03.095+0530: 1900971.620: Total time for which application threads were stopped: 0.0045643 seconds, Stopping threads took: 0.0009709 seconds 2017-05-11T21:10:04.099+0530: 1900972.624: Total time for which application threads were stopped: 0.0033764 seconds, Stopping threads took: 0.0007673 seconds 2017-05-11T21:10:07.104+0530: 1900975.629: Total time for which application threads were stopped: 0.0056445 seconds, Stopping threads took: 0.0019495 seconds 2017-05-11T21:10:07.447+0530: 1900975.972: Total time for which application threads were stopped: 0.0045721 seconds, Stopping threads took: 0.0009695 seconds {Heap before GC invocations=85258 (full 0):  garbage-first heap   total 6291456K, used 5059586K [0x0000000640000000, 0x0000000640206000, 0x00000007c0000000)  region size 2048K, 1844 young (3776512K), 105 survivors (215040K)  Metaspace       used 36253K, capacity 36654K, committed 36784K, reserved 1081344K   class space    used 3844K, capacity 3965K, committed 4016K, reserved 1048576K 2017-05-11T21:10:07.453+0530:
    1900975.978: [GC pause (GCLocker Initiated GC) (young) Desired survivor size 242221056 bytes, new threshold 15 (max 15)
    - age   1:  182659896 bytes,  182659896 total
    - age   2:   21418160 bytes,  204078056 total
    - age   3:     871016 bytes,  204949072 total
    - age   4:    3600512 bytes,  208549584 total
    - age   5:    1313096 bytes,  209862680 total
    - age   6:      21152 bytes,  209883832 total , 0.0972867 secs]    [Parallel Time: 83.6 ms, GC Workers: 13]
          [GC Worker Start (ms): Min: 1900975980.9, Avg: 1900975986.0, Max: 1900975989.4, Diff: 8.5]
          [Ext Root Scanning (ms): Min: 8.5, Avg: 11.9, Max: 17.4, Diff: 8.9, Sum: 154.3]
          [Update RS (ms): Min: 26.8, Avg: 30.8, Max: 34.2, Diff: 7.4, Sum: 399.8]
             [Processed Buffers: Min: 31, Avg: 87.3, Max: 144, Diff: 113, Sum: 1135]
          [Scan RS (ms): Min: 0.1, Avg: 2.8, Max: 6.2, Diff: 6.1, Sum: 36.6]
          [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 0.9]
          [Object Copy (ms): Min: 31.5, Avg: 32.3, Max: 33.2, Diff: 1.7, Sum: 419.9]
          [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
          [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.3, Sum: 3.1]
          [GC Worker Total (ms): Min: 74.7, Avg: 78.1, Max: 83.3, Diff: 8.7, Sum: 1014.7]
          [GC Worker End (ms): Min: 1900976063.9, Avg: 1900976064.1, Max: 1900976064.3, Diff: 0.3]    [Code Root Fixup: 0.1 ms]    [Code Root Purge: 0.0 ms]    [Clear CT: 1.4 ms]    [Other: 12.1 ms]
          [Choose CSet: 0.0 ms]
          [Ref Proc: 1.0 ms]
          [Ref Enq: 0.0 ms]
          [Redirty Cards: 1.6 ms]
          [Humongous Reclaim: 0.4 ms]
          [Free CSet: 4.5 ms]    [Eden: 3478.0M(3476.0M)->0.0B(252.0M) Survivors: 210.0M->54.0M Heap: 4941.5M(6144.0M)->1175.0M(6144.0M)] Heap after GC invocations=85259 (full 0):  garbage-first heap   total 6291456K, used 1203201K [0x0000000640000000, 0x0000000640206000, 0x00000007c0000000)   region size 2048K, 27 young (55296K), 27 survivors (55296K)  Metaspace       used 36253K, capacity 36654K, committed 36784K, reserved 1081344K   class space    used 3844K, capacity 3965K, committed 4016K, reserved 1048576K }  [Times: user=0.58 sys=0.21, real=0.09 secs]  2017-05-11T21:10:07.551+0530:
    1900976.076: Total time for which application threads were stopped: 0.1015506 seconds, Stopping threads took: 0.0004470 seconds 2017-05-11T21:10:08.557+0530: 1900977.082: Total time for which application threads were stopped: 0.0053236 seconds, Stopping threads took: 0.0010397 seconds {Heap before GC invocations=85259 (full 0):  garbage-first heap   total 6291456K, used 1461249K [0x0000000640000000, 0x0000000640206000, 0x00000007c0000000)   region size 2048K, 153 young (313344K), 27 survivors (55296K)  Metaspace      used 36253K, capacity 36654K, committed 36784K, reserved 1081344K   class space    used 3844K, capacity 3965K, committed 4016K, reserved 1048576K 2017-05-11T21:10:08.995+0530: 1900977.520: [GC pause (G1 Evacuation Pause) (mixed) Desired survivor size 20971520 bytes, new threshold 3 (max 15)
    - age   1:   19769440 bytes,   19769440 total
    - age   2:     801568 bytes,   20571008 total
    - age   3:   20679216 bytes,   41250224 total
    - age   4:     738736 bytes,   41988960 total
    - age   5:    3536096 bytes,   45525056 total
    - age   6:    1242648 bytes,   46767704 total
    - age   7:      19208 bytes,   46786912 total , 0.1879873 secs]    [Parallel Time: 175.3 ms, GC Workers: 13]
          [GC Worker Start (ms): Min: 1900977524.7, Avg: 1900977525.0, Max: 1900977525.3, Diff: 0.6]
          [Ext Root Scanning (ms): Min: 14.5, Avg: 14.8, Max: 15.3, Diff: 0.8, Sum: 192.9]
          [Update RS (ms): Min: 7.1, Avg: 7.9, Max: 13.3, Diff: 6.1, Sum: 103.2]
             [Processed Buffers: Min: 4, Avg: 37.0, Max: 93, Diff: 89, Sum: 481]
          [Scan RS (ms): Min: 55.7, Avg: 60.9, Max: 61.8, Diff: 6.1, Sum: 791.9]
          [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.7, Diff: 0.7, Sum: 0.9]
          [Object Copy (ms): Min: 90.3, Avg: 90.7, Max: 91.2, Diff: 0.9, Sum: 1179.5]
          [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
          [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum: 0.8]
          [GC Worker Total (ms): Min: 174.3, Avg: 174.6, Max: 174.8, Diff: 0.6, Sum: 2269.2]
          [GC Worker End (ms): Min: 1900977699.6, Avg: 1900977699.6, Max: 1900977699.7, Diff: 0.1]    [Code Root Fixup: 0.2 ms]    [Code Root Purge: 0.0 ms]    [Clear CT: 0.9 ms]    [Other: 11.5 ms]
          [Choose CSet: 3.3 ms]
          [Ref Proc: 0.4 ms]
          [Ref Enq: 0.0 ms]
          [Redirty Cards: 1.1 ms]
          [Humongous Reclaim: 0.1 ms]
          [Free CSet: 3.6 ms]    [Eden: 252.0M(252.0M)->0.0B(3656.0M) Survivors: 54.0M->30.0M Heap: 1427.0M(6144.0M)->890.0M(6144.0M)] Heap after GC invocations=85260 (full 0):  garbage-first heap   total 6291456K, used 911310K [0x0000000640000000, 0x0000000640206000, 0x00000007c0000000)   region size 2048K, 15 young (30720K), 15 survivors (30720K)  Metaspace       used 36253K, capacity 36654K, committed 36784K, reserved 1081344K   class space    used 3844K, capacity 3965K, committed 4016K, reserved 1048576K }  [Times: user=1.59 sys=0.60, real=0.19 secs]

  • jstat

      S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
      0.00 100.00   2.29  37.85  98.56  95.74  79645 6351.963     0    0.000 6351.963
      0.00 100.00   4.63  37.85  98.56  95.74  79645 6351.963     0    0.000 6351.963
      0.00 100.00   7.81  37.85  98.56  95.74  79645 6351.963     0    0.000 6351.963
      0.00 100.00  10.46  37.85  98.56  95.74  79645 6351.963     0    0.000 6351.963
      0.00 100.00  13.17  37.81  98.56  95.74  79645 6351.963     0    0.000 6351.963
      0.00 100.00  15.30  38.11  98.56  95.74  79645 6351.963     0    0.000 6351.963

  • /var/log/cassandra/system.log
WARN  [SharedPool-Worker-1] 2017-05-11 21:08:49,621 SliceQueryFilter.java:319 - Read 286 live and 12978 tombstone cells in icmsgraph_10042017.edgestore for key: f800000000104600 (see tombstone_warn_threshold). 2147483647 columns were requested, slices=[b02ea000800000350080000002-b02ea000800000350080000003]
WARN  [SharedPool-Worker-1] 2017-05-11 21:08:52,647 SliceQueryFilter.java:319 - Read 286 live and 12978 tombstone cells in icmsgraph_10042017.edgestore for key: f800000000104600 (see tombstone_warn_threshold). 2147483647 columns were requested, slices=[b02ea000800000350080000002-b02ea000800000350080000003]
WARN  [GossipTasks:1] 2017-05-11 21:09:05,030 Gossiper.java:714 - Gossip stage has 1 pending tasks; skipping status check (no nodes will be marked down)
WARN  [GossipTasks:1] 2017-05-11 21:09:16,588 FailureDetector.java:249 - Not marking nodes down due to local pause of 16886451559 > 5000000000
INFO  [ScheduledTasks:1] 2017-05-11 21:09:16,954 MessagingService.java:888 - 1924 READ messages dropped in last 5000ms
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,153 StatusLogger.java:51 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,154 StatusLogger.java:66 - MutationStage                     0         0       32568360         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,155 StatusLogger.java:66 - RequestResponseStage              1         0    12112153129         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,155 StatusLogger.java:66 - ReadRepairStage                   0         0      593141839         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,155 StatusLogger.java:66 - CounterMutationStage              0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,155 StatusLogger.java:66 - ReadStage                        15         0    12168491480         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,155 StatusLogger.java:66 - MiscStage                         0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,156 StatusLogger.java:66 - HintedHandoff                     0         1           3531         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,156 StatusLogger.java:66 - GossipStage                       0         0        5798815         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,156 StatusLogger.java:66 - CacheCleanupExecutor              0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,156 StatusLogger.java:66 - InternalResponseStage             0         0            537         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,156 StatusLogger.java:66 - CommitLogArchiver                 0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,157 StatusLogger.java:66 - CompactionExecutor                0         0       29915673         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,157 StatusLogger.java:66 - ValidationExecutor                0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,157 StatusLogger.java:66 - MigrationStage                    0         0           3990         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,157 StatusLogger.java:66 - AntiEntropyStage                  0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,158 StatusLogger.java:66 - PendingRangeCalculator            0         0            255         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,159 StatusLogger.java:66 - Sampler                           0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,159 StatusLogger.java:66 - MemtableFlushWriter               0         0           7705         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,159 StatusLogger.java:66 - MemtablePostFlush                 0         0          44884         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,159 StatusLogger.java:66 - MemtableReclaimMemory             0         0           7705         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:75 - CompactionManager                 0         0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:87 - MessagingService                n/a      57/3
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:97 - Cache Type                     Size                 Capacity               KeysToSave
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:99 - KeyCache                  104857584                104857600                      all
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:105 - RowCache                          0                        0                      all
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:112 - ColumnFamily                Memtable ops,data

INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.compaction_history                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.hints                              0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.IndexInfo                          0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.schema_columnfamilies                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.schema_triggers                    0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.size_estimates         762300,93404798
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.paxos                              0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.peer_events                        0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.range_xfers                        0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.compactions_in_progress                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.peers                       283,103428
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.schema_keyspaces                   0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.schema_usertypes                   0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.local                              0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.sstable_activity             315,63532
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.schema_columns                     0,0    
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.batchlog                           0,0   

INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.titan_ids                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.graphindex         1843,277464
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.system_properties_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.system_properties                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.edgestore_lock_              6,1191
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.systemlog          741,242777
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.txlog           741,242777
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.graphindex_lock_         1332,277240
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.edgestore      75171,16624309
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_05052017.system_properties_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_05052017.system_properties                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.edgestore_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.txlog                  0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.edgestore                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.graphindex                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.titan_ids                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.systemlog                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.graphindex_lock_                 0,0   
WARN  [SharedPool-Worker-24] 2017-05-11 21:09:17,200 SliceQueryFilter.java:319 - Read 286 live and 12978 tombstone cells in icmsgraph_10042017.edgestore for key: f800000000104600 (see tombstone_warn_threshold). 2147483647 columns were requested, slices=[b02ea000800000350080000002-b02ea000800000350080000003]
WARN  [SharedPool-Worker-18] 2017-05-11 21:09:17,201 SliceQueryFilter.java:319 - Read 286 live and 12978 tombstone cells in icmsgraph_10042017.edgestore for key: f800000000104600 (see tombstone_warn_threshold). 2147483647 columns were requested, slices=[b02ea000800000350080000002-b02ea000800000350080000003]
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,170 MessagingService.java:888 - 1 READ messages dropped in last 5000ms
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,170 StatusLogger.java:51 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - MutationStage                     0         0       32568376         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - RequestResponseStage              0         0    12112193109         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - ReadRepairStage                   0         0      593143878         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - CounterMutationStage              0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - ReadStage                         0         0    12168508581         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - MiscStage                         0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - HintedHandoff                     0         1           3531         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - GossipStage                       0         0        5798844         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - CacheCleanupExecutor              0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - InternalResponseStage             0         0            537         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - CommitLogArchiver                 0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - CompactionExecutor                0         0       29915673         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - ValidationExecutor                0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - MigrationStage                    0         0           3990         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - AntiEntropyStage                  0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - PendingRangeCalculator            0         0            255         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - Sampler                           0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:66 - MemtableFlushWriter               0         0           7705         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:66 - MemtablePostFlush                 0         0          44884         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:66 - MemtableReclaimMemory             0         0           7705         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:75 - CompactionManager                 0         0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:87 - MessagingService                n/a       0/0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:97 - Cache Type                     Size                 Capacity               KeysToSave
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:99 - KeyCache                  104857584                104857600                      all   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,174 StatusLogger.java:105 - RowCache                          0                        0                      all   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:99 - KeyCache                  104857584                104857600                      all   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,174 StatusLogger.java:105 - RowCache                          0                        0                      all   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,174 StatusLogger.java:112 - ColumnFamily                Memtable ops,data
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.titan_ids                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.graphindex         1844,277804
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.system_properties_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.system_properties                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.edgestore_lock_              6,1191
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.systemlog          741,242777
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.txlog           741,242777
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.graphindex_lock_         1332,277240
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.edgestore      75176,16625645
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.system_properties_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.system_properties                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.edgestore_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.txlog                  0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.edgestore                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.graphindex                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.titan_ids                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.systemlog                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.graphindex_lock_                 0,0   
WARN  [SharedPool-Worker-34] 2017-05-11 21:09:24,177 SliceQueryFilter.java:319 - Read 286 live and 12978 tombstone cells in icmsgraph_10042017.edgestore for key: f800000000104600 (see tombstone_warn_threshold). 2147483647 columns were requested, slices=[b02ea000800000350080000002-b02ea000800000350080000003]
  • ThreadDump около 600 экономичных потоков:

    "Thrift:20118" #124110 daemon prio=5 os_prio=0 tid=0x00007feab9a71460 nid=0xb1c6 runnable [0x00007fea6a578000]
       java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        - locked  (a java.io.BufferedInputStream)
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
        at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
        at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:205)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

72 MessagingService-Входящие потоки ...


    "MessagingService-Incoming-/192.168.33.67" #124106 prio=5 os_prio=0 tid=0x00007feab98436b0 nid=0xb0ca runnable [0x00007fea7503a000]    java.lang.Thread.State: RUNNABLE     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)  at sun.nio.ch.IOUtil.read(IOUtil.java:197)  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        - locked  (a java.lang.Object)  at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:192)
        - locked  (a java.lang.Object)  at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
        - locked  (a sun.nio.ch.SocketAdaptor$SocketInputStream)    at net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:215)  at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:149)     at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:101)   at java.io.DataInputStream.readInt(DataInputStream.java:387)    at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:169)   at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88)

  • sar -q
09:10:01  IST         5      1396     16.02      9.24      6.31         0

person Pankaj Yadav    schedule 12.05.2017    source источник


Ответы (1)


Я думаю, проблема в надгробии Кассандры

В Cassandra всякий раз при удалении вставляйте явно null или создается Tombstone с истечением срока действия. Надгробия - это механизм, который позволяет Cassandra писать быстро, но за это приходится платить.

Прежде всего, надгробия сами по себе являются записями. Они занимают место и могут существенно увеличить объем требуемого хранилища.

Во-вторых, Большое количество надгробий вызывает задержку и давление кучи

Из системного журнала вашей кассандры кажется, что около 13K надгробных камней создается только для одной вершины (Vertex Key ID: 0xf800000000104600). Это огромная. Вы можете видеть, что когда это предупреждение появляется в system.log cassandra, давление кучи увеличивается, и поэтому gc.log одновременно печатается.

Что мне делать?

Cassandra полностью отбросит эти надгробия при срабатывании сжатия, только после того, как local_delete_time + gc_grace_seconds определено в таблице, к которой принадлежат данные.

Подробнее:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html https://opencredo.com/cassandra-tombstones-common-issues/

person Ashraful Islam    schedule 13.05.2017
comment
Спасибо @Ashraful Islam за ответ. Не было времени проверить это. Позвольте мне рассмотреть предложения и обновлю здесь. - person Pankaj Yadav; 21.05.2017