We have a causal cluster with 5 core nodes. DB size ~60GB.
Every couple of days one of the nodes becomes offline
with status: Quarantine marker is present, but unable to read
.
This seems to be caused by running out of space even though each instance has volumes of 250GB.
After close inspection we see that the problem is with the raft.log
files growing and growing and not being pruned as it is supposed to. In this particular node, we can see now over 700 files of 250MB each, from the last 3 weeks, taking more than 175GB of available space.
This is our current (and also default) config for pruning:
"raft_log_entry_prefetch_buffer.max_entries":"1024"
"raft_log_implementation": "SEGMENTED"
"raft_log_prune_strategy": "1g size"
"raft_log_pruning_frequency": "10m"
"raft_log_reader_pool_size": "8"
"raft_log_rotation_size": "250.00MiB"
We would appreciate any ideas. Is there a specific reason why these logs are not pruned?
Can we prune them manually without affecting the causal cluster?
Thanks in advance!