Using Auto heal to improve cluster uptime for Nginx
Last updated
Was this helpful?
Last updated
Was this helpful?
Most noderunners have the common problem that stale nodes, meaning nodes that are stuck at a certain block height and will not continue keeping up the chain tip, provide horrible UX to users. To mitigate this, we can add some extra monitoring tools to dynamically add or remove nodes from the cluster.
The provided software gives you an entry level solution for this. As everything is written in Python, you can adjust this to your API setup.
To install, please do the following:
Ensure you have Python installed. If not, download and install Python from . You'll also need pip, Python's package manager, to install required libraries. If you're using a Linux-based system, ensure you have NGINX installed and properly configured.
Since we edit the nginx config directly, we need to give pyhton3 sudo rights.
Be aware when you follow this tutorial.
To clone the AutoHealBot repository, use the following steps:
Install Git: Ensure Git is installed on your system. If not, install it from .
Clone the Repository: Open a terminal and run:
Navigate to the Directory: After cloning, change to the repository directory:
This clones the entire repository to your local machine, allowing you to access all files and resources. To proceed with the tutorial, follow additional setup instructions provided in the repository's README or other documentation.
When installing, make sure to install this under root with sudo
, otherwise the script will later not find the libraries later on.
To configure your environment variables, copy over the .env.example
in the repository.
Replace the placeholders with actual values:
NGINX_CONFIG_PATH
: The path to your NGINX configuration file.
BASE_RATE
and NODE_MULTIPLIER
: Adjust as needed.
RPC_PORT
, GRPC_PORT
, LCD_PORT
: Set to your specific ports.
FILE_PATH
: Path to the text file with node URLs.
TIME_BEFORE_FALLEN_BEHIND
: Maximum allowed time before a node is considered unhealthy.
UPDATE_TIME
: Time between health checks.
Create a text file with the node URLs. For example, create nodes.txt
with one URL per line, make sure to include the RPC
port to each node here as well:
In the AutoHealBot script, "upstream blocks" refer to sections in the NGINX configuration that specify which backend servers handle different types of traffic. This setup divides the backend nodes into separate streams: RPC, gRPC, and LCD. The script checks the health of these nodes and updates the corresponding upstream blocks to reflect the healthy nodes for each stream. It ensures that traffic is routed to servers that are online and functional.
As a reference, the upstream blocks are defined as:
Run the script to start the asynchronous health checks and NGINX updates:
Troubleshooting
Environment Variables Not Loaded: Ensure your .env
file is in the same directory as the script or specify its path explicitly with dotenv_path
.
NGINX Not Reloading: Check if you have the necessary permissions to reload NGINX and ensure systemctl or other command-line utilities are in your PATH.
With this setup, the script will run asynchronously, periodically checking node health, updating the NGINX configuration, and reloading the NGINX service as needed.