Using Auto heal to improve cluster uptime for Nginx
Most noderunners have the common problem that stale nodes, meaning nodes that are stuck at a certain block height and will not continue keeping up the chain tip, provide horrible UX to users. To mitigate this, we can add some extra monitoring tools to dynamically add or remove nodes from the cluster.
The provided software gives you an entry level solution for this. As everything is written in Python, you can adjust this to your API setup.
To install, please do the following:
Prerequisites
Ensure you have Python installed. If not, download and install Python from python.org. You'll also need pip, Python's package manager, to install required libraries. If you're using a Linux-based system, ensure you have NGINX installed and properly configured.
Since we edit the nginx config directly, we need to give pyhton3 sudo rights.
Be aware when you follow this tutorial.
Clone Repository
To clone the AutoHealBot repository, use the following steps:
Install Git: Ensure Git is installed on your system. If not, install it from git-scm.com.
Clone the Repository: Open a terminal and run:
Navigate to the Directory: After cloning, change to the repository directory:
This clones the entire repository to your local machine, allowing you to access all files and resources. To proceed with the tutorial, follow additional setup instructions provided in the repository's README or other documentation.
Install Required Libraries for Python
When installing, make sure to install this under root with sudo
, otherwise the script will later not find the libraries later on.
Setup the enviroment file
To configure your environment variables, copy over the .env.example
in the repository.
Replace the placeholders with actual values:
NGINX_CONFIG_PATH
: The path to your NGINX configuration file.BASE_RATE
andNODE_MULTIPLIER
: Adjust as needed.RPC_PORT
,GRPC_PORT
,LCD_PORT
: Set to your specific ports.FILE_PATH
: Path to the text file with node URLs.TIME_BEFORE_FALLEN_BEHIND
: Maximum allowed time before a node is considered unhealthy.UPDATE_TIME
: Time between health checks.
Create a File with Node URLs
Create a text file with the node URLs. For example, create nodes.txt
with one URL per line, make sure to include the RPC
port to each node here as well:
In the AutoHealBot script, "upstream blocks" refer to sections in the NGINX configuration that specify which backend servers handle different types of traffic. This setup divides the backend nodes into separate streams: RPC, gRPC, and LCD. The script checks the health of these nodes and updates the corresponding upstream blocks to reflect the healthy nodes for each stream. It ensures that traffic is routed to servers that are online and functional.
As a reference, the upstream blocks are defined as:
Execute the Script
Run the script to start the asynchronous health checks and NGINX updates:
Troubleshooting
Environment Variables Not Loaded: Ensure your
.env
file is in the same directory as the script or specify its path explicitly withdotenv_path
.NGINX Not Reloading: Check if you have the necessary permissions to reload NGINX and ensure systemctl or other command-line utilities are in your PATH.
With this setup, the script will run asynchronously, periodically checking node health, updating the NGINX configuration, and reloading the NGINX service as needed.
Last updated