Split Services Deployment

NOTE: Be sure to first read the General deployment instructions.

These instructions are for splitting VIAME-Web web and worker services across two VMs: a compute-only web VM for annotations, and a worker VM with one or more GPUs that can be turned on as needed to run jobs. This is a cost-effective solution, as the (expensive) worker VM is only turned on as needed. See the DIVE docs for more details. For running VIAME-Web on s single VM, see the Default Deployment.

Create GCP Resources

To create two VMs for a split services instance of VIAME-Web, create two VMs: a web and a worker. The infrastructure of the web and worker VMs should be identical, except that the web node will have no GPU, should have a slightly larger disk capacity, and needs SSH AllowTcpForwarding to be enabled.

Click here to download the Terraform code template for the VMs for a split services deployment. Copy this code into your Terraform config file (e.g., main.tf) and update project-specific values as needed. In the Cloud Shell Terminal, run terraform init, and then terraform apply to create the resources.

Provision GCP VMs

First, we set the variables in Cloud Shell that will be used throughout. Note that both install scripts require the internal IP of the web node.

ZONE=us-east4-c
INSTANCE_NAME_WEB=viame-web-web
INSTANCE_NAME_WORKER=viame-web-worker
REPO_URL=https://raw.githubusercontent.com/us-amlr/viame-web-noaa-gcp/main/scripts
WEB_INTERNAL_IP=$(gcloud compute instances describe $INSTANCE_NAME_WEB --zone=$ZONE --format='get(networkInterfaces[0].networkIP)')

Web VM

Once the VMs have been created and variables have been set, run the following command to download the install script to the VM, make it executable, and finally run the install script.

gcloud compute ssh $INSTANCE_NAME_WEB --zone=$ZONE \
  --command="curl -L $REPO_URL/dive_install_web.sh -o ~/dive_install_web.sh \
  && chmod +x ~/dive_install_web.sh \
  && ~/dive_install_web.sh $WEB_INTERNAL_IP"

You still need to restart the VM to allow permissions changes to take effect. Then, run the startup script for the web node.

gcloud compute instances stop $INSTANCE_NAME_WEB --zone=$ZONE \
  && gcloud compute instances start $INSTANCE_NAME_WEB --zone=$ZONE

gcloud compute ssh $INSTANCE_NAME_WEB --zone=$ZONE --command="/opt/noaa/dive_startup_web.sh"

Worker VM

Next, provision the worker. As above, the following command downloads the install script to the VM, makes it executable, and then runs the install script. Running this install script may take 10-15 minutes.

gcloud compute ssh $INSTANCE_NAME_WORKER --zone=$ZONE \
  --command="curl -L $REPO_URL/dive_install.sh -o ~/dive_install.sh \
  && chmod +x ~/dive_install.sh \
  && ~/dive_install.sh -w $WEB_INTERNAL_IP"

Because of permissions changes and installing the NVIDIA drivers, the VM must now be restarted. Restart the VM and run the startup script to pull updated files and spin up the VIAME-Web stack:

gcloud compute instances stop $INSTANCE_NAME_WORKER --zone=$ZONE \
  && gcloud compute instances start $INSTANCE_NAME_WORKER --zone=$ZONE

gcloud compute ssh $INSTANCE_NAME_WORKER --zone=$ZONE --command="/opt/noaa/dive_startup_worker.sh"

Access VIAME-Web deployment

See Access VIAME-Web

Web and Worker VM Communication

For the split services to be able to work, the web and worker VMs must be able to communicate. You can confirm this either through either the DIVE API (recommended) or the VMs directly. Before testing the connection, be sure that 1) both the web and worker VMs are on and the services have been started (i.e., the startup scripts have been run) and 2) both VMs have the your viame network tag applied.

DIVE API

Open the swagger UI at http://{server_url}:{server_port}/api/v1 (likely http://localhost:8010/api/v1).
Under the 'worker' endpoint, issue a GET /worker/status request.
The 'Response Body' section should be a long list of successful connection attempts. If the 'Response Body' values are null, then there is a communication issue.

Other

SSH into the web VM and check that the VM is listening on at least ports 8010 and 5672. Note that you must have root access to run these commands.

# check if VM is listening on any ports
# should list at least 8010 and 5672 as LISTEN
sudo apt install net-tools #install if necessary 
netstat -plaunt

# Get the internal IP of the web VM from the third output block if needed
ifconfig

SSH into the worker VM and check if the VM can make a connection to the web VM on the expected ports. These commands should output a string like Connection to ##.##.##.## 8010 port [tcp/*] succeeded!. If the worker VM cannot make a connection to the web VM, then you will get a 'operation timed out' message.

WEB_IP=##.##.##.##
nc -v -w3 $WEB_IP 8010
nc -v -w3 $WEB_IP 5672