Network design

This document describes how our network is designed.

Audience

The audience for this document is people who stake with us that are technically inclined and curious about how we design things or people who wish to use our VaaS (Validator as a Service) platform.

Description

TL;DR at any point in time we maintain at least two servers for every validator node.

We have been running production workloads for more than a decade, so hard-earned experience tells us that if something can go wrong, it will go wrong at some point. There’s an obvious cost factor involved as our duplication of resources implies at least double the costs for hosting a single validator node. We aim to provide quality service and avoid saving money by cutting corners or not meeting hardware specs.

For this purpose, we deploy all of our nodes with backup nodes, so in case a server or an entire data centre is experiencing an outage, the backup nodes take over. These data centres are hosted in geographically distinct regions to avoid localised problems.

Should a primary or a backup node or an entire data centre fail, we keep a spare setup. These are not actively deployed nodes, however, we are automation experts. We can deploy replacements in minutes. To be able to respond to these scaling issues with enough elasticity, these tertiary nodes are hosted by cloud providers rather than classic datacentres. To increase our resilience further, this spare cloud setup is being done on a different continent so it does not share any of the infrastructures of either the primary or the backup datacentre.

We use different vendors where our nodes are deployed (Datacentre 1 and Datacentre 2) or where they are ready to deploy (Cloud Region 1). This ensures that a vendor making overnight changes for their acceptable use policy can’t simply turn off both the primary and the backup servers at the same time which would be a risk with a single-vendor deployment. The Datacentre 1 and 2 vendors may be either cloud or bare metal vendors.

Network diagram

Our simplified network diagram for Elrond validator nodes looks like this:

Primary Region - Active (Datacentre 1)
Primary Region - Active (Datacentre 1)
Primary Server 2

node-5
node-6
node-7
node-8
node-9
Primary Server 2...
Primary Server 1

node-0
node-1
node-2
node-3
node-4
Primary Server 1...
Secondary Region - Backup (Datacentre 2)
Secondary Region - Backup (Datacentre 2)
Secondary Server 1

node-0
node-1
node-2
node-3
node-4
Secondary Server 1...
Secondary Server 2

node-5
node-6
node-7
node-8
node-9
Secondary Server 2...
Tertiary Region - Automation Ready (Cloud Region 1)
Tertiary Region - Automation Ready (Cloud Region 1)
node-0
node-0
node-3
node-3
node-1
node-1
node-2
node-2
node-4
node-4
node-6
node-6
node-5
node-5
node-7
node-7
node-9
node-9
node-8
node-8
Viewer does not support full SVG 1.1

The number of nodes per machine in the above image is just for illustration purposes to be able to visualise our network diagram. The actual number of node services is determined according to the actual hardware specifications of a particular server.