Clustering and Failover

A Flux cluster is a group of engines that communicate with one another (communicating directly over the network and using the database) to run workflows as efficiently as possible, and to provide a failover mechanism in case one or more engines go down. In a cluster, there is no single point of failure, and any workflows that are running on an engine will fail over and begin executing on another engine in the cluster.

Configuring the Cluster

To set up two or more Flux engines to run in a cluster, you only need to configure each engine in the cluster to use the same database tables. The rest of the cluster configuration is handled automatically; as long as all of the engines share the same set of database tables, you do not need to do any further work to enable clustering. You can optionally enable cluster networking to support load balancing, but this option is not required for failover.

Engines that do not use the same database tables will not attempt to cluster or communicate with one another.

Maximum Cluster Size

Flux supports clusters of up to 35 engines.

In some cases, it may be appropriate to disable cluster networking for best performance. If your cluster satisfies the following criteria:

  • The cluster contains 5 or more engines
  • 50,000 or more workflows run across the entire cluster per day
  • All engines in the cluster are expected to operate at maximum concurrency

It may be appropriate to disable cluster networking to increase throughput and performance (in this case, the advantages of cluster networking like load balancing would be less advantageous, especially if the cluster is running at maximum capacity at all times).

Load Balancing

Flux engines in the cluster will work together to evenly distribute workflows. Load balancing is provided in one of two ways:

  • By default, workflows that are ready to execute will be assigned to the engine in the cluster that is running the smallest number of workflows. This allows workflows to be spread evenly across the cluster and ensures that no engines are starved of new workflows to execute.
  • If the system resource monitor is enabled (by configuring the SYSTEM_RESOURCE_COMMAND and SYSTEM_RESOURCE_INTERVAL engine configuration options), workflows will be assigned to the engine with the most resources available. This allows workflows to be assigned to the engine that is under the least stress and ensures that the workflows will run on the engine that is most likely to have sufficient resources available.

For more information on configuring the system resource monitor, refer to the /examples/software_developers/system_resource_monitoring (while this example uses Java code to configure and start the engine, the process for setting the engine configuration options is the same even if you are configuring the engine from a file).

Load balancing is not available if the engines cannot communicate across the network (for example, if a firewall is blocking traffic between the engines).

Concurrent Actions within a Workflow

A flux cluster can also execute the actions within a workflow across the cluster. If a workflow contains a split (or, in general, if the workflow contains two or more actions that will run at the same time), it is possible that the actions will run on different engines in the cluster. Since each execution flow will update the database as it runs with all necessary information, all engines in the cluster will be able to see all information from the branched workflow.

In short, even if multiple actions within a workflow run on different engines, there will be no differences in behavior or execution of the workflow. This clustering behavior allows actions to execute more quickly and reliably across the cluster without impacting the operation of your engine or workflow.

Clustering and the Network

Clustered engines communicate automatically across the network. If the engines are not able to use the network (for example, if a firewall is blocking traffic between the engines), they will continue to communicate through the database, but load balancing will not be available.

Port Usage

Cluster networking is performed using the port specified by the PORT engine configuration setting. By default, engines use port 7520 for all communications, including clustering.

Clustering and the Database

For Flux clustering to work correctly, the database must be accessible. If an engine loses its connection to the database, it will not be able to participate in the cluster, even if network communication is still available.

Failover

If an engine in the cluster fails, the other engines in the cluster will cooperate to recover and resume any workflows that were running on the failed engine. An engine is considered to have failed any time it becomes inaccessible (the machine crashes, the engine shuts down unexpectedly, database connectivity is lost, etc.).

When a workflow is failed over to another engine instance, it will roll back to the last transaction break it reached successfully and begin running from that point.

A workflow will only be failed over if it was running an action when its engine failed. A workflow that was waiting on a trigger is not affected by failover, since a workflow that is waiting for an event is not yet associated with a particular engine and can still run on any engine in the cluster.

The Failover Time Window

FAILOVER_TIME_WINDOW is an engine configuration property, specified as a time expression, that is used to determine how frequently an engine should check for failed workflow instances (workflows that were running on engines that have become inactive).

When a workflow is running, it will maintain a heartbeat in the database to let all of the engines in the cluster know that the workflow is still active. A workflow will update its heartbeat three times during the failover time window, so the heartbeat frequency is defined as (failover time window / 3).

By default, the failover time window is +3m, or three minutes. This means three things:

  1. The engine will check every 3 minutes for workflows that were running on failed engines.
  2. Any workflow running on the engine will update its heartbeat every 1 minute (one-third of the failover time window).
  3. If the last heartbeat was recorded more than three minutes ago, the workflow is failed over to the other engines in the cluster.

To prevent workflows from being prematurely failed over, make sure that the following items are consistent across all engines in the cluster:

  • All engines use the same failover time window. If different engines have different time windows, the workflows might not update their heartbeats frequently enough to satisfy all of the failover time windows in the cluster even if they are actually running correctly.
  • All engines have their system clocks synchronized. The heartbeat is recorded to the database using the system clock of the engine where the workflow is running, so if the clocks in the cluster are not synchronized the heartbeat can appear older than it actually is.

We recommend using an NTP (Network Time Protocol) to synchronize the system clocks across the cluster.

Failover Time Window in a Non-Clustered Environment

The FAILOVER_TIME_WINDOW is still consulted even if your engine is not running in a cluster. When an engine recovers after shutting down or crashing, it will still wait for the duration of the FAILOVER_TIME_WINDOW before attempting to reclaim and execute any workflows that were running when the engine stopped.

The engine must wait before reclaiming its workflows because there is no way for an engine to guarantee that it is the only engine using the database, so the failover time window is used as a safeguard in case another engine instance has already claimed or begun running the failed over workflows.

Disabling Failover

There is no built-in way to disable failover in Flux. You can, however, set the failover time window to a very large value (like +10y), which prevents Flux from failing over workflows. In general, the lower the setting of the failover time window, the faster workflows are failed over. On a cautionary note, setting a value that is too low causes false alarms and workflows that are failed over prematurely.

Make sure that the value set is a valid time-expression, as setting something like ‘0’ would cause an engine configuration error.