Troubleshooting
Where to find the Flux logs
When using the default Flux installation:
The engine logs can be found in the logs folder of your Flux installation folder.
- The logs are named
flux-<server name>-dd-MMM-yyyy.log
. The operations console logs are also located in the logs folder of your Flux installation folder - The logs are named
opsconsole-dd-MMM-yyyy.log
.
Engine or JVM crashes unexpectedly
Most engine crashes occur due to a bug in the Java Virtual Machine (JVM). To find the known bugs for your Java environment, you can search Sun’s bug database for your version of Java.
When a JVM bug causes the Flux engine to crash, you will see a message written to standard err (stderr), and a fatal error crash report file is written to the file system. This crash report is normally saved in the temp directory (/tmp for Linux / Unix, or C:\Windows\temp on Windows); if you cannot find the crash report file in the temp directory, it may also be located in the working directory of the JVM (usually the directory where the JVM or Flux engine was started from).
Another common cause for crashes, in cases where the crash occurs after the environment has been running for a long period of time, is that the environment has run out of resources. If the crash occurs for a non-apparent reason (there is no error message logged or displayed to stderr), we recommend monitoring the JVM closely on a daily basis to watch memory usage and see if memory increases in the time leading up to the crash. Although Java will typically write an error message to stderr in the event of a memory problem, this can often be unreliable as memory problems are known to cause unexpected behavior.
If the crash occurs in a long-running environment that was previously stable, and does not appear to occur due to a bug in the JVM or a memory problem, email support@flux.ly for further assistance and take special care to note any possible changes in the environment for the following:
- System upgrades
- Java upgrades
- Other software upgrades
- Device driver upgrades
- Command line argument changes
- Changes to application code (in particular, any calls to the Java method System.exit())
- Additional load recently added to the Flux system
- Java library upgrades
- For memory usage problems, we also recommend enabling the heap dump on out of memory JVM configuration parameter:
-XX:+HeapDumpOnOutOfMemoryError
This will generate a heap dump if the JVM runs out of memory, which can be useful for the Flux team in debugging what caused the application to exceed its memory limitations.
My engine or workflow is stuck in a certain state and does not respond to commands
If an engine or workflow is unresponsive, it is typically because the MAX_CONNECTIONS parameter in the engine configuration is not large enough to accommodate the number of users or client connections for the engine. Every client or user who connects to Flux may need a database connection available (as well as a certain number for workflows and background tasks), so the MAX_CONNECTIONS parameter must be high enough to accommodate all of the potential database connections the engine might require. For more information on setting an ideal MAX_CONNECTIONS parameter, see Max Connections and Concurrency Level.
If your MAX_CONNECTIONS is configured correctly, or you are certain that the engine has not reached its total number of available database connections, and your engine is still unresponsive, gather as much of the following information as you are able and send it in an email to support@flux.ly for further assistance:
- A log file from the engine, preferably at the FINEST level. A general timestamp of when you started experiencing these issues can be helpful as well.
- A thread dump from the JVM where the engine is running, taken at least 5 minutes after the engine or workflow becomes unresponsive.
- The full contents of the FLUX_READY table in your database from the time the unresponsiveness occurred.
- The version of Flux that you are running (if you aren’t sure, run the command “java -jar flux.jar” and copy the output).
- Copies of the workflows, if any, that have become stuck.
- A database deadlock report from the database, showing any deadlocks or deadlocked transactions that occurred around the time the engine/workflow became stuck.
Cannot contact a remote engine from a Linux system
This problem is a known Linux and Java issue. It is not Flux-specific. In general, when a Java application tries to lookup a remote object on a Linux computer, the remote reference that is returned to the Java application may contain a reference to 127.0.0.1 (localhost), instead of the remote computer’s actual (routable) IP address and host name.
Very likely, the first entry in your Linux system’s /etc/hosts file matches, or is similar to, the following line:
127.0.0.1 localhost
There may be additional lines below this line that specify other IP addresses and host names. However, if the very first line is similar to the above line, this Linux/Java problem can occur.
To resolve this problem, move the first line farther below in your /etc/hosts file, beneath the line that lists your computer’s real (routable) IP address and host name.
Workflow fires twice
If a database deadlock occurs while your workflow is firing, the database transaction is rolled back and tried again automatically. At this point, your workflow will fire again but only because it did not run to completion successfully the last time it fired. This behavior is normal.
You can completely eliminate the possibility of “your workflow firing twice” by tying your workflow’s work into the same database connection that your Flux workflow uses. That way, if Flux’s database connection rolls back, your workflow’s work rolls back too, and there is no harm done.
You can also tie the Flux database connection into your work by using an XA resource or an XA database connection. Again, if the Flux database connection rolls back, your workflow’s work rolls back too — no harm done.
mmiVerifyTpAndGetWorkSize: stack_height=2 should be zero; exit
You may see the above message on your console. It is a harmless message emitted directly from the IBM Java Runtime Environment (JRE) or the JRE’s Just In Time (JIT) Compiler.
I see database deadlocks! What is wrong?
Probably nothing. Database deadlocks are a normal part of any database application. Deadlocks occur in a normally functioning software application.
If a database deadlock occurs while a workflow is running, Flux rolls back the current database transaction and automatically retries the flow chart. No administrative action is required.
If a deadlock occurs while using the Flux Designer, you must manually retry the GUI action that you attempted.
Once your flow chart is successfully added to the engine, database deadlocks do not require any action on your part.
In general, row-level locking is preferred in databases, because it minimizes the opportunity for deadlock and connection timeouts. If possible, enable row-level locking at the database level.
If you see more than an average of one deadlock per hour or if you can reproduce a deadlock regularly by following a well defined sequence of steps, then contact Flux Support at support@flux.ly with an explanation of the deadlock situation. We will work with you to attempt to reduce the number of deadlocks to a tolerance of less than an average of one deadlock per hour.
EJB 2.0 restriction on Flux client calls
If you are using EJB 2.0 and are making client calls into a Flux engine from your EJB, Flux will not operate properly if the calling EJB has Container Managed Transactions (CMT) enabled. This issue occurs because the EJB 2.0 specification does not allow other applications (in this case, a Flux engine) to look up user transactions while CMT is enabled. Flux engines utilize user transactions in order to allow them to participate in EJB transactions.
The workaround for this issue is to either configure your EJB 2.0 beans to use Bean Managed Transactions (BMT) or simply have your beans use EJB 1.1.
First, you should make sure the Flux engine’s loggers are enabled. These loggers record useful information about the state of Flux and running flow charts. By default, loggers write their logging information to the console (standard out or stdout), but these logs can be reset to log to other destinations.
Workflows firing late or failing to fire after an engine restart
If some of your previously scheduled workflows seem to fire very late, if at all, after you restart your engine, the cause may be that the engine was not properly disposed. If the engine process is terminated before the shutdown fully completes, some of your workflows may have been left in the FIRING state. In this case, when your engine restarts, your engine leaves these workflows alone, assuming they are being fired by a second, clustered engine instance. After a few minutes, according to the configuration parameter FAILOVER_TIME_WINDOW, your engine instance will failover these workflows, and they will begin running again.
In order to avoid this delay, be sure to shutdown cleanly by calling engine.dispose(). Alternately, configure your engine so that it is running standalone, not as part of a cluster. To ensure that an engine runs as a standalone, it must be the only engine pointed to its set of database tables (clustering is enabled by pointing multiple engines to the same database tables).
Flux reporting a lack of CPU or memory resources available when it runs embedded in my application
Because Flux runs inside the same JVM as the rest of your application, if parts of your application exhaust database, memory, or virtual machine resources, this excessive consumption of resources may be revealed in Flux. If a lack of resources is reported by Flux, it does not necessarily imply that Flux itself leaked or consumed these resources. Other parts of your application residing in the same JVM may have consumed most or all of these resources.
For example, suppose you call a Flux engine method and an SQLException is thrown, indicating that that database has run out of database cursors. This SQLException does not necessarily imply that Flux is leaking database resources. It may imply that other parts of the application are leaking database resources, but that this leak was merely exposed by a call to Flux.