Troubleshooting

Connection issues

Data/Forwarder node status shown in gray in the web console

First, check whether the Control Node can reach the web server of the Data/Forwarder node. Sign in to the Logpresso shell on the Control Node, and then enter the IP address of the Data/Forwarder node and the web server port as shown below to check connectivity.

# tcpscan <Data/Forwarder node IP> 8443
tcpscan 203.0.113.161 8443
Diagnosing firewall policy issues

Depending on the command output, there may be a firewall policy issue or a federation communication port issue as described below.

timeout

If timeout is displayed after a certain amount of time when you run the tcpscan command, check every firewall policy on the communication path from the Control Node to the Data/Forwarder node, along with the node's local firewall policy using the firewall-cmd --list-all command.

trying to connect /203.0.113.161:8443
timeout
not opened: Connection refused

If the not opened: Connection refused message is displayed, the federation communication port is not open. Run the httpd.bindings command in the Logpresso shell on the Data/Forwarder node to recheck the port settings. When it is working properly, the output looks like the following.

# httpd.bindings command output
/0.0.0.0:8443 (ssl: key logpresso-web, trust null), opened, default context: webconsole, idle timeout: 0seconds, log file prefix: null, access log: false, error log: false

Run the httpd.openSsl command as shown below to open the federation communication port.

# httpd.openSsl <port> <context> <key alias>
httpd.openSsl 8443 webconsole logpresso-web
Diagnosing web server certificate issues

If the tcpscan <Data/Forwarder node IP> 8443 command shows opened but the Data/Forwarder node connection status is still shown in gray in the web console, the SSL certificate or the policy synchronization password is set incorrectly. To determine the cause, connect to port 8443 of the node from a web browser on your operations terminal. If ERR_SSL_PROTOCOL_ERROR is displayed as shown below, it is a certificate issue.

Data/Forwarder node web server certificate issue

If it is a web server certificate issue on the Data/Forwarder node, run the master node connection setup again with the sonar.setMaster command (this goes through the process of downloading and installing the certificate while communicating with the master node). The following describes the values to enter when you run the sonar.setMaster command.

host? 203.0.113.193    # Virtual IP address of the Control Node pair
port? 8443             # Enter 8443
account? root          # Enter the Control Node federation account root
password?              # Enter the password of the Control Node federation account
connect timeout? 10000 # Press Enter to use the default value
read timeout? 10000    # Press Enter to use the default value
secure? true           # Enter true (the default is false)
skip cert check? true  # Enter true (the default is false)
Diagnosing policy synchronization password issues

If the ENT web console screen is displayed properly when you connect to port 8443 of the Data/Forwarder node from a web browser, enter the federation account name and password on this screen to check whether you can sign in successfully. If you cannot sign in, run the dom.resetPassword localhost root command in the Logpresso shell on the Data/Forwarder node to reset the password, and then re-enter the reset password in the Password fields under Node A Settings and Node B Settings on the System > Clusters > Node screen of the web console.

Failure to connect from a Data/Forwarder node to the Control Node RPC
Diagnosing the RPC connection status

Run the following commands in the Logpresso shell on the Data/Forwarder node to check whether the connection to the Control Node RPC port works.

forwarder.connections # For a Forwarder node
sentry.connections    # For a Data node. Can also be run on a Forwarder node

The command output should list content like the following.

Connections
--------------------
[c1a] id=1317075310, peer=(39c2dd55-5bb5-4497-a327-ee6f8cae9ad9, /203.0.113.194:7140), trusted level=Low, ssl=true, props={phase=post_hello, ping_failure=0, type=command}

If no RPC connection related to the Control Node IP address is listed as shown above, it may be a firewall policy issue or an issue with the certificate used for TLS mutual authentication.

Diagnosing firewall policy issues

First, in the Logpresso shell on the Data/Forwarder node, enter the Control Node IP address and RPC port as shown below to check connectivity.

# tcpscan <Control Node IP> 7140
tcpscan 203.0.113.193 7140

If timeout is printed after a certain amount of time as shown below, check every firewall policy on the communication path from the Data/Forwarder node to the Control Node.

trying to connect /203.0.113.193:7140
timeout
Diagnosing SSL certificate issues

In the Logpresso shell on the Data/Forwarder node, run the logger.tail command or view the /opt/logpresso/log/araqne.log file to check the daemon log.

Certificate password error

If a keystore password was incorrect error occurs as shown below, the certificate password is incorrect.

[2025-01-30 09:24:10.812]  WARN (KeyStoreManagerImpl) - getKeyStore() error:
java.io.IOException: keystore password was incorrect
        at java.base/sun.security.pkcs12.PKCS12KeyStore.engineLoad(PKCS12KeyStore.java:2116)
        at java.base/sun.security.util.KeyStoreDelegator.engineLoad(KeyStoreDelegator.java:222)
        at java.base/java.security.KeyStore.load(KeyStore.java:1479)
        at org.araqne.keystore.KeyStoreManagerImpl.getKeyStore(KeyStoreManagerImpl.java:298)
        at org.araqne.keystore.KeyStoreManagerImpl.getKeyManagerFactory(KeyStoreManagerImpl.java:414)
        at org.araqne.rpc.RpcKeyStoreManagerImpl.__M_getKeyManagerFactory(RpcKeyStoreManagerImpl.java:62)
        at org.araqne.rpc.RpcKeyStoreManagerImpl.getKeyManagerFactory(RpcKeyStoreManagerImpl.java)
        at org.logpresso.sentry.impl.ConnectionWatchdogImpl.__M_connect(ConnectionWatchdogImpl.java:216)
        at org.logpresso.sentry.impl.ConnectionWatchdogImpl.connect(ConnectionWatchdogImpl.java)
        at org.logpresso.sentry.impl.ConnectionWatchdogImpl.__M_checkConnections(ConnectionWatchdogImpl.java:171)
        at org.logpresso.sentry.impl.ConnectionWatchdogImpl.checkConnections(ConnectionWatchdogImpl.java)
        at org.logpresso.sentry.impl.ConnectionWatchdogImpl.__M_checkNow(ConnectionWatchdogImpl.java:149)
        at org.logpresso.sentry.impl.ConnectionWatchdogImpl.checkNow(ConnectionWatchdogImpl.java)
        at org.logpresso.sentry.impl.ConnectionWatchdogImpl.__M_run(ConnectionWatchdogImpl.java:123)
        at org.logpresso.sentry.impl.ConnectionWatchdogImpl.run(ConnectionWatchdogImpl.java)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.security.UnrecoverableKeyException: failed to decrypt safe contents entry: javax.crypto.BadPaddingException: Given final block not properly padded. Such issues can arise if a bad key is used during decryption.
        ... 16 more

Untrusted certificate error

If a No trusted certificate found error occurs as shown below, the certificate used is not the one issued by the Control Node but the one automatically generated when the daemon first started.

[2025-01-30 09:28:14.623] ERROR (ConnectionWatchdogImpl) - logpresso-sentry: failed to connect, closing connection (No trusted certificate found)
[2025-01-30 09:28:14.624] ERROR (RpcHandler) - araqne rpc: ssl handshake exception from x.x.x.x:7140, channel 2ca2e90a (No trusted certificate found)

For both errors, run the master node connection setup again with the sonar.setMaster command (this goes through the process of downloading and installing the certificate while communicating with the master node). The following describes the values to enter when you run the sonar.setMaster command.

host? 203.0.113.193    # Virtual IP address of the Control Node pair
port? 8443             # Enter 8443
account? root          # Enter the Control Node federation account root
password?              # Enter the password of the Control Node federation account
connect timeout? 10000 # Press Enter to use the default value
read timeout? 10000    # Press Enter to use the default value
secure? true           # Enter true (the default is false)
skip cert check? true  # Enter true (the default is false)

Diagnosing Syslog reception

If Syslog is not collected properly through a Forwarder node even after you configure a collector in the web console, diagnose the issue as described below. The following example assumes that packets are sent from the IP address 172.20.100.100.

Forwarder node trace
  1. Connect to the Forwarder node over SSH, and then run the following command to enter the Logpresso shell.

    ssh -p7022 root@localhost
    
  2. Run the syslog.servers command to list the Syslog server settings.

    Syslog Servers
    ----------------
    [logpresso] 0.0.0.0:514 (udp), charset=UTF-8 (override: 0), capacity=20000, rx_buf_size=0, receiver_cpu_id=-1, queue_count=1, buffer_file_path=./, buffer_file_size=10737418240, start from=2024-12-11 13:22:13, received=7
    
  3. Use the syslog.stats logpresso command to view reception statistics by Syslog client IP address.

    Syslog Statistics
    -------------------
    x.x.x.x => 1 (first seen 2025-01-20 10:58:44, last seen 2025-01-20 10:58:44)
    
  4. Use the syslog.trace logpresso command to trace the content of incoming Syslog packets in real time. Press Ctrl+C during the trace to stop it.

Checking Syslog packets

You can verify whether packets reach the Logpresso Forwarder node by using the tcpdump command in the terminal as shown below.

# tcpdump -i <interface> host <Forwarder node IP address> port <Forwarder node syslog port> -A
tcpdump -i any host 172.20.100.100 port 514 -A
Checking whether the port is open
  1. Once you confirm that Syslog packets are reaching the Forwarder node, run the netstat -na | grep :514 command to check whether the port is open.

    # netstat -na | grep :514
    udp        0      0 0.0.0.0:514             0.0.0.0:*
    
  2. If the port is not open as shown above, you may have skipped the JVM permission granting step during installation and failed to open the port, or the port settings may have changed. Run the following command to check whether cap_net_bind_service is printed.

    # getcap <java executable path>
    getcap /opt/logpresso/jdk/bin/java
    
  3. If the permission is not granted, use the setcap command to grant the required permissions to the java executable.

    # setcap cap_net_bind_service,cap_sys_time,cap_net_raw=+ep <java executable path>
    setcap cap_net_bind_service,cap_sys_time,cap_net_raw=+ep /opt/logpresso/jdk/bin/java
    
Checking the host firewall policy

If there is no blocked segment on the connection path from the Syslog client to the Forwarder node, but Syslog packets do not reach the Forwarder node at all, run the firewall-cmd --list-ports command to recheck the host firewall policy.

# firewall-cmd --list-ports
Checking the rp_filter setting

If the firewall policy and the port open status are all normal but you cannot confirm reception with the syslog.trace command in the Logpresso shell, the issue may be with the Linux kernel's Reverse Path Filtering setting.

The Linux kernel's Reverse Path Filtering feature checks the source of a packet to block spoofed packets. The default is 1, which drops packets that come in on an invalid path.

If the Forwarder node has multiple network interface cards and packets from a source can be received through multiple network interface cards, change the rp_filter setting.

Checking the current setting

  1. Run the cat /proc/sys/net/ipv4/conf/<interface>/rp_filter command to check the rp_filter setting value.

    # cat /proc/sys/net/ipv4/conf/eth2/rp_filter # command output
    1
    
    • 0 (disabled): Does not perform Reverse Path Filtering
    • 1 (default, strict): Checks the path of the interface based on the packet's source IP address, and drops the packet if it does not match
    • 2 (loose): Allows a packet with the same source IP address if it has a valid path through any interface
  2. If the value is 1, add the following setting to the end of the /etc/sysctl.conf file.

 net.ipv4.conf.eth2.rp_filter = 2
  1. Run the following command with administrator privileges to apply the kernel setting.

    sysctl -p
    

MariaDB

Galera Cluster cannot be restarted

If you stop all Galera Cluster servers and then restart them, but the cluster does not start and the database does not start even with the systemctl start mariadb command, restart MariaDB in the following order.

  1. (Control Node A) Edit the /var/lib/mysql/grastate.dat file with administrator privileges.

    # GALERA saved state
    Version: 2.1
    Uuid:    5022f7e5-281a-11e8-98c9-9baa762d13e6
    Seqno:   -1
    Safe_to_bootstrap: 1
    
    • Change the Safe_to_bootstrap value from 1 to 0.
  2. (Control Node A) Restart the Galera Cluster server.

    sudo galera_new_cluster
    
  3. (Control Node B) Run the following command to start MariaDB, and check whether wsrep_start_position has the same value as Control Node A.

    sudo systemctl start mariadb && \
    ps -ef | grep mysql
    

    Running this command produces a result similar to the following.

    mysql    1141195       1  1 Apr15 ?        22:50:53 /usr/sbin/mariadbd --wsrep_start_position=c6609d0e-091a-11f0-86bb-3e7cc9ee21e7:40512
    
    Caution
    Node B joins an already-started Galera Cluster, so you must not run the galera_new_cluster command. The galera_new_cluster command must be run only on the first node to start in the cluster.
    Note
    If the wsrep_start_position value is "00000000-0000-0000-0000-000000000000", run the systemctl restart mariadb command.

Diagnosing OOM and memory leaks

Generating a heap dump

If an OutOfMemoryError (OOM) occurs on a node and stops the logpresso process, or a memory leak is suspected, you can generate a heap dump to diagnose the cause.

Caution
The size of the heap dump file can be similar to the JVM heap size. Make sure there is enough free disk space in the save path.
  1. Add the following lines to the end of the /opt/logpresso/etc/logpresso.conf file.

    JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"
    JAVA_OPTS="$JAVA_OPTS -XX:HeapDumpPath=/data/heapdump.hprof"
    
  2. Restart the logpresso process.

    sudo systemctl restart logpresso
    
  3. When an OOM occurs afterward, a heap dump file is automatically generated at the specified path. Run the following command to check the file.

    ls -lh /data/heapdump.hprof
    

Collecting data for failure analysis

  1. Run the jcmd command to find the PID of the java process running araqne-core. The following is an example of the command output.

    3370022 /logpresso/araqne-core-4.0.5-package.jar
    2428126 jdk.jcmd/sun.tools.jcmd.JCmd
    
  2. Generate evidence based on the identified PID.

    # Run jcmd based on the PID identified earlier.
    # The path where files are saved can be changed.
    # Run so that all evidence files are stored in a directory with sufficient free space.
    
    # Generate jmap
    jcmd 3370022  GC.class_histogram > /data/histogram_yyMMdd.txt
    
    # Generate JFR -> (the file is created after 60 seconds.)
    jcmd 3370022 JFR.start duration=60s settings=profile filename=/data/jfr_yyMMdd.jfr
    
    # Generate jstack
    jcmd 3370022 Thread.print > /data/jstack_yyMMdd.txt
    
    # Generate HeapDump
    # The file can be created as large as the Java heap memory size. Run this in a directory with ample free space if possible.
    jcmd 3370022 GC.heap_dump -all=true /data/heapdump_yyyyMMdd.hprof
    
  3. Export the generated evidence.

Note
If you use OpenJDK distributed as a ZIP file, the java file path may be registered in the PATH environment variable. Run the 'ps -ef | grep java' command to first check the java file path used to run Logpresso Sonar.
If it is not the path registered in PATH, move to the directory that contains the java file and prefix each command with './' to run the audit evidence generation commands.