Skip to main content

Druva Agent Installation and Stuck Backups in MS SQL Cluster Environments

Druva Agent Installation and Stuck Backups in MS SQL Cluster Environments

Problem Description

When setting up or migrating to a Druva agent in an MS SQL Cluster (Failover Cluster Instance) environment, administrators may experience installation failures or find that newly scheduled backup jobs become permanently queued. A critical priority during this configuration change is transitioning cleanly without losing the historical backup chain created during the previous standalone setup.

Cause

The failure typically stems from two distinct phases of the infrastructure migration:

  1. Installation Failures on Cluster Nodes

    • Stale Remnants: Leftover Druva service configurations and registry entries from a previous standalone deployment cause execution path conflicts on the newly clustered node.

    • Activation Conflicts: Attempting a fresh registration when the asset is already indexed inside the Druva infrastructure triggers "device already exists" blocks.

  2. Backup Jobs Stuck in Queue

    • Broken TLS Chain: The local SQL plugin relies on mutual TLS verification to send jobs over port 8082 to the local Falcon agent. Migrations can break the validation chain, resulting in an x509: certificate signed by unknown authority exception that drops job submissions.

Traceback

Review the main_mssql_service.log file located at C:\ProgramData\Druva\EnterpriseWorkloads\logs\mssql\ to verify the following signature error:

Plaintext

level=info filename=commandhandler.go:512 message="Pending jobs Enqueue" JobID=[Job_ID] JobType=backup level=error filename=commandhandler.go:282 message="failed to enqueue job at falcon" Error="Post \"https://localhost:8082/falcon/v1/jobs/enqueue\": tls: failed to verify certificate: x509: certificate signed by unknown authority"

Resolution

Follow these two implementation phases sequentially to remediate the nodes and restore cluster backup pipelines.

Phase 1: Fix Installation Issues & Drive Mapping

  1. Remove Stale Services: Open an elevated Command Prompt (Run as Administrator) on the affected cluster node and remove the legacy workload service entry:

    sc delete Druva-EnterpriseWorkloadsSVC
  2. Clean the Registry: Open regedit and safely back up, then delete residual registry keys associated with the old standalone deployment under:

    • HKEY_LOCAL_MACHINE\SOFTWARE\Druva

  3. Configure the Shared Directory Junction: To ensure backup metadata moves seamlessly during node failovers, the local Druva directory must resolve to your shared cluster resource. Map a directory junction pointing to your shared drive volume:

    mklink /J "C:\ProgramData\Druva" "<Drive_Letter>:\SharedProgramData"

    ⚠️ Note: Replace <Drive_Letter>:\SharedProgramData with the absolute path of your configured Shared Cluster Volume.

  4. Re-register the Server Node: Generate a fresh re-registration token from your Druva Management Console. Apply this token during agent activation to cleanly link the cluster node to your pre-existing asset configuration, thereby safely maintaining historical backup sets.

Phase 2: Repair Certificate Trust Chains

  1. Clear Stuck Queues: Log into the Druva Cloud Console, navigate to the affected MS SQL resource, and manually Cancel any active or queued backup jobs currently trapped in a pending state.

  2. Regenerate Certificates: Access the active cluster node, open the Services Management console (services.msc), and restart the Druva-EnterpriseWorkloads Service.

    • Why this works: Cycling the service explicitly forces the localized components to drop existing corrupted session hashes, auto-generate fresh TLS endpoints, and correctly map the internal communication chain.

Verification and Validation

Confirm healthy cluster operations by completing the following tests:

Evaluation Criteria

Action Method

Target Success Condition

Failover Resiliency

Initiate a manual failover moving ownership to a secondary cluster node.

The Druva Enterprise Workload service switches dynamically and initializes cleanly on the target node.

Ad-Hoc Backup Pipeline

Force manual Full and Incremental backup tasks from the console.

Tasks pass the staging phase instantly and run through to a Success state without queuing blocks.

Log Health Audit

Re-inspect the main_mssql_service.log path.

The log confirms active queues dispatching with code 200 OK responses instead of TLS handshake errors.

Did this answer your question?