top of page

Best practise and pitfalls to avoid for vCenter SFTP file-based backups

  • Mark
  • May 4
  • 5 min read

Updated: May 20

We all know backing up of your vCenter's using the built in file based backup is critical for recovery. I just wanted to share a couple of important pitfalls i came across recently in case it helps anyone else that comes across this.


Do not backup Multiple vCenters at the same start time to the same directory


Following the recent addition of some new vCenters, we started to see an unusual increase in backup failures for the associated file-based backup. In this particular environment there were 4 vCenter's and an SDDC Manager backing up to the same SFTP Server. All 5 x appliances were configured to backup to the same directory at exactly the same start time


My first thought was the SFTP server was probably running low on space, and the intermittent failures were due to fluctuating free space, but this was quickly ruled out. My second thought was this could be an issue with the number of concurrent backups being written to the SFTP server, but that was ruled out also. We then enabled debug logging on the Linux based SFTP Server and discovered something very interesting.


When multiple vCenters are configured to backup to the same directory, they create their own subdirectories for separation but will share the same top level directory for temp files used during the backup job. For example: if vCenter1 starts its backup job at 23:59, it creates a timestamped directory and cfg file in the top level directory (eg backup_20260325-2359/fbbr_write235912.cfg). If vCenter2 then starts backing up to the same path at the same time (and most importantly in the same second) it begins referencing the same file vCenter1 is using. vCenter1 then deletes the file, causing vCenter2’s backup job to fail. The more vCenter’s that are present the more likely the File Collision issue will occur.   Prior to the increase in vCenters, we would see the odd failure, but not enough to be a concern. It was only when the number of vCenters was increased from 2 to 4 that we noticed the obvious increase in failures, leading us to delve deeper.


Extract from SFTP Server debug log

Here's an extract from the actual debug log (anonymised). At the time of a failed backup, we can 2 different vCenters opening a session in the same second, one vCenter then deletes a tmp file that the other vCenter is still referencing. The second vCenter also then tries to delete the file, but fails because it no longer exists, resulting in the No such file error, and ultimately a failed backup.


  • Mar 25 23:59:14 sftp-server.acme.com sftp-server[88136]: session opened for local user vmware_backup from [10.0.0.80]

  • Mar 25 23:59:14 sftp-server.acme.com sftp-server[88136]: remove name "/home/vmware_backup/vCenter/backup_20260325-2359/fbbr_write235912.cfg"

  • Mar 25 23:59:14 sftp-server.acme.com sftp-server[88139]: session opened for local user vmware_backup from [10.0.0.7]

  • Mar 25 23:59:14 sftp-server.acme.com sftp-server[88139]: remove name "/home/vmware_backup/vCenter/backup_20260325-2359/fbbr_write235912.cfg"

  • Mar 25 23:59:14 sftp-server.acme.com sftp-server[88139]: sent status No such file


Remediation: I couldn't find this particular issue referenced in the official VMware documentation but to avoid this issue from occurring, the guidance here is to use dedicated subdirectory paths for each vCenter backup to enforce separation. This allows multiple backups to run at the same time without error. Alternatively use 1 minute intervals to separate start times, whilst ensuring Issue 2 (below) is also adhered to.


 

All vCenters (and SDDC Manager if VCF) in a linked mode environment must be scheduled to start their backup within a 5 minute window

 

Whilst troubleshooting the issue above, i came across a few examples in customer environments where the above guidance wasn't being followed. This is important for recovery situations as not following this guidance increases the risk of a failure to recovery a working environment.


Broadcom guidance states “You must configure the backup jobs for SDDC Manager and all vCenter instances to start within the same 5-minute window


Again i couldn't find this explicitly referenced in the official vSphere docs but it is referenced in the official VCF docs here  VCF 5.2 Backup Docs and here VCF 9.0 Backup Docs


Risks: If you deviate from this guidance - it is possible in a DR scenario that SDDC Manager and all vCenter’s in a Linked Mode configuration may need to be restored. If recovery points between these components differs by more than the 5 minute window, it increases the risk of a failure to recover from a DR situation


Ensure your SFTP server is optimally configured for vCenter backups


Another scenario I came across recently was another case of intermittent failures for multiple vCenters in a VCF environment, this time with a different cause. Initially we dismissed the odd failure as a transient issue, but as more vCenters were added, the more failures we saw.


The backup log is located on the VMSA under /var/og/vmware/applmgmt/backup.log. On examining this file we could see the associated errors such as "Cannot write: Broken pipe" and also more revealing in this case were the errors "Failure establishing ssh session: -43, Failed getting banner\n". This is typically seen when the vCenter fails to complete the SSH connection request to the SFTP target, but why was this happening?


First some background - when a VCSA appliance starts a file-based backup, by default it runs in Parallel mode (no I didn't know this either). This means a single VCSA attempts to establish multiple concurrent SSH connection requests to the SFTP target in order to initiate and carry out its backup. A single VCSA on its own is obviously manageable in most scenarios, but when you have multiple vCenters trying to backup to the same SFTP target at the same start time, it can result in some of those vCenter backups failing. Through SSHD debug logging we found the failures were occurring because some of the vCenter's were the limit for concurrent SSH connection requests on the SFTP server, and as a result, connections were being dropped, resulting in backup failures


A sample from the debug log showed:

May 19 23:59:45 sftp-server.somecompany.com sshd[130682]: drop connection #11 from [172.18.123.12]:41170 on [172.25.1.10]:22 past MaxStartups


The number of concurrent SSH requests on a Linux SFTP server is controlled by the MaxStartups parameter (as highlighted in the error above) which by default is set to 10:30:100, this means the first 10 connections are accepted, then it will drop 30% of further attempts if the number is above 10, until it hits 100 attempts at which point all further attempts are dropped (until the number drops back below that threshold). This explains the intermittent nature of the failures.


The suggestion from VMware support was to switch all vCenter backups to Serial mode, but rather than reconfigure every vCenter (obviously not ideal) we decided it made more sense to increase the MaxStartUps parameter on the SFTP server to a more suitable value, but what value do you need? This is where the documentation comes in, this page in the Broadcom documentation states "Backup servers must support a minimum of 10 simultaneous connections for each vCenter Server" so if for example you have 5 x vCenters backing up to the same SFTP target at the same time, you would need to configure the MaxStartups parameter to a value of 50:30:100 to ensure the higher number of connections is accommodated. This is a detail that is often missed and not overly obvious as to the reason for intermittent failures.


Conclusion

 

Whilst vCenter backups are typically straightforward, most of us have seen transient errors where file-based backups randomly fail for no apparent reason. Hopefully the above guidance will help alleviate some of these scenarios and ensure more reliable backups for your vCenter's.

 
 
 

Recent Posts

See All
Deploying vCenter 8.0U3i using simple automation.

Did you know the vCenter ISO contains everything you need to complete a fully automated installation of the vCenter VCSA appliance? This is great for homelabbing or even for production deployment. I p

 
 
 
Patching ESXi 8.0U1 to 8.0U3i via esxcli

So today I wanted to update my old Dell T7910 Lab machine to 8.0U3i. It was running a pretty old build (ESXi 8.0U1) from 2023 so I checked the interoperability matrix for ESXi first to make sure the

 
 
 

Comments


bottom of page