Brought to you by Access Group

Hidden VMware Snapshots

Friday 23 October 2015 09:46 : Simon Greaves

You may find from time to time that a snapshot removal fails and that the delete all option is not working.  What you are left with is a virtual machine running off of the snapshot disks whereas vCenter may think that the virtual machine has no snapshots.
What does this mean and how can I avoid it?  Well first let me explain how the VMware snapshot process works and what should happen.

How Snapshots Work

A snapshot of a virtual machine is a point in time image of the current state and data.  The state is the virtual machines current power state, and the data is made up of all the files that make up the virtual machine including memory, disk, network cards, USB devices and so on.

A snapshot can be created simply through the use of the vSphere Client and the vSphere Web Client by right clicking on a virtual machine and selecting Snapshot>Create Snapshot.  You are then presented with the following options.

Name – Name for the snapshot.
Description – Description of the snapshot.
Snapshot the virtual machine’s memory – All the memory in active use on the virtual machine is written to a memory dump file (vmsn file) that is included in the snapshot.
Quiesce guest file system (Needs VMware Tools installed) – The quiescing process tells the operating system to write transactions out of the memory buffers and in-memory cache to the disk so that the virtual machine can have a consistent state that can be recovered from.

When the snapshot is created an additional disk is added to the virtual machine called a child disk or a delta disk which is labelled as <vm-name>-<number>.vmdk and  <vm-name>-<number>-delta.vmdk.
The <vm-name>-<number>-delta.vmdk file is a hidden file that will not show up in the datastore browser. You can however view this by connecting to the ESXi host either through SSH or through the vMA (vSphere Management Assistant). Here is an example of the same datastore location through a remote SSH connection.
Snapshot child disks are sparse disks that use a copy-on-write mechanism which means that only changed data is written to the child disks which allows for space saving by not replicating existing data.  The data is only written to the disk following a write.  This means that the child delta disks can save quite a bit of space.

In the illustration below the hashed blocks represent changed data blocks and the white blocks represent empty space due to the sparse layout of the disks.
Some additional files are created with the snapshot; the virtual machine snapshot database <vm-name>.vmsd and the virtual machine memory state file <vm-name>.vmsn.  The virtual machine snapshot database name file <vm-name>.vmsd contains the snapshot information and is where the snapshot manager gets its information from. It is a text readable file that can prove useful when trying to troubleshoot snapshot issues.

Here is an output of the snapshot .vmsd file associated with the example virtual machine.

.encoding = "UTF-8"
snapshot.lastUID = "1"
snapshot.current = "1"
snapshot0.uid = "1"
snapshot0.filename = "Demo-VM01-Snapshot1.vmsn"
snapshot0.displayName = "Demo-Snapshot01"
snapshot0.description = "Example Snapshot"
snapshot0.type = "1"
snapshot0.createTimeHigh = "316405"
snapshot0.createTimeLow = "-1275531695"
snapshot0.numDisks = "1"
snapshot0.disk0.fileName = "Demo-VM01.vmdk"
snapshot0.disk0.node = "scsi0:0"
snapshot.numSnapshots = "1"

The snapshot options are controlled through the VMware API using the following options.

CreateSnapshot - Creates the snapshot. This is labelled as ‘Take Snapshot‘ in the vSphere Client.
RemoveSnapshot  - Remove the snapshot and delete the associated <vm-name>-<number>.vmdk and <vm-name>-<number>-delta.vmdk disks.  This is labelled as ‘Delete’ in Snapshot Manager in the vSphere Client.
RevertToSnapshot – This option takes the running state of the virtual machine back to the state of the last snapshot and changes made since are lost.  You can save the current state of the virtual machine by taking another snapshot should you need to revert back to the currently active state of the virtual machine.  This is labelled as ‘Go to‘ in Snapshot Manager in the vSphere Client.
RemoveAllSnapshots – This option removes all the snapshots by writing the active state of the child disk into the parent disk.  Pre-vSphere 4 Update 2 f there are multiple snapshots and thus multiple child disks, each child disk will write it’s contents into its parent disk all the way up the chain until the child disks have written all their changes into their parent disks.  At this point all the child disks are deleted.

If you think about what that means for a second, if you have lots of large snapshots then you will also need to ensure there is enough free space to accommodate these snapshots during the RemovalAllSnapshot process.

As an example lets say that your virtual machine has 4 snapshots on it which are left on there whilst carrying out some work on the server and these snapshots grow in size as follows.

Original disk – 100GB
Snapshot one – 10GB
Snapshot two – 20GB
Snapshot three – 10GB
Snapshot four – 20GB

When the RemoveAllSnapshots API is called the four snapshots would roll up, so four would roll into three, then three into two, then two into one and finally one into the original disk.  What was originally a 100GB virtual machine disk is suddenly a machine with a potential size requirement of 240GB!

Thankfully that is no longer the case with vSphere 4 Update 2 version or later.  The changes made were that the snapshots would roll up starting with the closest disk, so snapshot one would roll into the original disk, then two into the original disk, then three and finally four.  This means that not only is space saved during the RemoveAllSnapshots but also data is only written once rather than repeatedly during each snapshot roll up.
This is labelled as ‘Delete All‘ in Snapshot Manager in the vSphere Client.
Consolidate – The consolidate option was added in vSphere 5 and is there to allow you to write back the child disks that may have become disassociated from the Snapshot Manager due to a failed RemoveSnapshot or RemoveAllSnapshots command.  This failure can be caused by a time out during the write back of the child disks to the parent disks.

A virtual machine may show up in the vSphere Client as requiring consolidation with a Needs Consolidation alert on the summary tab of the virtual machine.
There is also a Needs Consolidation column in the virtual machines view from any higher level in vCenter, such as the cluster level.
Click the image for a larger view.

Orphaned Snapshots

What may happen is that the Snapshot Manager may think that the consolidation process is complete and so you do not get an error related to the virtual machine requiring consolidation in the vSphere Client but when you check the .vmx file or select the option to edit settings and view the location of the virtual machine disk files you may see that the disk is actually called <vm-name>-<number>.vmdk.  If this is the case look in the datastore browser and you will see the files <vm-name>-<number>.vmdk.

You can also open an SSH connection to the host  to view the  <vm-name>-<number>.vmdk and <vm-name>-<number>-delta.vmdk files by listing out the contents of the directory location of the virtual machine.  You can do this with the following commands.
#cd /vmfs/volumes/<datastorename>/<VirtualMachineName>
#ls -lah

Here you will see all the disk files including the hidden flat disks.  <vm-name>-<number>-flat.vmdk. The flat disks are the actual virtual machine disk files, the ‘plain’ .vmdk files are a configuration file pointing to the flat disk file.
If you see that the VM is running from a snapshot delta you have several options.

Option 1 – Clone the virtual machine.  A nice simple fix.  To ensure a consistent state of the virtual machine you will need to shut the machine down first before starting the clone, otherwise the cloned VM will be in the state the the original virtual machine was in during the initial snapshot taken at the start of the clone process.  Please note this snapshot state is a crash consistent snapshot; one without the option to quiesce the disk or snapshot the memory so any items on the virtual machine not committed to disk will be lost.
Option 2 – Take and delete a snapshot in the vSphere Client.  What will happen with this option is that the snapshot removal will also perform the consolidate action and rewrite the additional delta child disks back to the original parent disk.  Should you try this option and the snapshot removal doesn’t fix it either try shutting the virtual machine down first or selecting the option to Quiesce guest file system whilst taking the snapshot.
Option 3 - Take and delete a snapshot using an SSH connection to the host.  You may find that the snapshot removal still doesn’t work using the vSphere Client.  If so try the same process from the command line.  Use these steps as a guide.

Step 1 – List out the VMID of the virtual machines on the host
# vim-cmd vmsvc/getallvms

Alternatively use grep to list out just the virtual machine name you are looking for.  In my example I use
# vim-cmd vmsvc/getallvms | grep Demo*

Here is the output.
22     Demo-VM01    [EQL03-SHARED05] Demo-VM01/Demo-VM01.vmx
windows7Server64Guest       vmx-08

Step 2 – Verify if the snapshot exists
# vim-cmd vmsvc/snapshot.get [VMID]

Here is the output.
# vim-cmd vmsvc/snapshot.get 22
Get Snapshot:
--Snapshot Name : Demo-Snapshot01
--Snapshot Id : 1
--Snapshot Desciption :
--Snapshot Created On : 2/1/2013 12:11:49
--Snapshot State : powered off

Step 3 – Create a new snapshot
# vim-cmd vmsvc/snapshot.create [VmId] [snapshotName] [snapshotDescription] [includeMemory] [quiesced]

Here is the output.
# vim-cmd vmsvc/snapshot.create 22 Demo-Snapshot02 "Snapshot Demo 2 Two" 0 0
Create Snapshot:

Step 4 – Remove all the snapshots  (Labelled as Delete all in Snapshot Manager)
# vim-cmd vmsvc/snapshot.removeall [VMID]

Here is the output.
# vim-cmd vmsvc/snapshot.removeall 22
Remove All Snapshots:

Run a directory list command ls -lah to confirm that the snapshots have all been removed.

You can also take and remove snapshots using the vSphere CLI or vSphere Management Assistant  (vMA) and PowerCLI.  The vSphere CLI and vMA uses the same commands as above, you just need to specify the remote server that you want to perform the checks against.

For example run this to take a snapshot of a virtual machine running on an ESXi host through vCenter Server.
> vmware-cmd -h <vCenter Server> -U <user_name> -P <password> createsnapshot <name> <description> quiesce [0|1] memory [0|1]

PowerCLI can use the following commands to take a snapshot.
> New-Snapshot [-Name] <Snapshot_Name> [-Description <Description_Of_Snapshot>] [-Memory] [-Quiesce] [-VM] <Virtual_Machine_Name> [-Server <vCenter_Server>]

Checking for virtual machine disk locks

Should any <vm-name>-<number>.vmdk delta disks remain the next step is to see if any virtual machine disks have locks on them.  For this you can use the vmkfstools command set and have a look at the current mode of the relevant .vmdk file.
A virtual machine disk can be in one of four modes.

mode 0 = no lock.
mode 1 = is an exclusive lock.  This will be the case if the virtual machine is powered on and in use.  A powered on virtual machine will also have an up to date modification date on the .vmdk file.
mode 2 = is a read-only lock.  This will be the case of the <vm-name>-flat.vmdk  of a running virtual machine with snapshots.
mode 3 = is a multi-writer lock.  This will be the mode of the vmdk if it is used for Microsoft clusters disks or fault tolerance virtual machines.

Ensure you are in the relevant virtual machine directory and use the following actions to perform these checks.

Step 1 – Check the mode state of the virtual machine flat disk file  (<vm-name>-flat.vmdk)
 # vmkfstools -D <vm-name>-<number>.vmdk

Here is the output of the demo VM with a snapshot in place.
# vmkfstools -D Demo-VM01-flat.vmdk

Lock [type 10c00001 offset 159152128 v 123, hb offset 3244032
gen 25, mode 2, owner 00000000-00000000-0000-000000000000 mtime 1190286 nHld 1 nOvf 0]
RO Owner[0] HB Offset 3244032 50b60d57-e9cb48dc-9d82-984be10fc230
Addr <4, 346, 95>, gen 106, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 42949672960, nb 0 tbz 0, cow 0, newSinceEpoch 0, zla 3, bs 1048576

As you can see the base disk is in read only mode because all changes are currently being written to the snapshot delta disk.
If I run the same command on the snapshot delta disk I get the following.

# vmkfstools -D Demo-VM01-000001-delta.vmdk

Lock [type 10c00001 offset 262713344 v 152, hb offset 3244032
gen 25, mode 1owner 50b60d57-e9cb48dc-9d82-984be10fc230 mtime 1190281 nHld 0 nOvf 0]
Addr <4, 598, 134>, gen 147, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 86016, nb 1 tbz 0, cow 0, newSinceEpoch 0, zla 1, bs 1048576

This disk is in exclusive lock mode because the virtual machine is switched on and is being used to write the changes to.   You can see which host has the lock on this virtual machine disk by looking at the MAC address given after the word, owner.

Step 2 – Shut the virtual machine down to see if the lock gets released
Here is the output following a shutdown of the virtual machine.

# vmkfstools -D Demo-VM01-flat.vmdk

Lock [type 10c00001 offset 159152128 v 124, hb offset 3244032
gen 25, mode 0, owner 00000000-00000000-0000-000000000000 mtime 1190723 nHld 0 nOvf 0]
Addr <4, 346, 95>, gen 106, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 42949672960, nb 0 tbz 0, cow 0, newSinceEpoch 0, zla 3, bs 1048576

As you can see the mode is 0 on the demonstration virtual machine meaning that the machine disk is not locked by another device.  Once the mode is 0 you should be able to take a snapshot and remove a snapshot successfully.

Step 3 – Forcefully remove the lock
If you find that the mode is anything other than 0 then another device is locking the disk.  This may be another host or depending on your backup software may be your backup server.  If the file is still locked you should see the MAC address of the owner.  If you find that it is your backup server that corresponds to the MAC address restarting the backup server should release the lock.  If it is another host then you will need to unregister the virtual machine from the current host and re-register it on the host with the corresponding MAC address.  Once you have registered it on the appropriate host try and power it on.  If it still fails check if the virtual machine still has a World ID assigned to it on the host identified as the owner of the lock.

# esxcli vm process list

World ID: 3657905
Process ID: 0
VMX Cartel ID: 3670192
UUID: 42 36 06 d4 0f 1b 35 61-17 aa f9 4b 8d 6c e1 78
Display Name: Demo-VM01
Config File: /vmfs/volumes/4fe306c8-b1c504a6-a734-984be10fb3e4/Demo-VM01/Demo-VM01.vmx

The world ID number (3657905) is the Virtual Machine Monitor (VMM) for vCPU 0.  Run the following command to force the virtual machine to stop by killing the process.

# esxcli vm process kill --type soft --world-id 3657905

Should you find that you are not able to see the virtual machine name when running this command this is because the virtual machine is not running on this host.
If this is the case or you are not able to kill the process you can restart the management agent or reboot the host to release the lock.

It is worth noting that you can use the k command in esxtop to kill a running virtual machine process. SSH to the host and perform the following.

Step 1 – Run esxtop by typing esxtop
Step 2 -Press c to switch to the CPU resource utilization screen (This is the default view)
Step 3 -Press Shift+f to display the list of fields
Step 4 -Press c to add the column for the Leader World ID
Step 5 -Identify the target virtual machine by its Name and Leader World ID (LWID)
Step 6 -Press k
Step 7 -At the World to kill prompt, type in the Leader World ID from step 5 and press Enter
Step 8  -Wait up to 30 seconds and validate that the process is no longer listed