I do not mean to start a panic with this unsolicited advice but lately I have met a few vendors and even partners that are beginning to advocate for combining primary storage and data protection together with snapshots, forgoing the requirement for a separate backup solution. So, are storage snapshots backup? No! Definitely not! But also, maybe sort of. Let me explain.
As you all may know, my previous gig was at a primary storage vendor, Nimble Storage. That product offered efficient redirect-on-write (ROW) snapshots and so we frequently pushed the benefits of thin snapshots including no data movement, fast restores and low space overhead. I always encouraged customers to snapshot and replicate everything including their servers, databases and network file shares.
Storage snaps are an important component to a complete data protection plan, but they are more of a “near-line” backup than a complete backup strategy.
“So if snapshots are so great, why are you telling us not to rely on them?”
Again- I’m a big snapshot fan. Please keep doing them. Just don’t rely totally on snapshots. Why? Snapshots are completely reliant on the underlying primary storage system they are intricately tied to. Snapshots are mostly the same data blocks/files as the running primary copies of that data, with some clever pointer tables to create additional restore points. These snapshots are generally very reliable because the primary storage systems they depend on are reliable, but they are not infallible.
Preface – Please know this is NOT in any way a judgement on NetApp storage. NetApp is an excellent, successful company and they make terrific products. NetApp also pioneered the modern ROW-style snapshots that allow the protection benefits aforementioned and is still the gold standard today. I would gladly discuss this incident with any NetApp team directly or any other storage vendor for that matter.
Several years ago, a local city municipality (who shall remain shameless) was considering my storage product versus staying with the NetApp incumbent storage and upgrading to a newer array. While they liked both offers, it was no surprise when the city decided to stay with NetApp because of their familiarity, easier public contract procurement & previously positive experience.
We parted ways as friends, and I assumed I would not be hearing from the city IT team again. Sadly, I was wrong. Approximately 5 months later, we were urgently asked to come back and present our solution again. During the meeting, we were informed that approximately three months after their new system was deployed, the NetApp array suffered a catastrophic system failure, destroying all primary volumes and snapshots.
“But we replicate to another system so we’re good, right?”
This customer also had a second NetApp system they replicated to. The corrupted data had been replicated and even reported completion but were also entirely corrupted beyond recovery. This happened because even though there was a copy of the data on another device, it was the same format and underlying platform and thus proliferated the problem rather than creating an air-gapped separate copy on a different platform.
After weeks of escalated data recovery efforts with the vendor, the customer was finally able to restore most of their data from approximately three months earlier from the downstream replicated system. Approximately three months of public records were completely lost.
The city IT manager explained that they were in active litigation with the storage vendor and reseller to get their money back, and if successful, wanted to know if they could still get that deal on our storage.
Again, no primary storage array is impervious to serious problems such as downtime or worse- data loss. Any enterprise-grade storage system will include multi-level checksums, redundant hardware and even snapshots to prevent such issues, yet they still happen. To make matters worse, storage vendors are highly motivated to camouflage or even outright deny their losses to prevent/minimize bad press, which I believe leads to a false sense of security.
To make this problem more blurred, there are some hyper-converged (HCI) primary providers that are now claiming to “build-in” backup into the solution. HCI is becoming more and more popular, but it’s relatively new and many customers are woefully under-educated. So when a primary HCI vendor comes along and says “you don’t need to do backup anymore, we do that already,” it sounds like a used car salesman explaining how that hood latch is actually supposed to open with a coat hanger.
To clarify, hyperconverged infrastructure is a newer way to store data and manage infrastructure, combining servers and storage into one unit that scales out. To protect against component failure, HCI platforms typically make copies of the data across multiple nodes which creates redundancy. So, when these HCI vendors create a snapshot, some of these vendors are now calling these backups rather than snapshots simply because the data will be replicated across nodes. This works well for protection from component failures but does little to protect against platform-level events such as what my customer experienced.
The only way to totally protect data from these sorts of incidents is to create an air-gapped copy of the data. This means when planning for data protection, organizations should always create a backup that is on an entirely separate storage platform that is not accessible on the network.
I was fortunate to not have a customer case that resulted in such catastrophic data loss, but I never would advocate that customers only use the vertically integrated snapshots and replication features to protect data.
Snapshots AND Backup – Love Will Keep Us Together
Chips & salsa, crocs & socks, Captain & Tennille- Some things are great by themselves but are simply unstoppable when combined. Such is the case for primary storage snapshots and backup. The increased demands of organizations to protect more applications faster is driving the need for the kind of protection that can only be initialized by snapshots which makes it possible to take near instant recovery points with no data movement. The only trouble is these snapshots are insufficient to adequately protect workloads from system failures, cybercrimes and site disasters.
Do both! Snapshots and backup are not mutually exclusive but rather two integral parts of a complete data protection strategy. Better yet, choose platforms with tight integration between backup and primary storage. I know what you’re thinking-
“Hey pal, I came to this site looking for Sumo suits- I’m not even sure I like this blog, don’t throw me curve balls like that!”
When Cohesity customers have Pure Storage, Cisco HyperFlex, Isilon or NetApp, Cohesity can manage and offload the snapshots of the primary storage system to the Cohesity cluster. Cohesity can initiate an instant backup job simply by telling the primary storage system to take a snapshot. Next, in the background and completely automatically, Cohesity will backup the changed snapshot data rather than the running instance of the object, such as a virtual server. This process means older snapshots can be deleted from the primary system where they could consume valuable resources but are still accessible from the secondary Cohesity system. This process also takes the backup workflow completely out of band from the primary production network, significantly lowering the impact of the backup process on the primary server & storage networks. Cohesity has integration planned soon with many more popular primary storage systems so stay tuned for more!
I will mercifully summarize – If there are any workloads today that are only protected with snapshots & replication, I would recommend some augmented protected solution. Now as a reward for reading all the way to the bottom and eating your vegetables, please enjoy the greatest song ever made: