Blog

30,000 IOPS is better than 20,000 IOPS

More is better. Yes, that’s exactly how much an average end user knows about IOPS. The more the merrier, what else can it be? Although the perceived performance is a function of capacity, read/write ratios, IO dispersion patterns, number of threads, clients, network, cache algorithm, hit rates, garbage collection, space reclamation, background process, and durability requirements, among 20 other factors that might equally impact the performance of a storage solution, but who cares, more IOPS can’t hurt, can it? Well I guess it can’t, unless it’s going to cost more money.

One of the most prominent value propositions of Software-Defined Storage is policy-based self-provisioning. For enterprises, this is all nice and dandy until greed comes into play. In the name of “better err on the safe side”, people almost always ask for more storage, capacity and/or performance, than they actually need, unless a tight quota and a penal code are in place. It’s a system administrator’s nightmare to efficiently manage the company’s virtualized storage environment while keeping the end users’ ignorance at bay. It’s less of an issue for capacity because volumes are always thinly provisioned, but performance is pretty much all hard-wired, so to speak, otherwise the IOPS cannot be guaranteed.

For public clouds, customers face a different challenge. Service providers like Amazon are starting to introduce tiered IOPS provisioning for their block storage solutions with SLAs and a billing system that would require a PhD in number theory to decipher. But you have to give AWS credit. Their offerings get better by the hour, figuratively speaking, and if you are one of their customers, hardly a week goes by without you receiving an email or two from Amazon telling you how much they have improved their product lineups or how much more you can do with your existing instances.

Basic EBS from AWS now features a general purpose SSD tier that bursts to 3,000 IOPS for an unspecified duration of time, as well as provisioned SSD tiers up to 4,000 sustained IOPS. Admittedly AWS has gone a long way and they are objectively THE market leader in the Infrastructure-as-a-Service (IaaS) playground. Still, end users need to have a fairly good idea regarding the level of performance required before a volume is requested. Additionally, very few workloads exercise storage equally over time, but volumes have to be over-provisioned to cover the peak values, as illustrated by the dotted red line in the following diagram.

  • Text Hover

Sure, other than wasting some high-value SSD resources (it’s only money, right?), this over-provisioning is not so bad, because at least the performance is there and customer’s business is taken care of. Now what if the user doesn’t even know what the IOPS requirements are at the time of volume creation? With the current inflexible provisioning system, if a volume’s performance is deemed inadequate, the workload has to be moved to a higher tier along with data migration and possibly some system downtime. Wouldn’t it be nice that the storage system can learn that the volume needs more speed and just give it on the fly?

Before all flash arrays become substantially more affordable than they are today, hybrid systems remain the dominant architecture in today’s market as they hit the sweet spot between performance and capacity. However, hybrid arrays do not solve the over-provisioning problem shown in the above example. A preferred embodiment of a modern software-defined storage system should do one better than performance tiering. It should learn the traffic pattern of all the LUN’s under the storage hypervisor and proactively allocate the needed SSD-based cache space to the volumes that need it the most at any given point in time. Applications that do not need the speed should surrender the precious flash resources to other applications, maximization resource utilization rate.

The bottom line is, storage performance goes way beyond IOPS provisioning, and the industry as a whole has not really figured out a closed-form solution that relates the perceived IOPS to the host of parameters that directly or indirectly impact performance. Now throw multi-tenancy and “noisy neighbors” into the mix, the problem gets out of hand really quick. Open-loop control does not cut it anymore for today’s demanding applications and IT environment. Consider that IO analytics and machine learning are not exactly new concepts, it does not take a genius to put one and one together and get three. Why not all SDS solutions offer some type of traffic modeling and elastic resource control is beyond me, but surely they will be standard features that are taken for granted in the very near future. I can’t think of any other way.