When you talk about performance you need to have some numbers. Many people are confused here and just talk about “wee need more IOPS” without even understand what those are. If you wish to analyze your environment performance you have to know what to measure. In this short post I would like to describe some of the definitions.

Bandwidth

The amount of data that is sent transferred along a channel at a give amount of time. Very often measured in MB/s (megabytes per second). Here you are not concered with how much IOPS you are dealing with, but how much data is being send in the fixed amount of time.

Throughput

It’s a number of IO operations that are processed per second over a period of time (IOPS). You are not considered of how much data is being moved, but of the number of transactions.

Bandwidth vs Throughput

So, which characteristic is more important? It all depends what are you dealing with. If your workload is more or less sequential the bandwidth is much more crucial. If your workload is more random and IO size is small much more important is throughput. Of course , it’s never black-or-white type of situation, but in most cases you should be aware what is more important for you – bandwidth or throughput

Response Time

In simple words it’s the interval of time between submitting a command (request), and receiving a response. Normally measured in miliseconds (ms).

Average Seek Distance

Amount of data that the disk transverses during a seek. The larger the distance, the more random the IO. Longer seek distances result in a longer seek times and therefore higher response times. Avg seek distance is measured usually in GB. It is measured in GB, not in tracks, because different disk manufactures might have different design of a hard drives, therefore the numbers wouldn’t be consistent, when using drives from different vendors in one storage system. With the avg seek distance you can check how much randomness there is on a disk level. But remember, randomness on disk level is not the same as randomness on the LUN level, one disk, being a part of a RAID group (or/and pool) can host many different LUNs and you have to consider that.

Queue Length

Number of request within a certain time interval, that are waiting to be served by the component. Recognizing a queue length is crucial, because, my optimizing it, you can resolve many of your performance issues.

Utilization

Utilization measures the fraction of time that a device is busy serving requests, usually reported as a percentage busy.

What is IO

When it comes to performance issues the term you hear really often is IO. IO is a shortcut for input/output and it is basically communication between storage array and the host. Inputs are the data received by the array, and outputs are the data sent from it. To analyze and tune performance, you must understand the workload that an application and/or host is generating. Application workloads have IO characteristics. Those characteristics may be described in terms of:

IO size
IO access pattern
thread count
read/write ratio

In this post I would like to shortly go through those characteristics, because many people understands IO only as a “number of operations”, without the awareness of what this number actually means and how to understand it. Very often when people are talking about IO, what they might mean is IOPS – which is basically number of IOs per second. But to talk about IOPS and understand the number – you first have to understand and consider IO characteristics.

IO size

IO request size has an affect on performance throughput – in general, the larger the IO size, the higher the storage bandwidth. What is very often overlooked is that in most cases tools are showing the average IO size. Bare in mind that most production workloads have a mix of IO sizes.

IO access pattern

In terms of access patterns we very often use terms:

random read/write – there is no sequential activity at all (near zero), the read and writes are purely random, almost impossible to boost performance with cache.
sequential read/write – the exact opposite – purely sequential access, with no randomness at all. In such environments using storaga cache can really boost the performance.

In the real word you will almost never find 100% random or 100% sequential data access. Even in sequential environment there might be number of different sequential activities at the same time, which actually build randomness when switching between them.

IO access pattern may relate to IO size, with larger IO size (like 64kB etc) you most often deal with more sequential data access, whereas with small IO size (8kB for example) the access is most often random

Thread count

How many different activities are going on in the same time. If you have a single threaded IO access patterns, it means that host sends a write to a storage system, and waits for the acknowledge from the storage system that the write has completed. And once its completed it will send next write and so on. In real word most applications will produce multiple threads at the same time. They don’t have to wait for single response to send another request. If we go deeper here – one disk can do only one operation at a time. If multiple IO are targeted to the same disk we have a queue. Using queues storage system can optimize the way it works. The worst type of performance is when the tread count is 1 – how can you optimize single thread?

Read/write ratio

Writes are most expensive (performance wise) then reads. Mostly because the system has to determine where to put new chunk of data. Once it decides where to place the data the write itself is time consuming due to RAID write penalty (Of course it depends on the RAID level placed underneath). With enterprise storage system it actually changes – very often writes are faster, because they are placed into cache, and then the acknowledge to the host is send (later on writes from cache to actual hard drives are send in optimized way by storage system). Whereas reads – especially random reads, are very often un-cached and have to be fetched from hard drives.

Read/Write ratio is really important in terms of replication. Obviously only writes are being replicated, reads are not.

Workload IO and (some of the most popular) RAID Types

RAID 1/0

Best for small random write-intensive workloads. The RAID penalty with RAID 1/0 is only 2. That’s the lowest of all the RAID types (that actually give you some kind of failure protection)

RAID 5

Good mix of performance and protection. But the RAID penalty is 4, so the performance is much worst with small random writes.
Best with client write IO size of 64 KB or lager – full stripe writes can really boost the performance, you can write the entire stripe without a need of reading the parity information – which of course lower the RAID penalty.
Best practice is to use RAID 5 for workloads that have random writes of 30% or less.

RAID 6

More protection due to two parity disks, but at the same time higher RAID penalty – which is 6.
Very often used with NL-SAS drives (or SATA drives) which are the cheapest in terms of TB/$, but are the slowest.

storagefreak – storage & cloud blog

A little bit about NAS, a little bit about SAN… A little bit about cloud.. Wait – whaaat?

Month: February 2015

Storage Performance – what to measure?