Monday, July 6, 2009

Solid-State Drive Benchmarks and the Write Cache

Like many people, we are interested in deploying solid-state drives (SSDs) for our database systems. Jignesh posted some performance test results a while ago, but as I had commented there, this test ran with the write cache on, which concerned me.

The Write Cache

Interlude: The disk write cache is the feature that causes you to lose your data when the server machine crashes or loses power. Just like the kernel may lie to the user-land application about writing stuff to disk unless the user-land application calls fsync(), the disk may lie to the kernel about writing stuff to the metal unless the write cache is turned off. (There is, as far as I know, no easy way to explicitly flush the cache. So write cache off is kind of like open_sync, if you know what that means.) As PostgreSQL pundits know, PostgreSQL does fsyncs at the right places unless you explicitly turn this off, and ignore all the warning signs on the way there. By contrast, however, the write cache is on by default on consumer grade ATA disks, including SATA disks and, as it turns out, also including "enterprise" SSD SATA devices.

To query the state of the write cache on a Linux system, use something like hdparm -W /dev/sda. To turn it off, use hdparm -W0 /dev/sda, to turn it back on, hdparm -W1 /dev/sda. If this command fails, you probably have a higher-grade RAID controller that does its own cache management (and doesn't tell you about it), or you might not have a disk at all. ;-) Note to self: None of this appears to be described in the PostgreSQL documentation.

It has been mentioned to me, however, that SSDs require the write cache for write wear leveling, and turning it off may significantly reduce the life time of the device. I haven't seen anything authoritative on this, but it sounds unattractive. Anyone know?

The Tests

Anyway, we have now gotten our hands on an SSD ourselves and gave this a try. It's an Intel X25-E from the local electronics shop, because the standard, big-name vendor can't deliver it. The X25-E appears to be the most common "enterprise" SSD today.

I started with the sequential read and write tests that Greg Smith has described. (Presumably, an SSD is much better at being better at random access than at sequential access, so this is a good worst-case baseline.) And then some bonnie++ numbers for random seeks, which is where the SSDs should excel. So to the numbers ...

Desktop machine with a single hard disk with LVM and LUKS over it:
  • Write 16 GB file, write caching on: 46.3 MB/s
  • Write 16 GB file, write caching off: 27.5 MB/s
  • Read 16 GB file: 59.8 MB/s (same with write cache on and off)
Hard disk that they put into the server that we put the SSD in:
  • Write 16 GB file, write caching on: 49.3 MB/s
  • Write 16 GB file, write caching off: 14.8 MB/s
  • Read 16 GB file: 54.8 MB/s (same with write cache on and off)
  • Random seeks: 210.2/s
This is pretty standard stuff. (Yes, the file size is at least twice the RAM size.)

SSD Intel X25-E:
  • Write 16 GB file, write caching on: 220 MB/s
  • Write 16 GB file, write caching off: 114 MB/s
  • Read 16 GB file: 260 MB/s (same with write cache on and off)
  • Random seeks: 441.4/s
So I take it that sequential speed isn't a problem for SSDs. I also repeated this test with the disk half full to see if the performance would then suffer because of the write wear leveling, but I didn't see any difference in these numbers.

A 10-disk RAID 10 of the kind that we currently use:
  • Write 64 GB: 274 MB/s
  • Read 64 GB: 498 MB/s
  • Random seeks: 765.1/s
(This device didn't expose the write cache configuration, as explained above.)

So a good disk array still beats a single SSD. In a few weeks, we are expecting an SSD RAID setup (yes, RAID from big-name vendor, SSDs from shop down the street), and I plan revisit this test then.

Check the approximate prices of these configurations:
  • plain-old hard disk: < 100 €
  • X25-E 64 GB: 816.90 € retail, 2-5 weeks delivery
  • RAID 10: 5-10k €
For production database use, you probably want at least four X25-E's in a RAID 10, to have some space and reliability. At that point you are approaching the price of the big disk array, but probably pass it in performance (to be tested later, see above). Depending on whether you more deperately need space or speed, SSDs can be cost-reasonable.

There are of course other factors to consider when comparing storage solutions, including space and energy consumption, ease of management, availability of the hardware, and reliability of the devices. It looks like it's still a tie there overall.

Next up are some pgbench tests. Thanks Greg for all the performance testing instructions.

(picture by XaYaNa CC-BY)


  1. There is in fact a quick intro to the Linux hdparm stuff I helped add to the manual at cs/current/static/wal-reliability.html but it falls rather short of complete.

  2. Are those 414 random seeks reads or writes or a mix of both? For reads this number should be much higher.

  3. The bonnie++ random seeks number is a mix of reads and writes using 3 processes. If you look at the "Database hardware benchmarking" presentation on my web page (linked to right at the end) you'll find a slide covering this. Those seek numbers are also relative to the size of the file the benchmark allocates.

  4. This comment has been removed by a blog administrator.

  5. Peter,

    What about lag time? I'd think that the greatest benefit to SSD would be vastly smaller lag time on small writes, rather than throughput on large writes.

  6. Josh,

    The random seek numbers should give an idea about lag time. But in principle I had assumed that the lag time would be much better with SSDs anyway and concentrated on the sequential performance and the effect of the write cache, because those were the unknowns for me.

  7. Peter,

    First, if you use Bonnie 1.95 you can get lag time numbers which would be useful.

    Overally the SSD numbers don't look very impressive compared to standard HDD RAID, especially given the limited lifetime of SSDs. I'd be interested to see what you can do with a RAID of SSDs, and how much it costs.

  8. It looks like Bonnie did not have sufficient IO queue depth. These SSDs have 10 channels. So, their random "seek" performance heavily depends on long queue, all way up to 32, which is SATA NCQ spec limit.

  9. Wow...what an eye-opener. Only 414 R/W IOPS from a device advertised by Intel to do 35,000 reads and 7,000 writes!?!?!

    414 IOPS is only about 2x what a 10,000rpm HDD can do with its write cache disabled. Essentially this means I could use a 4-disk RAID-10 array and get BETTER performance than X25-E SSD, with an extra 500GB of capacity for free.

    I am curious, why does Intel fail to disclose the size of the DRAM write cache in the X25-E documentation?

  10. The test is just wrong. 414 R/W IOPS.

    More like 4,000.

  11. Somebody above said: "The test is just wrong. 414 R/W IOPS. More like 4,000.

    I wish. There are some very interesting comments on a recent Flash benchmark run by Sun/Oracle here that seem to validate the (much) lower 414 IOPS number.

    In a nutshell...Sun/Oracle ran a PeopleSoft Payroll benchmark on a two-tier setup: 40xSSDs (SLC) plus a striped array of 12 15KHDDs.

    They then compared this to a single tier of 62x15KHDDs, and the performance was (as it turned out) identical.

    For simplicity, lets ignore the 12x15K HDDs in the SSD setup and just say that the "performance equivalence" ratio in this application was 40xSSD = 62xHDD, or 1.5:1.

    Looking at the Bonnie++ numbers for HDD and SSD above (210IOPS for HDD vs. 441IOPS for SSD) we get a "performance equivalence" ratio of 2.1:1

    In this light the 441 r/w IOPS for the Intel X25E on Bonnie++ seems right in line.

    Unfortunately...this is between 1/10th and 1/100th of what the SSD makers are putting in spec sheets.

    Re: "It looks like Bonnie did not have sufficient queue depth...up to 32"

    Well...maybe that's why the Bonnie++ results look more like the real world than the vendor spec-sheets, and if so that's a good thing!

    I have been profiling enterprise application workloads for almost 10 yrs and I can't recall seeing a queue depth at the target (HDD) end of more than 2 or 3 outstanding requests. I notice Intel (and seemingly everyone elses) numbers are always based on 32 deep queues -- which is meaningless.

    As far as I'm concerned, for any given application I need to see REAL WORLD SSD "performance equivalence" of about 10:1 (replace 10 fast-spinners with 1 SSD) before it's cost-effective to start building two-tier setups for performance critical apps.

    So far, Flash SSD isn't even close.

  12. SSD drives have the advantage of being far more rugged and reliable than any regular hard drives. They are built using the Flash technology for faster and smoother performance. Flash technology means that it erases blocks of memory before writing them.

  13. I wish I had a Fusion-io device to lend you for another test run.
    They claim the cheapest version has 89000 - 119000 IOPS. I wonder what the real-world numbers are. I wonder what happens when you flick the power off in the middle of a db txn on this device.

  14. Interlude: The disk write cache is the feature that causes you to lose your data when the server machine crashes or loses power.

    some ssd devices (at least intel 320 ssd drives) have capacitors, which allows to flush cache after power failure or another incident.

    what do you think about this?

  15. @edo: Yes, that's the sort of thing you need. The X25-E didn't have that, but all the newer SSDs intended for server use ought to have them. The trick might be to get information from the manufacturer or vendor to verify that.