We Bought 1347 Used Data Center SSDs to Look at SSD Endurance

21

Key Lessons Learned

Some folks just want to see the “so what.” I think we learned a few things.

  1. DWPD is becoming an increasingly silly metric as drive capacities have grown. At 40GB and 80GB DWPD was important. As we move to drives 100x as large, we transition from only hot data being stored on the SSDs, to more data being written once, and read multiple times.
  2. Organizations are still significantly overbuying high-endurance drives. Out of 1347 drives, the only drive that was at a pace higher than its DWPD rating, and the only drive type that was similar to its DWPD rating, was the Intel 750, a consumer SSD in data center form factors with low capacities and low endurance ratings. Other than that, drives generally were used well under their ratings.
  3. Something folks often miss is that drives often do not see even 24x7x365 workloads over five years. If a drive is put into a system and turned on but is not brought into a production environment for a week or two, then it accumulates hours but not writes. Likewise, maintenance windows, or even just daily or seasonal changes in workloads, can cause time to elapse with fewer writes.
  4. We looked only at TBW, not how the data was written. The DWPD stats done with 4K random writes as a worst-case workload are not the same workload that many of our drives saw. Instead, drives often see lots of reads and likely see more sequential-style workloads, especially as they get larger. When we start looking at 61.44TB and 122.88TB SSDs, even just the write speeds start to become a challenge with QLC NAND.
  5. There are certain applications, such as logging, write caching, and when SSDs are used as cold memory expansion, where heavy write patterns are very viable. Some applications certainly required higher endurance drives.

Of course, we have far from a perfect methodology, but it still provides an interesting source of data due to the random(-ish) sourcing of drives.

Final Words

We covered this in our Solidigm D5-P5336 61.44TB SSD Review, but with the popularity of massive SSDs, the idea of using drive writes per day becomes borderline silly. Endurance is certainly an important metric, but when the quantity of NAND increases on a device, the current DWPD regime feels wrong. Large drives are unlikely to be subjects of consistent 4K random writes. More reads and more sequential writes are more likely. With 122.88TB drives, data being stored is being transitioned from hard drive array alternatives which offer lower performance and lower duty cycles.

STH 2016 2024 Recycled SSD Population DWPD Actual By Capacity
STH 2016 2024 Recycled SSD Population DWPD Actual By Capacity

A very valid case is to purchase lower capacity, higher DWPD drives for different reasons. For example, it is common to see drives have 1 DWPD and 3 DWPD variants with the 3 DWPD variants trading some capacity for higher performance. There the endurance rating is not the driver, but the performance is.

STH 2016 2024 SSD Population DWPD Actual By Vendor And Model
STH 2016 2024 SSD Population DWPD Actual By Vendor And Model Averages

Still, this was a fun exercise to see just how a random sample of organizations actually used their drives. Endurance is quickly becoming the wrong metric to use to make decisions. Now that you have read this, and especially with larger drives, we hope you have a new appreciation for the importance of traditional DWPD ratings. It is time to move beyond them as an industry.

21 COMMENTS

  1. I, perhaps stupidly, just format to 90% of capacity.. Let them burn up the write cycles, firmware can reallocate as needed.

    Shrug.

  2. I would note that the 3DWPD drives tend to have higher random R/W IOPS. This can be important when you get into heavy DBs or ERP systems.

  3. We have bought about a dozen 3 TB Micron 9300 MAX drives with 3 DWPD in 2020. None of them failed, and it looks like when they hit 5 years they will have 50-70% endurance left, as per SMART data.
    No plans to replace them at the end of the 5 yr warranty.

  4. How does all this correlate to SSD endurance/DWPDs in single-user applications such as laptops or home desktop systems? My issue seems to stem from an apparent lacking of data for the home user and not enterprise server disk arrays. Also, WHAT ABOUT UNPOWERED DATA RETENTION?! Especially for portable SSDs that connect to a USB port and may not always get powered up every day the way server drives do? If I’m ever going to feel comfortable upgrading the “spinning rust” I currently have in my Bravo 17 to a large-capacity many-TBs SSD, I kinda need to know this!

  5. @Stephen

    Consumer SSDs are more reliable than consumer HDDs. In normal desktop usage you won’t come anywhere close to even hitting a 0.3DWPD usage. I’m still using my 240GB SATA SSD from 2013 and it hasn’t crashed.

    In terms of unpowered data retention an SSD should be able to go a year without needing to be powered on to hold the data.

    Going with an SSD over an HDD will make the biggest change to your overall user experience. It is very much worthwhile to go with one.

  6. I made the switch to SSDs when affordable 4TB ones first came out. They were 2-3 times as expensive as magnetic disks, so I got 4TB magnetic ones as write mirrors. Since then the magnetic ones have failed at least once, but the original SSDs still work fine. So if you include the cost of replacing the magnetic disks, the SSDs were actually no more expensive overall, and this is 5-6 years ago. With SSD costs even lower now, the deal is even better.

    Unfortunately I still have to buy magnetic disks because I don’t have enough drive bays to get the capacities I need in SSD form, which is why I’m so excited about these higher capacity SSDs coming out. Finally I might be able to go 100% SSD without sacrificing storage capacity.

    @RJCvn: If you format a drive to 90% of its capacity, the remaining 10% will just sit unused, the drive will never use it to reallocate bad sectors to because drive firmware operates at a lower level than disk partitions. You need to overprovision the drive by using the manufacturer’s firmware utilities, which result in the drive reporting a smaller usable capacity. Then you format 100% of this smaller drive and the firmware will use the extra space for reallocating flaky flash cells.

    But first you might want to look at the SMART data to see how much of this extra space is actually being used, since most manufacturers overprovision their drives already and you may find very little of this extra space is being used. With most drive firmwares doing wear levelling, it’s less likely for specific blocks to fail and more likely for the whole drive to start wearing out, as all the cells start failing at much the same time since the wear levelling means they’ve all been written to the same number of times.

  7. Great analysis, thank you. I suspect I’ll be linking to this every time the ‘QLC write endurance’ discussion starts up again.
    Relax, people – as long as you’re not chia plotting on the drive, it’ll probably be fine.

    I can contribute one data point myself – I bought a used Kioxia CD6-R 7680GB from china early 2023. It arrived with about 9400 hours on it and 2,137,464 GB written. Yep, that’s over two Petabytes.
    That makes for 278 drive writes in 392 days or about 0.7 DWPD on quite a big drive.
    Hmm – actually, it’d top your max DWPD chart for the 7680GB capacity cathegory. Yay?

    Either way, SMART tells me that it’s at 3% percentage used (attribute 05) – or 97% health remaining. Presumably sequential workloads, but yeah – good luck writing this drive to death. If it ever dies, I expect it’ll be the controller that goes.

  8. Former server procurement guy here. I’d say endurance is still a factor, and we shouldn’t write it off so quickly. I definitely do agree it’s very often a factor of overestimated importance though. In my former job, I had a hard time convincing ops teams to monitor their actual DWPD (such an easy thing to do!) to show them they really don’t need these ~10 DWPD “write-intensive” (WI) SSDs that they insisted on spec’ing, and that they could easily do with ~3 DWPD “mixed-use” (MU) ones instead and stop over-buying.

    And if we have to discuss about endurance, then DWPD, or actually DWPD-over-an-N-year-warranty-period (as it should be properly stated, but often isn’t…), is not that bad a metric actually. Expressing endurance as warranted TBW is another option (and one can easily translate one to the other), but the DWPD metric has the advantage of remaining comparable across capacity classes, and to allow useful classification such as the industry standard “write-intensive” (WI, ~10 DWPD) or “mixed-use” (MU, ~3 DWPD).

    Also, something that people persistly can’t seem to understand is that, within a given SSD product line, a 3.84 TB MU SSD and a 3.2 TB WI SSD are hardware-wise the exact same 4 TiB = 4.4 TB SSD, just with different firmware parameters that set aside more (WI) or less (MU) spare NAND cells.

    And because yes, we tend to overestimate our actual DPWD needs, a fair part of these spare cells end up to be unused. This is wasted capacity, it comes with a cost, and this is something worth working on, at the very least by collecting usage data in the telemetry, for people to get a better grasp of what they actually use and need.

    I know that (at least some…) hyperscalers go even further and purchase SSDs with special firmware that allows them (among other things) to manage cell wear-levelling their own way and optimize it to their specific workloads, thus reducing this waste and the associated cost to basically nothing.

  9. @G

    I have some 7.68TB Micron 9300 Pro 1DWPD NVMe drives in a Software Defined Storage Array. We got them in 2019 and after 5 years of use that includes development DBs they are at 4% used according to SMART. We also have a VMware vSAN array using P4800X as the cache drive and WD Gold (rebranded Ultrastar DC SN640s) as the capacity drives. After 4 years of use with even more DBs in that cluster they show 0% used in SMART. You are 100% correct that people overestimate how much will be written to the drives. Unless you have a storage network that can handle the extra IOPS by the MU drives 99.999% of the time the RI (1DWPD) drives will last longer than the SAN. Heck I have 2 240GB Intel D3-S4510s acting as a write cache for a NAS. Those are 2DWPD disks that I set to a higher over-provisioning. After 4 years of HEAVY writes they still have 35% life left.

  10. Why should they be destroyed? Encryption is a thing. We believe in math.

    I would love to see the other stats on these devices, the wearout indicators, time over critical temp, etc.

  11. I think there’s like a specific subset of apps and uses that need 30DWPD or more. For everything else, you’re right that they’re better with bigger drives.

    You made this point in the video, but not in the article, but we saw exactly what you said there. 24 disk RAID 16TB array in HDD took over a week to complete rebuilding a failed drive due to the activity. We just lost one with Micron 15.36TB SSDs and the 24 drive array took less than a day to be fully rebuilt. That’s worth a lot too.

  12. Is this dataset available? I’d be more interested in seeing total writes (sliced along similar dimensions) rather than daily writes.

  13. Patrick, you’re claiming sequential writes are less problematic than random 4K writes. How is this possible, since all sequential writes are sequentially written blocks?

    I know the internal block size of an SSD might be bigger than 4K. But all what I have seen is that SMART reports the blocksize in its metric then, too. Is there any cornercase I’m overlooking.

  14. If you write 4K random it makes drives write to cells not full, so you’re eating PE cycles without writing the cell out fully with data. If you write sequential big blocks you’re using most if not all of a cell’s data.

  15. Just to add, I had a few 1.6TB Intel S3500 drives (pretty old SATA stuff) that were rated at 880TBW at that capacity and they regularly died just after 4PBW, outperforming their rated writes nearly five times.
    Of course, this isn’t recommended in the slightest, but it shows that sometimes even the manufacturers can underestimate what their drives are capable of.

  16. @David,

    I would suggest you go F yourself with that kind of attitude toward used drives, but if you’re that anally-retentive, you wouldn’t be able to.

  17. Great data. I always had a feeling that recycled drives, like DRAM, should be a bigger market than it currently is.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.