Atomic Commands

At the January T10 meeting, the proposal for an initial atomic read and write command was accepted in the T10 Commands, Architecture, and Protocol (CAP) committee, only to be rejected at the plenary meeting. In the March meeting, a greatly simplified proposal for atomic writes was presented that did not make it out of committee (4 yes, 7 no, and 6 abstain). The atomic read and write proposal, along with several variants, has been in play for over three years and has had innumerable revisions. What’s going on here?

Some background: an atomic write is one which writes some data to a storage device. When the command is completed, the new data has either been successfully written to the device, or in the case of an error, none of the data was written and the data that was present before the write is maintained.  This avoids the problem of a torn write where only part of the specified data range was written correctly and the remainder may be either old data – or worse – undefined. In database applications this sort of behavior can lead to an inconsistent database. This is usually avoided by writing a log record to the database which is used by the database application to recover from the failure and make the database consistent. But this requires an additional write for each data base record update and slows the system. As large flash-based storage systems become commonplace in database servers, this behavior becomes doubly problematic. Not only do these additional writes reduce overall system performance, they are also a further write burden on the flash memory system of the storage device.

In the past, this was never an issue because atomic writes are difficult to implement on disk drives. For rotating media it is more cost effective to have the application simply do another write to cover the relatively rare error, rather than build that complexity into cost sensitive disk drives. With the advent of flash memory based storage systems, implementing atomic writes is much easier thanks to the way flash memory is written and managed. Reducing the number of write operations is also a big plus for flash due to the relatively limited write capacity of most flash based storage devices.

Another factor that’s important to consider in the atomic write story is the advent of PCIe based storage. Multiple vendors now produce PCIe cards that can provide terabytes of data on a single PCIe card. The NVMe interface was developed to take advantage of this class of device. T10 was not far behind when the SCSI over PCIe (SOP) and the PCIe Queuing Interface (PQI) specifications were developed in response and are now reaching maturity. These developments have allowed database servers to become much more efficient by speeding access to significant amounts of data at much higher speeds – usually by several orders of magnitude.

These changes have allowed servers to keep large databases in local storage with speeds approaching that of random access memory. More virtual machines can be instantiated on each server and have improved performance over network attached storage while reducing overall data center costs and power consumption. And SCSI is an ideal interface for such installations thanks to a mature and well developed protocol for handling multiple initiators. Recent work in the committee is under way to provide and manage millions of LUNs from a single SCSI target. While NVMe is a lightweight and low latency interface, it does not have the breadth or maturity of the SCSI protocol for handling large numbers of initiators or logical units.

…to be continued…