At the May T10 meeting in Vancouver BC, a revision of the atomic write proposal was brought in by Samsung, HP, and NetApp (SBC-4 SPC-5 Atomic writes and reads (14-043) [Martin, Ballard, Knight]) and was accepted by the committee. This provides for a single atomic write and associated information regarding atomic boundaries and granularities. It is unlikely that any vectored command will be proposed unless a strong champion arrives.
This is probably unlikely in the near term, given the unprecedented consolidation now occurring in the solid state storage industry. For example, LSI was acquired by Avago and then sold the SandForce organization to Seagate, while Sandisk acquired Fusion-io. It is more likely that the features that flash technologies can provide will become available through proprietary software stacks for the time being.
Perhaps the most interesting work around persistent memory is occurring in the SNIA Non Volatile Memory technical work group (NVMP). This group has been dealing with some of the changes required by software applications to accommodate using non-volatile memory as storage. Discussions around atomics and transactional operations are routine in this work group, and participation by industry leaders has been steadily growing, a good indication of interest in this technology. The first draft specification is available now, and works in continuing on dealing with further topics, including remote persistent memory and isolation.
The real value of vectored commands and atomic commands becomes apparent when they are combined into vectored atomic writes. A database update that requires multiple writes to update a record (e.g., multiple data fields or a data filed and links) will be significantly speeded if the writes can be performed in an vectored write command can be performed atomically. In this case, all of the data segments defined in the vectored command are either all completed successfully, or if an error occurs, they are all restored to their original values before the atomic write was attempted. In other words, if the vectored write operations succeed, all of the data segments will contain the new data. If not, all of the data segments will contain the data they had before the vectored write operation was attempted. This aggregation of writes along with the atomicity properties can lead to a significant improvement in data base performance.
So, what’s the problem?
First, there are issues with vectored commands around error reporting. When if one of the segment writes fails, how do you tell the initiator where the command failed? If the writes are all atomic, it doesn’t matter since they are all the old data. While it is interesting to note the position of the error for failure analysis, but that can be obtained through a variety of ways, including vendor specific methods.
Another problem is related to support for bi-directional commands. Here’s a flow diagram for a typical, non-vectored READ command:
And here’s the flow for a vectored read command:
In this case the initiator needs to send the segment descriptor list to the target (a data out phase) and then turn the bus around to receive the incoming data (a data in phase), which a bi-directional SCSI command. This has been the single biggest objection to vectored commands. Many implementations of SCSI transports were not designed to accommodate this type of bus transaction. And it’s not just a firmware change – much of this low level processing has been embedded in SCSI controller state machines.
Finally, implementing vectored atomic write operations is difficult in both traditional rotating media and in array controller systems. For these types of systems, the expense and maintenance of such functionality simply does not make economic sense. But for flash memory based storage systems typically use some sort of write logging mechanism, implementing this functionality is remarkably easy.
…to be continued…
Another factor that’s important to consider in the atomic write story is the advent of PCIe based storage. Multiple vendors now produce PCIe cards that can provide terabytes of data on a single PCIe card. The NVMe interface was developed to take advantage of this class of device, but T10 was not far behind when the SCSI over PCIe (SOP) and the PCIe Queuing Interface (PQI) specifications were developed in response and are now reaching maturity. These developments have allowed database serves to become much more efficient by speeding access to significant amounts of data at much higher speeds – usually by several orders of magnitude.
With this increase in speed, the system overhead required to process each SCSI command CDB becomes a much larger part of the total time required to process a write or read operation. With rotating media, it may take many milliseconds to process a write command due to the latency inherent in the physical media. In this case, the time required to Process the command CDB is very small compared to the overall operation. For flash devices, the access is measures in micro-seconds and processing of the CDB becomes a significant portion of completing the write operation.
A solution to this problem is to define vectored read and write operations. These commands permit read and write operations to multiple data segments which do not have to be contiguous, unlike normal read or write commands. This is analogous to the scatter/gather lists employed in typical HBA interfaces and provided in the PQI and SOP specifications. Historically, there has been strong resistance to vectored commands within the T10 committee due to the complexity of error processing. It also makes little sense in rotating media where the latencies are large and a queue of many individual commands works as well and is more easily implemented and managed. But for flash memory based storage with a PCIe interface, atomics may provide a strong motivator.
…to be continued…
At the January T10 meeting, the proposal for an initial atomic read and write command was accepted in the T10 Commands, Architecture, and Protocol (CAP) committee, only to be rejected at the plenary meeting. In the March meeting, a greatly simplified proposal for atomic writes was presented that did not make it out of committee (4 yes, 7 no, and 6 abstain). The atomic read and write proposal, along with several variants, has been in play for over three years and has had innumerable revisions. What’s going on here?
Some background: an atomic write is one which writes some data to a storage device. When the command is completed, the new data has either been successfully written to the device, or in the case of an error, none of the data was written and the data that was present before the write is maintained. This avoids the problem of a torn write where only part of the specified data range was written correctly and the remainder may be either old data – or worse – undefined. In database applications this sort of behavior can lead to an inconsistent database. This is usually avoided by writing a log record to the database which is used by the database application to recover from the failure and make the database consistent. But this requires an additional write for each data base record update and slows the system. As large flash-based storage systems become commonplace in database servers, this behavior becomes doubly problematic. Not only do these additional writes reduce overall system performance, they are also a further write burden on the flash memory system of the storage device.
In the past, this was never an issue because atomic writes are difficult to implement on disk drives. For rotating media it is more cost effective to have the application simply do another write to cover the relatively rare error, rather than build that complexity into cost sensitive disk drives. With the advent of flash memory based storage systems, implementing atomic writes is much easier thanks to the way flash memory is written and managed. Reducing the number of write operations is also a big plus for flash due to the relatively limited write capacity of most flash based storage devices.
Another factor that’s important to consider in the atomic write story is the advent of PCIe based storage. Multiple vendors now produce PCIe cards that can provide terabytes of data on a single PCIe card. The NVMe interface was developed to take advantage of this class of device. T10 was not far behind when the SCSI over PCIe (SOP) and the PCIe Queuing Interface (PQI) specifications were developed in response and are now reaching maturity. These developments have allowed database servers to become much more efficient by speeding access to significant amounts of data at much higher speeds – usually by several orders of magnitude.
These changes have allowed servers to keep large databases in local storage with speeds approaching that of random access memory. More virtual machines can be instantiated on each server and have improved performance over network attached storage while reducing overall data center costs and power consumption. And SCSI is an ideal interface for such installations thanks to a mature and well developed protocol for handling multiple initiators. Recent work in the committee is under way to provide and manage millions of LUNs from a single SCSI target. While NVMe is a lightweight and low latency interface, it does not have the breadth or maturity of the SCSI protocol for handling large numbers of initiators or logical units.
…to be continued…
The March T10 meeting week was held in New Orleans, Louisiana this year. SOP-1, PQI-1, SBC-3, all closed public review with no comments. SPC-4 should go to public comment in May 2014.The atomic read and write proposals were overhauled after being rejected at the January plenary meeting, but all to no avail. Work on the initial draft of Zone Block Recording (ZBR) continues in force.
My full notes are here, the official T10 minutes for CAP are here, and the T10 minutes for the plenary sessions are here.
I attended the INCITS T10 (SCSI) meeting on January 13 to 16 in a warm and sunny Orange County. The highlights of the meeting:
- One of my perennial favorites, Atomic Writes and Reads (13-064r8) was passed in the CAP committee meeting with 11 yes and 3 no votes, but when it was reviewed in the plenary meeting, it failed to pass with 4 yes and 8 no votes. Back to the drawing board.
- SPC-4 Letter ballot comment resolution (13-256r3) continued, with some 800 comments remaining to be resolved.
- Work began in earnest on the Zoned Block Commands (ZBC 14-051r3) specification for SCSI zoned block devices, notably those employing shingled write.
My complete notes for the meeting can be found here.