Capacity saving function: data deduplication and compression

When the capacity saving function is in use, the controller of the storage system performs data deduplication and compression to reduce the size of data to be stored. Capacity saving can be enabled on DP-VOLs in Dynamic Provisioning pools. You can use the capacity saving function on all drive types, including data stored on encrypted flash drives and external storage.

NoteThe capacity saving function cannot be used for Virtual Storage Platform G100.

How capacity saving works

The capacity saving function includes deduplication and compression:

  • Deduplication

    The data deduplication function deletes duplicate copies of data written to different addresses in the same pool and maintains only a single copy of the data at one address. The deduplication function is enabled on a Dynamic Provisioning pool and then on the desired DP-VOLs in the pool. When deduplication is enabled, data that has multiple copies between DP-VOLs assigned to that pool is removed.

    When you enable deduplication on a pool, the deduplication system data volume (DSD volume) for that pool is created. The deduplication system data volume is used exclusively by the storage system to manage the deduplication function. A search table in the deduplication system data volume is used to locate redundant data in the pool.

  • Compression

    The data compression function utilizes the LZ4 compression algorithm to compress the data. The compression function is also enabled per DP-VOL.

The following figure illustrates the capacity saving function.

how deduplication and compression works

Data received by the storage controller is stored in a temporary area in the pool. When the data is classified as inactive (one hour since the last update for Dynamic Provisioning), the capacity saving processing is performed, and the post-process data is stored in the data storage area. When post-process data is updated again, the data stored in the data storage area is no longer required. The used capacity of the pool increases until garbage collection, which collects old data that is no longer required. The pool capacity that is eventually required is the sum of the physical data capacity after capacity saving plus the amount of metadata.

Note
  • The temporary area and the data storage area are not assigned fixed capacities. They share the pool and use the pool as needed.
  • The temporary area is used when the post-process mode is applied. When the inline mode is applied, capacity saving processing is performed simultaneously with receiving of data from the host, and host data is not stored in the temporary area.

The capacity overheads associated with the capacity saving function include the following:

  • Capacity consumed by metadata

    The capacity consumed by metadata for the capacity saving function (deduplication and compression) is approximately 3% of the consumed DP-VOL capacity that has been processed by capacity saving. For example, if the consumed capacity of a DP-VOL is 150 TB and the capacity saving feature has processed 100 TB of the 150 TB consumed capacity and reduced it to 30 TB, the capacity consumed by metadata for capacity saving function is approximately 3 TB (3% of 100 TB). The total consumed capacity of this DP-VOL at this instant is 83 TB (30 TB + 50 TB + 3 TB).

  • Capacity consumed by garbage (invalid) data

    The capacity consumed by garbage data is approximately 7% of the total consumed capacity of all DP-VOLs with capacity saving enabled. The capacity is dynamically consumed based on garbage data created by the capacity saving process and cleaned by the background garbage collection process. The garbage collection process is a background process with a lower priority than host I/O, so the capacity consumed by garbage data depends on both the garbage created and the host I/O rate.

CautionDo not use capacity saving and NAS deduplication on the same volumes, because the additional processing decreases the I/O performance substantially. For information about the NAS deduplication function, see the File Services Administration Guide.

Capacity saving processing for existing data

The compression and deduplication processing is performed asynchronously for pages that store data, and the free area of the pool can be increased, thereby reducing the cost of purchasing drives over time.

applying capacity saving

Capacity saving processing for new write data

The capacity saving mode of a DP-VOL (post-process mode or inline mode) determines how capacity saving is applied to new write data from the host:

  • Post-process mode

    When you apply capacity saving with the post-process mode to a DP-VOL, the compression and deduplication processing are performed asynchronously for new write data. Since capacity saving processing is not performed at the time the new data is written, the post-process mode can reduce the impact of capacity saving processing on I/O performance, but pool capacity is required to store the new write data until the capacity saving processing is performed.

    When you enable capacity saving on a DP-VOL using Device Manager - Storage Navigator, post-process mode is applied.

  • Inline mode (RAID Manager only)

    When you apply capacity saving with the inline mode to a DP-VOL, the compression and deduplication processing are performed synchronously for new write data. The inline mode minimizes the pool capacity required to store new write data but can impact I/O performance more than the post-process mode. The inline mode should be applied when writing data with sequential I/Os, for example, when writing data to target volumes of data migration or secondary volumes of copy pairs. When the data migration or copy pair creation has completed, the mode should be changed from the inline mode to the post-process mode.

    If you want to use inline mode, you must use RAID Manager (raidcom add ldev [-capacity_saving_mode <saving mode>]).

The following example illustrates how the pool used capacity changes over time when performing data migration. The red line shows the capacity when the post-process mode is applied, and the black line shows the capacity when the inline mode is applied. This example assumes that the writing speed (GB/h) for the new data is faster than the initial capacity saving processing (GB/h).

change in pool used capacity over time for inline and post-process modes

When the inline mode is applied, capacity saving processing is performed synchronously for the writing of data. When the post-process mode is applied, capacity saving processing is performed asynchronously for the writing of data, and the temporary storing area is required for the write data. The capacity required for the temporary storing area depends on the writing speed of the new data, or on the frequency of data updates during migration.

The following table shows the processing method (synchronous or asynchronous) for initial data, new write data, and update data. For new write data, the capacity saving processing is performed at different times for the post-process mode and the inline mode.

Mode

Initial data*

New write data

Updated write data

Compression processing

Deduplication processing

Compression processing

Deduplication processing

Post-process mode

Asynchronous

Asynchronous

Asynchronous

Synchronous when compressed data is updated

Asynchronous when uncompressed data is updated

Asynchronous

Inline mode

Asynchronous

Synchronous

Synchronous for data whose transfer length is 256 KB or more.

Asynchronous for data whose transfer length is less than 256 KB.

Synchronous when compressed data is updated

Asynchronous when uncompressed data is updated

Asynchronous

* The initial data is the existing data on the DP-VOL when the capacity saving function is enabled. Both compression and deduplication processing are performed asynchronously for the initial data.