How deduplication works
Data deduplication technology is typically hosted on a server or appliance that is performing a task like storing backup data. Most deduplication devices do this by sectioning a file into sub-file segments ranging from 4 KB to 64 KB. These segments are then processed by an algorithm that generates a unique hash code for each segment. The code is unique to that data segment. It can be thought of as similar to a fingerprint. As new data segments are processed, they are passed through this same algorithm. If the algorithm generates an identical hash, the device knows that it has stored that data before and just creates a reference to the data instead of storing it a second time. The result can be a significant saving in space.
The goal of deduplication, no matter where it is implemented, is to reduce the amount of data that needs to be handled by a given process. Many IT professionals consider deduplication a storage technology only, when, in fact, it can be implemented in multiple areas of the data center.
Data deduplication technology in wide area networks
One of the best uses of data deduplication is on the wide area network (WAN). The role of data deduplication over the WAN is to allow WAN segments to be more efficiently utilized in order to delay or even prevent the need to purchase additional and expensive WAN bandwidth. A WAN optimization device with deduplication capabilities will have a local cache on both the sending and receiving ends of the connection. If either end (through the process described above) calculates that the data has already been sent to the other site, then only the reference information -- not the entire data set -- is sent. This can dramatically reduce the amount of traffic that needs to traverse the WAN.
Typically, WAN deduplication devices are used to speed performance of remote office locations. For example, instead of placing a file server in each location or an email server, just the WAN deduplication device is installed. Then, for example, if an email with a large attachment is sent more than once, the repetitive attachments are pulled from the WAN deduplication cache rather than across the WAN segment multiple times.
Storage deduplication devices, today, are often single purpose. For example, they may deduplicate only backup data. WAN deduplication devices can be leveraged across multiple purposes. The email attachment mentioned above may eventually be stored to a file share, and it may also be copied again as part of the backup process. With WAN deduplication, the copy is sent only once, multiplying the savings.
Continue reading part 2 of this article to learn about a WAN manager's role in storage.
About the author:
George Crump is president and founder of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. With 25 years of experience designing storage solutions for data centers across the U.S., he has seen the birth of such technologies as RAID, NAS and SAN. Prior to founding Storage Switzerland, George was chief technology officer at one of the nation's largest storage integrators, where he was in charge of technology testing, integration and product selection. Find Storage Switzerland's disclosure statement here.
This was first published in April 2010