Title: HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud

Source: MSST'17

Authors: Huijun Wu, Chen Wang, Yinjin Fu, Sherif Sakr, Liming Zhu, Kai Lu

Summary

The problem the paper aims to solve

Deduplication for primary storage in clouds.

Primary storage often requires low latency for deduplication. Therefore, inline deduplication or post-processing deduplication is utilized to make this happen. However, in-line deduplication highly relies on the temporal locality of workloads to deduplicate. However, mixed streams often destroy the locality, which results in low deduplication ratio.
How can the paper address the problem? What is the main idea of this paper?
- Deploy deduplication mechanism in hypervisor
  - Allow to check duplicates between multiple virtual machines
- Estimate locality and prioritize cache for high locality workloads
- Use reservoir sampling to choose elements equally
  - Apply the unseen algorithm to estimate locality based on samples provided by reservoir sampling
- Adjust threshold for utilizing spatial locality and latency for different workloads
Validations

Their evaluations show that they can improve inline deduplication ratios, cache hit ratio and cache distribution to prove their effectiveness of estimation of LDSS for mixing streams.

Strengthens

HPDedup has an interesting problem and also show the effectiveness of using estimation and corresponding cache replacement.
HPDedup utilizes many other new techniques to complete their work such as using unseen and D-LRU.

Weaknesses

Using estimation methods lacks analysis and enough motivation. Why not just using counting mechanism? This explanation should be included in the paper.
Optimizing fragmentation is not very convincing as well as the peak disk capacity requirement.

Comments

Provide feedback

Saved searches