Scheduling Queries on Tape-resident Data

Sachin More and Alok Choudhary

Abstract

Advances in storage technology have made near-line tertiary storage a viable alternative for database and data warehouse systems. Tertiary storage systems are employed in cases where secondary storage can not satisfy the data handling requirements or tertiary storage is more cost effective option. Tertiary storage devices have traditionally been used as archival storage. The new application domains require on-demand retrieval of data from these devices. This paper investigates issues in optimizing I/O time for a query whose data resides on automated tertiary storage containing multiple storage devices. We model the problem as a limited storage parallel two-machine flow-shop scheduling problem with additional constraints. Given a query, we establish an upper bound on the number of storage devices for an optimal I/O schedule and provide experimental proof for it. For queries that access small amounts of data from multiple media, we derive an optimal schedule analytically. For queries that access large amount of data we derive a heuristics based scheduling algorithm using analytically proved results. We also discuss practical aspects of about using the theoretical results in a real environment and demonstrate our applicability of our work using an accurate robotic tape library simulator.

Gzipped Postscript version of the paper