Skip all navigation and jump to content Jump to site navigation Jump to section navigation.
NASA - National Aeronautics and Space Administration
+ Visit NASA.gov
AISRP logo
ABOUT AISRP PROGRAM MANAGEMENT PROJECTS RESULTS
Earth Sun System Sun Solar System Universe Exploration Computational Science
Universe
Index
Next
Previous
Started:03/01/2006
Reports
Report:6/25/2008
Report:3/18/2007
Latest Quad:1/17/2007
PI: Jeffrey Gardner
Carnegie Mellon

Enabling Massive Scientific Databases Through Automated Schema Design
As massively parallel platforms continue to expand in processor count, simulations can now generate datasets of unprecedented size relative to the processing power of a serial workstation. NASA has pushed this envelope with the Columbia supercomputer at NASA/ARC, which has 10,240 processors with a combined capability exceeding 50 teraflops. With the ability to run on 10 to 100 times as many processors as we could a mere five years ago, one of the greatest challenges scientists face in using these impressive resources is extracting meaningful scientific knowledge out of the immense datasets that they can generate. Simulations are now being limited in scope not by the capabilities of the computers that run them, but by the scientist's inability to cope with the resultant data flow. The most obvious solution when dealing with massive data volumes is to register it into a database. The difficulty lies in the amount of computing and access time required to execute database queries: a cosmology simulation that fully utilized Columbia, for example, would generate at least 20 TB of data. The trick to optimizing queries lies in designing an intelligent schema, i.e. a description how the data is actually arranged within the database. Databases of this size mandate optimal schema design. Unfortunately, simulation groups--which are often quite small--rarely have the resources to spend on manually designing efficient database schema. Furthermore, the schema design for simulations is typically much more difficult than for observational datasets: a simulation catalog not only incorporates the time dimension, but it also frequently employs more layers of relationships. Current automated design tools are not useful for our purposes as they rely on aggressively replicating indexes, a strategy that increases the size of the database by factors of 2 to 3. We propose to overcome the barriers to efficient query execution by combining work and ideas of the database design community with the needs of the computational cosmology domain. We will design and implement AutoPart, a system that automatically designs scientific databases by partitioning the tables in the original database according to a representative workload. Compared to conventional index-based techniques, AutoPart removes the need for replication by pre-designing the tables according to the query requirements. AutoPart will maximize performance of scientific queries while eliminating the need for additional space and update overhead, thereby permitting any researcher to interact with their data in the most efficient way possible. Our initial focus will be on enabling the mining of massive cosmological simulations. The results of our work, however, will be applicable far beyond cosmology and will effect any NASA endeavor that can exploit massive datasets.

FirstGov logo + NASA Privacy, Security, Notices NASA Curator: AISRP Curator
NASA Official: Joseph H. Bredekamp
Last Updated: 01/18/2005