|
|
|
Distributed and Peer-to-Peer Data Mining for Scalable Analysis of Data from Virtual Observatories
|
|
| Design, implementation, and archiving of very large sky surveys play a critical role in today's Astronomy research. However,
astronomers will be unable to tap the riches of this collection of gigabyte, terabyte, and (eventually) petabyte catalogs without a
computational backbone that includes support for queries and data mining across distributed virtual tables of de-centralized, joined,
and integrated sky survey catalogs. Moreover, use of local data management systems such as MyDB, MySpace in AstroGrid, and Grid
Bricks for storing and managing user's local data is becoming increasingly popular. This is opening up the possibility of constructing a
Peer-to-Peer (P2P) network for data sharing and mining. This document proposes research and development for a new generation of
scalable data analytic services for the NVO based on advanced distributed and P2P data mining capabilities across multiple data
repositories. This research will develop technology for supporting web services within the NVO that will allow astronomy researchers
to analyze data from multiple surveys using fundamentally distributed algorithms. It will also develop several distributed data mining
algorithms for analysis of distributed Astronomy catalogs without requiring the data to be downloaded and centralized. Specific
objectives include the following items: (1) The project will design and implement distributed algorithms for computing statistical
primitives, principal component analysis, and outlier detection from distributed Astronomy catalogs and their partial images stored in
users' local data management systems. These algorithms will be able to analyze data without requiring source catalogs to be
downloaded and centralized. (2) The project will develop a prototype system which will offer a rich collection of web-services based on
various DDM algorithms. This service offers a novel augmentation to the existing NVO environment and it will support a rich variety
of data mining tasks that will work in a distributed fashion. (3) The developed system will be tested using specific astronomical
research problems. In particular, we will explore the multi-dimensional multi-wavelength parameter space of astrophysical properties
of starbursting galaxies. We will search for unusual correlations, outliers, sub-clusters, and fundamental planes within the
multi-dimensional parameter space presented by several large surveys. To carry out this research we will set up a simulated distributed
Astronomy catalog environment in the lab using data from publicly available source catalogs. Our system will be benchmarked
according to speed, communication cost, and accuracy. Accuracy will be validated within the context of the Astronomy problem
described above. This research is directly relevant to the AISR program for several reasons. First, it enables increased productivity of
NASA's Science Mission Directorate (SMD) research endeavors through rapid multi-mission correlative analysis, such as distributed
and P2P mining of the large survey catalogs from GALEX, Spitzer, 2MASS, and eventually WISE. Second, this research involves an
interdisciplinary team of researchers in Astrophysics (Borne), Database Technologies(Kargupta, Giannella), Distributed Systems
(Kargupta), and Distributed Data Mining (all). Dr. Borne is a senior member of the NVO project. We also have strong support from
NASA's Space Science Data Operations Office (see attached supporting letter) regarding this proposed collaborative project and its
transition to the practice. Finally, this project explicitly demonstrates the relevance, applicability, and potential impact of emerging
information technologies to SMD missions and programs. |
|
|