Traditional high performance computing clusters (HPCC) and Hadoop clusters are both important platforms in computational and data-enabled science and engineering (CDS&E). Hadoop MapReduce is typically effective in large-scale data analysis while traditional HPC is more commonly employed in computational problems (petabytes vs. petaflops). HPCC can include a shim layer that allows Hadoop MapReduce to access HPC storage (denoted by Hadoop+HPCC), local storage (denoted by Hadoop), and a combination of both (denoted by Hadoop/HPCC). This project aims to identify the characteristics of various CDS&E MapReduce computational tasks, and adaptively determine the best platform (Hadoop, Hadoop+HPCC and Hadoop/HPCC) for individual applications based on their characteristics, and also optimally arrange data placement between local storage and dedicated remote storage, given performance objectives and system cost metrics.
Broader impacts include critical insights into the suitability of different computing platforms to different CDS&E applications, and a more advanced HPC system. Research results will be disseminated through technology transfer to industry partners, via publication in peer review journals, and in software releases. Results will also serve as catalyst for research in cyberinfrastructure, which serves the CDS&E fields. This project will provide thorough training of students and collaborative research opportunities for participating Clemson graduates, undergraduates, faculty, and K-12 students. Results will be integrated into courses taught by the PIs. The PIs will recruit new students, particularly those from underrepresented groups, to undertake the study of a STEM discipline.