摘要:Just as the Internet, web content and search engines have combined to revolutionize every aspect of our lives, the scientific process is poised to undergo a radical transformation based on the ability to access, analyze, and merge complex data sets. Scientists will be able to combine their own data with that of other scientists, validating models, interpreting experiments, re-using and re-analyzing data, and making use of sophisticated mathematical analyses and simulations to drive the discovery of relationships across data sets. This "scientific web" will yield higher quality science, more insights per experiment, a higher impact from major investments in scientific instruments, and an increased democratization of science-allowing people from a wide variety of backgrounds to participate in the science process. At the same time, machine learning is revolutionizing many areas of computer science, and is also impacting the theory and practice of physical, energy and life sciences.
What does this "big science data" view of the world have to do with exascale computing, which has been primarily targeting modeling and simulation? Scientists have always demanded some of the fastest computers for simulations, but now there is an additional driver for computer performance with the need to analyze large experimental and observational data sets. The exponential growth rates in detectors, sequencers and other observational technology, data sets across many science disciplines are outstripping the storage, computing, and algorithmic techniques available to individual scientists. Some machine learning methods are particularly compute intensive, and can make use of the worlds fastest computers.
In this talk I will describe some examples of how science disciplines are changing in the face of their own data explosions, and how this will lead to a set of open questions for computer scientists due to the scale of the data sets, the data rates, inherent noise and complexity, and the need to "fuse" disparate data sets. What is needed to support machine learning and other data-driven science workloads in terms of hardware, systems software, algorithms and programming environments, and how well can those be supported on systems that also run simulation codes?
简介:herine (Kathy) Yelick is a Professor of Electrical Engineering and Computer Sciences at UC Berkeley and the Associate Laboratory Director (ALD) for Computing Sciences at Lawrence Berkeley National Laboratory. Her research is in high performance computing, programming languages, compilers, parallel algorithms, and automatic performance tuning. She currently leads the Berkeley UPC project and co-lead the Berkeley Benchmarking and Optimization (Bebop) group. As ALD for Computing Sciences at LBNL, she oversees the National Energy Research Scientific Computing Center (NERSC), the Energy Sciences Network (ESnet) and the Computational Research Division (CRD), which covers applied math, computer science, data science and computational science.