Lightweight Top-K Analysis in DBMSs Using Data Stream Analysis Techniques
Data Stream Analysis , Top-k Analysis
Problem determination is the identification of problems and performance issues that occur in an observed system and the discovery of solutions to resolve them. Top-k analysis is common task in problem determination in database management systems. It involves the identification of the set of most frequently occurring objects according to some criteria, such as the top-k most frequently used tables or most frequent queries, or the top-k queries with respect to CPU usage or amount of I/O. Effective problem determination requires sufficient monitoring and rapid analysis of the collected monitoring statistics. System monitoring often incurs a great deal of overhead and can interfere with the performance of the observed system. Processing vast amounts of data may require several passes through the analysis system and thus be very time consuming. In this thesis, we present our lightweight top-k analysis framework in which lightweight monitoring tools are used to continuously poll system statistics producing several continuous data streams which are then processed by stream mining techniques. The results produced by our tool are the “top-k” values for the observed statistics. This information can be valuable to an administrator in determining the source of a problem. We implement the framework as a prototype system called Tempo. Tempo uses IBM DB2’s snapshot API and a lightweight monitoring tool called DB2PD to generate the data streams. The system reports the top-k executed SQL statements and the top-k most frequently accessed tables in an on-line fashion. Several experiments are conducted to verify the feasibility and effectiveness of our approach. The experimental results show that our approach achieves low system overhead.