Date of Award


Document Type


Degree Name

Doctor of Philosophy (PhD)


Computer Science

First Advisor

Donna M. Rizzo

Second Advisor

Byung Suk Lee


Cluster analysis explores the underlying structure of data and organizes it into groups (i.e., clusters) such that observations within the same group are more similar than those in different groups. Quantifying the ``similarity'' between observations, choosing the optimal number of clusters, and interpreting the results all require careful consideration of the research question at hand, the model parameters, the amount of data and their attributes. In this dissertation, the first manuscript explores the impact of design choices and the variability in clustering performance on different datasets. This is demonstrated through a benchmark study consisting of 128 datasets from the University of California, Riverside time series classification archive. Next, a multivariate event time series clustering approach is applied to hydrological storm events in watershed science. Specifically, river discharge and suspended sediment data from six watersheds in the Vermont are clustered, and yield four types of hydrological water quality events to help inform conservation and management efforts. In a second application, a novel and computationally efficient clustering algorithm called SOMTimeS (Self-organizing Map for Time Series) is designed for large time series analysis using dynamic time warping (DTW). The algorithm scales linearly with increasing data, making SOMTimeS, to the best of our knowledge, the fastest DTW-based clustering algorithm to date. For proof of concept, it is applied to conversational features from a Palliative Care Communication Research Initiative study with the goal of understanding and motivating high quality communication in serious illness health care settings.



Number of Pages

170 p.