Time Series Clustering with Water Temperature Data
This thesis studies three different approaches to cluster long-term water temperature data using the unsupervised pattern recognition method called hierarchical clustering. The cluster qualities are assessed using internal cluster validity indexes and forecast deviation analysis.
Bögli Roman, 2020
Bachelor Thesis, Federal Office for the Environmnet (FOEN), Hydrology Division
Keywords: Hydrology, Water Temperatures, Hierarchical Clustering, Time Series
Views: 50 - Downloads: 19
The underlying data constitute long-term water temperature measurements of several Swiss water bodies and originates from metering stations managed by the FOEN. The goal is to group these stations according to the resemblance of their hydrologic temperature curve over a period of ten years. Stations that exhibit very similar short-term as well as long-term temperature behaviour and evolution over time should be grouped into the same clusters. The clusterings should provide a better understanding of the data heterogeneity and support future decisions regarding the integration of new stations.
The first part of this work characterises time series data, surveys the field of pattern discovery techniques, explains hierarchical clustering and elaborates four internal cluster validity indexes. The main part addresses the applied distance measuring. This includes the two shape-based strategies Pairwise Distance and Dynamic Time Warping and the feature-based strategy Discrete Wavelet Transformation. Finally, the various clustering approaches are challenged using internal cluster validity indexes and a forecast deviation analysis.
DTW does not outperform the more trivial strategy of PDIST to a remarkable extend. The application of DTW should therefore be carefully reconsidered, at it requires a higher computational effort to conduct. The point of interest in all three metrics derived from the forecast deviation analysis lies somewhere between five and eleven clusters. This evidence can be used to contain the range for the true cluster number which might be helpful for later subject-specific analyses in the field of hydrology. DWT demonstrated that a competitive clustering also can be established with a reduced amount of information.
The project initiated by the FOEN is still ongoing and further distance metrics for time series will be evaluated. This thesis allowed to gain first experiences regarding advantages and disadvantages of the elaborated strategies and produced a Python library which can be applied to conduct and parametrise the same cluster analyses with the data received from cantonal metering stations. Together with the explanations provided in this work, it supports the process of finding an ideal prioritization regarding the incorporation of cantonal metering stations into the federal network.
Studiengang: Business Information Technology (Bachelor)
Fachbereich der Arbeit: Statistik