What is time series cleaning ? Why is it so important ?
Time series cleaning is part of the data preprocessing that is specific to time series data. That includes a set of operations and tools that are very essential to clean and prepare the time series before passing it to the TS Forecasting module and applying machine learning on it. A good cleaning of the series, plus a good parameterization of the chosen ML models guarantees a good forecasting.
Different type of Time series cleaning operations are proposed on papAI, like :
Resampling : time series resampling means changing the frequency of the series observations. Sometimes, the time series can be in irregular time intervals, so it cannot be useful for analytics, a valid time series must have equispaced intervals and that’s why resampling is primordial. Resampling is also useful when transforming an equispaced data from one frequency level to another (for example, minutes to hours).
Imputation : missing values in datasets are a well-known problem, it can lead to many issues such as poor data quality or understanding. In time series the problem of missing values became more important and it is essential to go through an imputation step (we can call it interpolation also) to build a complete series.
Datetime features generation : this is a step that can be useful to enrich the data with more datetime related features such as month, day, hour, minute. These new features can be used to extract some insights from the data through visualization tools or to improve the performance of ML models by using them as covariates.
Extract Holidays : some time series can have a different behavior during holidays and weekends like for example tickets sales of an amusement park that can be higher during weekends and school holidays. Holidays detection can therefore be interesting in such use cases. papAI allows us to extract weekends, national and school holidays based on a selected country or even flag our custom holidays (e.g a period where some advertising campaign has been launched)
Smoothing : this can appear as a very technical term, smoothing techniques are similar to data preprocessing techniques aiming to remove noise from the time series. This allows important patterns to stand out like trend and seasonality. In market analysis for example, smoothed data is preferred because it generally identifies changes in the economy compared to unsmoothed data.
Self differentiation : it’s a technique that can be used to make a time series stationary, which means a series whose properties do not depend on the time at which it was observed, so with no trend and no seasonality. And this kind of behavior is required for some statistical models to give good performances.
Lag generation : lag generation can be used to transform our times series forecasting problem to a supervised learning problem that involves using directly a simple regression model. It generates some new columns of the previous observations. For example if we do a lagging up to 3 look back steps, it will generate 3 new columns with t-1, t-2, and t-3 values of a selected column.
Windoing : this operation allows us to apply aggregations over a rolling window of timestamps, for example we want to know the mean of a measurement through windows of 7 timestamps. It can be used for the same reason as lag generation (transform the problem to supervised learning) or just for analysis.
Benefits of Time Series Cleaning :
Time series cleaning is an essential step on any time series forecasting pipeline, it can allow us to correct some anomalies on the data (by doing imputation or resampling for example) or to enrich the data with other features (like datetime features generation and holidays extraction) or improve the quality of this data (smoothing and self differentiation) in order to get a better forecasting or just a better analysis of this data.