Features in machine learning are signals containing information about a measurable property of the system being observed.
It is difficult to use raw sensor data in machine-learning. The data may be too much: the refrigerator, for example, uses an AC compressor and is connected to a 60 Hz power supply. The Nyquist-Shannon sampling theorem says that to recover the 60 Hz, one has to sample at at-least 120 Hz. In reality, most engineers over-sample in an attempt to capture higher frequencies that would preserve transients in the signal. In the refrigeration case-study, the sampling rate is 500 Hz which means that at every 2 ms there is a new data point. Even at this modest sampling frequency it may be computationally hard to apply the classification or regression formula. We are only interested in algorithms that can compute diagnosis at real-time. Summarizing the signals over time-windows solves the computational difficulty problem.
Another reason for using features is that classification and regression machine-learning formulas are typically not good enough to capture the complex physics of the devices we are trying to diagnose. Manually constructing features is bridging physics and model-based reasoning and machine learning. We can actually "hide" a fully-fledged model-based diagnosis engine as a feature. This feature will directly compute if the device is failing and then there is no work for the classifier to determine the state of the system.
Table 1 shows the types of features that are used for diagnosing the refrigerator. They are all computed for a sliding window of samples. The first four features are simply common sliding-window descriptive statistics. The fifth one is the first derivative of temperature (of course, first a low-pass filter with a cut-off frequency of 1 mHz is used to smooth the original signal). The last feature is the product of two signals. Notice that after computing derivatives or products, it is necessary to again compute descriptive statistics (minimum, maximum, mean, and standard deviation) over a sliding window.
In the case of the thermostat only the present value of the thermostat at the time of diagnosis is used. This approach does not use all thermostat information. To use all information it is possible to compute a feature that shows when was the last thermostat transition. Somewhat surprisingly, this feature not only does not help with any of the classifiers we have tried, its inclusion significantly decreases the isolation accuracy.
How is the length of the sliding window determined? A more advanced approach would be to use intelligent segmentation instead of a sliding window, however segmentation implies making a step toward computing a diagnosis which imposes a bootstrapping problem. In our study we just take a range of sliding window lengths and we compute a large number of features. We also compute the product of all possible temperature pairs. This results in a large number of features: 10 681, to be precise.
Figure 1 shows the features computed from one temperature sensor for a fixed sliding window size of 30 min.
Machine learning often normalizes the feature ranges by multiplying them with suitable constants. Experiments showed that this does not help the accuracy of classification in the case of the refrigerator and we use the features without scaling.