This post provides an overview of the paper “Transformer-based Dog Behavior Classification with Motion Sensors” published in September 2024 IEEE Sensors Journal. We extend the discussion to highlight how transformers, particularly encoder-only architectures, can be applied to motion classification across various use cases. The focus is on the advantages of transformers for 1D sensor data classification over modern architectures (LSTM, CNN, LSTM-CNN), making it relevant to a broad range of real-time applications.
Background
The presented approach for classifying dog behavior using motion sensors (accelerometer and gyroscope) is built on a transformer-based DNN. While the application in this study is dog behavior classification, the architecture and methods we present are broadly applicable to anyone dealing with 1D sensor data for classification tasks. The underlying principles can be extended to numerous domains where accurate temporal dependency capture is crucial.
Traditional models like LSTMs and CNNs often struggle with capturing long-range dependencies in sensor data, particularly in real-time applications. The transformer-based approach effectively leverages the self-attention mechanism to identify relevant features across multiple time scales. This makes it a powerful tool for sensor data processing, where high-dimensionality and temporal patterns play a critical role.
We focus exclusively on the encoder part of the transformer, bypassing the decoder to streamline real-time classification by adding a classifier head. The model was tested on a publicly available dataset of seven distinct dog behaviors, recorded through motion sensors attached to the dog’s back. The model achieved a 98.5% accuracy, significantly outperforming traditional time-series DNN models like LSTM, CNN, and hybrid LSTM-CNN architectures. Moreover, beyond accuracy, the robustness of the transformer model was presented against these traditional models under various challenging conditions, such as additive noise and missing data. The transformer consistently exhibited superior resilience. The architecture’s reduced computational complexity, lower latency, and compact design further underscore its suitability for real-time applications, particularly in resource-constrained environments.
Let’s dig in, starting with the transformer encoder architecture...
Transformer Encoder
The transformer encoder (right figrue) is designed to process sequential motion sensor data and extract meaningful features that represent dog behavior from 2-second record examples, in this task (left figure).
At the heart of the transformer encoder is the self-attention mechanism (more accurately, the multi-head attention — MHA), which revolves around three key components: Q (Query), K (Key), and V (Value) — presented in the figure below. These components have their NLP interpretation, but let’s explore the meaning in the motion sensing data:
Query (Q): Represents the specific motion feature or segment the model is focusing on, such as a sudden change in the dog’s speed or direction. This could be a behavior-defining motion, like running or walking.
Key (K): Encodes information about the entire sequence of motion data, which allows the model to compare various parts of the input and assess their relevance to the current Query.
Value (V): Holds the actual motion sensor readings — accelerometer and gyroscope data — associated with different segments of the sequence. These are used to compute the final attention-weighted output.
To conclude, in this motion-sensing context, the self-attention mechanism uses Q to find which parts of the sensor data (encoded in K) are most relevant and retrieve the corresponding values from V.
The physical meaning is where the model focuses on key behaviors such as rapid movements or specific postures, while ignoring irrelevant or noisy signals. By attending to different time steps in parallel, the transformer efficiently captures both short- and long-term dependencies in the motion data, making it ideal for classification tasks.
Performance
The performance of the transformer-based model was compared to three classical DNN architectures — LSTM, LSTM-CNN, and Bi-LSTM — commonly used for time-series and sequential data classification tasks. Each model presented unique strengths, but the transformer-based approach exhibited clear advantages, particularly in terms of computational efficiency and deployment suitability for real-time applications.
Accuracy and F1-Score Comparison
The transformer-based model achieved the highest accuracy of 98.5%, outperforming LSTM, LSTM-CNN, and Bi-LSTM models. Although the Bi-LSTM model closely followed with an accuracy of 97.8%, and both LSTM and LSTM-CNN achieved 97.6%, the transformers’ ability to process sequences in parallel allowed it to excel in accuracy. In terms of the F1-Score (balances precision and recall) the Bi-LSTM model led slightly with 98.1%, reflecting its strength in identifying relevant instances and minimizing false positives, where the transformer maintained an F1-Score of 94.6%.
Precision, Recall, and Efficiency
The Bi-LSTM model achieved the highest precision (98.4%) and recall (97.9%). However, despite the slightly lower precision and recall of the transformer model, its balance between the two metrics ensures robustness in real-time scenarios where high accuracy and efficient decision-making are essential.
Robustness and Real-Time Considerations
Real-Time Considerations
A key advantage of the transformer-based DNN model is its low computational complexity and compact design, making it particularly suited for resource-constrained environments. With only 55,743 parameters and a model size of 218 KB, the transformer-based approach is significantly more efficient than the LSTM-CNN model, which has 113,915 parameters and a size of 445 KB. This reduction in complexity not only reduces memory usage but also speeds up both training and inference times.
In terms of floating-point operations per second (FLOPs), the transformer model is by far the most efficient, requiring only 146,082 FLOPs, compared to the LSTM-CNN model, which demands a staggering 12,731,195 FLOPs. The low FLOPs of the transformer model make it ideal for deployment on energy-constrained devices, such as mobile or embedded systems, where computational resources are limited and real-time processing is critical.
Robustness
we evaluated how well the transformer-based model performs under challenging conditions, as real-world sensor data can be subject to noise, missing information, or other degradation factors.
To simulate real-world sensor degradation, we applied varying levels of Gaussian noise to the accelerometer and gyroscope signals. In practical scenarios, motion sensors, especially low-cost MEMS sensors, are prone to performance degradation due to environmental factors such as temperature shifts or wear over time. The introduction of Gaussian noise with standard deviations ranging from 0.001 to 0.01 across all sensor axes allowed us to test how well the models withstand increasing data corruption.
Among the models tested, the transformer-based DNN showed significantly higher robustness to noise. As the noise level increased, the transformer maintained superior accuracy, precision, recall, and F1 score compared to LSTM, LSTM-CNN, and Bi-LSTM models. The self-attention mechanism in transformers, which dynamically focuses on the most relevant parts of the input, likely contributed to its ability to handle noisy data more effectively. This makes the transformer a strong candidate for real-world motion sensing applications where sensor reliability may degrade over time.
Missing Data
In addition to noisy data, real-world systems often encounter missing data due to communication errors, power failures, or sensor malfunctions. To evaluate the models’ performance under such conditions, we simulated missing data by randomly replacing portions of the sensor signals (both accelerometer and gyroscope) with zeros. We tested the models across different proportions of missing data, ranging from 0% to 20%.
The transformer-based model outperformed LSTM and LSTM-CNN models across various scenarios involving missing data. It exhibited greater resilience to missing information, maintaining effective classification performance when up to 20% of data was missing from the input signals. However, the Bi-LSTM model showed a slight edge in certain situations, particularly when data was missing from the x-axis of the accelerometer and gyroscope, which indicates that the bidirectional nature of Bi-LSTM provides it with a minor advantage in reconstructing missing temporal dependencies. The figure below presents the Recall value as a function of the proportion of missing data.
Summary
The transformer-based architecture, specifically designed to capture temporal dependencies and handle high-dimensional data, demonstrated outstanding performance, achieving an accuracy of 98.5%. Through comparative analysis with traditional models like LSTM, LSTM-CNN, and Bi-LSTM, the transformer-based model proved to be superior, particularly in terms of computational efficiency and suitability for real-time applications.
One of the key advantages of the transformer model is its reduced computational complexity and lower latency. It requires far fewer resources than its counterparts, such as LSTM-CNN, which has a significantly larger footprint. The transformer model’s low FLOPs make it highly suitable for deployment in resource-constrained environments, such as mobile phones or embedded systems, where quick decision-making and energy efficiency are critical.
The implications of this work extend beyond dog behavior classification. By offering a robust, efficient, and accurate framework for processing 1D sensor data, this model holds promise for a variety of real-time applications, including monitoring human activities or other motion-based systems. The ability to maintain high performance under challenging conditions, such as noisy or missing data, further reinforces the potential of transformers in motion sensor-based classification tasks.
The study paves the way for future research and applications for the use of transformer models in real-time motion data processing.
Comments