We thought about the YogAI concept for some time. Initially, we envisioned a smart yoga mat based on computer vision for corrective posture advice. We found out that others have taken the approach of embedding sensors within the yoga mat, although it appears they too are interested in vision.

We returned to the idea after seeing interesting work using pose estimation that we wanted to reproduce. However, we decided to put YogAI on a smart mirror platform, more like this.

By framing photos from the perspective of a large mirror on the wall rather than on the ground from a yoga mat upward, we could train models using yoga photos from the wild. These typically feature full body perspective from a distance at a height typical of a photographer.

The mirror is a ubiquitous gym tool because of the value of visual perspective when coordinating body motions. We wanted to take YogAI further than any existing commercial product by offering real time corrective posture advice using an intelligent assistant. Ready to train when you are and doesn't send your data to remote servers.

Making a smart mirror is simple enough: just need an old monitor, a raspberry pi, and a one-way mirror, see how we built it here. We add a camera and microphone to support VUI and a visual analysis of the user, all taking place on-device.

To evaluate our concept, we begin by gathering images of Yoga poses with an image search for terms like: 'yoga tree pose', 'yoga triangle pose', etc. We chose yoga because the movements are relatively static compared to other athletic maneuvers, this makes the constraints on frame rate of inference less demanding. We can quickly filter out irrelevant photos and perhaps refine our queries to build a corpus of a couple thousand yoga pose images.

We'd love to achieve the speed of a CNN classifier, however, with only a few thousand images of people in different settings, clothing, positions, etc. we are unlikely to find that path fruitful. Instead, we turn to pose estimation models. These are especially well-suited to our task of reducing all the scene complexity down to the pose information we want to evaluate. These models are not quite as fast as classifiers, but with tf-lite we manage roughly 2.5 FPS on a raspberry pi 3.

Pose estimation gets us part way. To realize YogAI, we need to add something new. We need a function that takes us from pose estimates to yoga position classes. With up to 14 body keypoints, each of our couple thousand images can be represented as a vector in a 28-dimensional real linear space. By convention, we will take the x and y indices of the mode for each key point slice of our pose estimation model belief map. In other words, the pose estimation model will output a tensor shaped like (1, 96, 96, 14) where each slice along the final axis corresponds to a 96x96 belief map for the location of a particular body key point. Taking the max of each slice, we find the most likely index where that slice's keypoint is positioned relative to the framing of the input image.

This representation of the input image offers the additional advantage of reducing the dimensionality of our problem for greater statistical efficiency in building a classifier. We regard the pose estimation process as an image feature extractor for a pose classifier based on gradient boosting machines, implemented with XGBoost.

By creating a helper function to apply random geometric transformations directly to the 28-dim pose vector, we were able to augment our dataset without necessitating costly pose estimation runs on more images. This observation helped us to quickly take our training set from a couple thousand vectors to tens of thousands by randomly applying small rotations, flips, translations to a sample vector.

Then we were able to quickly demonstrate our approach by training a gradient boosting machine. It didn't take much parameter tweaking before we were able to evaluate our pose estimation model in a test run.

yoga poses

It was natural to consider how we might use pose estimates over time to get a more robust view of a figures position in the photo. Having built a reasonable pose classifier, this also begs the question how we might generalize our work to classifying motion.

Our first idea here was to concatenate the pose vectors from 2 or 3 successive time steps and try to train the tree to recognize a motion. To keep things simple, we start by framing a desire to differentiate between standing, squatting, and forward bends (deadlift). These categories were chosen to test both static and dynamic maneuvers. Squats and Deadlifts live on similar planes-of-motion and are leg-dominant moves though activating opposing muscle groups.

We found a couple youtube videos of high repetition moves performed by fitness athletes filmed from a perspective similar to the design of our smart mirror. We split the videos using ffmpeg and ran pose esimtation to get our pose vector representation of each frame after some minor edits to cut out irrelevant video segments.

Our gradient boosting machine model seemed to lack the capacity to perform reasonably. We decided to apply LSTMs to our sequence of pose vectors arranged in 28xd blocks, sweeping d in {2, 3, 5}. After some experiementation, we determined that 2 LSTM blocks followed by 2 fully connected layers on a 28x5 input sequence yielded a reasonable model.

Now we have basic motion classification!

motion classifier

We expect a big improvement can come from an associated wearable to track heart rates and motion. This would introduce some orthogonal information for vision-based motion and pose classification.

A big goal for YogAI is to direct a yoga workout while offering corrective posture advice. Now we have a viable approach to the fundamental problem of posture evaluation/classification. We might start to ask questions like: can we adapt/control the workout to target a heart range adjusting for skill level, or optimizing for user satisfaction?

Before all that, we need to work on the interface. Since we have a smart mirror, we will display info visuallly but we don't want a yogi to break from position to change the flow. That is why we use snips.ai to implement VUI on-device.

We use snips to send messages with MQTT. After filling out sample intents, we download the package for our voice assistant. To help YogAI guide a yoga session orally, we use flite.

At this point, we are ready to demo YogAI. See us at the NotImpossible Awards Pitch in San Franscisco Feb 27, 2019.

Future of YogAI

In the future, we want to gather more data by expanding the use cases around other fitness & wellness applications. We want to explore optimizations like training a single end-to-end classifier or distilling this model into a very small multi-layer perceptron that can be run on cheap microcontrollers.

We want to explore new product features with more refined corrective advice. Recent work on related models like bodypix may help us retain more information from the input image. We want to investigate embedding person-segmented images with unsupervised learning. Perhaps by simplifying the input, we can apply variational autoencoders to learn a pose manifold modeled by a 2-dim latent space. By learning the distribution of pose images over many samples, we might discover a projection onto a latent space representing various pose pathologies that can inform our YogAI agent e.g. 'try tilting your head forward'.

We would ultimately like to implement federated learning to keep data and inference on device while leveraging the information learned by other devices.