key: cord-0061498-1in100ja
authors: Gerard, Charlie
title: Experimenting with inputs
date: 2020-11-17
journal: Practical Machine Learning in JavaScript
DOI: 10.1007/978-1-4842-6418-8_5
sha: 48b9f51d8bc37574964d801af0bcaeb0507d01d0
doc_id: 61498
cord_uid: 1in100ja
In the previous chapters, we looked into how to use machine learning with images and text data to do object detection and classification, as well as sentiment analysis, toxicity classification and question answering.
When you first read the words "audio data," you might think that this section of the book is going to focus on music; however, I am going to dive into using sound more generally.
We don't really think about it often but a lot of things around us produce sounds that give us contextual information about our environment.
For example, the sound of thunder helps you understand the weather is probably bad without you having to look out the window, or you can recognize the sound of a plane passing by before you even see it, or even hearing the sound of waves indicates you are probably close to the ocean, and so on.
Without realizing, recognizing, and understanding the meaning of these sounds impacts our daily lives and our actions. Hearing a knock on your door indicates someone is probably behind waiting for you to open it, or hearing the sound of boiling water while you are cooking suggests that it is ready for you to pour something in it.
Using sound data and machine learning could help us leverage the rich properties of sounds to recognize certain human activities and enhance current smart systems such as Siri, Alexa, and so on. This is what is called acoustic activity recognition.
Considering a lot of the devices we surround ourselves with possess a microphone, there is a lot of opportunities for this technology.
So far, the smart systems some of us may be using recognize words to trigger commands, but they have no understanding of what is going on around them; your phone does not know you are in the bathroom, your Alexa device does not know you might be in the kitchen, and so on. However, they could and this could be used to create more tailored and useful digital experiences.
Before we dive into the practical part of this chapter and see how to build such systems in JavaScript using TensorFlow.js, it is helpful to start by understanding the basics of what sound is, and how it is translated to data we can use in code.
Sound is the vibration of air molecules.
If you have ever turned the volume of speakers really loud, you might have noticed that they end up moving back and forth with the music. This movement pushes on air particles, changing the air pressure and creating sound waves.
The same phenomenon happens with speech. When you speak, your vocal cords vibrate, disturbing air molecules around and changing the air pressure, creating sound waves.
A way to illustrate this phenomenon is with the following image. When you hit a tuning fork, it will start vibrating. This back and forth movement will change the surrounding air pressure. The movement forward will create a higher pressure and the movement backward will create a region of lower pressure. The repetition of this movement will create waves.
On the receiver side, our eardrums vibrate with the changes of pressure and this vibration is then transformed into an electrical signal sent to the brain.
So, if sound is a change in air pressure, how do we transform a sound wave into data we can use with our devices?
To be able to interpret sound data, our devices use microphones.
There exist different types of microphones, but in general, these devices have a diaphragm or membrane that vibrates when exposed to changes of air pressure caused by sound waves.
These vibrations move a magnet near a coil inside the microphone that generate a small electrical current. Your computer then converts this signal into numbers that represent both volume and frequency.
In JavaScript, the Web API that lets developers access data coming from the computer's microphone is the Web Audio API.
If you have never used this API before, it's totally fine; we are going to go through the main lines you need to get set up everything.
To start, we need to access the AudioContext interface on the global window object, as well as making sure we can get permission to access an audio and video input device with getUserMedia.
Listing 5-1. Setup to use the Web Audio API in JavaScript window.AudioContext = window.AudioContext || window. webkitAudioContext; navigator.getUserMedia = navigator.getUserMedia || navigator. webkitGetUserMedia;
This code sample takes into consideration cross-browser compatibility. Then, to start listening to input coming from the microphone, we need to wait for a user action on the page, for example, a click.
Once the user has interacted with the web page, we can instantiate an audio context, allow access to the computer's audio input device, and use some of the Web Audio API built-in methods to create a source and an analyzer and connect the two together to start getting some data. source.connect(analyser); analyser.fftSize = 1024; getAudioData(); };
In the preceding code sample, we are using navigator.mediaDevices. getUserMedia to get access to the microphone. If you have ever built applications that were using audio or video input devices before, you might be familiar with writing navigator.getUserMedia(); however, this is deprecated and you should now be using navigator.mediaDevices. getUserMedia().
Writing it the old way will still work but is not recommended as it will probably not be supported in the next few years.
Once the basic setup is done, the getAudioData function filters the raw data coming from the device to only get the frequency data.
Listing 5-3. Function to filter through the raw data to get the frequency data we will use const getAudioData = () => { const freqdata = new Uint8Array(analyser.frequencyBinCount); analyser.getByteFrequencyData(freqdata);
console.log(freqdata); requestAnimationFrame(getAudioData); };
We also call requestAnimationFrame to continuously call this function and update the data we are logging with live data.
Altogether, you can access live data from the microphone in less than 25 lines of JavaScript! Listing 5-4. Complete code sample to get input data from the microphone in JavaScript window.AudioContext = window.AudioContext || window. webkitAudioContext; navigator.getUserMedia = navigator.getUserMedia || navigator. webkitGetUserMedia; let analyser; document.body.onclick = async () => { const audioctx = new window.AudioContext(); const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); const source = audioctx.createMediaStreamSource(stream); analyser = audioctx.createAnalyser(); analyser.smoothingTimeConstant = 0;
source.connect(analyser); analyser.fftSize = 1024; getAudioData(); };
const getAudioData = () => { const freqdata = new Uint8Array(analyser.frequencyBinCount); analyser.getByteFrequencyData(freqdata);
console.log(freqdata); requestAnimationFrame(getAudioData); };
The output from this code is an array of raw data we are logging in the browser's console. These arrays represent the frequencies that make up the sounds recorded by the computer's microphone. The default sample rate is 44,100Hz, which means we get about 44,000 samples of data per second.
In the format shown earlier (arrays of integers), finding patterns to recognize some type of activity seems pretty difficult. We wouldn't really be able to identify the difference between speaking, laughing, music playing, and so on.
To help make sense of this raw frequency data, we can turn it into visualizations.
There are different ways to visualize sound. A couple of ways you might be familiar with are waveforms or frequency charts.
Waveform visualizers represent the displacement of sound waves over time.
On the x axis (the horizontal one) is the unit of time and on the y axis (vertical one) is the frequencies. Sound happens over a certain period of time and is made of multiple frequencies.
144 This way of visualizing sound is a bit too minimal to be able to identify patterns. As you can see in the illustration earlier, all frequencies that make up a sound are reduced to a single line.
Frequency charts are visualizations that represent a measure of how many times a waveform repeats in a given amount of time.
You might be familiar with this type of audio visualization as they are probably the most common one.
This way of visualizing can maybe give you some insights about a beat as it represents repetitions or maybe about how loud the sound is as the y axis shows the volume, but that's about it.
This visualization does not give us enough information to be able to recognize and classify sounds we are visualizing.
Another type of visualization that is much more helpful is called a spectrogram.
A spectrogram is like a picture of a sound. It shows the frequencies that make up the sound from low to high and how they change over time. It is a visual representation of the spectrum of frequencies of a signal, a bit like a heat map of sound.
On the y axis is the spectrum of frequencies and, on the x axis, the amount of time. The axes seem similar to the two other type of visualizations we mentioned previously, but instead of representing all frequencies in a single line, we represent the whole spectrum.
In a spectrogram, a third axis can be helpful too, the amplitude. The amplitude of a sound can be described as the volume. The brighter the color, the louder the sound.
Visualizing sounds as spectrograms is much more helpful in finding patterns that would help us recognize and classify sounds.
For example, next is a screenshot of the output of a spectrogram running while I am speaking.
By itself, this might not help you understand why spectrograms are more helpful visualizations. The following is another screenshot of a spectrogram taken while I was clapping my hands three times. Hopefully, it starts to make more sense! If you compare both spectrograms, you can clearly distinguish between the two activities: speaking and clapping my hands.
If you wanted, you could try to visualize more sounds like coughing, your phone ringing, toilets flushing, and so on.
Overall, the main takeaway is that spectrograms help us see the signature of various sounds more clearly and distinguish them between different activities.
If we can make this differentiation by looking at a screenshot of a spectrogram, we can hope that using this data with a machine learning algorithm will also work in finding patterns and classify this sounds to build an activity classifier.
A broader example of using spectrograms for activity classification is from a research paper published by the Carnegie Mellon University in the United States. In their paper titled "Ubicoustics: Plug-and-Play Acoustic Activity Recognition," they created spectrograms for various activities from using a chainsaw, to a vehicle driving nearby. So, before we dive into using sound with machine learning, let's go through how we can turn the live data from the microphone that we logged in the console using the Web Audio API to a spectrogram.
In the code sample we wrote earlier, we created a getAudioData function that was getting the frequency data from the raw data and was logging it to the browser's console.
Listing 5-5. getAudioData function to get frequency data from raw data const getAudioData = () => { const freqdata = new Uint8Array(analyser.frequencyBinCount); analyser.getByteFrequencyData(freqdata);
console.log(freqdata); requestAnimationFrame(getAudioData); };
Where we wrote our console.log statement, we are going to add the code to create the visualization.
To do this, we are going to use the Canvas API, so we need to start by adding a canvas element to our HTML file like so.
Listing 5-6. Adding a canvas element to the HTML file
In our JavaScript, we are going to be able to access this element and use some methods from the Canvas API to draw our visualization.
Listing 5-7. Getting the canvas element and context in JavaScript var canvas = document.getElementById("canvas"); var ctx = canvas.getContext("2d");
The main concept of this visualization is to draw the spectrum of frequencies as they vary with time, so we need to get the current canvas and redraw over it every time we get new live data.
Listing 5-8. Getting the image data from the canvas element and redrawing over it imagedata = ctx.getImageData(1, 0, canvas.width -1, canvas. height); ctx.putImageData(imagedata, 0, 0); Then, we need to loop through the frequency data we get from the Web Audio API and draw them onto the canvas. Then, we call strokeStyle and pass it a dynamic value that will represent the colors used to display the amplitude of the sound.
After that, we call moveTo to move the visualization 1 pixel to the left and leave space for the new input to be drawn onto the screen at the far right, drawn with lineTo.
Finally, we call the stroke method to draw the line.
Altogether, our getAudioData function should look something like this. You might be wondering why it is important to understand how to create spectrograms. The main reason is that it is what is used as training data for the machine learning algorithm.
Instead of using the raw data the way we logged it in the browser's console, we instead use pictures of spectrograms generated to transform a sound problem into an image one.
Advancements in image recognition and classification have been really good over the past few years, and algorithms used with image data have been proven to be very performant.
Also, turning sound data into an image means that we can deal with a smaller amount of data to train a model, which would result in a shorter amount of time needed.
Indeed, the default sample rate of the Web Audio API is 44KHz, which means that it collects 44,000 samples of data per second.
If we record 2 seconds of audio, it is 88,000 points of data for a single sample.
You can imagine that as we need to record a lot more samples, it would end up being a very large amount of data being fed to a machine learning algorithm, which would take a long time to train.
On the other hand, a spectrogram being extracted as a picture can be easily resized to a smaller size, which could end up being only a 28x28 pixel image, for example, which would result in 784 data points for a 2-second audio clip. Now that we covered how to access live data from the microphone in JavaScript and how to transform it into a spectrogram visualization, allowing us to see how different sounds create visually different patterns, let's look into how to train a machine learning model to create a classifier.
Instead of creating a custom machine learning algorithm for this, we are going to use instead one of the Teachable Machine experiments dedicated to sound data. You can find it at https://teachablemachine.withgoogle. com/train/audio.
This project allows us to record samples of sound data, label them, train a machine learning algorithm, test the output, and export the model all within a single interface and in the browser! To start, we need to record some background noise for 20 seconds using the section highlighted in red in the following figure.
Then, we can start to record some samples for whatever sound we would like the model to recognize later on.
The minimum amount of samples is 8 and each of them needs to be 2 seconds long.
As this experiment uses transfer learning to quickly retrain a model that has already been trained with sound data, we need to work with the same format the original model was trained with.
Eight samples is the minimum but you can record more if you'd like. The more samples, the better. However, don't forget that it will also impact the amount of time the training will take.
Once you have recorded your samples and labelled them, you can start the live training in the browser and make sure not to close the browser window.
155 When this step is done, you should be able to see some live predictions in the last step of the experiment. Before you export the model, you can try to repeat the sounds you recorded to verify the accuracy of the predictions. If you don't find it accurate enough, you can record more samples and restart the training. If you are ready to move on, you can either upload your model to some Google servers and be provided with a link to it, or download the machine learning model that was created.
If you'd like to get a better understanding of how it works in the background, how the machine learning model is created, and so on, I'd recommend having a look at the source code available on GitHub! Even though I really like interfaces like Teachable Machine as they allow anyone to get started and experiment quickly, looking at the source code can reveal some important details. For example, the next image is how I realized that this project was using transfer learning.
While going through the code to see how the machine learning model was created and how the training was done, I noticed the following sample of code.
On line 793, we can see that the method addExample is called. This is the same method we used in the chapter of this book dedicated to image recognition when we used transfer learning to train an image classification model quickly with new input images.
Noticing these details is important if you decide to experiment with re-creating this model on your own, without going through the Teachable Machine interface. Now that we went through the training process, we can write the code to generate the predictions.
Before we can start writing this code, we need to import TensorFlow.js and the speech commands model.
Listing 5-11. Import TensorFlow.js and the speech commands model in an HTML file As I mentioned earlier, this experiment uses transfer learning, so we need to import the speech commands model that has already been trained with audio data to make it simpler and faster to get started. The speech commands model was originally trained to recognize and classify spoken words, like "yes", "no", "up", and "down". However, here, we are using it with sounds produced by activities, so it might not be as accurate as if we were using spoken words in our samples.
Before going through the rest of the code samples, make sure you have downloaded your trained model from the Teachable Machine platform, unzipped it, and added it to your application folder.
The following code samples will assume that your model is stored in a folder called activities-model at the root of your application.
Overall, your file structure should look something like this:
In our JavaScript file, we will need to create a function to load our model and start the live predictions, but before, we can create two variables to hold the paths to our model and metadata files. You may have noticed that I used localhost:8000 in the preceding code; however, feel free to change the port and make sure to update this if you decide to release your application to production.
Then, we need to load the model and ensure it is loaded before we continue.
Listing 5-13. Loading In each array, the value closest to 1 represents the label predicted. To match the data predicted with the correct label, we can create an array containing the labels we used for training and use it when calling the setupModel function.
Listing 5-17. Mapping scores to labels const labels = [ "Coughing", "Phone ringing", "Speaking", "_background_noise_
In the previous section, we covered how to record sound samples and train the model using the Teachable Machine experiment, for simplicity. However, if you are looking to implement this in your own application and let users run this same process themselves, you can use the transfer learning API.
This API lets you build your own interface and call API endpoints to record samples, train the model, and run live predictions.
Let's imagine a very simple web interface with a few buttons.
Some of these buttons are used to collect sample data, one button to start the training and the last one to trigger the live predictions.
To get started, we need an HTML file with these six buttons and two script tags to import TensorFlow.js and the Speech Commands model.
Listing 5-19. HTML file
Speech recognition
In the JavaScript file, before being able to run these actions, we need to create the model, ensure it is loaded, and pass a main label to our model to create a collection that will contain our audio samples.
Listing 5-20. Set up the recognizers const init = async () => { const baseRecognizer = speechCommands.create("BROWSER_FFT"); await baseRecognizer.ensureModelLoaded(); transferRecognizer = baseRecognizer.createTransfer("colors"); };
Then, we can add event listeners on our buttons so they will collect samples on click. For this, we need to call the collectExample method on our recognizer and pass it a string we would like the sample to be labelled with.
Listing 5-21. Collecting samples const redButton = document.getElementById("red"); redButton.onclick = async () => await transferRecognizer. collectExample("red");
To start the training, we call the train method on the recognizer. Altogether, this code sample looks like the following.
Listing 5-24. Full code sample let transferRecognizer; const init = async () => { const baseRecognizer = speechCommands.create("BROWSER_FFT"); await baseRecognizer.ensureModelLoaded(); transferRecognizer = baseRecognizer.createTransfer("colors"); }; init(); const redButton = document.getElementById("red"); const backgroundButton = document.getElementById("background"); const trainButton = document.getElementById("train"); const predictButton = document.getElementById("predict"); redButton.onclick = async () => await transferRecognizer. collectExample("red"); backgroundButton.onclick = async () => await transferRecognizer.collectExample("_background_ noise_");
Even though the examples I have used so far for our code samples (speaking and coughing) might have seemed simple, the way this technology is currently being used shows how interesting it can be.
In July 2020, Apple announced the release of a new version of their watchOS that included an application triggering a countdown when the user washes their hands. Related to the advice from public health officials around avoiding the spread of COVID-19, this application uses the watch's microphone to detect the sound of running water and trigger the 20 seconds countdown.
From the code samples shown in the last few pages, a similar application can be built using JavaScript and TensorFlow.js.
One of my favorite applications for this technology is in biodiversity research and protection of endangered species.
A really good example of this is the Rainforest Connection collective. This collective uses old cell phone and their built-in microphones to detect the sound of chainsaws in the forest and alert rangers of potential activities of illegal deforestation.
Using solar panels and attaching the installation to trees, they can constantly monitor what is going on around and run live predictions.
If this is a project that interests you, they also have a mobile application called Rainforest Connection, in which you can listen to the sound of nature, live from the forest, if you would like to check it out! Another use of this technology is in protecting killer whales. A collaboration between Google, Rainforest Connection, and Fisheries and Oceans Canada (DFO) uses bioacoustics monitoring to track, monitor, and observe the animal's behavior in the Salish Sea.
Another application you might not have noticed is currently implemented in a service you probably know. Indeed, if you are using YouTube, you may have come across live ambient sound captioning.
If you have ever activated captions on a YouTube video, you may know of spoken words being displayed as an overlay at the bottom.
However, there are more information in a video than what can be found in the transcripts.
Indeed, people without hearing impairment benefit from having access to additional information in the form of contextual sounds like music playing or the sound of rain in a video.
Only displaying spoken words in captions can cut quite a lot of information out for people with hearing impairment.
About 3 years ago, in 2017, YouTube released live ambient sound captioning that uses acoustic recognition to add to the captions details about ambient sounds detected in the soundtrack of a video.
Here is an example.
The preceding screenshot is taken from an interview between Janelle Monae and Pharrell Williams where the captions are activated.
Spoken words are displayed as expected, but we can also see ambient sounds like [Applause] .
People with hearing impairment can now have the opportunity to get more information about the video than only dialogues.
At the moment, the ambient sounds that can be detected on YouTube videos include • Applause • Music playing
It might not seem like much, but again, this is something we take for granted if we never have to think about the experience some people with disabilities have on these platforms.
Besides, thinking this feature has been implemented about 3 years ago already shows that a major technology company like Google has been actively exploring the potential of using machine learning with audio data and has been working on finding useful applications.
Now that we covered how to experiment with acoustic activity recognition in JavaScript and a few different applications, it is important to be aware of some of the limitations of such technology to have a better understanding of the real opportunities.
If you decide to build a similar acoustic activity recognition system from scratch and write your own model without using transfer learning and the speech commands model from TensorFlow.js, you are going to need to collect a lot more sound samples than the minimum of 8 required when using Teachable Machine.
To gather a large amount of samples, you can either decide to record them yourself or buy them from a professional audio library.
Another important point is to make sure to check the quality of the data recorded. If you want to detect the sound of a vacuum cleaner running, for example, make sure that there is no background noise and that the vacuum cleaner can be clearly heard in the audio track.
One tip to generate samples of data from a single one is to use an audio editing software to change some parameters of a single audio source to create multiple versions of it. You can, for example, modify the reverb, the pitch, and so on.
At the moment, this technology seems to be efficient in recognizing a single sound at once.
For example, if you trained your model to recognize the sound of someone speaking as well as the sound of running water, if you placed your system in the kitchen and the user was speaking as well as washing the dishes, the activity predicted would only be the one with the highest score in the predictions returned.
However, as the system runs continuously, it would probably get confused between the two activities. It would probably alternate between "speaking" and "running water" until one of the activities stopped.
This would definitely become a problem if you built an application that can detect sounds produced by activities that can be executed at the same time.
For example, let's imagine you usually play music while taking a shower and you built an application that can detect two activities: the sound of the shower running and speaking.
You want to be able to trigger a counter whenever it detects that the shower is running so you can avoid taking long showers and save water.
You also want to be able to lower the sound of your speakers when it detects that someone is speaking in the bathroom.
As these two activities can happen at the same time (you can speak while taking a shower), the system could get confused between the two activities and detect the shower running for a second and someone speaking the next.
As a result, it would start and stop the speakers one second, and start/stop the counter the next. This would definitely not create an ideal experience.
However, this does not mean that there is no potential in building applications using acoustic activity recognition, it only means that we would need to work around this limitation.
Besides, some research is being done around developing systems that can handle the detection of multiple activities at once. We will look into it in the next few pages.
When it comes to user experience, there are always some challenges with new technologies like this one.
First of all, privacy.
Having devices listening to users always raises some concerns about where the data is stored, how it is used, is it secure, and so on.
Considering that some companies releasing Internet of Things devices do not always put security first in their products, these concerns are very normal.
As a result, the adoption of these devices by consumers can be slower than expected.
Not only privacy and security should be baked in these systems, it should also be communicated to users in a clear way to reassure them and give them a sense of empowerment over their data.
Secondly, another challenge is in teaching users new interactions. For example, even though most modern phones now have voice assistants built-in, getting information from asking Siri or Google is not the primary interaction.
This could be for various reasons including privacy and limitations of the technology itself, but people also have habits that are difficult to change.
Besides, considering the current imperfect state of this technology, it is easy for users to give up after a few trials, when they do not get the response they were looking for.
A way to mitigate this would be to release small applications to analyze users' reactions to them and adapt. The work Apple did by implementing the water detection in their new watchOS is an example of that.
Finally, one of the big challenges of creating a custom acoustic activity recognition system is in the collection of sample data and training by the users.
Even though you can build and release an application that detects the sound of a faucet running because there's a high probability that it produces a similar sound in most homes, some other sounds are not so common.
As a result, empowering users to use this technology would involve letting them record their own samples and train the model so they can have the opportunity to have a customized application.
However, as machine learning algorithms need to be trained with a large amount of data to have a chance to produce accurate predictions, it would require a lot of effort from users and would inevitably not be successful.
Luckily, some researchers are experimenting with solutions to these problems. Now, even though there are some limits to this technology, solutions also start to appear.
For example, in terms of protecting users' privacy, an open source project called Project Alias by Bjørn Karmann attempts to empower voice assistant users.
This project is a DIY add-on made with a Raspberry Pi microcontroller, a speaker, and microphone module, all in a 3D printed enclosure that aims at blocking voice assistants like Amazon Alexa and Google Home from continuously listening to people.
Through a mobile application, users can train Alias to react on a custom wake word or sound. Once it is trained, Alias can take control over the home assistant and activate it for you. When you don't use it, the addon will prevent the assistant from listening by emitting white noise into their microphone.
Alias's neural network being run locally, the privacy of the user is protected.
Another project, called Synthetic Sensors, aims at creating a system that can accurately predict multiple sounds at once.
Developed by a team of researchers at the Carnegie Mellon University, this project involves a custom-built piece of hardware made of multiple sensors, including an accelerometer, microphone, temperature sensor, motion sensor, and color sensor.
Using the raw data collected from these sensors, researchers created multiple stacked spectrograms and trained algorithms to detect patterns produced by multiple activities such as Finally, in terms of user experience, a research project called Listen Learner aims at allowing users to collect data and train a model to recognize custom sounds, with minimal effort.
The full name of the project is Listen Learner, Automatic Class Discovery and One-Shot Interaction for Activity Recognition.
It aims at providing high classification accuracy, while minimizing user burden, by continuously listening to sounds in its environment, classifying them by cluster of similar sounds, and asking the user what the sound is after having collected enough similar samples.
The result of the study shows that this system can accurately and automatically learn acoustic events (e.g., 97% precision, 87% recall), while adhering to users' preferences for nonintrusive interactive behavior.
After looking at how to use machine learning with audio data, let's look into another type of input, that is, body tracking.
In this section, we are going to use data from body movements via the webcam using three different Tensorlfow.js models.
The first model we are going to experiment with is called Facemesh. It is a machine learning model focused on face recognition that predicts the position of 486 3D facial landmarks on a user's face, returning points with their x, y, and z coordinates.
The main difference between this face recognition model and other face tracking JavaScript libraries like face-tracking.js is that the TensorFlow.js model intends to approximate the surface geometry of a human face and not only the 2D position of some key points.
This model provides coordinates in a 3D environment which allows to approximate the depth of facial features as well as tracking the position of key points even when the user is rotating their face in three dimensions.
To start using the model, we need to load it using the two following lines in your HTML file.
Listing 5-25. Importing TensorFlow.js and Facemesh in an HTML file
As we are going to use the video feed from the webcam to detect faces, we also need to add a video element to our file.
Altogether, the very minimum HTML you need for this is as follows.
Listing 5-26. Core HTML code needed Facemesh Then, in your JavaScript code, you need to load the model and the webcam feed using the following code. As we can see in the preceding two screenshots, the predictions returned contain an important amount of information.
The annotations are organized by landmark areas, in alphabetical order and containing arrays of x, y, and z coordinates.
The bounding box contains two main keys, bottomRight and topLeft, to indicate the boundaries of the position of the detected face in the video stream. These two properties contain an array of only two coordinates, x and y, as the z axis is not useful in this case.
Finally, the mesh and scaledMesh properties contain all coordinates of the landmarks and are useful to render all points in 3D space on the screen.
Altogether, the JavaScript code to set up the model, the video feed, and start predicting the position of landmarks should look like the following.
To put this code sample into practice, let's build a quick prototype to allow users to scroll down a page by tilting their head back and forth.
We are going to be able to reuse most of the code written previously and make some small modifications to trigger a scroll using some of the landmarks detected.
The specific landmark we are going to use to detect the movement of the head is the lipsLowerOuter and more precisely its z axis.
Looking at all the properties available in the annotations object, using the lipsLowerOuter one is the closest to the chin, so we can look at the predicted changes of z coordinate for this area to determine if the head is tilting backward (chin moving forward) or forward (chin moving backward).
To do this, in our main function, once we get predictions, we can add the following lines of code. In this code sample, I declare a variable that I call zAxis to store the value of the z coordinate I want to track. To get this value, I look into the array of coordinates contained in the lipsLowerOuter property of the annotations object.
Based on the annotation objects returned, we can see that the lipsLowerOuter property contains 10 arrays of 3 values each.
This is why the code sample shown just earlier was accessing the z coordinates using predictions[0].annotations.lipsLowerOuter[9] [2] .
I decided to access the last element ([9]) of the lipsLowerOuter property and its third value ([2]), the z coordinate of the section.
The value 5 was selected after trial and error and seeing what threshold would work for this particular project. It is not a standard value that you will need to use every time you use the Facemesh model. Instead, I decided it was the correct threshold for me to use after logging the variable zAxis and seeing its value change in the browser's console as I was tilting my head back and forth.
Then, assuming that you declared scrollPosition earlier in the code and set it to a value (I personally set it to 0), a "scroll up" event will happen when you tilt your head backward and "scroll down" when you tilt your head forward.
Finally, I set the property behavior to "smooth" so we have some smooth scrolling happening, which, in my opinion, creates a better experience.
If you did not add any content to your HTML file, you won't see anything happen yet though, so don't forget to add enough text or images to be able to test that everything is working! In less than 75 lines of JavaScript, we loaded a face recognition model, set up the video stream, ran predictions to get the 3D coordinates of facial landmarks, and wrote some logic to trigger a scroll up or down when tilting your head backward or forward! This model is specialized in detecting face landmarks. Next, we're going to look into another one, to detect keypoints in a user's hands.
The second model we are going to experiment with is called Handpose. This model specializes in recognizing the position of 21 3D keypoints in the user's hands.
The following is an example of the output of this model, once visualized on the screen using the Canvas API.
To implement this, the lines of code will look very familiar if you have read the previous section.
We need to start by requiring TensorFlow.js and the Handpose model:
Listing 5-32. Import TensorFlow.js and the Handpose model
Similarly to the way the Facemesh model works, we are going to use the video stream as input so we also need to add a video element in your main HTML file.
Then, in your JavaScript file, we can use the same functions we wrote before to set up the camera and load the model. The only line we will need to change is the line where we call the load method on the model.
As we are using Handpose instead of Facemesh, we need to replace facemesh.load() with handpose.load().
So, overall the base of your JavaScript file should have the following code.
Once the model is loaded and the webcam feed is set up, we can run predictions and detect keypoints when a hand is placed in front of the webcam.
To do this, we can copy the main() function we created when using Facemesh, but replace the expression model.estimateFaces with model. estimateHands.
As a result, the main function should be as follows.
Listing 5-34. Run predictions and log the output async function main() { const predictions = await model.estimateHands( document.querySelector("video") );
if (predictions.length > 0) { console.log(predictions); } requestAnimationFrame(main); }
The output of this code will log the following data in the browser's console. We can see that the format of this data is very similar to the one when using the Facemesh model! This makes it easier and faster to experiment as you can reuse code samples you have written in other projects. It allows developers to get set up quickly to focus on experimenting with the possibilities of what can be built with such models, without spending too much time in configuration.
The main differences that can be noticed are the properties defined in annotations, the additional handInViewConfidence property, and the lack of mesh and scaledMesh data.
The handInViewConfidence property represents the probability of a hand being present. It is a floating value between 0 and 1. The closer it is to 1, the more confident the model is that a hand is found in the video stream.
At the moment of writing this book, this model is able to detect only one hand at a time. As a result, you cannot build applications that would require a user to use both hands at once as a way of interacting with the interface.
To check that everything is working properly, here is the full JavaScript code sample needed to test your setup.
To experiment with the kind of applications that can be built with this model, we're going to build a small "Rock Paper Scissors" game.
To understand how we are going to recognize the three gestures, let's have a look at the following visualizations to understand the position of the keypoints per gesture.
The preceding screenshot represents the "rock" gesture. As we can see, all fingers are folded so the tips of the fingers should be further in their z axis than the keypoint at the end of the first phalanx bone for each finger.
Otherwise, we can also consider that the y coordinate of the finger tips should be higher than the one of the major knuckles, keeping in mind that the top of the screen is equal to 0 and the lower the keypoint, the higher the y value.
We'll be able to play around with the data returned in the annotations object to see if this is accurate and can be used to detect the "rock" gesture. In the "paper" gesture, all fingers are straight so we can use mainly the y coordinates of different fingers. For example, we could check if the y value of the last point of each finger (at the tips) is less than the y value of the palm or the base of each finger.
Finally, the "scissors" gesture could be recognized by looking at the space in x axis between the index finger and the middle finger, as well as the y coordinate of the other fingers.
If the y value of the tip of the ring finger and little finger is lower than their base, they are probably folded.
Reusing the code samples we have gone through in the previous sections, let's look into how we can write the logic to recognize and differentiate these gestures.
If we start with the "rock" gesture, here is how we could check if the y coordinate of each finger is higher than the one of the base knuckle. if (indexTip > indexBase) { console.log("index finger folded"); } We can start by declaring two variables, one to store the y position of the base of the index finger and one for the tip of the same finger.
Looking back at the data from the annotations object when a finger is present on screen, we can see that, for the index finger, we get an array of 4 arrays representing the x, y, and z coordinates of each key point.
The y coordinate in the first array has a value of about 352.27 and the y coordinate in the last array has a value of about 126.62, which is lower, so we can deduce that the first array represents the position of the base of the index finger, and the last array represents the keypoint at the tip of that finger.
We can test that this information is correct by writing the if statement shown earlier that logs the message "index finger folded" if the value of indexTip is higher than the one of indexBase.
And it works! If you test this code by placing your hand in front of the camera and switch from holding your index finger straight and then folding it, you should see the message being logged in the console! If we wanted to keep it really quick and simpler, we could stop here and decide that this single check determines the "rock" gesture. However, if we would like to have more confidence in our gesture, we could repeat the same process for the middle finger, ring finger, and little finger.
The thumb would be a little different as we would check the difference in x coordinate rather than y, because of the way this finger folds.
For the "paper" gesture, as all fingers are extended, we could check that the tip of each finger has a smaller y coordinate than the base.
Here's what the code could look like to verify that.
Listing 5-37. Check the y coordinate of each finger for the "paper" gesture We start by storing the coordinates we are interested in into variables and then compare their values to set the extended states to true or false.
If all fingers are extended, we log the message "paper gesture!". If everything is working fine, you should be able to place your hand in front of the camera with all fingers extended and see the logs in the browser's console.
If you change to another gesture, the message "other gesture" should be logged. The following are two screenshots of the data we get back with this code sample.
We can see that when we do the "scissors" gesture, the value of the diffFingersX variable is much higher than when the two fingers are close together.
Looking at this data, we could decide that our threshold could be 100. If the value of diffFingersX is more than 100 and the ring and little fingers are folded, the likelihood of the gesture being "scissors" is very high.
So, altogether, we could check this gesture with the following code sample.
Listing 5-39. Detect If everything works properly, you should see the correct message being logged in the console when you do each gesture! Once you have verified that the logic works, you can move on from using console.log and use this to build a game or use these gestures as a controller for your interface, and so on.
The most important thing is to understand how the model works, get familiar with building logic using coordinates so you can explore the opportunities, and be conscious of some of the limits.
Finally, the last body tracking model we are going to talk about is called PoseNet.
PoseNet is a pose detection model that can estimate a single pose or multiple poses in an image or video.
Similarly to the Facemesh and Handpose models, PoseNet tracks the position of keypoints in a user's body.
The following is an example of these key points visualized.
Even though this model is also specialized in tracking a person's body using the webcam feed, using it in your code is a little bit different from the two models we covered in the previous sections.
Importing and loading the model follows the same standard as most of the code samples in this book.
Listing 5-41. Import TensorFlow.js and the PoseNet model in HTML If you want to experiment with the parameters, you can also load it this way. If you feel a bit confused by the different parameters, don't worry, as you get started, using the default ones provided is completely fine. If you want to learn more about them, you can find more information in the official TensorFlow documentation.
Once the model is loaded, you can focus on predicting poses.
To get predictions from the model, you mainly need to call the estimateSinglePose method on the model. The image parameter can either be some imageData, an HTML image element, an HTML canvas element, or an HTML video element. It represents the input image you want to get predictions on.
The flipHorizontal parameter indicates if you would like to flip/ mirror the pose horizontally. By default, its value is set to false.
If you are using videos, it should be set to true if the video is by default flipped horizontally (e.g., when using a webcam).
The preceding code sample will set the variable pose to a single pose object that will contain a confidence score and an array of keypoints detected, with their 2D coordinates, the name of the body part, and a probability score.
The following is an example of the object that will be returned.
Listing 5-46. Complete object returned as predictions We can see that some additional parameters are passed in.
• maxDetections indicates the maximum number of poses we'd like to detect. The value 5 is the default but you can change it to more or less.
• scoreThreshold indicates that you only want instances to be returned if the score value at the root of the object is higher than the value set. 0.5 is the default value.
• nmsRadius stands for nonmaximum suppression and indicates the amount of pixels that should separate multiple poses detected. The value needs to be strictly positive and defaults to 20.
Using this method will set the value of the variable poses to an array of pose objects, like the following.
Altogether, the code sample to set up the prediction of poses in an image is as follows.
Listing 5-49. HTML code to detect poses in an image