key: cord-0061498-1in100ja
authors: Gerard, Charlie
title: Experimenting with inputs
date: 2020-11-17
journal: Practical Machine Learning in JavaScript
DOI: 10.1007/978-1-4842-6418-8_5
sha: 48b9f51d8bc37574964d801af0bcaeb0507d01d0
doc_id: 61498
cord_uid: 1in100ja

In the previous chapters, we looked into how to use machine learning with images and text data to do object detection and classification, as well as sentiment analysis, toxicity classification and question answering.

When you first read the words "audio data," you might think that this section of the book is going to focus on music; however, I am going to dive into using sound more generally.

We don't really think about it often but a lot of things around us produce sounds that give us contextual information about our environment.

For example, the sound of thunder helps you understand the weather is probably bad without you having to look out the window, or you can recognize the sound of a plane passing by before you even see it, or even hearing the sound of waves indicates you are probably close to the ocean, and so on.

Without realizing, recognizing, and understanding the meaning of these sounds impacts our daily lives and our actions. Hearing a knock on your door indicates someone is probably behind waiting for you to open it, or hearing the sound of boiling water while you are cooking suggests that it is ready for you to pour something in it.

Using sound data and machine learning could help us leverage the rich properties of sounds to recognize certain human activities and enhance current smart systems such as Siri, Alexa, and so on. This is what is called acoustic activity recognition.

Considering a lot of the devices we surround ourselves with possess a microphone, there is a lot of opportunities for this technology.

So far, the smart systems some of us may be using recognize words to trigger commands, but they have no understanding of what is going on around them; your phone does not know you are in the bathroom, your Alexa device does not know you might be in the kitchen, and so on. However, they could and this could be used to create more tailored and useful digital experiences.

Before we dive into the practical part of this chapter and see how to build such systems in JavaScript using TensorFlow.js, it is helpful to start by understanding the basics of what sound is, and how it is translated to data we can use in code.

Sound is the vibration of air molecules.

If you have ever turned the volume of speakers really loud, you might have noticed that they end up moving back and forth with the music. This movement pushes on air particles, changing the air pressure and creating sound waves.

The same phenomenon happens with speech. When you speak, your vocal cords vibrate, disturbing air molecules around and changing the air pressure, creating sound waves.

A way to illustrate this phenomenon is with the following image. When you hit a tuning fork, it will start vibrating. This back and forth movement will change the surrounding air pressure. The movement forward will create a higher pressure and the movement backward will create a region of lower pressure. The repetition of this movement will create waves.

On the receiver side, our eardrums vibrate with the changes of pressure and this vibration is then transformed into an electrical signal sent to the brain.

So, if sound is a change in air pressure, how do we transform a sound wave into data we can use with our devices?

To be able to interpret sound data, our devices use microphones.

There exist different types of microphones, but in general, these devices have a diaphragm or membrane that vibrates when exposed to changes of air pressure caused by sound waves.

These vibrations move a magnet near a coil inside the microphone that generate a small electrical current. Your computer then converts this signal into numbers that represent both volume and frequency.

In JavaScript, the Web API that lets developers access data coming from the computer's microphone is the Web Audio API.

If you have never used this API before, it's totally fine; we are going to go through the main lines you need to get set up everything.

To start, we need to access the AudioContext interface on the global window object, as well as making sure we can get permission to access an audio and video input device with getUserMedia.

Listing 5-1. Setup to use the Web Audio API in JavaScript window.AudioContext = window.AudioContext || window. webkitAudioContext; navigator.getUserMedia = navigator.getUserMedia || navigator. webkitGetUserMedia;

This code sample takes into consideration cross-browser compatibility. Then, to start listening to input coming from the microphone, we need to wait for a user action on the page, for example, a click.

Once the user has interacted with the web page, we can instantiate an audio context, allow access to the computer's audio input device, and use some of the Web Audio API built-in methods to create a source and an analyzer and connect the two together to start getting some data. source.connect(analyser); analyser.fftSize = 1024; getAudioData(); };

In the preceding code sample, we are using navigator.mediaDevices. getUserMedia to get access to the microphone. If you have ever built applications that were using audio or video input devices before, you might be familiar with writing navigator.getUserMedia(); however, this is deprecated and you should now be using navigator.mediaDevices. getUserMedia().

Writing it the old way will still work but is not recommended as it will probably not be supported in the next few years.

Once the basic setup is done, the getAudioData function filters the raw data coming from the device to only get the frequency data.

Listing 5-3. Function to filter through the raw data to get the frequency data we will use const getAudioData = () => { const freqdata = new Uint8Array(analyser.frequencyBinCount); analyser.getByteFrequencyData(freqdata);

console.log(freqdata); requestAnimationFrame(getAudioData); };

We also call requestAnimationFrame to continuously call this function and update the data we are logging with live data.

Altogether, you can access live data from the microphone in less than 25 lines of JavaScript! Listing 5-4. Complete code sample to get input data from the microphone in JavaScript window.AudioContext = window.AudioContext || window. webkitAudioContext; navigator.getUserMedia = navigator.getUserMedia || navigator. webkitGetUserMedia; let analyser; document.body.onclick = async () => { const audioctx = new window.AudioContext(); const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); const source = audioctx.createMediaStreamSource(stream); analyser = audioctx.createAnalyser(); analyser.smoothingTimeConstant = 0;

source.connect(analyser); analyser.fftSize = 1024; getAudioData(); };

const getAudioData = () => { const freqdata = new Uint8Array(analyser.frequencyBinCount); analyser.getByteFrequencyData(freqdata);

console.log(freqdata); requestAnimationFrame(getAudioData); };

The output from this code is an array of raw data we are logging in the browser's console. These arrays represent the frequencies that make up the sounds recorded by the computer's microphone. The default sample rate is 44,100Hz, which means we get about 44,000 samples of data per second.

In the format shown earlier (arrays of integers), finding patterns to recognize some type of activity seems pretty difficult. We wouldn't really be able to identify the difference between speaking, laughing, music playing, and so on.

To help make sense of this raw frequency data, we can turn it into visualizations.

There are different ways to visualize sound. A couple of ways you might be familiar with are waveforms or frequency charts.

Waveform visualizers represent the displacement of sound waves over time.

On the x axis (the horizontal one) is the unit of time and on the y axis (vertical one) is the frequencies. Sound happens over a certain period of time and is made of multiple frequencies.

144 This way of visualizing sound is a bit too minimal to be able to identify patterns. As you can see in the illustration earlier, all frequencies that make up a sound are reduced to a single line.

Frequency charts are visualizations that represent a measure of how many times a waveform repeats in a given amount of time.

You might be familiar with this type of audio visualization as they are probably the most common one.

This way of visualizing can maybe give you some insights about a beat as it represents repetitions or maybe about how loud the sound is as the y axis shows the volume, but that's about it.

This visualization does not give us enough information to be able to recognize and classify sounds we are visualizing.

Another type of visualization that is much more helpful is called a spectrogram.

A spectrogram is like a picture of a sound. It shows the frequencies that make up the sound from low to high and how they change over time. It is a visual representation of the spectrum of frequencies of a signal, a bit like a heat map of sound.

On the y axis is the spectrum of frequencies and, on the x axis, the amount of time. The axes seem similar to the two other type of visualizations we mentioned previously, but instead of representing all frequencies in a single line, we represent the whole spectrum.

In a spectrogram, a third axis can be helpful too, the amplitude. The amplitude of a sound can be described as the volume. The brighter the color, the louder the sound.

Visualizing sounds as spectrograms is much more helpful in finding patterns that would help us recognize and classify sounds.

For example, next is a screenshot of the output of a spectrogram running while I am speaking.

By itself, this might not help you understand why spectrograms are more helpful visualizations. The following is another screenshot of a spectrogram taken while I was clapping my hands three times. Hopefully, it starts to make more sense! If you compare both spectrograms, you can clearly distinguish between the two activities: speaking and clapping my hands.

If you wanted, you could try to visualize more sounds like coughing, your phone ringing, toilets flushing, and so on.

Overall, the main takeaway is that spectrograms help us see the signature of various sounds more clearly and distinguish them between different activities.

If we can make this differentiation by looking at a screenshot of a spectrogram, we can hope that using this data with a machine learning algorithm will also work in finding patterns and classify this sounds to build an activity classifier.

A broader example of using spectrograms for activity classification is from a research paper published by the Carnegie Mellon University in the United States. In their paper titled "Ubicoustics: Plug-and-Play Acoustic Activity Recognition," they created spectrograms for various activities from using a chainsaw, to a vehicle driving nearby. So, before we dive into using sound with machine learning, let's go through how we can turn the live data from the microphone that we logged in the console using the Web Audio API to a spectrogram.

In the code sample we wrote earlier, we created a getAudioData function that was getting the frequency data from the raw data and was logging it to the browser's console.

Listing 5-5. getAudioData function to get frequency data from raw data const getAudioData = () => { const freqdata = new Uint8Array(analyser.frequencyBinCount); analyser.getByteFrequencyData(freqdata);

console.log(freqdata); requestAnimationFrame(getAudioData); };

Where we wrote our console.log statement, we are going to add the code to create the visualization.

To do this, we are going to use the Canvas API, so we need to start by adding a canvas element to our HTML file like so.

Listing 5-6. Adding a canvas element to the HTML file <canvas id="canvas"></canvas>

In our JavaScript, we are going to be able to access this element and use some methods from the Canvas API to draw our visualization.

Listing 5-7. Getting the canvas element and context in JavaScript var canvas = document.getElementById("canvas"); var ctx = canvas.getContext("2d");

The main concept of this visualization is to draw the spectrum of frequencies as they vary with time, so we need to get the current canvas and redraw over it every time we get new live data.

Listing 5-8. Getting the image data from the canvas element and redrawing over it imagedata = ctx.getImageData(1, 0, canvas.width -1, canvas. height); ctx.putImageData(imagedata, 0, 0); Then, we need to loop through the frequency data we get from the Web Audio API and draw them onto the canvas. Then, we call strokeStyle and pass it a dynamic value that will represent the colors used to display the amplitude of the sound.

After that, we call moveTo to move the visualization 1 pixel to the left and leave space for the new input to be drawn onto the screen at the far right, drawn with lineTo.

Finally, we call the stroke method to draw the line.

Altogether, our getAudioData function should look something like this. You might be wondering why it is important to understand how to create spectrograms. The main reason is that it is what is used as training data for the machine learning algorithm.

Instead of using the raw data the way we logged it in the browser's console, we instead use pictures of spectrograms generated to transform a sound problem into an image one.

Advancements in image recognition and classification have been really good over the past few years, and algorithms used with image data have been proven to be very performant.

Also, turning sound data into an image means that we can deal with a smaller amount of data to train a model, which would result in a shorter amount of time needed.

Indeed, the default sample rate of the Web Audio API is 44KHz, which means that it collects 44,000 samples of data per second.

If we record 2 seconds of audio, it is 88,000 points of data for a single sample.

You can imagine that as we need to record a lot more samples, it would end up being a very large amount of data being fed to a machine learning algorithm, which would take a long time to train.

On the other hand, a spectrogram being extracted as a picture can be easily resized to a smaller size, which could end up being only a 28x28 pixel image, for example, which would result in 784 data points for a 2-second audio clip. Now that we covered how to access live data from the microphone in JavaScript and how to transform it into a spectrogram visualization, allowing us to see how different sounds create visually different patterns, let's look into how to train a machine learning model to create a classifier.

Instead of creating a custom machine learning algorithm for this, we are going to use instead one of the Teachable Machine experiments dedicated to sound data. You can find it at https://teachablemachine.withgoogle. com/train/audio.

This project allows us to record samples of sound data, label them, train a machine learning algorithm, test the output, and export the model all within a single interface and in the browser! To start, we need to record some background noise for 20 seconds using the section highlighted in red in the following figure.

Then, we can start to record some samples for whatever sound we would like the model to recognize later on.

The minimum amount of samples is 8 and each of them needs to be 2 seconds long.

As this experiment uses transfer learning to quickly retrain a model that has already been trained with sound data, we need to work with the same format the original model was trained with.

Eight samples is the minimum but you can record more if you'd like. The more samples, the better. However, don't forget that it will also impact the amount of time the training will take.

Once you have recorded your samples and labelled them, you can start the live training in the browser and make sure not to close the browser window.

155 When this step is done, you should be able to see some live predictions in the last step of the experiment. Before you export the model, you can try to repeat the sounds you recorded to verify the accuracy of the predictions. If you don't find it accurate enough, you can record more samples and restart the training. If you are ready to move on, you can either upload your model to some Google servers and be provided with a link to it, or download the machine learning model that was created.

If you'd like to get a better understanding of how it works in the background, how the machine learning model is created, and so on, I'd recommend having a look at the source code available on GitHub! Even though I really like interfaces like Teachable Machine as they allow anyone to get started and experiment quickly, looking at the source code can reveal some important details. For example, the next image is how I realized that this project was using transfer learning.

While going through the code to see how the machine learning model was created and how the training was done, I noticed the following sample of code.

On line 793, we can see that the method addExample is called. This is the same method we used in the chapter of this book dedicated to image recognition when we used transfer learning to train an image classification model quickly with new input images.

Noticing these details is important if you decide to experiment with re-creating this model on your own, without going through the Teachable Machine interface. Now that we went through the training process, we can write the code to generate the predictions.

Before we can start writing this code, we need to import TensorFlow.js and the speech commands model.

Listing 5-11. Import TensorFlow.js and the speech commands model in an HTML file <script src="https://cdn.jsdelivr.net/npm/@tensorflow/ tfjs@1.3.1/dist/tf.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/ speech-commands@0.4.0/dist/speech-commands.min.js"></script> As I mentioned earlier, this experiment uses transfer learning, so we need to import the speech commands model that has already been trained with audio data to make it simpler and faster to get started. The speech commands model was originally trained to recognize and classify spoken words, like "yes", "no", "up", and "down". However, here, we are using it with sounds produced by activities, so it might not be as accurate as if we were using spoken words in our samples.

Before going through the rest of the code samples, make sure you have downloaded your trained model from the Teachable Machine platform, unzipped it, and added it to your application folder.

The following code samples will assume that your model is stored in a folder called activities-model at the root of your application.

Overall, your file structure should look something like this:

In our JavaScript file, we will need to create a function to load our model and start the live predictions, but before, we can create two variables to hold the paths to our model and metadata files. You may have noticed that I used localhost:8000 in the preceding code; however, feel free to change the port and make sure to update this if you decide to release your application to production.

Then, we need to load the model and ensure it is loaded before we continue.

Listing 5-13. Loading In each array, the value closest to 1 represents the label predicted. To match the data predicted with the correct label, we can create an array containing the labels we used for training and use it when calling the setupModel function.

Listing 5-17. Mapping scores to labels const labels = [ "Coughing", "Phone ringing", "Speaking", "_background_noise_ 

In the previous section, we covered how to record sound samples and train the model using the Teachable Machine experiment, for simplicity. However, if you are looking to implement this in your own application and let users run this same process themselves, you can use the transfer learning API.

This API lets you build your own interface and call API endpoints to record samples, train the model, and run live predictions.

Let's imagine a very simple web interface with a few buttons.

Some of these buttons are used to collect sample data, one button to start the training and the last one to trigger the live predictions.

To get started, we need an HTML file with these six buttons and two script tags to import TensorFlow.js and the Speech Commands model.

Listing 5-19. HTML file <html lang="en"> <head> <title>Speech recognition</title> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/ tfjs@1.3.1/dist/tf.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/@tensorflowmodels/speech-commands@0.4.0/dist/speech-commands.min.js"> </script> </head> <body> <section> <button id="red">Red</button> <button id="blue">Blue</button> <button id="green">Green</button> <button id="background">Background</button> <button id="train">Train</button> <button id="predict">Predict</button> </section> <script src="index.js"></script> </body> </html>

In the JavaScript file, before being able to run these actions, we need to create the model, ensure it is loaded, and pass a main label to our model to create a collection that will contain our audio samples.

Listing 5-20. Set up the recognizers const init = async () => { const baseRecognizer = speechCommands.create("BROWSER_FFT"); await baseRecognizer.ensureModelLoaded(); transferRecognizer = baseRecognizer.createTransfer("colors"); };

Then, we can add event listeners on our buttons so they will collect samples on click. For this, we need to call the collectExample method on our recognizer and pass it a string we would like the sample to be labelled with.

Listing 5-21. Collecting samples const redButton = document.getElementById("red"); redButton.onclick = async () => await transferRecognizer. collectExample("red");

To start the training, we call the train method on the recognizer. Altogether, this code sample looks like the following.

Listing 5-24. Full code sample let transferRecognizer; const init = async () => { const baseRecognizer = speechCommands.create("BROWSER_FFT"); await baseRecognizer.ensureModelLoaded(); transferRecognizer = baseRecognizer.createTransfer("colors"); }; init(); const redButton = document.getElementById("red"); const backgroundButton = document.getElementById("background"); const trainButton = document.getElementById("train"); const predictButton = document.getElementById("predict"); redButton.onclick = async () => await transferRecognizer. collectExample("red"); backgroundButton.onclick = async () => await transferRecognizer.collectExample("_background_ noise_"); 

Even though the examples I have used so far for our code samples (speaking and coughing) might have seemed simple, the way this technology is currently being used shows how interesting it can be.

In July 2020, Apple announced the release of a new version of their watchOS that included an application triggering a countdown when the user washes their hands. Related to the advice from public health officials around avoiding the spread of COVID-19, this application uses the watch's microphone to detect the sound of running water and trigger the 20 seconds countdown.

From the code samples shown in the last few pages, a similar application can be built using JavaScript and TensorFlow.js.

One of my favorite applications for this technology is in biodiversity research and protection of endangered species.

A really good example of this is the Rainforest Connection collective. This collective uses old cell phone and their built-in microphones to detect the sound of chainsaws in the forest and alert rangers of potential activities of illegal deforestation.

Using solar panels and attaching the installation to trees, they can constantly monitor what is going on around and run live predictions.

If this is a project that interests you, they also have a mobile application called Rainforest Connection, in which you can listen to the sound of nature, live from the forest, if you would like to check it out! Another use of this technology is in protecting killer whales. A collaboration between Google, Rainforest Connection, and Fisheries and Oceans Canada (DFO) uses bioacoustics monitoring to track, monitor, and observe the animal's behavior in the Salish Sea. 

Another application you might not have noticed is currently implemented in a service you probably know. Indeed, if you are using YouTube, you may have come across live ambient sound captioning.

If you have ever activated captions on a YouTube video, you may know of spoken words being displayed as an overlay at the bottom.

However, there are more information in a video than what can be found in the transcripts.

Indeed, people without hearing impairment benefit from having access to additional information in the form of contextual sounds like music playing or the sound of rain in a video.

Only displaying spoken words in captions can cut quite a lot of information out for people with hearing impairment.

About 3 years ago, in 2017, YouTube released live ambient sound captioning that uses acoustic recognition to add to the captions details about ambient sounds detected in the soundtrack of a video.

Here is an example.

The preceding screenshot is taken from an interview between Janelle Monae and Pharrell Williams where the captions are activated.

Spoken words are displayed as expected, but we can also see ambient sounds like [Applause] .

People with hearing impairment can now have the opportunity to get more information about the video than only dialogues.

At the moment, the ambient sounds that can be detected on YouTube videos include • Applause • Music playing

It might not seem like much, but again, this is something we take for granted if we never have to think about the experience some people with disabilities have on these platforms.

Besides, thinking this feature has been implemented about 3 years ago already shows that a major technology company like Google has been actively exploring the potential of using machine learning with audio data and has been working on finding useful applications.

Now that we covered how to experiment with acoustic activity recognition in JavaScript and a few different applications, it is important to be aware of some of the limitations of such technology to have a better understanding of the real opportunities.

If you decide to build a similar acoustic activity recognition system from scratch and write your own model without using transfer learning and the speech commands model from TensorFlow.js, you are going to need to collect a lot more sound samples than the minimum of 8 required when using Teachable Machine.

To gather a large amount of samples, you can either decide to record them yourself or buy them from a professional audio library.

Another important point is to make sure to check the quality of the data recorded. If you want to detect the sound of a vacuum cleaner running, for example, make sure that there is no background noise and that the vacuum cleaner can be clearly heard in the audio track.

One tip to generate samples of data from a single one is to use an audio editing software to change some parameters of a single audio source to create multiple versions of it. You can, for example, modify the reverb, the pitch, and so on. 

At the moment, this technology seems to be efficient in recognizing a single sound at once.

For example, if you trained your model to recognize the sound of someone speaking as well as the sound of running water, if you placed your system in the kitchen and the user was speaking as well as washing the dishes, the activity predicted would only be the one with the highest score in the predictions returned.

However, as the system runs continuously, it would probably get confused between the two activities. It would probably alternate between "speaking" and "running water" until one of the activities stopped.

This would definitely become a problem if you built an application that can detect sounds produced by activities that can be executed at the same time.

For example, let's imagine you usually play music while taking a shower and you built an application that can detect two activities: the sound of the shower running and speaking.

You want to be able to trigger a counter whenever it detects that the shower is running so you can avoid taking long showers and save water.

You also want to be able to lower the sound of your speakers when it detects that someone is speaking in the bathroom.

As these two activities can happen at the same time (you can speak while taking a shower), the system could get confused between the two activities and detect the shower running for a second and someone speaking the next.

As a result, it would start and stop the speakers one second, and start/stop the counter the next. This would definitely not create an ideal experience.

However, this does not mean that there is no potential in building applications using acoustic activity recognition, it only means that we would need to work around this limitation.

Besides, some research is being done around developing systems that can handle the detection of multiple activities at once. We will look into it in the next few pages.

When it comes to user experience, there are always some challenges with new technologies like this one.

First of all, privacy.

Having devices listening to users always raises some concerns about where the data is stored, how it is used, is it secure, and so on.

Considering that some companies releasing Internet of Things devices do not always put security first in their products, these concerns are very normal.

As a result, the adoption of these devices by consumers can be slower than expected.

Not only privacy and security should be baked in these systems, it should also be communicated to users in a clear way to reassure them and give them a sense of empowerment over their data.

Secondly, another challenge is in teaching users new interactions. For example, even though most modern phones now have voice assistants built-in, getting information from asking Siri or Google is not the primary interaction.

This could be for various reasons including privacy and limitations of the technology itself, but people also have habits that are difficult to change.

Besides, considering the current imperfect state of this technology, it is easy for users to give up after a few trials, when they do not get the response they were looking for.

A way to mitigate this would be to release small applications to analyze users' reactions to them and adapt. The work Apple did by implementing the water detection in their new watchOS is an example of that.

Finally, one of the big challenges of creating a custom acoustic activity recognition system is in the collection of sample data and training by the users.

Even though you can build and release an application that detects the sound of a faucet running because there's a high probability that it produces a similar sound in most homes, some other sounds are not so common.

As a result, empowering users to use this technology would involve letting them record their own samples and train the model so they can have the opportunity to have a customized application.

However, as machine learning algorithms need to be trained with a large amount of data to have a chance to produce accurate predictions, it would require a lot of effort from users and would inevitably not be successful.

Luckily, some researchers are experimenting with solutions to these problems. Now, even though there are some limits to this technology, solutions also start to appear.

For example, in terms of protecting users' privacy, an open source project called Project Alias by Bjørn Karmann attempts to empower voice assistant users.

This project is a DIY add-on made with a Raspberry Pi microcontroller, a speaker, and microphone module, all in a 3D printed enclosure that aims at blocking voice assistants like Amazon Alexa and Google Home from continuously listening to people.

Through a mobile application, users can train Alias to react on a custom wake word or sound. Once it is trained, Alias can take control over the home assistant and activate it for you. When you don't use it, the addon will prevent the assistant from listening by emitting white noise into their microphone.

Alias's neural network being run locally, the privacy of the user is protected. 

Another project, called Synthetic Sensors, aims at creating a system that can accurately predict multiple sounds at once.

Developed by a team of researchers at the Carnegie Mellon University, this project involves a custom-built piece of hardware made of multiple sensors, including an accelerometer, microphone, temperature sensor, motion sensor, and color sensor.

Using the raw data collected from these sensors, researchers created multiple stacked spectrograms and trained algorithms to detect patterns produced by multiple activities such as Finally, in terms of user experience, a research project called Listen Learner aims at allowing users to collect data and train a model to recognize custom sounds, with minimal effort.

The full name of the project is Listen Learner, Automatic Class Discovery and One-Shot Interaction for Activity Recognition.

It aims at providing high classification accuracy, while minimizing user burden, by continuously listening to sounds in its environment, classifying them by cluster of similar sounds, and asking the user what the sound is after having collected enough similar samples.

The result of the study shows that this system can accurately and automatically learn acoustic events (e.g., 97% precision, 87% recall), while adhering to users' preferences for nonintrusive interactive behavior. 

After looking at how to use machine learning with audio data, let's look into another type of input, that is, body tracking.

In this section, we are going to use data from body movements via the webcam using three different Tensorlfow.js models. 

The first model we are going to experiment with is called Facemesh. It is a machine learning model focused on face recognition that predicts the position of 486 3D facial landmarks on a user's face, returning points with their x, y, and z coordinates.

The main difference between this face recognition model and other face tracking JavaScript libraries like face-tracking.js is that the TensorFlow.js model intends to approximate the surface geometry of a human face and not only the 2D position of some key points.

This model provides coordinates in a 3D environment which allows to approximate the depth of facial features as well as tracking the position of key points even when the user is rotating their face in three dimensions. 

To start using the model, we need to load it using the two following lines in your HTML file.

Listing 5-25. Importing TensorFlow.js and Facemesh in an HTML file <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"> </script> <script src='https://cdn.jsdelivr.net/npm/@tensorflow-models/ facemesh'></script>

As we are going to use the video feed from the webcam to detect faces, we also need to add a video element to our file.

Altogether, the very minimum HTML you need for this is as follows. 

Listing 5-26. Core HTML code needed <html lang="en"> <head> <meta charset="UTF-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>Facemesh</title> </head> <body> <video></video> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/ tfjs"></script> <script src="https://cdn.jsdelivr.net/npm/@tensorflowmodels/facemesh"></script> <script src="index.js"></script> </body> </html> Then, in your JavaScript code, you need to load the model and the webcam feed using the following code. As we can see in the preceding two screenshots, the predictions returned contain an important amount of information.

The annotations are organized by landmark areas, in alphabetical order and containing arrays of x, y, and z coordinates.

The bounding box contains two main keys, bottomRight and topLeft, to indicate the boundaries of the position of the detected face in the video stream. These two properties contain an array of only two coordinates, x and y, as the z axis is not useful in this case.

Finally, the mesh and scaledMesh properties contain all coordinates of the landmarks and are useful to render all points in 3D space on the screen.

Altogether, the JavaScript code to set up the model, the video feed, and start predicting the position of landmarks should look like the following. 

To put this code sample into practice, let's build a quick prototype to allow users to scroll down a page by tilting their head back and forth.

We are going to be able to reuse most of the code written previously and make some small modifications to trigger a scroll using some of the landmarks detected.

The specific landmark we are going to use to detect the movement of the head is the lipsLowerOuter and more precisely its z axis.

Looking at all the properties available in the annotations object, using the lipsLowerOuter one is the closest to the chin, so we can look at the predicted changes of z coordinate for this area to determine if the head is tilting backward (chin moving forward) or forward (chin moving backward).

To do this, in our main function, once we get predictions, we can add the following lines of code. In this code sample, I declare a variable that I call zAxis to store the value of the z coordinate I want to track. To get this value, I look into the array of coordinates contained in the lipsLowerOuter property of the annotations object.

Based on the annotation objects returned, we can see that the lipsLowerOuter property contains 10 arrays of 3 values each.

This is why the code sample shown just earlier was accessing the z coordinates using predictions[0].annotations.lipsLowerOuter[9] [2] .

I decided to access the last element ([9]) of the lipsLowerOuter property and its third value ([2]), the z coordinate of the section.

The value 5 was selected after trial and error and seeing what threshold would work for this particular project. It is not a standard value that you will need to use every time you use the Facemesh model. Instead, I decided it was the correct threshold for me to use after logging the variable zAxis and seeing its value change in the browser's console as I was tilting my head back and forth.

Then, assuming that you declared scrollPosition earlier in the code and set it to a value (I personally set it to 0), a "scroll up" event will happen when you tilt your head backward and "scroll down" when you tilt your head forward.

Finally, I set the property behavior to "smooth" so we have some smooth scrolling happening, which, in my opinion, creates a better experience.

If you did not add any content to your HTML file, you won't see anything happen yet though, so don't forget to add enough text or images to be able to test that everything is working! In less than 75 lines of JavaScript, we loaded a face recognition model, set up the video stream, ran predictions to get the 3D coordinates of facial landmarks, and wrote some logic to trigger a scroll up or down when tilting your head backward or forward! This model is specialized in detecting face landmarks. Next, we're going to look into another one, to detect keypoints in a user's hands.

The second model we are going to experiment with is called Handpose. This model specializes in recognizing the position of 21 3D keypoints in the user's hands.

The following is an example of the output of this model, once visualized on the screen using the Canvas API. 

To implement this, the lines of code will look very familiar if you have read the previous section.

We need to start by requiring TensorFlow.js and the Handpose model:

Listing 5-32. Import TensorFlow.js and the Handpose model <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"> </script> <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/ handpose"></script> 

Similarly to the way the Facemesh model works, we are going to use the video stream as input so we also need to add a video element in your main HTML file.

Then, in your JavaScript file, we can use the same functions we wrote before to set up the camera and load the model. The only line we will need to change is the line where we call the load method on the model.

As we are using Handpose instead of Facemesh, we need to replace facemesh.load() with handpose.load().

So, overall the base of your JavaScript file should have the following code. 

Once the model is loaded and the webcam feed is set up, we can run predictions and detect keypoints when a hand is placed in front of the webcam.

To do this, we can copy the main() function we created when using Facemesh, but replace the expression model.estimateFaces with model. estimateHands.

As a result, the main function should be as follows.

Listing 5-34. Run predictions and log the output async function main() { const predictions = await model.estimateHands( document.querySelector("video") );

if (predictions.length > 0) { console.log(predictions); } requestAnimationFrame(main); }

The output of this code will log the following data in the browser's console. We can see that the format of this data is very similar to the one when using the Facemesh model! This makes it easier and faster to experiment as you can reuse code samples you have written in other projects. It allows developers to get set up quickly to focus on experimenting with the possibilities of what can be built with such models, without spending too much time in configuration.

The main differences that can be noticed are the properties defined in annotations, the additional handInViewConfidence property, and the lack of mesh and scaledMesh data.

The handInViewConfidence property represents the probability of a hand being present. It is a floating value between 0 and 1. The closer it is to 1, the more confident the model is that a hand is found in the video stream.

At the moment of writing this book, this model is able to detect only one hand at a time. As a result, you cannot build applications that would require a user to use both hands at once as a way of interacting with the interface.

To check that everything is working properly, here is the full JavaScript code sample needed to test your setup. 

To experiment with the kind of applications that can be built with this model, we're going to build a small "Rock Paper Scissors" game.

To understand how we are going to recognize the three gestures, let's have a look at the following visualizations to understand the position of the keypoints per gesture.

The preceding screenshot represents the "rock" gesture. As we can see, all fingers are folded so the tips of the fingers should be further in their z axis than the keypoint at the end of the first phalanx bone for each finger.

Otherwise, we can also consider that the y coordinate of the finger tips should be higher than the one of the major knuckles, keeping in mind that the top of the screen is equal to 0 and the lower the keypoint, the higher the y value.

We'll be able to play around with the data returned in the annotations object to see if this is accurate and can be used to detect the "rock" gesture. In the "paper" gesture, all fingers are straight so we can use mainly the y coordinates of different fingers. For example, we could check if the y value of the last point of each finger (at the tips) is less than the y value of the palm or the base of each finger.

Finally, the "scissors" gesture could be recognized by looking at the space in x axis between the index finger and the middle finger, as well as the y coordinate of the other fingers.

If the y value of the tip of the ring finger and little finger is lower than their base, they are probably folded.

Reusing the code samples we have gone through in the previous sections, let's look into how we can write the logic to recognize and differentiate these gestures.

If we start with the "rock" gesture, here is how we could check if the y coordinate of each finger is higher than the one of the base knuckle. if (indexTip > indexBase) { console.log("index finger folded"); } We can start by declaring two variables, one to store the y position of the base of the index finger and one for the tip of the same finger.

Looking back at the data from the annotations object when a finger is present on screen, we can see that, for the index finger, we get an array of 4 arrays representing the x, y, and z coordinates of each key point.

The y coordinate in the first array has a value of about 352.27 and the y coordinate in the last array has a value of about 126.62, which is lower, so we can deduce that the first array represents the position of the base of the index finger, and the last array represents the keypoint at the tip of that finger.

We can test that this information is correct by writing the if statement shown earlier that logs the message "index finger folded" if the value of indexTip is higher than the one of indexBase.

And it works! If you test this code by placing your hand in front of the camera and switch from holding your index finger straight and then folding it, you should see the message being logged in the console! If we wanted to keep it really quick and simpler, we could stop here and decide that this single check determines the "rock" gesture. However, if we would like to have more confidence in our gesture, we could repeat the same process for the middle finger, ring finger, and little finger.

The thumb would be a little different as we would check the difference in x coordinate rather than y, because of the way this finger folds.

For the "paper" gesture, as all fingers are extended, we could check that the tip of each finger has a smaller y coordinate than the base.

Here's what the code could look like to verify that.

Listing 5-37. Check the y coordinate of each finger for the "paper" gesture We start by storing the coordinates we are interested in into variables and then compare their values to set the extended states to true or false.

If all fingers are extended, we log the message "paper gesture!". If everything is working fine, you should be able to place your hand in front of the camera with all fingers extended and see the logs in the browser's console.

If you change to another gesture, the message "other gesture" should be logged. The following are two screenshots of the data we get back with this code sample.

We can see that when we do the "scissors" gesture, the value of the diffFingersX variable is much higher than when the two fingers are close together.

Looking at this data, we could decide that our threshold could be 100. If the value of diffFingersX is more than 100 and the ring and little fingers are folded, the likelihood of the gesture being "scissors" is very high.

So, altogether, we could check this gesture with the following code sample.

Listing 5-39. Detect If everything works properly, you should see the correct message being logged in the console when you do each gesture! Once you have verified that the logic works, you can move on from using console.log and use this to build a game or use these gestures as a controller for your interface, and so on.

The most important thing is to understand how the model works, get familiar with building logic using coordinates so you can explore the opportunities, and be conscious of some of the limits.

Finally, the last body tracking model we are going to talk about is called PoseNet.

PoseNet is a pose detection model that can estimate a single pose or multiple poses in an image or video.

Similarly to the Facemesh and Handpose models, PoseNet tracks the position of keypoints in a user's body.

The following is an example of these key points visualized. 

Even though this model is also specialized in tracking a person's body using the webcam feed, using it in your code is a little bit different from the two models we covered in the previous sections.

Importing and loading the model follows the same standard as most of the code samples in this book.

Listing 5-41. Import TensorFlow.js and the PoseNet model in HTML <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></ script> <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/ posenet"></script> If you want to experiment with the parameters, you can also load it this way. If you feel a bit confused by the different parameters, don't worry, as you get started, using the default ones provided is completely fine. If you want to learn more about them, you can find more information in the official TensorFlow documentation.

Once the model is loaded, you can focus on predicting poses.

To get predictions from the model, you mainly need to call the estimateSinglePose method on the model. The image parameter can either be some imageData, an HTML image element, an HTML canvas element, or an HTML video element. It represents the input image you want to get predictions on.

The flipHorizontal parameter indicates if you would like to flip/ mirror the pose horizontally. By default, its value is set to false.

If you are using videos, it should be set to true if the video is by default flipped horizontally (e.g., when using a webcam).

The preceding code sample will set the variable pose to a single pose object that will contain a confidence score and an array of keypoints detected, with their 2D coordinates, the name of the body part, and a probability score.

The following is an example of the object that will be returned.

Listing 5-46. Complete object returned as predictions We can see that some additional parameters are passed in.

• maxDetections indicates the maximum number of poses we'd like to detect. The value 5 is the default but you can change it to more or less.

• scoreThreshold indicates that you only want instances to be returned if the score value at the root of the object is higher than the value set. 0.5 is the default value.

• nmsRadius stands for nonmaximum suppression and indicates the amount of pixels that should separate multiple poses detected. The value needs to be strictly positive and defaults to 20.

Using this method will set the value of the variable poses to an array of pose objects, like the following. 

Altogether, the code sample to set up the prediction of poses in an image is as follows.

Listing 5-49. HTML code to detect poses in an image <script src="https://cdn.jsdelivr.net/npm/@tensorflowmodels/posenet"></script> <script src="index.js"></script> </body> </html> For a video from the webcam feed, the code should be as follows.

Listing 5-51. HTML code to detect poses in a video <html lang="en"> <head> <meta charset="UTF-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>PoseNet</title> </head> <body> <video></video> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/ tfjs"></script> <script src="https://cdn.jsdelivr.net/npm/@tensorflowmodels/posenet"></script> <script src="index.js"></script> </body> </html> 

So far, we've mainly used console.log to be able to see the results coming back from the model. However, you might want to visualize them on the page to make sure that the body tracking is working and that the keypoints are placed in the right position.

To do this, we are going to use the Canvas API. We need to start by adding a HTML canvas element to the HTML file. Then, we can create a function that will access this element and its context, detect the poses, and draw the keypoints.

Accessing the canvas element and its context is done with the following lines.

Listing 5-53. Accessing the canvas element const canvas = document.getElementById("output"); const ctx = canvas.getContext("2d"); canvas.width = window.innerWidth; canvas.height = window.innerHeight;

Then, we can create a function that will call the estimateSinglePose method to start the detection, draw the video on the canvas, and loop through the keypoints found to render them on the canvas element. The poseDetectionFrame function should be called once the video and model are loaded.

Altogether, the full code sample should look like the following.

Listing 5-56. Complete HTML code to visualize keypoints <html lang="en"> <head> <meta charset="UTF-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>PoseNet</title> </head> <body> <video id="video"></video> <canvas id="output" /> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/ tfjs"></script> <script src="https://cdn.jsdelivr.net/npm/@tensorflowmodels/posenet"></script> <script src="utils.js" type="module"></script> <script src="index.js" type="module"></script> </body> </html> Listing 5-57. JavaScript code to visualize keypoints in index.js import { drawKeypoints, drawSkeleton } from "./utils.js"; The output of this code should visualize the keypoints like the following.

Now that we have gone through the code to detect poses, access coordinates for different parts of the body, and visualize them on a canvas, feel free to experiment with this data to create projects exploring new interactions.

For the last section of this chapter and the last part of the book that will contain code samples, we are going to look into something a bit more advanced and experimental. The next few pages will focus on using data generated by hardware and build a custom machine learning model to detect gestures.

Usually, when working with hardware, I use microcontrollers such as Arduino or Raspberry Pi; however, to make it more accessible to anyone reading this book that might not have access to such material, Figure 5 -51. Output of the complete code sample this next section is going to use another device that has built-in hardware components, your mobile phone! This is assuming you possess a modern mobile phone with at least an accelerometer and gyroscope.

To have access to this data in JavaScript, we are going to use the Generic Sensor API.

This API is rather new and experimental and has browser support only in Chrome at the moment, so if you decide to write the following code samples, make sure to use Chrome as your browser. To build our gesture classifier, we are going to access and record data from the accelerometer and gyroscope present in your phone, save this data into files in your project, create a machine learning model, train it, and run predictions on new live data.

To be able to do this, we are going to need a little bit of Node.js, web sockets with socket.io, the Generic Sensor API, and TensorFlow.js.

If you are unfamiliar with some of these technologies, don't worry, I'm going to explain each part and provide code samples you should be able to follow.

As we are using hardware data in this section, the first thing we need to do is verify that we can access the correct data.

As said a little earlier, we need to record data from the gyroscope and accelerometer.

The gyroscope gives us details about the orientation of the device and its angular velocity, and the accelerometer focuses on giving us data about the acceleration.

Even though we could use only one of these sensors if we wanted, I believe that combining data of both gyroscope and accelerometer gives us more precise information about the motion and will be helpful for gesture recognition. 

To access the data using the Generic Sensor API, we need to start by declaring a few variables: one that will refer to the requestAnimationFrame statement so we can cancel it later on, and two others that will contain gyroscope and accelerometer data. Then, to access the phone's sensors data, you will need to instantiate a new Gyroscope and Accelerometer interface, use the reading event listener to get the x, y, and z coordinates of the device's motion, and call the start method to start the tracking.

Listing 5-60. In index.js. Get accelerometer and gyroscope data At this point, if you want to check the output of this code, you will need to visit the page on your mobile phone using a tool like ngrok, for example, to create a tunnel to your localhost.

What you should see is the live accelerometer and gyroscope data displayed on the screen when you press it, and when you release it, the data should not update anymore.

At this point, we display the data on the page so we can double check that everything is working as expected.

However, what we really need is to store this data in file when we record gestures. For this, we are going to need web sockets to send the data from the front-end to a back-end server that will be in charge of writing the data to files in our application folder.

To set up web sockets, we are going to use socket.io.

So far, in all previous examples, we only worked with HTML and JavaScript files without any back end.

If you have never written any Node.js before, you will need to install it as well as npm or yarn to be able to install packages.

Once you have these two tools set up, at the root of your project folder, in your terminal, write npm init to generate a package.json file that will contain some details about the project.

Once your package.json file is generated, in your terminal, write npm install socket.io to install the package.

Once this is done, add the following script tag in your HTML file.

Listing 5-62. Import the socket.io script in the HTML file <script type="text/javascript" src="./../socket.io/socket. io.js"></script> Now, you should be able to use socket.io in the front end. In your JavaScript file, start by instantiating it with const socket = io().

If you have any issue with setting up the package, feel free to refer to the official documentation.

Then, in our event listener for touchstart, we can use socket.io to send data to the server with the following data.

Listing 5-63. Send We are sending the motion data as a string as we want to write these values down into files.

On touchend, we need to send another event indicating that we want to stop the emission of data with socket.emit('end motion data').

Altogether, our first JavaScript file should look like the following.

Listing 5-64. Complete JavaScript code in the index.js file const socket = io(); let gyroscopeData = { x: "", y: "", z: "", }; let accelerometerData = { x: "", y: "", z: "", }; Now, let's implement the server side of this project to serve our frontend files, receive the data, and store it into text files.

First, we need to create a new JavaScript file. I personally named it server.js.

To serve our front-end files, we are going to use the express npm package. To install it, type npm install express --save in your terminal.

Once installed, write the following code to create a '/record' route that will serve our index.html file.

Listing 5-65. Initial setup of the server.js file const express = require("express"); const app = express(); var http = require("http").createServer(app); app.use("/record", express.static(__dirname + '/')); http.listen(process.env.PORT || 3000);

You should be able to type node server.js in your terminal, visit http://localhost:3000/record in your browser, and it should serve the index.html file we created previously. Now, let's test our web sockets connection by requiring the socket.io package and write the back-end code that will receive messages from the front end.

At the top of the server.js file, require the package with const io = require('socket.io')(http).

Then, set up the connection and listen to events with the following data.

Listing 5-66. In server.js. Web sockets connection io.on("connection", function (socket) { socket.on("motion data", function (data) { console.log(data); }); socket.on("end motion data", function () { console.log('end'); }); }); Now, restart the server, visit the page on '/record' on your mobile, and you should see motion data logged in your terminal when you touch your mobile's screen.

If you don't see anything, double check that your page is served using https.

At this point, we know that the web sockets connection is properly set up, and the following step is to save this data into files in our application so we'll be able to use it to train a machine learning algorithm.

To save files, we are going to use the Node.js File System module, so we need to start by requiring it with const fs = require('fs');.

Then, we are going to write some code that will be able to handle arguments passed when starting the server, so we can easily record new samples.

For example, if we want to record three gestures, one performing the letter A in the air, the second the letter B, and the third the letter C, we want to be able to type node server.js letterA 1 to indicate that we are currently recording data for the letter A gesture (letterA parameter) and that this is the first sample (the 1 parameter).

The following code will handle these two arguments, store them in variables, and use them to name the new file created. Now, when starting the server, you will need to pass these two arguments (gesture type and sample number).

To actually write the data from the front end to these files, we need to write the following lines of code.

Listing 5-68. In server.js. Code to create a file and stream when receiving data socket.on("motion data", function ( We also close the stream when receiving the "end motion data" event so we stop writing motion data when the user has stopped touching their phone's screen, as this means they've stopped executing the gesture we want to record.

To test this setup, start by creating an empty folder in your application called 'data' , then type node server.js letterA 1 in your terminal, visit back the web page on your mobile, and execute the gesture of the letter A in the air while pressing the screen, and when releasing, you should see a new file named sample_letterA_1.text in the data folder, and it should contain gesture data! At this stage, we are able to get accelerometer and gyroscope data, send it to our server using web sockets, and save it into files in our application.

Listing 5-69. Complete code sample in the server.js file const express = require("express"); const app = express(); const http = require("http").createServer(app); const io = require('socket.io')(http); const fs = require('fs'); let stream; let sampleNumber; let gestureType; let previousSampleNumber; app.use("/record", express.static(__dirname + '/')); Before moving on to writing the code responsible for formatting our data and creating the machine learning model, make sure to record a few samples of data for each of our three gestures; the more, the better, but I would advise to record at least 20 samples per gesture.

For this section, I would advise to create a new JavaScript file. I personally called it train.js.

In this file, we are going to read through the text files we recorded in the previous step, transform the data from strings to tensors, and create and train our model. Some of the following code samples are not directly related to TensorFlow.js (reading folders and files, and formatting the data into multidimensional arrays), so I will not dive into them too much.

The first step here is to go through our data folder, get the data for each sample and gesture, and organize it into arrays of features and labels.

For this, I used the line-reader npm package, so we need to install it using npm install line-reader.

We also need to install TensorFlow with npm install @tensorflow/ tfjs-node.

Then, I created two functions readDir and readFile to loop through all the files in the data folder and for each file, loop through each line, transform strings into numbers, and return an object containing the label and features for that gesture.

Listing 5-70. In train.js. Loop through files to transform raw data into objects of features and labels const lineReader = require("line-reader"); var fs = require("fs"); const tf = require("@tensorflow/tfjs-node"); const gestureClasses = ["letterA", "letterB", "letterC"]; let numClasses = gestureClasses.length; let numSamplesPerGesture = 20; // the number of times you recorded each gesture. let totalNumDataFiles = numSamplesPerGesture * numClasses; let numPointsOfData = 6; // x, y, and z for both accelerometer and gyroscope let numLinesPerFile = 100; // Files might have a different amount of lines so we need a value to truncate and make sure all our samples have the same length. let totalNumDataPerFile = numPointsOfData * numLinesPerFile; I am not going to dive deeper into the preceding code sample, but I added some inline comments to help.

If you run this code using node train.js, you should get some output similar to the following figure. At this point, our variable allData holds all features and labels for each gesture sample, but we are not done yet. Before feeding this data to a machine learning algorithm, we need to transform it to tensors, the data type that TensorFlow.js works with.

The following code samples are going to be more complicated as we need to format the data further, create tensors, split them between a training set and a test set to validate our future predictions, and then generate the model.

I have added inline comments to attempt to explain each step. So, where we wrote console.log(allData) in the preceding code, replace it with format(allData), and the following is going to show the implementation of this function. For example, justLabels should look like [ [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], [ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ] ]. Now that we are getting closer to a format TensorFlow can work with, we still need to transform these multidimensional arrays to tensors. To do this, let's start by creating a function called transformToTensor. The values in the "shape" array will differ depending on how many samples of data you train and the number of lines per file.

Altogether, the code sample starting from the format function should look like the following. 

This part of the code is a bit arbitrary as there are multiple ways to create models and to pick values for parameters. However, you can copy the following code as a starting point and play around with different values later to see how they impact the accuracy of the model. At the end of our format function, call this createModel function using createModel(trainingFeatures, trainingLabels, testingFeatures, testingLabels). Now, if everything works fine and you run node train.js in your terminal, you should see the model training and find a model folder in your application! In case something is not working as expected, here's what the complete train.js file should look like.

Listing 5-79. Complete code sample in train.js const lineReader = require("line-reader"); var fs = require("fs"); const tf = require("@tensorflow/tfjs-node"); let justFeatures = []; let justLabels = []; const gestureClasses = ["letterA", "letterB", "letterC"]; let numClasses = gestureClasses.length; let numSamplesPerGesture = 5; let totalNumDataFiles = numSamplesPerGesture * numClasses; let numPointsOfData = 6; let numLinesPerFile = 100; let totalNumDataPerFile = numPointsOfData * numLinesPerFile; The training steps you should see in your terminal should look like the following figure.

The output of the model shows us that the last step of the training showed an accuracy of 0.9, which is really good! Now, to test this with live data, let's move on to the last step of this project, using our model to generate predictions.

For this last step, let's create a new JavaScript file called predict.js.

We are going to create a new endpoint called '/predict' , serve our index.html file, use similar web sockets code to send motion data from our phone to our server, and run live predictions.

A first small modification is in our initial index.js file in our front-end code. Instead of sending the motion data as a string, we need to replace it with the following data. Listing 5-80. In index.js. Update the shape of the motion data sent via web sockets let data = { xAcc: accelerometerData.x, yAcc: accelerometerData.y, zAcc: accelerometerData.z, xGyro: gyroscopeData.x, yGyro: gyroscopeData.y, zGyro: gyroscopeData.z, }; socket.emit("motion data", data);

As the live data is going to have to be fed to the model, it is easier to send an object of numbers rather than go through the same formatting we went during the training process.

Then, our predict.js file is going to look very similar to our server. js file at the exception of an additional predict function that feeds live data to the model and generate a prediction about the gesture.

Listing 5-81. In predict.js. Complete code for the predict.js file const tf = require("@tensorflow/tfjs-node"); const express = require("express"); const app = express(); var http = require("http").createServer(app); const io = require("socket.io")(http); let liveData = []; let predictionDone = false; let model; const gestureClasses = ["letterA", "letterB", "letterC"]; http.listen(process.env.PORT || 3000);

If you run the preceding code sample using node predict.js, visit the page on '/predict' on your phone, and execute one of the three gestures we trained. While holding the screen down, you should see a prediction in the terminal once you release the screen! When running live predictions, you might come across the following error. This happens when a gesture is executed too fast and the amount of data collected was lower than our 600 value, meaning the data does not have the correct shape for the model to use it. If you try again a bit slower, it should be working.

Now that our live predictions work, you could move on to changing some parameters used to create the model to see how it impacts the predictions, or train different gestures, or even send the prediction back to the front end using web sockets to create an interactive application. The main goal of this last section was to cover the steps involved into creating your own machine learning model.

Over the last few pages we learned to access hardware data using the Generic Sensor API, set up a server and web sockets to communicate and share data, save motion data into files, process and transform it, as well as create, train, and use a model to predict live gestures! Hopefully it gives you a better idea of all the possibilities offered by machine learning and TensorFlow.js.

However, it was a lot of new information if you are new to it, especially this last section was quite advanced and experimental, so I would not expect you to understand everything and feel completely comfortable yet.

Feel free to go back over the code samples, take your time, and play around with building small prototypes if you are interested.

function calls two other functions, shuffleData and split. Listing 5-74

features.map((featuresArray, index) => { shuffledFeatures

Split the data into training and test set const split = (featuresTensor, labelsTensor, testSplit) => { // Split the data into a training set and a test set, based on `testSplit

const split = (featuresTensor, labelsTensor, testSplit) => { const numTestExamples = Math

new to machine learning and TensorFlow. js, but we are almost there. Our data is formatted and split between a training set and a test set, so the last step is the creation of the model and the training

At this point, if you add a console.log statement in the code to log the trainingFeatures variable in the format function, you should get a tensor as output.Listing 5-76. Example