key: cord-0467999-dplsanaj authors: Yaqub, Waheeb; Mohanty, Manoranjan; Suleiman, Basem title: Image-Hashing-Based Anomaly Detection for Privacy-Preserving Online Proctoring date: 2021-07-20 journal: nan DOI: nan sha: 012368da620e0011584b5a4b500851684d4695f9 doc_id: 467999 cord_uid: dplsanaj Online proctoring has become a necessity in online teaching. Video-based crowd-sourced online proctoring solutions are being used, where an exam-taking student's video is monitored by third parties, leading to privacy concerns. In this paper, we propose a privacy-preserving online proctoring system. The proposed image-hashing-based system can detect the student's excessive face and body movement (i.e., anomalies) that is resulted when the student tries to cheat in the exam. The detection can be done even if the student's face is blurred or masked in video frames. Experiment with an in-house dataset shows the usability of the proposed system. Online teaching has increased during this COVID time. Many universities (and other educational organizations) are planning to stick to the online or blended mode teaching even after the COVID. Online teaching can involve online exams, which can require live online proctoring. In online proctoring, a student can appear for the exam by switching on the front camera (e.g., web camera) of her device (which will capture the live video), and a proctor can identify any cheating attempt by monitoring the student's live exam-taking video. Proctoring a class of hundreds of students on a computer screen by a few university staffs, however, is both tedious and error-prone (unlike the in-class proctoring). Involving a large number of university staffs also seems to be infeasible as there can be multiple parallel exams. Recruiting a large number of permanent staffs is expensive for the university, especially when an exam occurs once or twice in a semester. A possible solution is to outsource the proctoring task to a third-party company. The company (which can be in a different country) can employ an adequate number of people (e.g., by crowd sourcing the proctoring task) for monitoring the students (as shown in Figure 1 ). Although such an arrangement can address above issues to some extent, privacy is a major concern [2] [3] . The availability of the student's videos (which contain faces and other personal information) to a number of casual individuals can lead to serious privacy consequences (e.g., leaking a video on a public social media platform, impersonating young students, etc.). Note that this privacy issue does not go away even if the university hires a number of Fig. 1 . One of the widely used proctoring system ProctorU, where Proctors are monitoring online exam takers. The image has been taken from [1] . casual employees or crowd-source the proctoring task instead of outsourcing to the company [1] [3] . In this paper, we propose a scheme that can help in addressing the privacy issue in online proctoring. In this scheme, a student's face is either blurred or masked in her video for protecting privacy. The blurring or masking is done in such a way that the student's excessive face and body movements can be detected. We believe that such excessive face and body movement results when the student is looking away from the computer screen for potentially taking help in the exam (i.e., trying to cheat). We call such excessive face and body movements as anomalies (with respect to normal exam taking behaviour). For detecting these anomalies in a video containing blurred/masked faces, image hashing has been used. Frames of the exam-taking video are hashed and compared to each other. When the difference in two hash values is above a threshold, it is assumed that the student has deviated from her normal exam-taking pose for attempting cheating. Such detected anomalies are then reported to humans who reconfirm any suspicious activities by looking at the blurred or masked videos. Both the tool-based anomaly detection and human reconfirmation happen at the third-party company-end. The company will then report such suspicious activities to the university, who can take further action. The university can have access to the plaintext exam-taking suspicious video. One of the major challenge of the proposed idea is the lack of a public exam-taking video dataset. In this paper, we use an in-house dataset of five exam-taking and cheating-attempting videos. Experimental results show that the proposed scheme has high precision and requires low computation cost. The proposed work can be considered as an initial work (which, to the best of our knowledge, is first such work) to address a new and practical research problem (i.e., privacy-preserving online proctoring). Further research is required to improve the results, e.g., using a large exam-taking video dataset. Rest of this paper is organized as follows. Section II discusses related work. In Section III, we provide an overview of the proposed method. Section IV explains face hiding, and Section V explains anomaly detection. In Section VI, we provide experimental results. Section VII concludes and discusses future work. In the past, various cheating detection methods for online exams have been proposed. Some old ways of doing cheating detection are analysing students answer sheets for finding similarities [4] [5] [6] or considering the exam takers previous academic performance [7] . Although these methods are sometimes useful [7] , they can be easily challenged as they do not video record the students. Recently, machine-learning-based online proctoring has been explored [8] - [10] . These methods can detect cheating by monitoring a student video from the other side of the machine. Online proctoring methods generally fall under three categories: live proctoring, recorded proctoring, and automated proctoring [11] . Live proctoring is a real-time proctoring system where a human proctor monitors the exam-taking student through live videos, similar to the one shown in Figure 1 . In live proctoring, minimal information of the exam-taking student is recorded. On the other hand, recorded proctoring records the video and other log data of the exam taker. Later, such information is manually analysed for detecting cheating. Finally, automated proctoring aims to minimize the human involvement by replacing the human with an automated system. Automated proctoring is based on complex machine learning algorithms that can detect anomalies with the help of cloud-based high performance computers. Humans simply rely on the outcomes of the machine learning algorithms [2] . The automated proctoring is touted as the future for its costeffectiveness and scalability. The list of private information that an exam taker gives up can be unanticipated and intrusive. Most of the above solutions have been built without considering student's privacy. Some of the examples of student's information that can be gathered during online exams are: audio, video (including 360°panning video) , shared screen, keyboard strokes etc. [11] . To the best of our knowledge, students privacy has not been addressed yet in an online proctoring system. In this paper, we show that proctoring is possible in a privacy-preserving manner. Figure 2 shows the overall architecture of the proposed system. There are four main players: a student, a trusted entity, an honest-but-curious third-party proctor, and a trusted university staff. We assume that the trusted entity can access the student's information, such as exam videos, photos, ID cards, in plaintext. This entity can either be present at the student-end (such as a trustzone in student's computing device) or at the university-end (such as a highly secure dedicated machine). The communication between the student and the trusted entity is assumed to be secured. The third-party proctor is tasked to do the heavy-weight proctoring task. It is assumed that this entity will perform its task honestly but can be curious to know student's information, hence leading to privacy concerns. This entity can either be a third-party company hired by the university or an adhoc university department who has mostly crowd-sourced the proctoring activities. The trusted university staff can be a permanent staff who can access student's information in plaintext. This staff can be part of a small team (e.g., the team dealing with cheating) who will access the information of only those students flagged by the third-party proctor. The system will work as follows. A student will be asked to switch-on her webcam or selfie camera when she is taking the exam. Her live exam-taking video will be sent to the trusted entity in plaintext (Step 1). At the beginning, the trusted entity will verify student's ID. We assume that ID verification can be done by implementing an appropriate technique, such as the OCR technique [12] [13] . The main job of the trusted entity is to hide facial information of the student. This is for minimizing privacy leaks. The face-hidden video will then be sent to the third-party proctor (Step 2). The third-party proctor will run anomaly detection on the face-hidden video for detecting potential cheating. This anomaly detection tool will serve as a triaging tool because of its light weight nature. Students flagged by this tool will then go through another round of manual check at the third-party proctor-end (by viewing clipped anonymised video as shown in Figure 7 ). Those confirmed by the third-party proctor will be reported to the trusted university staff (Step 3). The university staff will finally get the clipped plaintext video from the trusted entity (Step 4) and take any further actions. In the following sections, the details of face hiding module and anomaly detection module will be discussed. These modules were developed using following datasets. 1) In-house dataset: The lack of a public exam-taking and cheating-attempting video dataset was one of the challenges of this project. Thus, we created an in-house dataset of three videos taken from five different Asian-origin 25 to 30 years old participants. Each video is roughly two to three minute of a mock-up exam, with 1280 × 720 resolutions and 25 to 30 FPS (frames per second). The cheating was attempted by creating an anomaly as shown in Figure 3 . We assume that in a normal exam-taking condition, a student will look at the front computer screen or the keyboard. Any other movements or gestures from the upper body can be an anomaly. Figure 3 shows various poses a student can create when taking the exam. The Front Face pose represents the normal exam-taking behaviour. Other poses show typical student behaviour when she is trying to cheat, e.g., when peeking on a cheat sheet or flipping a book. 2) Public Datasets: Because the in-house dataset is small, we reconfirmed the results in Section IV using the following public image datasets: HELEN [14] , UTKFace [15] , CelebA [16] , RF [17] , and LFW [18] . The public dataset helped to asses the privacy leak from anonymised faces. The Face Hiding module hides the student's facial information from a video for minimizing privacy leak. We have used blurring or masking to hide the face (Figure 4 ). In the hidden face, the eyes are not hidden for facilitating anomaly detection (i.e., cheating detection) as shown in Figure 7 . At first, the face and eyes are detected, and then blurring or masking is applied. Machine learning-based approach has been used to find face and eyes from a given exam taking video sequentialy (frame-by-frame). A number of pre-trained machine-learning models are available for detecting a face and an eye from a frame (aka image). We compared Dlib and MediaPipe models 1 as they do not require high-performance computers and give highest detection rate (based on our experiment with inhouse dataset as discussed below). They were run on a local Windows 10 computer having 16 GB RAM and i7-10710U CPU. Face detection and eyes detection results were compared individually. The in-house dataset was first used. Then HELEN dataset was used as it has images having similar resolution to video frames. 1) Results using in-house dataset: Face detection result: All the models performance were compared using detected rate = N umber of f aces detected Actual number of f aces × 100 metric. For this, the frames of test videos were first manually labeled as face and no face for getting the ground truth (which is called as Actual number of faces in the formula). Eye detection result: The eye detection rates were also obtained by first labeling the videos and then running the models. The detection rate was computed as detected rate = N umber of f rames with correct eyes detection N umber of f rames labeled as "f ace" . The detection results of the models are given in Table I . We then performed our experiment using the HELEN dataset. Note that for the MediaPipe model, the eye detection rate is different than the face detection rates (unlike Dlib). This is because the MediaPipe model detects face and eyes independently. 2) Results using HELEN dataset: Table I show the detection results of both the models individually and also when they are combined (hybrid mode). Based on the results, the hybrid mode was chosen. First, the MediaPipe is used because of its higher FPS rate. If MediaPipe fails, then Dlib is used. Based on our experiment, we found that this arrangement can detect face and eyes with a very high precision rate. Thus, if no face and eyes were found, we assumed that the input frame did not contain face and eyes. In that case, the face and eyes of the previous frame were considered. As show in Figure 5 , bounding boxes representing the detected face and eyes were output to the next step. After the final schema of the privacy-preserving part and the blur level are decided, the effect of privacy-preserving methods will be evaluated using public datasets. Either blurring or masking is used to hide the facial information except the eyes. Blurring is a widely used method [19] . Therefore, although we found that the masking is better than blurring in preserving privacy (as shown in Table II ) and the computation cost blurring was over hundred times more than that of masking, 63 and 28, 852 FPS respectively. For sake of completeness we have also conducted experiments with blurring. Guassian blurring with blur level 30 has been used in this paper. A single white mask (Mask without Eyes in Figure 6 ) has been considered for masking the frames of all videos. For this, first 3D face landmarks (key facial features) are detected (MediaPipe returns 468 3D face landmarks and Dlib returns 68 face landmarks), and then these landmarks are blocked using point size 26. Note that different masks can also be used for different students and this mask can also be personalised based on user preferences (e.g., using Superman mask for one student and Hulk mask for another student, as shown in Figure 6 ). The objective of the anomaly detection module is to detect anomalies even when the student's face is blurred or masked. Image hashing-based anomaly detection method has been used. Traditionally, image hashing [20] , or perceptual hashing, is used for finding similar images. When the difference in hash values of two images is less than a threshold, the images are considered as similar. In this paper, we have used this phenomena for finding if all the frames of an examtaking video are similar or there are some frames different than others. When the frames are similar, they supposedly have the same pose position. If different, they show at least two different pose positions, and hence the possibility of an anomaly. The hash comparison is done by comparing each subsequent frame with an anchor frame. Hamming distance [21] is used for finding the difference in the hash values. A previous study has found that the dHash hashing method is faster and has better detection rate than other two popular hashing methods, aHash and pHash [22] . Our experiment with public datasets also confirmed this claim. We therefore have used the dHash hashing method with hash size 12. 1) Determining threshold: Setting the threshold is tricky as the hash difference depends on the exam-taking environment. For example, if a student is closer to the camera, the hash difference will be larger than the hash difference obtained when the student is far from the camera. Thus, we use a tailored threshold for each exam-taking session, which is computed from the student's sitting pattern at the beginning of the exam. At the beginning, student's sitting pattern is recorded by her sitting position and eye interaction with the computer. The correct sitting position is set by showing a real-time camera view to the student, and asking the student to adjust the sitting position accordingly (Figure 8) . A method similar to Krafka et al.'s proposed method [23] ' is then used to record the eye interaction reliably. The core idea of Krafka et al's method is to show a random symbol at several screen positions, and ensure that the student gazed them. For this, we show the student a crosshair and circle pair at random screen positions (Figure 9 ). The circle shrinks to the center of the crosshair to guide the student where to look. Once the circle shrunk to the minimum size, a random arrow out of (↑, ↓, →, ←) will show up for 0.5 seconds. During this 0.5 seconds, the computer camera takes a photo of the student. Then the student is asked to click the corresponding direction button on the keyboard to save the photo (after 0.5 seconds). If the student responds correctly, the photo and corresponding symbol coordinates are saved, and the crosshair and circle pair is shown in a new screen position. Otherwise, the pair is shown in the same screen position. We assume that the above sitting pattern covers all interactions a student will do with the computer when siting with her normal exam-taking pose. The collected photos, therefore, are used to obtain the threshold. The difference in the hash value of one photo with the hash value of all other photos are obtained, and the maximum hash difference is set as the threshold. 2) Selecting the anchor frame: A simple approach is to set the first frame of the exam-taking video as the anchor frame, and compare all other frames with it. As illustrated in Figure 10 , this approach, however, does not work well when the student changes her sitting position during the exam. In this example, the student has two sitting positions: closer to camera position, and far from the camera position (both shown using yellow arrows). Initially, the student was closer to the camera. The hash difference in this case is a bit lower. But, when the student sits a bit far from the camera, the hash difference becomes significantly higher (highlighted using yellow box in the graph). Such significant change in hash values can lead to poor performance. This issue can be addressed by either adjusting the threshold or changing the anchor frame when the student changes her position. Adjusting the threshold will need that the student goes through the above calibration process once again in the middle of the exam (which is clearly very user unfriendly). Changing the anchor frame does not have such requirement. Thus, we used this method. In this case, the threshold found from the initial calibration is used through out the exam session. Ideally, the anchor frame can be changed when the student changes her sitting position. A changed position can be identified by analysing the student's face in consecutive frames of the video. When a student sits closer to the screen, her face will be bigger on the frame than when she is far from the screen. In this paper, we, however, use a different approach for changing the anchor frame. We change the anchor frame whenever a potential normal behaviour is found after potential anomalous behaviour. The normal behaviour can be found by analysing the plot in Figure 10 . In this plot, the valley points represent the potential normal behaviours. This is because the student will look at the computer screen (for typing the result) even after looking away from it for sometime (e.g., when reading a book), therefore creating the valley points. As part of the analysis, we first apply a widely used Savitzky-Golay filter [24] for smoothing (removing any noise) the signal presented in the plot. The smoothed plot is then inverted (as shown in Figure 11 (a)). The frames corresponding to the peak points of the inverted-plot (shown as green dots in the plot) are considered as the anchor frames. Each time a new green dot is found, the anchor frame is reset to the frame corresponding to this dot, and all future hash difference calculations are done using the new anchor frame. Figure 11 (b) shows the plot using such anchor frames. The experiment was done using an in-house test dataset of three exam-taking videos collected from three different participants. Each video was two to three minutes long. Detailed instruction of how to emulate an exam and attempt cheating was provided to each participant. Each frame of a video was then manually labeled as either a normal pose or an anomaly pose. MediaPipe and Dlib-based face and eye detection, Gaussian blurring and single-white masking, and dHashing-based image hashing with hash size 12 were used. The experiment was done on a local Windows 10 computer having 16 GB RAM and i7-10710U CPU. Each video frame was first processed by MediaPipe and Dlib (for detecting face and eye) module, then by blurring or masking-based face hiding module, and finally by the dHashing-based image hashing module. The obtained anomaly results were compared with the ground truth. Table III shows the experimental result. The result is different for different participants as they acted differently for creating the same anomaly behaviour. For example, when asked to look left, one looked to a bit more left than the other. The low recall rate is due to the fact that only looking at the screen was considered as normal behaviour, and even looking at the edge of the screen was labeled as anomaly. The proposed model has low computation overhead. The average running of blurring-based and masking-based methods measured in terms of FPS are 31 and 35 FPS, respectively. As expected, masking approach performed better, although both the approaches can be easily performed on normal PCs. Further speed performance could be improved by employing cropping and simple scaling-based approach [25] . Moreover, the proposed module could easily be adopted to existing proctoring systems as well as zoom-based proctoring [26] with added level of security [27] VII. CONCLUSION AND FUTURE WORK Online student proctoring is a reality in online-based exams. In this paper, we proposed a privacy-preserving online proctoring system using image-hashing-based anomaly detection. Experiments showed promising results. There are a several ways how this preliminary work can be further improved. The first and the far most requirement is creating a large exam-taking students dataset. Secondly, the proposed method can be improved by exploring other privacypreserving measures and by considering on other anomalies. Online exam proctoring catches cheaters, raises concerns A systematic review on ai-based proctoring systems: Past, present and future Are you implementing a remote proctor solution this fall? recommendations from nln testing services Who's cheating? mining patterns of collusion from text and events in online exams Prototype of online examination on molearn applications using text similarity to detect plagiarism Cheat-resistant multiple-choice examinations using personalization A conceptual framework for detecting cheating in online and take-home exams Cues of being watched enhance cooperation in a real-world setting Examining the effect of proctoring on online test scores Are online exams an invitation to cheat? An evaluation of online proctoring tools What is wrong with scene text recognition model comparisons? dataset and model analysis Github -jaidedai/easyocr: Ready-to-use ocr with 80+ supported languages and all popular writing scripts including latin, chinese, arabic, devanagari, cyrillic and etc Interactive facial feature localization Age progression/regression by conditional adversarial autoencoder From facial parts responses to face detection: A deep learning approach Real and fake face detection Labeled faces in the wild: A database for studying face recognition in unconstrained environments Facial information protection by mosaicking faces in video dissemination Looks like it Hamming distance metric learning Kind of like that Eye tracking for everyone Numerical methods of curve fitting Towards camera identification from cropped query images Novel cost-effective internal live online proctoring solution Seamless authentication for online teaching and meeting We would like to thank Jason, Yang, Mill, Doris, Vivi for the implementation and testing of the algorithms on their dataset.