You can find the code documented over here: Kinect Body Language Analysis
Since the release of Kinect, there have been many efforts from Microsoft, as well as, various other open source projects to track human motion. Many frameworks exist which give the developer an abstraction from tracking human movement and instead, provide them with parameters such as hand gestures. We aim to take it a step further. Using such frameworks, we analyzed hand and leg movements, combined with whole body’s motion to get a score of user’s emotions. Using the result of this analysis we developed a framework which other developers can use. While using this framework, they not only have hand and body movement data, but also the emotions of users which they can utilize in their applications. Applications can include creation of music and arts for the sake of advertisement, interactive installations and games which either use emotions as an input event or for studying emotional changes as a reaction to a certain event.
The second part of the project focuses on developing an application which demonstrates the full potential of this framework. We created a desktop application which will create real-time music and art based on the user performing intentionally or randomly. The music scale and saturation of the colors used is based upon the emotions of the performer.
-
Go to OpenNI
-
Select OpenNI Packages
-
Select Stable
-
Select PrimeSense Package Stable Build for Windows x86 Development Edition
-
While installing, select OpenNI and NITE middleware. DO NOT check PrimeSense hardware as that driver is not for Microsoft Kinect
-
Download Kinect driver from Kinect (make sure that neither Microsoft’s nor any other driver for Kinect is installed on your computer) and install it.
-
To run the samples included with NITE, copy all .xml files from “[PrimeSense root directory]/NITE/Data” to “[PrimeSense root directory]/SensorKinect/Data”
- Create a new or open an existing Visual Studio 2010 project
- Open project properties
- Go to C/C++ -> General -> Additional Include Directories and add “[OpenNI root directory]/Include”
- Go to Linker -> General -> Additional Library Directories and add “[OpenNI root directory]/Lib”
- Go to Linker -> Input -> Additional Dependencies and add OpenNI.lib
- Your code should include XnOpenNI.h if you are using C interface, or XnCppWrapper.h if you are using C++ interface
- Optionally, you can use the namespace “xn” or you can reference objects using scope operator (For example, “xn::Context context”)
Installation and configuration in Visual Studio 2012 is exactly the same as Visual Studio 2010. But OpenNI doesn’t let you use their library in compiler version greater than VS 2010. But it can be overridden using the following steps:
- Within the OpenNI libraries directory, locate the file XnPlatform.h
- At the top of the file you will find the code “if defined(_WIN32). Beneath this you will find another condition which checks the compiler version
- Comment out that piece of code and you will be able to compile the project
- Create a new project in Eclipse or Netbeans
- Add “[OpenNI root directory]/Bin/org.OpenNI.jar” and “[NITE root directory]/Bin/com.primesense.NITE.jar” to “additional libraries
- Microsoft Windows 7 (x86)
- PrimeSense’s SensorKinect driver for Kinect
- NITE middleware for OpenNI
- OpenNI
Human beings are the most complex living organisms. Despite belonging to the mammals group, they are capable of evolving on such a fast scale and are able to redefine the aspects of their life through various means. Over the decades, human motions have shown such diversity that many scientists are trying to analyze and manipulate this knowledge to provide some benefit to the human race. In this regard, there are researches and attempts to read patterns in human motions and to use them to generate something useful such as art. This is a very promising avenue and can open doorways to more research and development.
Goal of this project is to:
-
Develop a framework that gathers mood and motion data. It means that our module will:
- Capture human motion
- Carry out emotional analysis on motion
- And present the results in a well-formed, consistent manner so that it’s a breeze to use the module
-
Write a demo application that shows full potential of this framework
The purpose of this research paper was to analyze and recognize various human motions such as walking, jumping etc. It uses Hidden Markov Model recognize motion. This method was successful at recognizing different motions from a scene as well recognizing gender and mood of the actor with 100% accuracy.
In this research paper human motion data was gathered by infrared sensors placed at strategic locations on a human body but I chose to not write the details of the data gathering process as we are using Kinect and we will already have human motion data in the form of joints.
This paper also addressed the problem of transforming one type of motion into another. They used two different approaches to implement this and both were successful in transforming a male walk into a female walk.
3D Human Action Recognition and Style Transformation Using Resilient Back propagation Neural Networks
This paper was published by the same authors as above but it uses Resilient Back propagation Neural Networks instead of Hidden Markov Model to implement the same principles as above.
I have read, at a very abstract level both HMM and neural networks but both are fairly complex so a comparison at this time is not possible. I think we can decide on the algorithm to be used in the implementation phase after we know the exact form in which we have the data that is to be analyzed.
As far as re-synthesis is concerned, I don’t think we need re-synthesis as mentioned in both of these research papers. We are to create totally different form of artifacts from our motion analysis but gathering mood and gender can come in very handy.
Various research papers focused on generating a final color palette which an artist uses to choose a color from. But this paper solved the problem of generating an optimized color scheme based on a certain input colors. This paper relies on Moon (G.D. Birkhoff. Aesthetic Measure. Harvard University Press, Cambridge, MA, USA, 1933) and Spencer (P. Moo n, D.E. Spencer. Aesthetic measure applied to color harmony. Journal of the Optical Society of America, vol. 34, Apr. 1944, pp. 234-242.) color harmony model which is based on psychological experiments. They argue that Genetic Algorithms the best method to solve this kind of problem.
In 1928, Birkhoff formalized the notion of beauty by the introduction of the aesthetic measure, defined as the ratio between order and complexity. Based on this measure Moon and Spencer proposed a quantitative model of color harmony, using color difference and area factor based on psychological factors.
Implementation is carried out using three phases. In the first phase image they read and evaluate the color image and initialize genetic algorithm parameters. The program reads the size of image, number of color and color pairs and area of each color (read in pixels). Genetic algorithm parameters for this phase include string size, number of generations, population size, mutation and crossover rate.
In the second phase evaluation of the aesthetic score for each possible solution takes place. This determines the possibility of survival and reproduction of each solution in the following generations.
Phase 3 is population generation. For each generation, three ages of population (parent, child and combined) are created. The best solutions in this combined population regardless of their origin are retained and passed to the following generation as a parent population.
In the experiment conducted, it took them 55 seconds to read an image and search for 6 unique optimized solutions.
There are also basic rules available to create color combinations. For example, if we hard code a certain color palette into out program and round-off each color read in the frame to one of those found in the color palette. This can be easily done in real time. While the solution mentioned in the research paper above is optimum, I think we are going to have a problem implementing that solution in real time.
Affect is described as feeling or emotion. Affective states refer to the different states of feelings or emotions.
Emotion and mood both are types of affective states but emotion is focused whereas mood is unfocused or diffused.
Arousal is defined as the level of energy a person possesses while displaying a certain emotion. Valence describes how positive or negative the stimuli is (which is causing a certain emotion in a person). Stance describes how approachable the stimuli are.
Together, these three terms form a model of quantitative analysis of emotion depicted by body language.
It is a movement of body that contains information.
The Geneva Multimodal Emotion Portrayals (GEMEP[1]) is a collection of audio and video recordings featuring 10 actors portraying 18 affective states, with different verbal contents and different modes of expression. It was created in Geneva by Klaus Scherer and Tanja Bänziger, in the framework of a project funded by the Swiss National Science Foundation (FNRS 101411-100367) and with support of the European Network of Excellence "Humaine" (IST-2002-2.3.1.6 Multimodal Interfaces, Contract no. 507422). Rating studies and objective behavioral analyses are also currently funded by Project 2 and the Methods module of the Swiss Affective Science Center Grant (FNRS).
There has been thorough research on motion analysis from facial recognition. But there has been little or no research on body language analysis. It is known that body expressions are as powerful as facial recognition when it comes to emotion analysis. In any social interactions, body language along with facial features communicates the mood/emotions of the person.
In chronic pain rehabilitation, specific movements and postural patterns inform about the emotional conflict experienced by the patients (called “guarding behavior”) which affects their level of ability to relax. If doctors have a way to know that a person’s emotional state is not letting him/her progress in his therapy, then they can formulate ways to treat such patients better.
Students lost motivation when high levels of affective states such as frustration, anxiety or fear are experienced. If such systems are developed which can read the body language of all students present in a class, it can point out when the teacher needs to change his/her tactics.
The whole point of this research is to answer two questions:
- What bodily information is necessary for recognizing the emotional state of a person
- Whether specific features of the body can be identified which contribute to specific emotions in a person
1The Role of Spatial and Temporal Information in Biological Motion Perception.pdf
In the experiment conducted, 9 human walkers were fitted with point-lights on all major joints. There movements, both to the left and right, were recorded. From the movements, 100 static images were extracted based on the following four configurations:
- Normal spatial and temporal
- Scrambled spatial and normal temporal
- Normal spatial and scrambled temporal
- Scrambled spatial and temporal
The experiment was conducted by both, an algorithm and a human subject in two further stages. Stage 1 analyzed the spatial structure of the frame by matching with some templates of body shapes. Stage 2 analyzed the temporal arrangement. The task was to find the facing direction (form) of the point-light body and movement (motion) direction of the body.
The results show that form can be recognized when temporal data is scrambled and spatial data is intact. But movement cannot be analyzed when either of the data is scrambled.
Evidence for Distinct Contributions of Form and Motion Information to the Recognition of Emotions from Body Gestures.pdf
The research concluded that motion signals alone are sufficient for recognizing basic emotions.
Recognizing emotional intentions of a musician by their body movements. The results indicate that:
- Happiness, sadness and anger are well communicated but fear was not
- Anger is indicated by large, fairly fast and jerky movement
- Sadness by fluid and slow movements
- But expressions of the same motion varied greatly depending upon the instrument played
While playing piano, movement is related to both the musical score that is being played as well as emotional intention conveyed. In the experiment conducted the pianist was asked to play the same musical score with different emotional intentions. Two motion cues were studied using an automated system:
- Quantity of motion of the upper body
- Velocity of head movement
The paper states that a comprehensive account of emotional communication should consider the entire path from sender to receiver. On the sender side, emotions are expressed with appearance and behavior by means of cues which can be objectively measured. On the receiver side, these cues are processed based upon the perception of the receiver. Receiver’s perception can be affected by many things such as culture and his own mood. So, although emotions perceived by the receiver are based on emotions expressed by the sender, but they are not necessarily equal.
This implies that a comprehensive account of emotion communication requires the conclusion of both expression and perception.
There are some distinctive patterns of movements and postural behavior associated with some of the emotions studied:
- Lifting shoulders seemed to be typical for joy and anger
- Moving shoulders forward is typical for disgust, despair and fear
Survey of various research papers concluded that head movement plays an important role in communication, as well as, perception of emotion.
For dance performances,
- Overall duration of time
- Contraction index
- Quantity of motion
- Motion fluency,
showed differences in four emotions: anger, fear, grief and joy. Another research indicated that quantity of motion and contraction index of upper body played a major role in discriminating between different emotions.
No other emotion, except for sad had any impact on quantity of motion (but this is because of lack of movement space due to piano)
Another research indicates that quantitative analysis of body expressions was also possible. For example, it was concluded that arm was raised 17 degrees higher for angry movements than other emotions and expanded limbs and torso both signify content and joy.
- 12 emotions expressed by 10 actors
- Visual tracking of trajectories of head and hands were performed from frontal and lateral view
- Postural and dynamic expressive gesture features were identified and analyzed
- Overall amount of motion captured
- The degree of contraction and expansion of body computed using its bounded regions
- Motion fluency computed on the bases of the changes magnitude of the overall amount of motion over time
-
Module 1 computes low-level motion features i.e., the 3D positions and kinematics of head and hands
-
Module 2 computes a vector of higher-level expressive gesture features, including the following five sets of features:
-
Energy (passive vs. animated)
-
Spatial extent (expanded vs. contracted)
-
Smoothness and continuity of movement (gradual vs. jerky)
-
Forward-backward leaning of the head
-
Spatial symmetry and asymmetry of the hands with respect to the horizontal and vertical axis
-
Module 3 reduces the dimensionality of the data, while highlighting the salient patterns in the data set
The paper also contains the details of how each of the features was computed as well as the Dimension Reduction Model.
Suggest that use of upper body only would be sufficient for classifying a large amount of effective behavior.
This portion of the document describes Kinect hardware and various software frameworks to be used with Kinect. Microsoft has a well-documented Kinect SDK (for windows only) but some third-party SDK’s and drivers, as well as frameworks are also available. I propose that we use OpenNI coupled with NITE (both are explained below) instead of Microsoft’s SDK as they are open source (you need to purchase a license to use Microsoft’s Kinect SDK for commercial purposes) and can be easily ported to Mac and Linux.
The Kinect sensor includes:
- RGB camera
- Depth sensor
- Multi-array microphones
- Tilt motor
- Three-axis accelerometer
The Kinect’s depth sensor consists of an infrared light source , a laser that projects a pattern of dots, that are read back by a monochrome CMOS IR sensor. The sensor detects reflected segments of the dot pattern and converts their intensities into distances. The resolution of the depth dimension (z-axis) is about one centimeter while spatial resolution (x- and y-axes) is in millimeters. Each frame generated by the depth sensor is at VGA resolution (640 x 480 pixels), containing 11-bit depth values which provides 2,048 levels of sensitivity. The output stream runs at a frame rate of 30 Hz.
The RGB video stream also utilizes VGA resolution and a 30 Hz frame rate.
The audio array consists of four microphones, with each channel processing 16-bit audio at a sampling rate of 16 KHz. The hardware includes ambient noise suppression.
Microsoft suggests that you allow about 6 feet of empty space between you and the sensor otherwise you can confuse the sensor.
There are four main Kinect development libraries:
- OpenKinect’s libfreenect
- CLNUI
- OpenNI
- Microsoft’s Kinect for Windows
Libfreenect is derived from a reverse-engineered Kinect driver and works across all OS platforms. OpenKinect Analysis library communicates with the OpenKinect API and analyzes the raw information into more useful abstractions. They aim to to implement the following features but most of them have not been implemented yet:
- hand tracking
- skeleton tracking
- other depth processing
- 3D audio isolation coning?
- audio cancellation (perhaps a driver feature?)
- Point cloud generation
- Inertial movement tracking with the built in accelerometer or an attached WiiMote
- 3d reconstruction
- GPU acceleration for any of the above
CLNUI is aimed for windows only but allows multiple kinects to work together.
OpenNI is a software framework and an API and provides support for:
- Voice and voice command recognition
- Hand gestures
- Body Motion Tracking
OpenNI is a multi-language, cross-platform framework that defines API’s for writing applications utilizing Natural Interaction. The main advantage of this framework is that you write software independent of the hardware. So, for example, we can aim to write a Human Motion Analysis program which can analyze human motion using kinect if it is available or the same program would use a camera if Kinect is not available.
Sensor modules that are currently supported are:
- 3D sensor
- RGB camera
- IR camera
- A microphone or an array of microphones
Middleware components that are supported are:
-
Full body analysis middleware: a software component that processes sensory data and generates body related information (typically data structure that describes joints, orientation, center of mass, and so on)
-
Hand point analysis middeware: a software component that processes sensory data and generates the location of a hand point
-
Gesture detection middleware: a software component that identifies predefined gestures (for example, a waving hand) and alerts the application
-
Scene analyzer middleware: a software component that analyzes the image of the scene in order to produce such information as:
-
The seperation between the foreground of the scene and the background
-
The coordination of the floor plane
-
The individual identification of figures in the scene (and output its current location and orientation of joints of this figure)
An important reason for using OpenNI is its support for middleware. The NITE library understands the different hand movements as gesture types based on how hand points change over time. NITE gestures include:
- Pushing
- Swiping
- Holding steady
- Waving
- Hand circling
Primesense’s NITE middleware allows computers or digital devices to perceive the world in 3D. NITE comprehends your movements and interactions within its view, translates them into application inputs, and responds to them without any wearable input.”
Including computer vision algorithms, NITE identifies users and tracks their movements, and provides the framework API for implementing Natural-Interaction UI controls based on gestures. NITE can detect when you want to control things with only hand gestures, or when to get your whole body involved.
Hand control: allows you to control digital devices with your bare hands and as long as you’re in control, NITE ignores what others are doing.
Full body control: lets you have a total immersive, full body video game experience. NITE middleware supports multiple users, and is designed for all types of action.
Microsoft’s Kinect SDK covers much the same ground as OpenNI. The low-level API gives access to the depth sensor, image sensor, and microphone array, while higher-level features include skeletal tracking, audio processing, and integration with the Windows speech recognition API.
The main area where SDK wins over OpenNI is audio. Other pluses for Microsoft’s Kinect SDK are its extensive documentation and ease of installation on Windows 7. The main drawback for Microsoft’s SDK is that it only works for Windows 7, not even Windows XP. The SDK is free but limited to non-commercial purposes.
Two major emotions, happiness and anger were taken into consideration. These two emotions were distinguished by only using the upper body data. Music was generated using motion of hands of the user as input. Emotional data was used to normalize the music generated to map it onto a musical scale so that it sounded aesthetically pleasing.
Where body language is a crucial part of emotional detection, either by another human or a computer device, it is not complete. Current state of body language largely depends on the context. For example, people playing tennis would depict different emotions differently through body language than people playing piano. This is because in each of the situation, body language will have different constraints and the subject can only show movement in certain directions.
Secondly, real-time generation of music according to the emotion, and which also sounds that way to an untrained ear also lacks accuracy.
We conclude that body language alone is not sufficient for accurately determining the emotion of a person. But, coupled with facial expression analysis and vocal analysis, these three complete the method in which emotions are perceived by human beings and hence, it does have the potential of increasing the computer-aided emotion detection.
[1] http://www.affective-sciences.org/gemep