Presenting an efficient model based on the subcombinations extracted from the feature to recognize human physical activities

Number of pages: 140 File Format: word File Code: 31009
Year: 2014 University Degree: Master's degree Category: Computer Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Presenting an efficient model based on the subcombinations extracted from the feature to recognize human physical activities

    Doctoral thesis in the field of computer engineering (artificial intelligence)

    Abstract

    Understanding and extracting information from images and videos is the common thread of most problems related to machine vision. Finding the main and useful parts of a movie and modeling the actions between these parts is one of the main goals of movie analysis. In the last decade, human activity detection using video images has been raised as a challenging topic in machine vision. Among the applications of this subject, we can mention the issues of surveillance and security, medicine and human-computer interaction. It is difficult and complex to extract the main components and summarize an activity due to the great diversity in the way an activity is performed. If we consider the beginning of video analysis to process the brightness of image pixels in different frames and the final goal to detect human activity, there is a great distance between the level of analysis and the final goal, and an urgent need to extract meaningful and higher-level features is felt. In fact, the main challenge is to fill the deep gap between low-level descriptors to express the type of activity and summarize it. In recent decades, researchers have not been very successful in providing effective summarization methods using vision and machine learning techniques, even at the level of images. In this regard, separation methods [1] have been proposed, which model the differential boundary of different classes. Despite their success, these models require a lot of labeled data and are limited to a specific context. In addition, the risk of overfitting [2] also threatens them. On the other hand, generative models [3] solved this problem by adding additional constraints to the model using a large amount of available unlabeled data. As an example, we can point to the unsupervised feature learning methods that reduce the distance between the low-level descriptors and the final model by adding some basic knowledge about the overall data structure. In this thesis, by presenting five different frameworks, the problem of human activity recognition is solved with the approach of summarizing and extracting higher-level features. The main steps of doing the work can be divided into three main parts, 1- feature extraction, 2- their quantization and 3- classification. In this research, the features of shape and movement related to two-dimensional images of video frames have been extracted. In the second part, which is almost the main part of this research, in order to reduce the quantization error and raise the level of the features (using the basic knowledge hidden in the data) and also for easier classification in the later stages, instead of the common methods such as K-means, thin coding methods and some improved versions of it, which are considered as unsupervised feature learning methods, have been used. In such methods, the goal is to find higher-level basic functions and describe the video using a linear combination of them. Also, we have used the very useful method of thin group encoding to extract useful information of the time sequence. Then, in order to avoid overfitting the model, spatial and temporal integration of coefficients is suggested. Finally, by using two different algorithms of the general methods of generative and separator classification, we have finished the activity detection.

    Among the highlights of this thesis, we can mention the combination of several features with different modalities, extracting the meaningful components of an activity and modeling their relationship by considering the temporal structure of the data, reducing the quantization error, and also significantly reducing the spatial and temporal complexity. The presented methods have been evaluated on several activity recognition databases consisting of artificial and real data with different challenges, and good results have been obtained.

    Key words: human activity recognition, basic knowledge, data structure, multi-category system, sparse coder, group sparse coder, unsupervised feature learning.

    Chapter 1

    Introduction

    Introduction

    Introduction

    Understanding and analyzing images is the common thread of most machine vision problems. In this regard and with the advancement of different machine vision techniques, the analysis of different scenes has risen above the image level and analyzes the film (a sequence of frames) taking into account the time relationships between them. This provides a better and more accurate understanding of the intended scene. Today, human activity detection is one of the most important and interesting research topics in the field of machine vision. The purpose of this diagnosis is to analyze the activities of humans in an unknown video.In general, the analysis of human movements can be divided into three categories: 1- detection of human activity [1], 2- tracking human movements [2] and 3- analysis of movements of different parts of the human body [3]. Each of these analyzes can be performed on two or three-dimensional frames. In many practical problems, after finding people in images and following them, we seek to categorize their activities. Activity detection is a process of tagging human activities that can be done using various sensors such as vision and sound. In this research, we only use field of vision observations that can be taken from one or more cameras.  The label of a specific activity is a name that almost average people, upon hearing it, imagine the same activity and can do it the same way. In other words, the activity tag is the best descriptor of an example of an activity performed by different people in different conditions.

    Looking deeper into the problem of activity recognition, it can be considered similar to some fields of artificial intelligence, such as natural language processing, text processing, and voice recognition, from different perspectives. It is useful to use different perspectives to analyze this issue. For example, we use the concepts of natural language and human speech to define the activity more precisely and recognize it. Humans use sentences in their daily conversations. Each simple sentence consists of subject, object and verb. There is almost the same structure to express the visual concepts in a movie. From this point of view, the subject or performer of the activity is usually humans. The object can usually be other people or objects or the environment on which the subject performs its activity. Finally, the verb indicates the type of activity or interaction between the subject and objects. From the point of view of sound processing, as components such as phonemes, letters and words make a sentence in this area, the sequence and order of movements together form a meaningful activity. Considering the existing similarities, it seems that we can achieve a more efficient solution to our problem by examining different methods in the mentioned areas.  

    There are different types of human activities. We divide the activities into 4 different levels according to their complexity[1]:

    1.[4]: The basic movements of body parts are atomic and are used to describe meaningful human movements. Such as opening the hand from the elbow or folding it, fisting the hand, etc.

    Human activity [5]: We put the simple activities that can include several movements from the first category in the time dimension into the second category. In other words, the combination of human atomic movements constitutes an activity. Such as walking, shaking hands, etc.

    Interaction of human activities[6]: In this category, two or more people or people and objects are connected. Such as two people fighting with each other or someone stealing someone's bag, which is an example of the interaction of two people with the same object.

    Group activities [7]: an operation that is carried out by a group of people with each other or with objects. Like marching a group of soldiers, meeting a group, etc.

    For example, a game of tennis is an interaction of human activity. This interaction includes several activities such as serving, returning the ball or time-out, etc. Each of these activities includes basic movements. For example, serving includes throwing the ball upwards, moving the racket back, moving the racket and hitting the ball. It should be noted that the choice of initial movements is an important and influential issue in the continuation of the diagnosis process. For example, arm movement may not be a sufficient movement for part of the activity of playing tennis, while it may be a sufficient movement for the activity of drinking. Therefore, the extraction of the basic movements of an activity is somewhat dependent on the type of activity, and a precise definition is not completely possible. Applications The ability to recognize complex human activities has various applications. including automatic monitoring systems in public places such as airports and highways that require the detection of abnormal and suspicious movements and activities against normal and normal activities [1]. For example, in airports, detecting some activities such as a person leaving a bag or throwing a person's handbag in the trash can be considered as suspicious movements.

  • Contents & References of Presenting an efficient model based on the subcombinations extracted from the feature to recognize human physical activities

    List:

    1- Introduction. 2

    1-1- Introduction. 2

    1-2- Applications 14

    1-3- Challenges and features of the environment. 6

    1-4- General definition of the problem. 11

    2- Review of past researches. 24

    2-1- Introduction. 24

    2-2- Single layer methods. 24

    2-2-1- Introduction of various time-space methods. 15

    2-2-2- Summary and comparison of time-space methods. 23

    2-2-3- sequential methods. 25

    2-2-4- Summary and comparison of successive methods. 26

    2-3- Multilayer (hierarchical) methods. 26

    2-3-1- Statistical methods. 27

    2-3-2- Syntactic methods. 27

    2-3-3- descriptive model. 28

    2-3-4- Summary and comparison of hierarchical methods. 28

    3- Studying the tools used 31

    3-1- Introduction. 31

    3-2- Tools used in feature extraction. 31

    3-2-1- Directional gradient histogram. 31

    3-2-2- optical flux. 32

    3-3- Tools used in learning higher level features. 44

    3-3-1- General pattern in unsupervised feature learning. 36

    3-3-2- Common methods in unsupervised feature learning. 37

    3-3-3- Moody's empirical analysis. 61

    3-4- Tools used in classification. 62

    3-4-1- Hidden Markov model. 62

    3-4-2- Support vector machine: 56

    4- Suggested method. 61

    4-1- Introduction. 61

    4-2- Defining the main framework. 61

    4-3- Steps to do the work. 62

    4-3-1- Video expression. 64

    4-3-2- feature extraction. 76

    4-3-3- Quantizing words and creating a dictionary. 68

    4-3-4- Integration. 88

    4-3-5- Classification. 89

    4-4- Proposed frameworks. 92

    4-4-1- First frame: 92

    4-4-2- Second frame: 92

    4-4-3- Third frame: 83

    4-4-4- Fourth frame: 84

    4-4-5- Fifth frame: 86

    5- Results. 95

    5-1- Available databases. 95

    5-2- Setting the parameters of the problem. 102

    5-3- Results. 104

    6- Discussion. 120

    6-1- Innovations and their advantages and disadvantages 120

    6-2- Comparison of proposed frameworks. 113

    6-3- Proposed works for the future. 114

    6-4- Summary. 115

    7- List of sources. 116

     

    Source:

     

    1.J. K. Aggarwal, and M. S. Ryoo, "Human Activity Analysis: A Review", ACM Computing Surveys Journal (CSUR), Vol. 43, No. 3, pp. 1-47, 2011

    2.R. Poppe, "A survey on vision-based human action recognition", Image and Vision Computing, Vol. 28, pp. 976–990, 2010.

    3.M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes”, IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 29, No. 12, pp. 2247–2253, 2007.

    4.A. Bobick, and J. Davis "The recognition of human movement using temporal templates", IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 23, No. 3, pp. 257-267, 2001.

    5.E. Shechtman, and M. Irani, "Space-time behavior based correlation", CVPR, 2005.

    6.Y. Ke, R. Sukthankar, and M. Hebert, "Spatio-temporal shape and flow correlation for action recognition", CVPR, 2007.

    7.M.D. Rodriguez, J. Ahmed, and M. Shah, "Action MACH: a spatiotemporal maximum average correlation height filter for action recognition", CVPR, 2008.

    8.Z. Li, Y. Fu, T. Huang, and S. Yan, "Real-time human action recognition by luminance field trajectory analysis", ACM International Conference on Multimedia, 2008.

    9.Y. Sheikh, M. Sheikh, and M. Shah, "Exploring the space of a human action", ICCV, 2005.

    10.Yilmaz, and M. Shah, "Recognizing human actions in videos acquired by uncalibrated moving cameras", ICCV, 2005.

    11.G. Johansson, "Visual perception of biological motion and a model for its analysis", Perception & Psychophysics, Vol. 14, pp. 201-211, 1973.

    12.I. Laptev, T. Lindeberg, "On Space-Time Interest Points", International Journal of Computer Vision, Vol. 64, pp. 107-123, 2005.

    13.P. Doll?r, V. Rabaud, G. Cottrell, S. Belongie, "Behavior Recognition via Sparse Spatio-TemporalBelongie, "Behavior Recognition via Sparse Spatio-Temporal Features", IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), 2005.

    14.A. Oikonomopoulos, I. Patras, and M. Pantic, "Spatiotemporal salient points for visual recognition of human actions", IEEE Trans. On Systems Man and Cybernetics (SMC) – Part B: Cybernetics, Vol. 36, No. 3, pp. 710–719, 2006.

    15.S.F Wong, and R. Cipolla, “Extracting spatiotemporal interest points using global information”, ICCV, 2007.

    16.T.K Kim, S.F Wong, and R. Cipolla, “Tensor canonical correlation analysis for action classification”, CVPR, 2007.

    17.G. Willems, T. Tuytelaars, and L. VanGool, "An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector", ECCV, 2008.

    18.I. Laptev and P. Perez, "Retrieving actions in movies", ICCV, 2007.

    19.W.L Lu, James J. Little, "Simultaneous tracking and action recognition using the PCA-HOG descriptor", Canadian Conference on Computer and Robot Vision, 2006.

    20.P. Scovanner, S. Ali, and M. Shah, "A 3-dimensional SIFT descriptor and its application to action recognition", International Conference on Multimedia, 2007.

    21.J. Yamato, J. Ohya, and K. Ishii, "Recognizing human action in time-sequential images using hidden Markov model", CVPR, 1992.

    22.A.Veeraraghavan, R. Chellappa, and A. Roy-Chowdhury, "The function space of an activity", CVPR, 2006.

    23.R. Lublinerman, N. Ozay, D. Zarpalas, and O. Camps, "Activity recognition from silhouettes using linear systems and model (in) validation techniques", ICPR, 2006.

    24.F. Lv, and R. Nevatia, "Recognition and segmentation of 3-D human action using HMM and multi-class adaBoost", ECCV, 2006.

    25.B. Chakraborty, O. Rudovic, J. Gonzalez, "View-invariant human-body detection with extension to human action recognition using component-wise HMM of body parts", International Conference on Automatic Face and Gesture Recognition, 2008.

    26.N.M. Oliver, B. Rosario, and A.P. Pentland, "A Bayesian computer vision system for modeling human interactions". IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 22, No. 8, pp. 831-843, 2006.

    27.S. Park, J.K. and Aggarwal, "A hierarchical Bayesian network for event recognition of human actions and interactions". Multimedia Systems, Vol. 10, No. 2, pp.164-179, 2004.

    28.E. Yu, and J.K. Aggarwal, "Detection of fence climbing from monocular video", ICPR, 2006.

    29.Y. Shi, Y. Huang, D. Minnen, A.F. Bobick, and I.A. Essa, "Propagation networks for recognition of partially ordered sequential action", CVPR, 2006.

    30.Y.A. Ivanov, and A.F. Bobick, "Recognition of visual activities and interactions by stochastic parsing". IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 22, No. 8, pp. 852-872, 2000.

    31.D. Moore and I. Essa, "Recognizing multi tasked activities using stochastic context-free grammar using video", AAAI, 2002.

    32.M.S. Ryoo, and J.K. Aggarwal, "Recognition of composite human activities through context-free grammar based representation", CVPR, 2006.

    33.A. Gupta, P. Srinivasan, J. Shi, and L.S. Davis, "Understanding videos, constructing plots learning a visually grounded storyline model from annotated video", CVPR, 2009.

    Th. Brox, A. Bruhn, N. Papenberg, and J. Weickert, "High accuracy optical flow estimation based on a theory for warping", ECCV, 2004.

    A. Coates, "Demystifying Unsupervised Feature Learning", PhD thesis. Stanford University, 2012.

    F. Bach, "Consistency of the Lasso group and multiple kernel learning", Journal of Machine Learning Research, Vol. 9, pp.1179–1225, 2008.

    G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, "Visual categorization with bags of keypoints", Workshop on statistical learning in computer vision, ECCV, 2004.

    R.

Presenting an efficient model based on the subcombinations extracted from the feature to recognize human physical activities