1    
    
                 tesis de maestría
            
         
                                                                           Publicado 2016                                                                                    
                        
                           
                           Enlace                        
                     
               
            
                           Enlace                        
                     
               
                  The proposed method consists of three parts: features extraction, the use of bag of words and classification. For the first stage, we use the STIP descriptor for the intensity channel and HOG descriptor for the depth channel, MFCC and Spectrogram for the audio channel. In the next stage, it was used the bag of words approach in each type of information separately. We use the K-means algorithm to generate the dictionary. Finally, a SVM classi fier labels the visual word histograms. For the experiments, we manually segmented the videos in clips containing a single action, achieving a recognition rate of 94.4% on Kitchen-UCSP dataset, our own dataset and a recognition rate of 88% on HMA videos.