To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-temporal domains according to the properties of human body movement.First,the temporal gradient combined with the constraint of coherent motion pattern is utilized to extract stable and dense motion features that are viewed as point features,then the mean-shift clustering algorithm with the adaptive scale kernel is used to label these features.After pooling the features with the same label to generate part-based representation,the visual word responses within one large scale volume are collected as video object representation.On the benchmark KTH(Kungliga Tekniska H?gskolan)and UCF (University of Central Florida)-sports action datasets,the experimental results show that the proposed method enhances the representative and discriminative power of action features, and improves recognition rates.Compared with other related literature,the proposed method obtains superior performance.