Walking in place for moving through virtual environments has attracted noticeable attention recently. Recent attempts focused on training a classifier to recognize certain patterns of gestures (e.g., standing, walking, etc) with the use of neural networks like CNN or LSTM. Nevertheless, they often consider very few types of gestures and/or induce less desired latency in virtual environments. In this paper, we propose a novel framework for accurate and efficient classification of in-place gestures. Our key idea is to treat several consecutive frames as a "point cloud". The HMD and two VIVE trackers provide three points in each frame, with each point consisting of 12-dimensional features (i.e., three-dimensional position coordinates, velocity, rotation, angular velocity). We create a dataset consisting of 9 gesture classes for virtual in-place locomotion. In addition to the supervised point-based network, we also take unsupervised domain adaptation into account due to inter-person variations. To this end, we develop an end-to-end joint framework involving both a supervised loss for supervised point learning and an unsupervised loss for unsupervised domain adaptation. Experiments demonstrate that our approach generates very promising outcomes, in terms of high overall classification accuracy (95.0%) and real-time performance (192ms latency). Our code will be publicly available at: https://github.com/ZhaoLizz/PCT-MCD.