Inferring human mental state (e.g., emotion, depression, engagement) with sensing technology is one of the most valuable challenges in the affective computing area, which has a profound impact in all industries interacting with humans. The self-report survey is the most common way to quantify how people think, but prone to subjectivity and various responses bias. It is usually used as the ground truth for human mental state prediction. In recent years, many data-driven machine learning models are built based on self-report annotations as the target value. In this research, we investigate the reliability of self-report surveys in the wild by studying the confidence level of responses and survey completion time. We conduct a case study (i.e., student engagement inference) by recruiting 23 students in a high school setting over a period of 4 weeks. Our participants volunteered 488 self-reported responses and data from their wearable sensors. We also find the physiologically measured student engagement and perceived student engagement are not always consistent. The findings from this research have great potential to benefit future studies in predicting engagement, depression, stress, and other emotion-related states in the field of affective computing and sensing technologies.