Social media, such as Twitter and Weibo, has created the platform for people to disseminate sentiments, ideas, and opinions [1]. The generated contents on social media can be in various types, such as texts, images and videos. Though little informational value is contained in a single status update, the aggregation of millions of messages can reveal useful information about a population [2]. Usually, social media content is unstructured, noisy, dynamic, and voluminous, requiring big data technologies to process [3]. In recent years, big data and social media analytics have been examined in several studies. Barbosa and Feng [4] measured public opinion through classifying users’ posts by sentiment. Sakaki et al. [5] treated each Twitter user as a social sensor, and constructed an earthquake reporting system in Japan by monitoring and mining their tweets. Other applications include marketing campaign [6], political polling [7], and financial prediction [8]. These successes have drawn interest from the public health community, whose goal is to study the health of a population.
Traditionally, studying population health requires pricy, time-consuming monitoring mechanisms, primarily surveys and data collection from clinical encounters [9]. Even high-priority projects conducted by the government are slow as they need data aggregation [9]. Cheaper public health surveillance tool with real-time or near real-time characteristics is highly demanded in the field. Social media users often publicly share their personal information. For example, messages like “OMG, I got flu” and “I need pain killers for my headache but I also need to take cold and flu” are common on Twitter. Knowing that an individual is sick may not be so exciting, but millions of such tweets may be revealing. This indicates there is a strong health “signal” in social media.
With massive real-time health-related data available, social media with big data analytics seems like a promising public health surveillance tool, as it can reduce the cost and provide timely public-health statistics. Studies have demonstrated that Twitter postings can be used to track the influenza rate [10], detect mental illnesses such as seasonal affective disorder and depression [11]. Moreover, data extracted from social media contains some hidden information those traditional health surveillance systems cannot uncover. For example, most people may take Over-the-counter medicines or query online for self-treatment methods to deal with sudden symptoms or minor ailments, rather than turn to health services. Some of them may post their feelings or symptoms on social media sites, making the aliments detectable. However, the traditional surveillance systems are not able to acquire such information. Some people may lie to their doctors and families about their socially undesirable behaviors such as excessive drinking, but they probably express those online anonymously.
Mining those self-reported data and online queries provides the opportunity to understand the population’s health status, which helps augment the traditional notification channels about a disease outbreak. However, most work to date has focused on English Twitter messages, emphasizing health topics of major concern in the United States, with little work concerning health issues in other countries, especially developing ones like China.
This research aims to investigate how social media and Internet data can benefit public health in China. Questions like what health contents can be extracted from Chinese social media like Weibo, what are the differences in the contents of various health-related social media, what are the causes for some health issues, how the methods detecting specific diseases can be improved, how Chinese texts can be processed more accurately, how the population’s health status can be reflected in real-time will be answered. Novel techniques and algorithms to gather, integrate, process and mine huge volumes of data will be investigated. With techniques in statistical analysis, machine learning and natural language processing, a research tool will be constructed to detect and track illnesses over time and space, population attitudes towards health and environment-related policies. With advanced data visualization technology, a China health map will be modelled to reveal trends and the state of health in real-time. The established system can be used to complement and verify traditional disease surveillance systems. Moreover, the developed technology and algorithms can be used to mine other valuable information about people’s livelihood.
Recent work in machine learning and computational linguistics has studied the health contents of social media messages and shown the potential for extracting public health information form their aggregation. In [12], new computational models are developed to explore health-related tweets and topics on Twitter. To be specific, the constructed model can discover ailments from raw Twitter postings for guided exploration because many public-health activities are disease-oriented. In the work, supervised learning is used to filter those messages and find health-related tweets. Similarly, over 570 million tweets from an eight month period are analyzed using supervised learning in [13], the result shows strong correlations between influenza Twitter messages and the U.S. Centers for Disease Control and Prevention data. Lampos and Cristianini [14] analyze tweets to track influenza rates in the UK. A “flu-score” is assigned to each document by learning weights for each word by their predictive power on held-out data. The flu-score is compared with data from the Health Protection Agency, and the result also indicates strong correlation. Because some social media services provide location information for some messages, public health surveillance can be geographically localized. For instance, Sadilek et al. [15] modelled the spread of infectious diseases based on analyzing 2.5 million geo-tagged tweets. Their research shows that the intensity of recent co-location may increase people’s likelihood of contracting an illness in the near future.
Of the numerous health topics to which social media analytics has been applied, mental health has drawn less attention. This is partly owing to the complication of the underlying causes of mental health issues and partly owing to the longstanding societal stigma making the subject all but taboo [16]. Coppersmith et al. [16] developed a novel data collection method for a couple of mental illnesses, like depression, post-traumatic stress disorder (PTSD), bipolar disorder, and seasonal affective disorder, and demonstrated quantifiable signals in tweets relevant to those issues. In [17], post-traumatic stress disorder is measured with Twitter data. Their results demonstrated some PTSD users can be easily and automatically identified as their language usage patterns are different from random users. Using statistical models, Choudhury et al. [18] quantify postpartum changes in new mothers to identify those who are at risk of postpartum depression. Changes in emotion, social engagement, linguistic style and social network are considered.
This work was carried out at the International Doctoral Innovation Centre (IDIC). The authors acknowledge the financial support from Ningbo Education Bureau, Ningbo Science and Technology Bureau, China's MOST, and the University of Nottingham. The work is also partially supported by EPSRC grant no EP/G037574/1.