Horizon CDT Research Highlights

Research Highlights

Public Health Surveillance using Social Media and Internet Data

  Mengdi Li (2013 cohort)

1. Introduction

Social media, such as Twitter and Weibo, has created the platform for people to disseminate sentiments, ideas, and opinions [1]. The generated contents on social media can be in various types, such as texts, images and videos. Though little informational value is contained in a single status update, the aggregation of millions of messages can reveal useful information about a population [2]. Usually, social media content is unstructured, noisy, dynamic, and voluminous, requiring big data technologies to process [3]. In recent years, big data and social media analytics have been examined in several studies. Barbosa and Feng [4] measured public opinion through classifying users’ posts by sentiment. Sakaki et al. [5] treated each Twitter user as a social sensor, and constructed an earthquake reporting system in Japan by monitoring and mining their tweets. Other applications include marketing campaign [6], political polling [7], and financial prediction [8]. These successes have drawn interest from the public health community, whose goal is to study the health of a population.

Traditionally, studying population health requires pricy, time-consuming monitoring mechanisms, primarily surveys and data collection from clinical encounters [9]. Even high-priority projects conducted by the government are slow as they need data aggregation [9]. Cheaper public health surveillance tool with real-time or near real-time characteristics is highly demanded in the field. Social media users often publicly share their personal information. For example, messages like “OMG, I got flu” and “I need pain killers for my headache but I also need to take cold and flu” are common on Twitter. Knowing that an individual is sick may not be so exciting, but millions of such tweets may be revealing. This indicates there is a strong health “signal” in social media.

With massive real-time health-related data available, social media with big data analytics seems like a promising public health surveillance tool, as it can reduce the cost and provide timely public-health statistics. Studies have demonstrated that Twitter postings can be used to track the influenza rate [10], detect mental illnesses such as seasonal affective disorder and depression [11]. Moreover, data extracted from social media contains some hidden information those traditional health surveillance systems cannot uncover. For example, most people may take Over-the-counter medicines or query online for self-treatment methods to deal with sudden symptoms or minor ailments, rather than turn to health services. Some of them may post their feelings or symptoms on social media sites, making the aliments detectable. However, the traditional surveillance systems are not able to acquire such information. Some people may lie to their doctors and families about their socially undesirable behaviors such as excessive drinking, but they probably express those online anonymously.

Mining those self-reported data and online queries provides the opportunity to understand the population’s health status, which helps augment the traditional notification channels about a disease outbreak. However, most work to date has focused on English Twitter messages, emphasizing health topics of major concern in the United States, with little work concerning health issues in other countries, especially developing ones like China.

2. Research Aim and Objectives

This research aims to investigate how social media and Internet data can benefit public health in China. Questions like what health contents can be extracted from Chinese social media like Weibo, what are the differences in the contents of various health-related social media, what are the causes for some health issues, how the methods detecting specific diseases can be improved, how Chinese texts can be processed more accurately, how the population’s health status can be reflected in real-time will be answered. Novel techniques and algorithms to gather, integrate, process and mine huge volumes of data will be investigated. With techniques in statistical analysis, machine learning and natural language processing, a research tool will be constructed to detect and track illnesses over time and space, population attitudes towards health and environment-related policies. With advanced data visualization technology, a China health map will be modelled to reveal trends and the state of health in real-time. The established system can be used to complement and verify traditional disease surveillance systems. Moreover, the developed technology and algorithms can be used to mine other valuable information about people’s livelihood.

3. Related work

Recent work in machine learning and computational linguistics has studied the health contents of social media messages and shown the potential for extracting public health information form their aggregation. In [12], new computational models are developed to explore health-related tweets and topics on Twitter. To be specific, the constructed model can discover ailments from raw Twitter postings for guided exploration because many public-health activities are disease-oriented. In the work, supervised learning is used to filter those messages and find health-related tweets. Similarly, over 570 million tweets from an eight month period are analyzed using supervised learning in [13], the result shows strong correlations between influenza Twitter messages and the U.S. Centers for Disease Control and Prevention data. Lampos and Cristianini [14] analyze tweets to track influenza rates in the UK. A “flu-score” is assigned to each document by learning weights for each word by their predictive power on held-out data. The flu-score is compared with data from the Health Protection Agency, and the result also indicates strong correlation. Because some social media services provide location information for some messages, public health surveillance can be geographically localized. For instance, Sadilek et al. [15] modelled the spread of infectious diseases based on analyzing 2.5 million geo-tagged tweets. Their research shows that the intensity of recent co-location may increase people’s likelihood of contracting an illness in the near future.

Of the numerous health topics to which social media analytics has been applied, mental health has drawn less attention. This is partly owing to the complication of the underlying causes of mental health issues and partly owing to the longstanding societal stigma making the subject all but taboo [16]. Coppersmith et al. [16] developed a novel data collection method for a couple of mental illnesses, like depression, post-traumatic stress disorder (PTSD), bipolar disorder, and seasonal affective disorder, and demonstrated quantifiable signals in tweets relevant to those issues. In [17], post-traumatic stress disorder is measured with Twitter data. Their results demonstrated some PTSD users can be easily and automatically identified as their language usage patterns are different from random users. Using statistical models, Choudhury et al. [18] quantify postpartum changes in new mothers to identify those who are at risk of postpartum depression. Changes in emotion, social engagement, linguistic style and social network are considered.


  1. Dredze, M. (2012). How Social Media Will Change Public Health, (AuGuST), 81–84.
  2. Paul, M. J., & Dredze, M. (2011). You Are What You Tweet : Analyzing Twitter for Public Health.
  3. Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144. doi:10.1016/j.ijinfomgt.2014.10.007
  4. Barbosa, L., & Feng, J. (2010, August). Robust sentiment detection on twitter from biased and noisy data. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 36-44). Association for Computational Linguistics.
  5. Sakaki, T. (2009). Earthquake Shakes Twitter Users : Real-time Event Detection by Social Sensors.
  6. Chen, W., Wang, C., & Wang, Y. (2010, July). Scalable influence maximization for prevalent viral marketing in large-scale social networks. InProceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1029-1038). ACM.
  7. O'Connor, B., Balasubramanyan, R., Routledge, B. R., & Smith, N. A. (2010). From tweets to polls: Linking text sentiment to public opinion time series. ICWSM, 11, 122-129.
  8. Asur, S., & Huberman, B. A. (2010, August). Predicting the future with social media. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on (Vol. 1, pp. 492-499). IEEE.
  9. Dredze, M. (2012). How social media will change public health. Intelligent Systems, IEEE, 27(4), 81-84.
  10. Lampos, V., & Cristianini, N. (2010, June). Tracking the flu pandemic by monitoring the social web. In Cognitive Information Processing (CIP), 2010 2nd International Workshop on (pp. 411-416). IEEE.
  11. Harman, G. C. M. D. C. Quantifying Mental Health Signals in Twitter.
  12. Paul, M. J., & Dredze, M. (2012). A model for mining public health topics from Twitter. Health, 11, 16-6.
  13. Culotta, A. (2013). Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages. Language resources and evaluation, 47(1), 217-238.
  14. Lampos, V., & Cristianini, N. (2010, June). Tracking the flu pandemic by monitoring the social web. In Cognitive Information Processing (CIP), 2010 2nd International Workshop on (pp. 411-416). IEEE.
  15. Sadilek, A., Kautz, H., & Silenzio, V. (n.d.). Modeling Spread of Disease from Social Interactions.
  16. Coppersmith, G., Dredze, M., & Harman, C. (n.d.). Quantifying Mental Health Signals in Twitter.
  17. Dredze, G. C. C. H. M. (2014). Measuring post traumatic stress disorder in Twitter.
  18. De Choudhury, M., Counts, S., & Horvitz, E. (2013). Predicting postpartum changes in emotion and behavior via social media. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems - CHI ’13, 3267. doi:10.1145/2470654.2466447.

This work was carried out at the International Doctoral Innovation Centre (IDIC). The authors acknowledge the financial support from Ningbo Education Bureau, Ningbo Science and Technology Bureau, China's MOST, and the University of Nottingham. The work is also partially supported by EPSRC grant no EP/G037574/1.