How can social scientists collect and analyze web logs – records of individuals’ browsing behavior – for their own research? In this Methods Bites Instructional Blog Post, Ruben Bach summarizes some key insights of his talk in the MZES Social Sciences Data Lab in December 2019. The blog post discusses how to obtain and extract information from web logs and related data, shows how they can be used for social research, and concludes with a short discussion of how to handle big data extracted from web logs.
How to use web log and related data for social research
Web logs (“browsing histories”) are records of app use and search queries. They are highly interesting data sources for research in the social sciences as they offer detailed insights into human behavior. However, only a few studies have used such data in the social sciences so far.
Examples of such studies include Stephens-Davidowitz (2014), who studied racial animus in the 2012 U.S. presidential election, Peterson, Goel, and Iyengar (2018) and Flaxman, Goel, and Rao (2016), who analyzed filter bubbles, echo chambers and partisan polarization in the U.S. and Guess, Nyhan, and Reifler (2020) who studied the spread of fake news in the U.S. In the Netherlands, Möller et al. (2019) analyzed online news engagement based on three different modes of news use. In Israel, Dvir-Girsman (2017) documented that audience homophily is higher among individuals with more extreme ideology and that it is associated with ideological polarization and intolerance. Bach et al. (2019) showed for Germany that online and mobile device activities predict voting behavior and political preferences to a limited degree only. Chancellor and Counts (2018) show that internet search data can be used to estimate employment demand in the U.S.
In other disciplines, similar data have been used, for example, to predict influenza activity (Ginsberg et al. 2009, but see @lazer2014parable) and to show that users who search for relatively harmless symptoms easily end up searching for serious diseases (White and Horvitz 2009). Another study demonstrates how concerns about pregnancy and childbirth change over the course of pregnancy (Fourney, White, and Horvitz 2015). Furthermore, several papers show how a variety of user attributes such as socio-demographics can be inferred from web logs, search queries and app records (see Hinds and Joinson 2018 for a recent overview). However, many of those studies have been conducted by researchers from computer science. While they often rely on search query data obtained from search engine providers like Bing and Google, they typically focus more on the technical aspects and on the evaluation of the performance of the underlying algorithms. As social scientists, however, we often focus on the theory-driven development of models and the testing of hypotheses about human and societal behavior. Thus, this blog post will focus on the latter.
One challenge that researchers face when designing studies that rely on web logs, records of app use, and search queries is how to get access to such data. Several of the studies mentioned above use large amounts of search query data from search engines like Google or Bing (Stephens-Davidowitz 2014; Chancellor and Counts 2018; White and Horvitz 2009; Fourney, White, and Horvitz 2015). These data are, however, usually only available if one teams up with researchers from the respective companies. Another way to obtain data is through commercial providers who keep opt-in panels of users who occasionally answer survey questions in exchange for money. Researchers can pay those vendors in order to get access to their panels and ask participants survey questions. In addition, several of these providers also offer web log and mobile device use records from users who (in exchange for additional pay) agreed to having their online mobile activities monitored. In this blog post, we will mainly focus on this latter way of obtaining data. Before we talk about this topic in more detail, we will briefly summarize what we mean when we speak of web logs, records of app use, and search queries.
To get a better understanding of such data, the table below shows a collection of a few artificial web logs made up for this blog post. Typically, we observe a person identifier (first column) for the person whose records we observe. Second, we have a URL (Uniform Resource Locator) column which tells us which URL this person visited, when (column “Timestamp”) and for how long (“Duration of use”; here, in seconds). The most interesting information in this table is the URL column. Even without a detailed understanding of the specific form of a URL, we can easily see that, in the first row, Person 1 visited a web address that seems to inform her about the weather in Mannheim. In addition, we observe when she visited this address and how much time she spent there. The second row tells us that this person likely sent (or received) a message to (from) Peter Mustermann. We cannot, however, observe the content of the message (which we also should not, given obvious privacy reasons). The third row shows that Person 1 then visited the Facebook page of the CDU. The fourth row shows that she watched a video on YouTube. If we accessed the web address (or programmed a scraping tool), we could also learn what the video was about. From the visit in the fifth row, we learn that this person read an article about the state of the German economy on DIE ZEIT, a German newspaper.
With respect to Person 2, we can also observe what items they searched on Amazon (sixth row) or which tweet they saw (seventh row). We also learn that they searched for the voting advice application “Wahl-O-Mat” (ninth row). From the URL in row 9, we can extract the exact words a person entered into a search engine (the “search queries”). From row 10, we also know that Person 2 then actually used this tool. Thus, we already see that we can learn a lot about users’ behavior, their interests and preferences.
We observe similar records regarding app use on mobile devices, as shown in the next table. Here, we observe the name of an app instead of a URL. Unfortunately, we cannot observe what users do inside the apps they use. That is, we do not know which articles they read when they open a newspaper app or which videos they watch on YouTube or Netflix. Yet, if somebody frequently opens newspaper apps, we might infer that this person may be interested in politics. Moreover, users who use an app called “Period Tracker” are likely female and we may even learn when they have their period by observing temporal variations in usage of this app. Similarly, somebody who uses an app that informs them about prayer times is likely religious. These are just a few examples that show how observing users’ web logs or app use can reveal information that users may perceive as sensitive personal information (see Bach et al. 2019 for a study on this topic).
1 |
Facebook |
2020-01-21 10:43:45 |
24 |
1 |
Kleiderkreisel |
2020-01-22 11:43:45 |
32 |
1 |
WhatsApp Messenger |
2020-01-23 23:43:01 |
45 |
1 |
Youtube |
2020-01-24 08:21:45 |
3625 |
1 |
GMX Mail |
2020-01-25 16:14:07 |
67 |
2 |
Period tracker |
2020-01-26 23:54:45 |
23 |
2 |
Pedometer |
2020-01-27 07:43:45 |
245 |
2 |
WhatsApp Messenger |
2020-01-28 01:01:45 |
56 |
2 |
WhatsApp Messenger |
2020-01-29 17:14:45 |
32 |
2 |
Youtube |
2020-01-30 09:09:45 |
578 |
2 |
Netflix |
2020-01-30 18:56:45 |
456 |
It is important to note that participants of commercial panels can usually pause data collection on their devices temporarily. This might happen if they do not feel comfortable having their activities recorded, e.g. when they do online banking or watch movies from illegal streams. However, our own data collections show large amounts of potentially sensitive information (such as adult content, gambling, and illegal streaming), which suggests that users do not make use of this possibility very often.
How to obtain web log and related data
As mentioned earlier, the easiest way to obtain web log and app use data is through commercial vendors who operate online access panels (“non-probability panels”). So far, there seem to be only a handful of providers, such as respondi AG (Germany, UK and France, for example), YouGov (UK and US, amongst others) and netquest (mostly Spain and Latin America). For an overview of providers, see for example, Wakoopa Hub. Since such data have not been available for long, we do not know much about the quality of such data (for an exception see Revilla, Ochoa, and Loewe 2017).. In addition to accessing web logs and app use data, researchers can collect survey data for the same individuals. That is, by asking users questions we can enrich the web log data with important information about users’ socio-demographic characteristics, their voting behavior, or their political preferences. This information is usually not directly observable from the web logs and app use records, but likely makes the data much more valuable for social scientific research purposes. However, we should always keep in mind that due to the opt-in nature of these access panels, we need to be careful when making statements about the representativeness of our findings. That is, because most of our statistical estimators (e.g., for variances) rely on true probability sampling, we need to adjust our models and estimators, e.g. by using weights (for more information on non-probability sampling and non-probability panels see, e.g., Mercer et al. 2017; Cornesse et al. 2020).
Another way to obtain data on users’ online activities is through Google Trends. Briefly speaking, this service allows the estimation of the popularity of search queries in Google Search across various regions and languages. Several R packages in R allow users to automatically extract data from Google Trends – see, for instance, gtrendsR
. Further details on Google Trends can be found here. A major drawback of these data is that they do not come with additional information about users and only provide information at aggregate levels. That is, one can learn only about the popularity of a specific search query compared to the popularity of other search queries, which is arguably much less informative than individual-level web logs.
A third way to obtain web log, app, and search query data is through developing one’s own research app. A few projects using this approach were launched in recent years. One of the most prominent examples is the IAB-SMART project (for details, see Kreuter et al. 2019). Researchers of the Institute for Employment Research (IAB) in Nuremberg, Germany, developed an app that has since been downloaded by several hundred respondents of the IAB’s panel survey “Labour Market and Social Security”. The app collects information on respondents’ smartphone activities and can access even more information than those described above (including geolocation and address books). While this approach is likely the most elaborate, it is also the most difficult and most costly one to implement: Researchers need to program their own tools and recruit participants on their own or piggyback on an existing study.
How to handle “big data”
Finally, a note on useful tools and prerequisites for analyzing web logs and records of smartphone use. First of all, the amount of data can quickly exceed the computational power of a standard desktop computer. Four months of web log data used in Bach et al. (2019), for example, contained about 38 million observations. Working with data of this size, researchers may have to consider using remote computing services like Digital Ocean, Amazon AWS or Microsoft Azure, which offer computational resources for little money through virtual servers. Second, understanding URL contents by observing single URLs is straightforward. Analyzing thousands of URLs, however, requires text mining and natural language processing (NLP) techniques if one wants, for example, to select only those URLs that point to news articles. Moreover, in addition to analyzing the title of a news article (which can often be observed from the URL alone), one might also want to analyze the whole content of the article. In such cases, in addition to being able to automatically extract the topic of an article through NLP techniques, knowing how to scrape website contents will likely also be helpful. Some useful materials are linked below.
About the presenter
Ruben Bach
is a postdoctoral researcher at the University of Mannheim, focusing on social science quantitative research methods. His interests include topics related to big data in the social sciences, machine learning, causal inference, and survey research.
References
Bach, R. L., C. Kern, A. Amaya, F. Keusch, F. Kreuter, J. Heinemann, and J. Hecht. 2019. “Predicting Voting Behavior Using Digital Trace Data.” Social Science Computer Review. https://doi.org/10.1177/0894439319882896.
Chancellor, S., and S. Counts. 2018. “Measuring Employment Demand Using Internet Search Data.” In Proceeding of the 2018 Chi Conference on Human Factors in Computing Systems, 1–14. CHI ’18. New York, NY, USA: ACM.
Cornesse, C., A. G. Blom, D. Dutwin, J. A. Krosnick, E. D. De Leeuw, S. Legleye, J. Pasek, et al. 2020. “A Review of Conceptual Approaches and Empirical Evidence on Probability and Nonprobability Sample Survey Research.” Journal of Survey Statistics and Methodology. https://doi.org/10.1093/jssam/smz041.
Dvir-Girsman, S. 2017. “Media Audience Homophily: Partisan Websites, Audience Identity and Polarization Processes.” New Media & Society 19 (7): 1072–91.
Flaxman, Seth, Sharad Goel, and Justin M Rao. 2016. “Filter Bubbles, Echo Chambers, and Online News Consumption.” Public Opinion Quarterly 80 (S1): 298–320.
Fourney, Adam, Ryen W. White, and Eric Horvitz. 2015. “Exploring Time-Dependent Concerns About Pregnancy and Childbirth from Search Logs.” In Proceedings of the 33rd Annual Acm Conference on Human Factors in Computing Systems, 737–46. CHI ’15. New York, NY, USA: ACM. https://doi.org/https://doi.org/10.1145/2702123.2702427.
Ginsberg, Jeremy, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark S Smolinski, and Larry Brilliant. 2009. “Detecting Influenza Epidemics Using Search Engine Query Data.” Nature 457 (7232): 1012–4.
Guess, Andrew M, Brendan Nyhan, and Jason Reifler. 2020. “Exposure to Untrustworthy Websites in the 2016 Us Election.” Nature Human Behaviour, 1–9.
Hinds, J., and A. N. Joinson. 2018. “What Demographic Attributes Do Our Digital Footprints Reveal? A Systematic Review.” PLoS One 13: 1–40.
Kreuter, Frauke, Georg-Christoph Haas, Florian Keusch, Sebastian Bähr, and Mark Trappmann. 2019. “Collecting Survey and Smartphone Sensor Data with an App: Opportunities and Challenges Around Privacy and Informed Consent.” Social Science Computer Review. https://doi.org/10.1177/0894439318816389.
Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176): 1203–5.
Mercer, Andrew W., Frauke Kreuter, Scott Keeter, and Elizabeth A. Stuart. 2017. “Theory and Practice in Nonprobability Surveys: Parallels between Causal Inference and Survey Inference.” Public Opinion Quarterly 81 (S1): 250–71. https://doi.org/10.1093/poq/nfw060.
Möller, Judith, Robbert Nicolai van de Velde, Lisa Merten, and Cornelius Puschmann. 2019. “Explaining Online News Engagement Based on Browsing Behavior: Creatures of Habit?” Social Science Computer Review, 0894439319828012. https://doi.org/10.1177/0894439319828012.
Peterson, Erik, Sharad Goel, and Shanto Iyengar. 2018. “Echo Chambers and Partisan Polarization: Evidence from the 2016 Presidential Campaign.”
Revilla, Melanie, Carlos Ochoa, and Germán Loewe. 2017. “Using Passive Data from a Meter to Complement Survey Data in Order to Study Online Behavior.” Social Science Computer Review 35 (4): 521–36.
Stephens-Davidowitz, Seth. 2014. “The Cost of Racial Animus on a Black Candidate: Evidence Using Google Search Data.” Journal of Public Economics 118: 26–40.