data analysis on Methods Bites

Multiverse analysis

Tue, 30 May 2023 00:00:01 +0100

Abstract

Data analysis involves many decisions, including study design, data preparation, and statistical model selection. However, a single analysis represents only one of many possible outcomes, raising questions about the impact of undocumented and at times arbitrary choices. Multiverse analysis addresses this issue by conducting all—or a large set of—meaningful analyses and presenting the results in summary form to assess the robustness of conclusions to alternative modeling decisions. The approach addresses two fundamental problems in research: the lack of transparency and the dependence of analysis results on data-analytic decisions. We will also discuss how to implement the approach, it’s advantages over more traditional analysis approaches, as well as limitations and open challenges, including statistical inference and computational requirements.

Presenters

Reinhard Schunck is Professor of Sociology at the University of Wuppertal. He works primarily in the field of social stratification and inequality, concentrating on migration and family related processes, and has a focus on quantitative methods.

Nora Huth-Stöckle is a doctoral student and works at the University of Wuppertal. Her research interests comprise intergroup relations, educational inequality, and quantitative methods.

Using Web Logs and Smartphone Records for Social Research

Tue, 14 Apr 2020 01:00:00 +0100

How can social scientists collect and analyze web logs – records of individuals’ browsing behavior – for their own research? In this Methods Bites Instructional Blog Post, Ruben Bach summarizes some key insights of his talk in the MZES Social Sciences Data Lab in December 2019. The blog post discusses how to obtain and extract information from web logs and related data, shows how they can be used for social research, and concludes with a short discussion of how to handle big data extracted from web logs.

How to use web log and related data for social research
How to obtain web log and related data
How to handle “big data”
About the presenter
Further reading
References

How to use web log and related data for social research

Web logs (“browsing histories”) are records of app use and search queries. They are highly interesting data sources for research in the social sciences as they offer detailed insights into human behavior. However, only a few studies have used such data in the social sciences so far. Examples of such studies include Stephens-Davidowitz (2014), who studied racial animus in the 2012 U.S. presidential election, Peterson, Goel, and Iyengar (2018) and Flaxman, Goel, and Rao (2016), who analyzed filter bubbles, echo chambers and partisan polarization in the U.S. and Guess, Nyhan, and Reifler (2020) who studied the spread of fake news in the U.S. In the Netherlands, Möller et al. (2019) analyzed online news engagement based on three different modes of news use. In Israel, Dvir-Girsman (2017) documented that audience homophily is higher among individuals with more extreme ideology and that it is associated with ideological polarization and intolerance. Bach et al. (2019) showed for Germany that online and mobile device activities predict voting behavior and political preferences to a limited degree only. Chancellor and Counts (2018) show that internet search data can be used to estimate employment demand in the U.S.

In other disciplines, similar data have been used, for example, to predict influenza activity (Ginsberg et al. 2009, but see @lazer2014parable) and to show that users who search for relatively harmless symptoms easily end up searching for serious diseases (White and Horvitz 2009). Another study demonstrates how concerns about pregnancy and childbirth change over the course of pregnancy (Fourney, White, and Horvitz 2015). Furthermore, several papers show how a variety of user attributes such as socio-demographics can be inferred from web logs, search queries and app records (see Hinds and Joinson 2018 for a recent overview). However, many of those studies have been conducted by researchers from computer science. While they often rely on search query data obtained from search engine providers like Bing and Google, they typically focus more on the technical aspects and on the evaluation of the performance of the underlying algorithms. As social scientists, however, we often focus on the theory-driven development of models and the testing of hypotheses about human and societal behavior. Thus, this blog post will focus on the latter.

One challenge that researchers face when designing studies that rely on web logs, records of app use, and search queries is how to get access to such data. Several of the studies mentioned above use large amounts of search query data from search engines like Google or Bing (Stephens-Davidowitz 2014; Chancellor and Counts 2018; White and Horvitz 2009; Fourney, White, and Horvitz 2015). These data are, however, usually only available if one teams up with researchers from the respective companies. Another way to obtain data is through commercial providers who keep opt-in panels of users who occasionally answer survey questions in exchange for money. Researchers can pay those vendors in order to get access to their panels and ask participants survey questions. In addition, several of these providers also offer web log and mobile device use records from users who (in exchange for additional pay) agreed to having their online mobile activities monitored. In this blog post, we will mainly focus on this latter way of obtaining data. Before we talk about this topic in more detail, we will briefly summarize what we mean when we speak of web logs, records of app use, and search queries.

To get a better understanding of such data, the table below shows a collection of a few artificial web logs made up for this blog post. Typically, we observe a person identifier (first column) for the person whose records we observe. Second, we have a URL (Uniform Resource Locator) column which tells us which URL this person visited, when (column “Timestamp”) and for how long (“Duration of use”; here, in seconds). The most interesting information in this table is the URL column. Even without a detailed understanding of the specific form of a URL, we can easily see that, in the first row, Person 1 visited a web address that seems to inform her about the weather in Mannheim. In addition, we observe when she visited this address and how much time she spent there. The second row tells us that this person likely sent (or received) a message to (from) Peter Mustermann. We cannot, however, observe the content of the message (which we also should not, given obvious privacy reasons). The third row shows that Person 1 then visited the Facebook page of the CDU. The fourth row shows that she watched a video on YouTube. If we accessed the web address (or programmed a scraping tool), we could also learn what the video was about. From the visit in the fifth row, we learn that this person read an article about the state of the German economy on DIE ZEIT, a German newspaper.

With respect to Person 2, we can also observe what items they searched on Amazon (sixth row) or which tweet they saw (seventh row). We also learn that they searched for the voting advice application “Wahl-O-Mat” (ninth row). From the URL in row 9, we can extract the exact words a person entered into a search engine (the “search queries”). From row 10, we also know that Person 2 then actually used this tool. Thus, we already see that we can learn a lot about users’ behavior, their interests and preferences.

Person ID	URL	Timestamp	Duration of use
1	https://www.wetter.de/deutschland/wetter-mannheim-18224779.html?q=mannheim	2020-01-21 10:43:45	24
1	https://www.facebook.com/messages/t/Peter.Mustermann	2020-01-22 11:43:45	32
1	https://www.facebook.com/CDU/	2020-01-23 23:43:01	45
1	https://www.youtube.com/watch?v=SpoXEEdsNfE	2020-01-24 08:21:45	3625
1	https://www.zeit.de/wirtschaft/unternehmen/2020-01/ifo-index-geschaeftsklima-deutsche-wirtschaft-konjunktur	2020-01-25 16:14:07	67
2	https://www.amazon.de/s?k=mostly+harmless+econometrics	2020-01-26 23:54:45	23
2	https://www.sueddeutsche.de/politik/bundeswehr-wehrbeauftragter-bericht-1.4774621	2020-01-27 07:43:45	245
2	https://twitter.com/realDonaldTrump/status/1222008772102705152	2020-01-28 01:01:45	56
2	https://www.google.com/search?q=wahl+o+mat+hamburg+2020	2020-01-29 17:14:45	32
2	https://www.wahl-o-mat.de/hamburg2020/	2020-01-30 09:09:45	578
2	https://www.notebookcheck.com/Top-10-Ultrabooks-im-Test-bei-Notebookcheck.125873.0.html	2020-01-30 11:49:45	243
2	https://www.google.com/search?q=flug+frankfurt+new+york	2020-01-30 12:09:45	67
2	https://www.google.com/flights?lite=0#flt=/m/02z0j./m/02_286.2020-01-30*/m/02_286./m/02z0j.2020-02-12;c:EUR;e:1;sd:1;t:f	2020-01-30 18:56:45	456

We observe similar records regarding app use on mobile devices, as shown in the next table. Here, we observe the name of an app instead of a URL. Unfortunately, we cannot observe what users do inside the apps they use. That is, we do not know which articles they read when they open a newspaper app or which videos they watch on YouTube or Netflix. Yet, if somebody frequently opens newspaper apps, we might infer that this person may be interested in politics. Moreover, users who use an app called “Period Tracker” are likely female and we may even learn when they have their period by observing temporal variations in usage of this app. Similarly, somebody who uses an app that informs them about prayer times is likely religious. These are just a few examples that show how observing users’ web logs or app use can reveal information that users may perceive as sensitive personal information (see Bach et al. 2019 for a study on this topic).

Person ID	App	Timestamp	Duration of use
1	Facebook	2020-01-21 10:43:45	24
1	Kleiderkreisel	2020-01-22 11:43:45	32
1	WhatsApp Messenger	2020-01-23 23:43:01	45
1	Youtube	2020-01-24 08:21:45	3625
1	GMX Mail	2020-01-25 16:14:07	67
2	Period tracker	2020-01-26 23:54:45	23
2	Pedometer	2020-01-27 07:43:45	245
2	WhatsApp Messenger	2020-01-28 01:01:45	56
2	WhatsApp Messenger	2020-01-29 17:14:45	32
2	Youtube	2020-01-30 09:09:45	578
2	Netflix	2020-01-30 18:56:45	456

It is important to note that participants of commercial panels can usually pause data collection on their devices temporarily. This might happen if they do not feel comfortable having their activities recorded, e.g. when they do online banking or watch movies from illegal streams. However, our own data collections show large amounts of potentially sensitive information (such as adult content, gambling, and illegal streaming), which suggests that users do not make use of this possibility very often.

How to obtain web log and related data

As mentioned earlier, the easiest way to obtain web log and app use data is through commercial vendors who operate online access panels (“non-probability panels”). So far, there seem to be only a handful of providers, such as respondi AG (Germany, UK and France, for example), YouGov (UK and US, amongst others) and netquest (mostly Spain and Latin America). For an overview of providers, see for example, Wakoopa Hub. Since such data have not been available for long, we do not know much about the quality of such data (for an exception see Revilla, Ochoa, and Loewe 2017).. In addition to accessing web logs and app use data, researchers can collect survey data for the same individuals. That is, by asking users questions we can enrich the web log data with important information about users’ socio-demographic characteristics, their voting behavior, or their political preferences. This information is usually not directly observable from the web logs and app use records, but likely makes the data much more valuable for social scientific research purposes. However, we should always keep in mind that due to the opt-in nature of these access panels, we need to be careful when making statements about the representativeness of our findings. That is, because most of our statistical estimators (e.g., for variances) rely on true probability sampling, we need to adjust our models and estimators, e.g. by using weights (for more information on non-probability sampling and non-probability panels see, e.g., Mercer et al. 2017; Cornesse et al. 2020).

Another way to obtain data on users’ online activities is through Google Trends. Briefly speaking, this service allows the estimation of the popularity of search queries in Google Search across various regions and languages. Several R packages in R allow users to automatically extract data from Google Trends – see, for instance, gtrendsR. Further details on Google Trends can be found here. A major drawback of these data is that they do not come with additional information about users and only provide information at aggregate levels. That is, one can learn only about the popularity of a specific search query compared to the popularity of other search queries, which is arguably much less informative than individual-level web logs.

A third way to obtain web log, app, and search query data is through developing one’s own research app. A few projects using this approach were launched in recent years. One of the most prominent examples is the IAB-SMART project (for details, see Kreuter et al. 2019). Researchers of the Institute for Employment Research (IAB) in Nuremberg, Germany, developed an app that has since been downloaded by several hundred respondents of the IAB’s panel survey “Labour Market and Social Security”. The app collects information on respondents’ smartphone activities and can access even more information than those described above (including geolocation and address books). While this approach is likely the most elaborate, it is also the most difficult and most costly one to implement: Researchers need to program their own tools and recruit participants on their own or piggyback on an existing study.

How to handle “big data”

Finally, a note on useful tools and prerequisites for analyzing web logs and records of smartphone use. First of all, the amount of data can quickly exceed the computational power of a standard desktop computer. Four months of web log data used in Bach et al. (2019), for example, contained about 38 million observations. Working with data of this size, researchers may have to consider using remote computing services like Digital Ocean, Amazon AWS or Microsoft Azure, which offer computational resources for little money through virtual servers. Second, understanding URL contents by observing single URLs is straightforward. Analyzing thousands of URLs, however, requires text mining and natural language processing (NLP) techniques if one wants, for example, to select only those URLs that point to news articles. Moreover, in addition to analyzing the title of a news article (which can often be observed from the URL alone), one might also want to analyze the whole content of the article. In such cases, in addition to being able to automatically extract the topic of an article through NLP techniques, knowing how to scrape website contents will likely also be helpful. Some useful materials are linked below.

About the presenter

Ruben Bach is a postdoctoral researcher at the University of Mannheim, focusing on social science quantitative research methods. His interests include topics related to big data in the social sciences, machine learning, causal inference, and survey research.

References

Bach, R. L., C. Kern, A. Amaya, F. Keusch, F. Kreuter, J. Heinemann, and J. Hecht. 2019. “Predicting Voting Behavior Using Digital Trace Data.” Social Science Computer Review. https://doi.org/10.1177/0894439319882896.

Chancellor, S., and S. Counts. 2018. “Measuring Employment Demand Using Internet Search Data.” In Proceeding of the 2018 Chi Conference on Human Factors in Computing Systems, 1–14. CHI ’18. New York, NY, USA: ACM.

Cornesse, C., A. G. Blom, D. Dutwin, J. A. Krosnick, E. D. De Leeuw, S. Legleye, J. Pasek, et al. 2020. “A Review of Conceptual Approaches and Empirical Evidence on Probability and Nonprobability Sample Survey Research.” Journal of Survey Statistics and Methodology. https://doi.org/10.1093/jssam/smz041.

Dvir-Girsman, S. 2017. “Media Audience Homophily: Partisan Websites, Audience Identity and Polarization Processes.” New Media & Society 19 (7): 1072–91.

Flaxman, Seth, Sharad Goel, and Justin M Rao. 2016. “Filter Bubbles, Echo Chambers, and Online News Consumption.” Public Opinion Quarterly 80 (S1): 298–320.

Fourney, Adam, Ryen W. White, and Eric Horvitz. 2015. “Exploring Time-Dependent Concerns About Pregnancy and Childbirth from Search Logs.” In Proceedings of the 33rd Annual Acm Conference on Human Factors in Computing Systems, 737–46. CHI ’15. New York, NY, USA: ACM. https://doi.org/https://doi.org/10.1145/2702123.2702427.

Ginsberg, Jeremy, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark S Smolinski, and Larry Brilliant. 2009. “Detecting Influenza Epidemics Using Search Engine Query Data.” Nature 457 (7232): 1012–4.

Guess, Andrew M, Brendan Nyhan, and Jason Reifler. 2020. “Exposure to Untrustworthy Websites in the 2016 Us Election.” Nature Human Behaviour, 1–9.

Hinds, J., and A. N. Joinson. 2018. “What Demographic Attributes Do Our Digital Footprints Reveal? A Systematic Review.” PLoS One 13: 1–40.

Kreuter, Frauke, Georg-Christoph Haas, Florian Keusch, Sebastian Bähr, and Mark Trappmann. 2019. “Collecting Survey and Smartphone Sensor Data with an App: Opportunities and Challenges Around Privacy and Informed Consent.” Social Science Computer Review. https://doi.org/10.1177/0894439318816389.

Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176): 1203–5.

Mercer, Andrew W., Frauke Kreuter, Scott Keeter, and Elizabeth A. Stuart. 2017. “Theory and Practice in Nonprobability Surveys: Parallels between Causal Inference and Survey Inference.” Public Opinion Quarterly 81 (S1): 250–71. https://doi.org/10.1093/poq/nfw060.

Möller, Judith, Robbert Nicolai van de Velde, Lisa Merten, and Cornelius Puschmann. 2019. “Explaining Online News Engagement Based on Browsing Behavior: Creatures of Habit?” Social Science Computer Review, 0894439319828012. https://doi.org/10.1177/0894439319828012.

Peterson, Erik, Sharad Goel, and Shanto Iyengar. 2018. “Echo Chambers and Partisan Polarization: Evidence from the 2016 Presidential Campaign.”

Revilla, Melanie, Carlos Ochoa, and Germán Loewe. 2017. “Using Passive Data from a Meter to Complement Survey Data in Order to Study Online Behavior.” Social Science Computer Review 35 (4): 521–36.

Stephens-Davidowitz, Seth. 2014. “The Cost of Racial Animus on a Black Candidate: Evidence Using Google Search Data.” Journal of Public Economics 118: 26–40.

White, Ryen W., and Eric Horvitz. 2009. “Cyberchondria: Studies of the Escalation of Medical Concerns in Web Search.” ACM Trans. Inf. Syst. 27 (4): 23:1–23:37. https://doi.org/https://doi.org/10.1145/1629096.1629101.

Shiny Apps: Development and Deployment

Tue, 17 Dec 2019 01:00:00 +0100

Shiny Apps allow developers and researchers to easily build interactive web applications within the environment of the statistical software R. Using these apps, R users can interactively communicate their work to a broader audience. In this Method Bites Tutorial, Konstantin Gavras and Nick Baumann present a comprehensive recap of Konstantin Gavras’ (University of Mannheim) workshop materials to illustrate how Shiny Apps enable vivid data presentation as well as its usefulness as an analytical tool.

After reading this blog post and engaging with the applied examples, readers should:

be able to retrace the logic behind the structure of Shiny Apps,
be able to build their own Shiny App, and
be able to deploy their Shiny App and run them in the world wide web.

Note: This blog post provides a summary of Konstantin’s workshop in the MZES Social Science Data Lab with some adaptations. Konstantin’s original workshop materials, including slides and scripts, are available from our GitHub. A recording of the workshop is available on Youtube.

Shiny Apps and Data Presentation
The Structure of Shiny Apps
Developing Shiny Apps
Deployment
1. Deployment using shinyapps.io
2. Deployment using Shiny Server
Conclusion

Shiny Apps and Data Presentation

Data are usually collected in a raw format and, no matter how well processed and analyzed, results should be presented in a plain and simple way in order to make them accessible for everyone. Unfortunately, the potential of effective data presentation is all too often not fully exploited. Yet, presentation is crucial as it illustrates the actual outcome of research. If not effectively visualized, we waste a great opportunity to build credibility, attract and sustain the interest of readers, and make large amounts of information easily accessible. Here, we introduce Shiny Apps as a powerful and highly flexible tool for interactive data presentation in the world wide web.

Shiny is an R package that offers cost- and programming-free tools for building web applications using R. It was developed by Joe Chang to serve as a reactive web framework for R that allows calculations, display of R objects, and the presentation of results. Since Shiny Apps come with an extensive back-end setup, users do not need extensive web development skills to build and host standalone apps on a homepage. However, for those keen on bringing their apps to perfection, Shiny Apps allows for CSS, HTML and JavaScript extensions.Shiny Apps can be used either for data presentation, as a communication tool for results, or even as an interactive analytical tool.

In what follows, we introduce the Shiny environment and guide readers through the development of Shiny Apps. Using the famous Kaggle titanic data set, we draw the distinction between the front-end ui.R and back-end server.R, which are required to build Shiny apps. Following this, we introduce important concepts and features to build an interactive app, including control widgets, reactivity, and rendering.

The Structure of Shiny Apps

Shiny applications have two components. The front-end builds the webpage that is actually shown to the user. As already mentioned, the HTML page is written by Shiny itself and includes layout, appearance, and design features. In Shiny terminology, this is called the ui, which stands for user interface. The ui file contains R functions that are then translated into an HTML file. The other component is the back-end, which includes the code for producing the app’s contents (e.g. functions or data import, management, and analysis). Here, we create the objects that are later shown on the front-end. In Shiny terminology this is called the server.

Ways of Setting Up a Shiny App

Shiny Apps can be set up in two different ways.

1. Single-file App: In a single-file app, ui and server are stored in one script. In this case, we create a file named app.R that contains both the server and UI components. This technique is used when developing very simple Shiny Apps and lacks some advantages of the alternative two-files App method.

2. Two-files App: Here, ui and server are stored in two separate scripts, which implies a clear separation between front-end and back-end. This method is preferable when developing more advanced Shiny Apps. It is important that the files are named ui.R and server.R and always stored in a separate folder. In this tutorial, we are going to develop Shiny Apps using the two-files method.

Developing Shiny Apps

Building a Shiny App from Scratch

To set up, make sure that all required packages are installed and subsequently load the shiny,tidyverse, and plotly packages along with R’s example data set titanic.

Code: R packages used in this tutorial

Advancing Text Mining with R and quanteda

Thu, 17 Oct 2019 00:00:00 +0100

Everyone is talking about text analysis. Is it puzzling that this data source is so popular right now? Actually no. Most of our datasets rely on (hand-coded) textual information. Extracting, processing, and analyzing this oasis of information becomes increasingly relevant for a large variety of research fields. This Methods Bites Tutorial by Cosima Meyer summarizes Cornelius Puschmann’s workshop in the MZES Social Science Data Lab in January 2019 on advancing text mining with R and the package quanteda. The workshop offered guidance through the use of quanteda and covered various classification methods, including classification with known categories (dictionaries and supervised machine learning) and with unknown categories (unsupervised machine learning).

This post was updated in December 2020 to be consistent with quanteda’s version 2.1.2. For more information on differences between quanteda versions, have a look at this excellent overview.

Overview

What is quanteda?
How do we use quanteda?
Classification
1. Known categories
  1. Dictionaries
  2. Supervised machine learning
    1. Naive Bayes (NB)
2. Unknown categories
  1. Unsupervised machine learning
Further readings

This blog post is based on this report and on Cornelius’ post on topic models in R.

What is quanteda?

In order to analyze text data, R has several packages available. In this blog post we focus on quanteda. quanteda is one of the most popular R packages for the quantitative analysis of textual data that is fully-featured and allows the user to easily perform natural language processing tasks. It was originally developed by Ken Benoit and other contributors. It offers an extensive documentation and is regularly updated. quanteda is most useful for preparing data that can then be further analyzed using unsupervised/supervised machine learning or other techniques. A combination with tidyverse leads to a more transparent code structure and offers a mere variety of useful areas that could not be addressed within the limited time of the workshop (e.g., scaling models, part-of-speech (POS) tagging, named entities, word embeddings, etc.).

There are also similar R packages such as tm, tidytext, and koRpus. tm has simpler grammer but slightly fewer features, tidytext is very closely integrated with dplyr and well-documented, and koRpus is good for tasks such as part-of-speech (POS) tagging).

How do we use quanteda?

Most analyses in quanteda require three steps:

_{1. Import the data}

The data that we usually use for text analysis is available in text formats (e.g., .txt or .csv files).

_{2. Build a corpus}

After reading in the data, we need to generate a corpus. A corpus is a type of dataset that is used in text analysis. It contains “a collection of text or speech material that has been brought together according to a certain set of predetermined criteria” (Shmelova et al. 2019, p. 33). These criteria are usually set by the researchers and are in concordance with the guiding question. For instance, if you are interested in analyzing speeches in the UN General Debate, these predetermined criteria are the time and scope conditions of these debates (speeches by countries at different points in time).

_{3. Calculate a document-feature matrix (DFM)}

Another essential component for text analysis is a document-feature matrix (DFM); also called document-term matrix (DTM). These two terms are synonyms but quanteda refers to a DFM whereas others will refer to DTM. It describes how frequently terms occur in the corpus by counting single terms. To generate a DFM, we first split the text into its single terms (tokens). We then count how frequently each term (token) occurs in each document.

The following graphic describes visually how we turn raw text into a vector-space representation that is easily accessible and analyzable with quantitative statistical tools. It also visualizes how we can think of a DFM. The rows represent the documents that are part of the corpus and the columns show the different terms (tokens). The values in the cells indicate how frequently these terms (tokens) are used across the documents.

Figure 1: Model of a DFM

Important things to remember about DFMs:

A corpus is positional (string of words) and a DFM is non-positional (bag of words). Put differently, the order of the words matters in a corpus whereas a DFM does not have information on the position of words.
A token is each individual word in a text (but it could also be a sentence, paragraph, or character). This is why we call creating a “bag of words” also tokenizing text. In a nutshell, a DFM is a very efficient way of organizing the frequency of features/tokens but does not contain any information on their position. In our example, the features of a text are represented by the columns of a DFM and aggregate the frequency of each token.
In most projects you want one corpus to contain all your data and generate many DFMs from that.
The rows of a DFM can contain any unit on which you can aggregate documents. In the example above, we used the single documents as the unit. It may also well be more fine-grained with sub-documents or more aggregated with a larger collection of documents.
The columns of a DFM are any unit on which you can aggregate features. Features are extracted from the texts and quantitatively measurable. Features can be words, parts of the text, content categories, word counts, etc. In the example above, we used single words such as “united”, “nations”, and “peace”.

To showcase the three steps introduced above, we are using the UN General Debate data by Mikhaylov, Baturo, and Dasandi dataset. There is also a pre-processed version of the dataset accessible with quanteda.corpora.

How to access the UNGD data with quanteda.corpora

Studying Politics on and with Wikipedia

Mon, 26 Aug 2019 01:00:00 +0100

The online encyclopedia Wikipedia, together with its sibling, the collaboratively edited knowledge base Wikidata, provides incredibly rich yet largely untapped sources for political research. In this Methods Bites Tutorial, Denis Cohen and Nick Baumann offer a hands-on recap of Simon Munzert’s (Hertie School of Governance) workshop materials to show how these platforms can inform research on public attention dynamics, policies, political and other events, political elites, and parties, among other things.

After reading this blog post and engaging with the applied exercises, readers should:

be able to collect Wikipedia data and Wikidata items using R
be able to conduct explorative analyses of Wikipedia data using R
have a basic intuition of the potentials and limitations of using Wikipedia data in research projects

Note: This blog post provides a summary of Simon’s workshop in the MZES Social Science Data Lab with some adaptations. Simon’s original workshop materials, including slides and scripts, are available from our GitHub.

Wikipedia for Political Research
Collecting and Analyzing Wikipedia Data
Collecting Data via Wikidata Queries
legislatoR
1. Application 1: Social Media Adoption Rates
2. Application 2: Public Attention to Members of the German Bundestag
Conclusion
About the Presenter
References

Wikipedia for Political Research

According to its website, “Wikipedia […] is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content”. As of July 2019, it comprises more than 48 million articles and is ranked sixth in the list of the most frequently visited websites.

Wikipedia harbors numerous types of data. These include both article contents as well as meta information such as pageviews, clickstreams, links and backlinks, or edits and revision histories. Additionally, Wikipedia’s sibling, the collaboratively edited document-oriented data base Wikidata, provides access to over 58 million data items (as of July 2019). Given the broad collection of articles on politicians and institutions from all over the world, Wikipedia offers tremendous potential for (comparative) political research.

In what follows, we will introduce the functionalities of various R packages, including WikipediR, WikidataR, and pageviews. In doing so, we will showcase how to connect to Wikipedia and Wikidata APIs, how to efficiently access and parse content, and how to process the retrieved data in order to address various questions of substantive interest. We will also provide an overview of the legislatoR package, a fully relational individual-level data package that comprises political, sociodemographic, and Wikipedia-related data on elected politicians from various consolidated democracies.

Code: R packages used in this tutorial

Quantitative Analysis of Political Text

Mon, 22 Jul 2019 00:00:00 +0100

How can we infer actors’ positions, substantive topics, or sentiments from (political) texts? This Methods Bites Tutorial by Julian Bernauer summarizes Denise Traber’s workshop in the MZES Social Science Data Lab in Spring 2018. Using exemplary sets of political documents (election manifestos and coalition agreements), it showcases tools of QTA for a variety of analytical objectives and demonstrates how to create, process, and analyse a text corpus through a series of hands-on applications.

After reading this blog post and engaging with the applied exercises, readers should:

be able to perform some basic preprocessing of text
be able to estimate the sentiment of texts
be able to find topics in texts
be able to estimate (scale) positions of texts

You can use these links to navigate across the main sections of this tutotial:

A tour of Quantitative Text Analysis
(Pre-)processing text
A small coalition corpus
Sentiment analysis using a dictionary
LDA topic modeling
Wordfish scaling
Estimating intra-party preferences: Comparing speeches to votes
Further readings

Note: This blog post presents Denise’s workshop materials in condensed form. The complete workshop materials, including slides and scripts, are available from our GitHub.

A tour of Quantitative Text Analysis

The workshop started with a few basics: While QTA can be efficient and cheap, it always fails to rely on a correct model of language. It does not free us from reading texts, and validation is key. We learned about the basic distinction between classification (organizing text into categories) and scaling (estimation positions of actors), and its supervised (where hand-coded or other external data is available) and unsupervised (without such data) variants.

(Pre-)processing text

We relied on the R package quanteda developed by Ken Benoit and collaborators, which takes QTA by storm, at least for those working in R. Together with the readtext package, it easily allows to get your text data into R, create a so-called “corpus” of texts with the actual content as well as meta-information, and perform various tasks of corpus and text processing (subsetting a corpus, creating a document-feature matrix (dfm), stopword removal) as well as analysis (scaling, classification). A large and increasing number of extras is also available, such as ways to assess text similarity (function textsta_simil()) and lexical diversity (textstat_lexdiv()). Some of these features are demonstrated in an example below. Also see this overview by quanteda for a full list of functions and the preText package for advise on evaluating pre-processing specifications.

A small coalition corpus

For a few examples from the workshop, consider a small set of three documents: The coalition agreement between the CDU/CSU and the SPD as well as the respective election manifestos from the 2017 Bundestag election. The corpus is created by:

library(readtext)
library(quanteda)
text <- readtext(paste0(wd, "coalition/*.txt"),
                 docvarsfrom = "filenames",
                 docvarnames = "Party")
text$text <- gsub("\n", " ", text$text)
coalitioncorpus <- corpus(text, docid_field = "doc_id")
coalitioncorpus$metadata$source <- "[directory] on [system] by [user]"
summary(coalitioncorpus)

## Corpus consisting of 3 documents:
## 
##           Text Types Tokens Sentences     Party
##     cducsu.txt  4738  26004      1288    cducsu
##  coalition.txt 11660  93214      3763 coalition
##        spd.txt  7650  50298      2402       spd
## 
## Source: [directory] on [system] by [user]
## Created: Wed Nov 11 15:06:41 2020
## Notes:

The code relies on the two packages, readtext and quanteda, to create a data frame with the text files, using their names for a document-level variable called “Party”. The gsub() command removes whitespace, and corpus() turns the data frame into a corpus, which is a special case of a data frame containing texts, some meta-information and document-level variables, all optimized to perform a variety of quantitative text analysis operations using quanteda.

Further document-level variables are added via:

docvars(coalitioncorpus, "Year") <- 2017
docvars(coalitioncorpus, "Party_regex") <- 
  sub("[\\.].*", "", names(texts(coalitioncorpus)))
docvars(coalitioncorpus)

##                   Party Year Party_regex
## cducsu.txt       cducsu 2017      cducsu
## coalition.txt coalition 2017   coalition
## spd.txt             spd 2017         spd

Note that this uses a regular expression (regex) to alternatively retrieve the party names from the filenames after creating the corpus. For specific analyses, we want to know the distribution of words across documents and create a document-feature matrix (dfm):

dfm_coal <- dfm(
  coalitioncorpus,
  remove = c(stopwords("german"),
             "dass",
             "sowie",
             "insbesondere"),
  remove_punct = TRUE,
  stem = FALSE
)
dfm_coal[, 1:8]

## Document-feature matrix of: 3 documents, 8 features (16.7% sparse).
## 3 x 8 sparse Matrix of class "dfm"
##                features
## docs            gutes land zeit deutschland liebens lebenswertes gut
##   cducsu.txt        6   48   11         147       1            1  16
##   coalition.txt     1   39   14         195       0            0  13
##   spd.txt           6   44   33          97       0            0  17
##                features
## docs            wohnen
##   cducsu.txt         1
##   coalition.txt     10
##   spd.txt            6

Creating a dfm induces a bag-of-words assumption. This means that the order in which words appear is ignored. A dfm is a means of information reduction and the most efficient way of storing text as data, but allows only analyses under this assumption. We quickly glance at the similarity (function textstat_simil()) and lexical diversity (function textstat_lexdiv()) of texts:

simil <- textstat_simil(dfm_coal,
                        margin = "documents",
                        method = "correlation")
simil

## textstat_simil object; method = "correlation"
##               cducsu.txt coalition.txt spd.txt
## cducsu.txt         1.000         0.968   0.975
## coalition.txt      0.968         1.000   0.986
## spd.txt            0.975         0.986   1.000

textstat_lexdiv(dfm_coal)[, 1:2]

##        document       TTR
## 1    cducsu.txt 0.3302084
## 2 coalition.txt 0.2487993
## 3       spd.txt 0.2802713

From this, we learn that the SPD manifesto has more similarity to the coalition agreement than that of the CDU/CSU, a notion which somewhat resembles the assessment of the 2017 German coalition. Also, the lexical diversity of the manifestos, measured in terms of types (different words) per token (total words), appears to be higher than the coalition agreement, especially for the CDU/CSU.

Sentiment analysis using a dictionary

For sentiment analyis, existing dictionaries are available. It is important to note that these do not necessarily fit the research question at hand. In this example, the German “LIWC” (linguistic inquiry and word count) dictionary is used, but alternatives exist, such as “Lexicoder” for political text. LIWC features the categories “anger”, “posemo” (positive emotion) and “religion”. After obtaining the dictionary and applying it while creating a dfm from the corpus, the share of the texts in the respective categories is displayed. The results indicate that the coalition agreement features less positive emotions as compared to the manifestos and that the SPD manifesto is the most “angry” text, while the CDU/CSU speaks most about religion.

Code: Using a Dictionary

Collecting and Analyzing Twitter Data Using R

Mon, 15 Jul 2019 01:01:01 +0100

How do you access Twitter’s API, collect a stream of tweets, and analyze the retrieved data? Which potentials, challenges, and limitations for social scientific research come along with using Twitter data? This Methods Bites Tutorial by Denis Cohen, based on a workshop by Simon Kühne (Bielefeld University) in the MZES Social Science Data Lab in Spring 2019, aims to tackle these questions.

After reading this blog post and engaging with the applied exercises, readers should:

be able to collect Twitter data using R
be able to perform explorative analyses of the data using R
have a better understanding of Twitter data, and thus, the potentials and limitations of using it in research projects

You can use these links to navigate across the four main sections of this tutotial:

About Twitter
Collecting Twitter Data
Analyzing Twitter Data
Potential Issues and Challenges

Note: This blog post presents Simon’s workshop materials in condensed form. The complete workshop materials, including slides and scripts, are available from our GitHub.

About Twitter

Twitter is an online news and social networking service, also used for micro-blogging. In everyday use, it mostly serves as a platform for publicly sharing short texts – often along with media content and/or links – in the form of so-called “tweets”. Twitter has approx. 326 Million monthly active users who send about 500 Million tweets each day (see this fact sheet).

The Basics

Each user has a profile (page) and can add a photo and information about themselves
Users can follow each other
Users can tweet, i.e., publicly sharing a text/photo/link
Each Tweet is restricted to a maximum of 280 characters
Users can interact with a Tweet via comments (replies), likes, and shares (retweets)
Users can interact with other users via direct messaging
Users can create a thread: A series of connected tweets
Users use hashtags (e.g., #mannheim) in order to associate their tweets with certain topics and to make them easier to find
Users can search for keywords/hashtags in order to find relevant tweets and users

Collecting Twitter Data

The Twitter API Platform

An API (Application Programming Interface) allows users to access (real-time) Twitter data. Twitter offers a variety of API services – some for free, others not. Using these services, one can search for tweets published in the past, stream tweets in realtime, and manage Twitter accounts and ads. The following exercises will focus on the free-of-charge API service, which is used in the vast majority of research projects: The Realtime Streaming API.

The Streaming API - Collecting Tweets in Realtime

“Establishing a connection to the streaming APIs means making a very long lived HTTP request, and parsing the response incrementally. Conceptually, you can think of it as downloading an infinitely long file over HTTP”. This way, we can receive up to a maximum of 1% of all tweets worldwide. As a query is usually specified by selected keywords or geographic areas, you will be able to collect (almost) all relevant tweets for your research interest. There are three filter parameters that you can use:

‘Follow’: Receive tweets of up to 5,000 users
‘Track’: Receive tweets that contain up to 400 keywords
‘Location’: Receive tweets from within a set of up to 25 geographic bounding boxes

API Authentification

You need to authenticate yourself when making requests to the Twitter API. Twitter uses the OAuth protocol, an “open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications”. This involves five necessary steps:

creating a Twitter account
logging in to your Twitter account via https://developer.twitter.com/
creating an app
creating keys, access token & secret

Data Collection

There are a number of ways to collect Twitter data, including writing your own script to make continuous HTTP requests, Python’s tweepy package, and R’s rtweet package. The following demonstrates how to collect Twitter data using different Streaming API endpoints and the rtweet package.

Collecting Tweets Using the rtweet Package

As usual, we start with a little housekeeping: Installing required packages per install.packages() and specifying a working directory using setwd().

Code: Setup

Visual Inference with R

Sun, 14 Jul 2019 03:03:03 +0100

How can we use data visualization for hypothesis testing? This question lies at the heart of this Methods Bites Tutorial by Cosima Meyer, which is based on Richard Traunmüller’s workshop in the MZES Social Science Data Lab in Fall 2017. We already covered the basic idea of visual inference in our blog post on Data visualization with R.

Note: This blog post presents Richard’s workshop materials in condensed form. The complete workshop materials are available from our GitHub.

Overview

What is visual inference?
Potential challenges and how to overcome them
Practical applications: How do we reveal the “true” data graphically? A step-by-step guide
Further readings

What is visual inference?

Visual inference uses our ability to detect graphical anomalies. The idea of formal testing remains the same in visual inference – with one exception: The test statistic is now a graphical display which is compared to a “reference distribution” of plots showing the null. Put differently, we plot both the “true pattern of the data” and additional random plots of our data. By comparing both, we should be able to identify the true data – if the pattern is not based on randomness. This approach can be applied to various (research) situations – some of them are described in the “Practical applications” section.

Potential challenges and how to overcome them

Major concerns related to exploratory data analysis are its seemingly informal approach to data analysis and the potential over-interpretation of patterns. Richard provides a line-up protocol how to best overcome these concerns:

_{1. Identify the question the plot is trying to answer or the pattern it is intended to show.}
_{2. Formulate a null hypothesis (usually this will be \(H_0\): “There is no pattern in the plot.”)}
_{3. Generate and visualize a null datasets (e.g., permutations of variable values, random simulations)}

The following examples illustrate this procedure and explain the steps in detail.

Practical applications: How do we reveal the “true” data graphically? A step-by-step guide

To reveal the “true” data, we may use several visual approaches. In the following, we present three different examples: 1) maps, 2) scatter plots, and 3) group comparisons. The underlying logic follows the line-up protocol described above. To produce the visual inference, we always apply the following steps:

_{1. Identify the question: ‘Is there a visual pattern?’}
_{2. Formulate a null hypothesis: ‘There is no visual pattern.’}
_{3. Generate null datasets: Just randomly permute one variable column and plot the data.}
_{4. Add the “true” data: Add the true data to the null datasets.}
_{5. Visual inference: Is there a visual difference between the randomly permuted data and the “true” data?}

1) Maps

This map provides an intuitive understanding of how to apply the line-up protocol to a real-world example. Richard uses data from the GLES (German Longitdunal Election Survey) as an example to analyze the interviewer selection effects. These biases arise if interviewers selectively contact certain households and fail to reach to others. Reasons might be that researchers try to avoid less comfortable areas.

As a first step, we need to read in the required packages as well as the data and code the interviewer behavior by color.

# Read all required packages
library(maps)
library(mapdata)
library(RColorBrewer)

# Read data
data <- readRDS("sub_data.rds")

# Code interviewer behavior by color
data$col <-
  ifelse(data$status == "No Contact", "maroon3", "darkolivegreen2")

Following the line-up protocol described above, we seek to answer the question if there is a visual pattern. Our null hypothesis assumes that there is no visual pattern. To generate the null dataset, we randomly permute one variable column and plot the data.

# Generate random plot placement
placement <- sample((1:20), 20)
layout(matrix(placement, 4, 5))

# Generate 19 null plots
par(mar = c(.01, .01, .01, .01), oma = c(0, 0, 0, 0))
for(i in 1:19) {
  # Randomize the order
  random <- sample(c(1:15591), 15591)
  # Plot
  map(
    # Refer to dataset
    database = "worldHires",
    fill = F,
    col = "darkgrey",
    # Range of x-axis
    xlim = c(6, 15),
    # Range of y-axis
    ylim = c(47.3, 55)
    ) 
  points(
    # Refer to data
    data$g_lon,
    data$g_lat,
    cex = .1,
    # Type of plotting symbol
    pch = 19,
    col = data$col[random]
    )
}

We then proceed and add the true data to the null datasets.

# Add the true plot
map(
  database = "worldHires",
  fill = F,
  col = "darkgrey",
  # Range of x-axis
  xlim = c(6, 15),
  # Range of y-axis
  ylim = c(47.3, 55)
  ) 
points(
  # Refer to data
  data$g_lon,
  data$g_lat,
  cex = .1,
  # Type of plotting symbol
  pch = 19,
  col = data$col
  )

# Reveal the true plot
box(col = "red", # Draw a box in red
    lty = 2, # Defines line type
    lwd = 2) # Defines line width
    which(placement == 20) # Defines the place of the box

Using the code above, we receive twenty maps from Germany. In a last step, we ask if these plots are substantially different from one another. If yes, can you tell which one is the odd-one-out? Just wait for a few seconds to let the image reveal the answer.

2) Scatter plot

Mimicking the approach for the maps, we proceed in a similar way with scatter plots.

Assume we have two variables and want to plot their correlation with a scatter plot. To compare if their relation is random, we can make use of visual inference. To do so, we first need to load all required packages and read in the data.

# Read required package
library(foreign) # Necessary to load datasets in other formats (such as .dta)

# Read the data
slop <- read.dta("slop_2009_agg_example.dta")

We then proceed and place randomly 20 plots within a 4x5 grid cells.

# Generate a random plot placement
placement <- sample((1:20), 20)
layout(matrix(placement, 4, 5))

We want to position 19 out of 20 random plots and leave one grid cell empty for the “true” plot.

Code: Plotting nineteen random scatter plots

data analysis on Methods Bites

Multiverse analysis

Abstract

Presenters

Using Web Logs and Smartphone Records for Social Research

Contents

How to use web log and related data for social research

How to obtain web log and related data

How to handle “big data”

About the presenter

Further reading

References

Shiny Apps: Development and Deployment

Contents

Shiny Apps and Data Presentation

The Structure of Shiny Apps

Ways of Setting Up a Shiny App

Developing Shiny Apps

Building a Shiny App from Scratch

Advancing Text Mining with R and quanteda

Overview

What is quanteda?

How do we use quanteda?

Studying Politics on and with Wikipedia

Contents

Wikipedia for Political Research

Quantitative Analysis of Political Text

A tour of Quantitative Text Analysis

(Pre-)processing text

A small coalition corpus

Sentiment analysis using a dictionary

Collecting and Analyzing Twitter Data Using R

About Twitter

The Basics

Twitter in Social Science Research

Collecting Twitter Data

The Twitter API Platform

The Streaming API - Collecting Tweets in Realtime

API Authentification

Data Collection

Collecting Tweets Using the rtweet Package

Visual Inference with R

Overview

What is visual inference?

Potential challenges and how to overcome them

Practical applications: How do we reveal the “true” data graphically? A step-by-step guide

1) Maps

2) Scatter plot