<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>data analysis on Methods Bites</title>
    <link>https://socialsciencedatalab.mzes.uni-mannheim.de/tags/data-analysis/</link>
    <description>Recent content in data analysis on Methods Bites</description>
    <generator>Hugo -- gohugo.io</generator>
    <lastBuildDate>Tue, 30 May 2023 00:00:01 +0100</lastBuildDate>
    
        <atom:link href="https://socialsciencedatalab.mzes.uni-mannheim.de/tags/data-analysis/index.xml" rel="self" type="application/rss+xml" />
    
    
    <item>
      <title>Multiverse analysis</title>
      <link>https://socialsciencedatalab.mzes.uni-mannheim.de/video/multiverse-analysis/</link>
      <pubDate>Tue, 30 May 2023 00:00:01 +0100</pubDate>
      
      <guid>https://socialsciencedatalab.mzes.uni-mannheim.de/video/multiverse-analysis/</guid>
      <description><![CDATA[
        <div id="abstract" class="section level5">
<h5>Abstract</h5>
<p>Data analysis involves many decisions, including study design, data preparation, and statistical model selection. However, a single analysis represents only one of many possible outcomes, raising questions about the impact of undocumented and at times arbitrary choices. Multiverse analysis addresses this issue by conducting all—or a large set of—meaningful analyses and presenting the results in summary form to assess the robustness of conclusions to alternative modeling decisions. The approach addresses two fundamental problems in research: the lack of transparency and the dependence of analysis results on data-analytic decisions. We will also discuss how to implement the approach, it’s advantages over more traditional analysis approaches, as well as limitations and open challenges, including statistical inference and computational requirements.</p>
</div>
<div id="presenters" class="section level5">
<h5>Presenters</h5>
<p>Reinhard Schunck <a href="mailto:schunck@uni-wuppertal.de"><i class="fa fa-envelope"></i> </a> is Professor of Sociology at the University of Wuppertal. He works primarily in the field of social stratification and inequality, concentrating on migration and family related processes, and has a focus on quantitative methods.</p>
<p>Nora Huth-Stöckle <a href="mailto:nhuth@uni-wuppertal.de"><i class="fa fa-envelope"></i> </a> is a doctoral student and works at the University of Wuppertal. Her research interests comprise intergroup relations, educational inequality, and quantitative methods.</p>
</div>
]]>
      </description>
    </item>
    
    <item>
      <title>Using Web Logs and Smartphone Records for Social Research</title>
      <link>https://socialsciencedatalab.mzes.uni-mannheim.de/article/using-web-logs/</link>
      <pubDate>Tue, 14 Apr 2020 01:00:00 +0100</pubDate>
      
      <guid>https://socialsciencedatalab.mzes.uni-mannheim.de/article/using-web-logs/</guid>
      <description><![CDATA[
        </p>
<p>How can social scientists collect and analyze web logs – records of individuals’ browsing behavior – for their own research? In this <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/categories/instructionals/">Methods Bites Instructional Blog Post</a>, <a href="https://twitter.com/rub3n_luc">Ruben Bach</a> summarizes some key insights of his talk in the <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/">MZES Social Sciences Data Lab</a> in December 2019. The blog post discusses how to obtain and extract information from web logs and related data, shows how they can be used for social research, and concludes with a short discussion of how to handle big data extracted from web logs.</p>
<div id="contents" class="section level5">
<h5>Contents</h5>
<ol style="list-style-type: decimal">
<li><a href="#how-to-use-web-log-and-related-data-for-social-research">How to use web log and related data for social research</a></li>
<li><a href="#how-to-obtain-web-log-and-related-data">How to obtain web log and related data</a></li>
<li><a href="#how-to-handle-big-data">How to handle “big data”</a></li>
<li><a href="#about-the-presenter">About the presenter</a></li>
<li><a href="#further-reading">Further reading</a></li>
<li><a href="#references">References</a></li>
</ol>
</div>
<div id="how-to-use-web-log-and-related-data-for-social-research" class="section level3">
<h3>How to use web log and related data for social research</h3>
<p>Web logs (“browsing histories”) are records of app use and search queries. They are highly interesting data sources for research in the social sciences as they offer detailed insights into human behavior. However, only a few studies have used such data in the social sciences so far.
Examples of such studies include <span class="citation">Stephens-Davidowitz (2014)</span>, who studied racial animus in the 2012 U.S. presidential election, <span class="citation">Peterson, Goel, and Iyengar (2018)</span> and <span class="citation">Flaxman, Goel, and Rao (2016)</span>, who analyzed filter bubbles, echo chambers and partisan polarization in the U.S. and <span class="citation">Guess, Nyhan, and Reifler (2020)</span> who studied the spread of fake news in the U.S. In the Netherlands, <span class="citation">Möller et al. (2019)</span> analyzed online news engagement based on three different modes of news use. In Israel, <span class="citation">Dvir-Girsman (2017)</span> documented that audience homophily is higher among individuals with more extreme ideology and that it is associated with ideological polarization and intolerance. <span class="citation">Bach et al. (2019)</span> showed for Germany that online and mobile device activities predict voting behavior and political preferences to a limited degree only. <span class="citation">Chancellor and Counts (2018)</span> show that internet search data can be used to estimate employment demand in the U.S.</p>
<p>In other disciplines, similar data have been used, for example, to predict influenza activity <span class="citation">(Ginsberg et al. 2009, but see <span class="citation">@lazer2014parable</span>)</span> and to show that users who search for relatively harmless symptoms easily end up searching for serious diseases <span class="citation">(White and Horvitz 2009)</span>. Another study demonstrates how concerns about pregnancy and childbirth change over the course of pregnancy <span class="citation">(Fourney, White, and Horvitz 2015)</span>. Furthermore, several papers show how a variety of user attributes such as socio-demographics can be inferred from web logs, search queries and app records <span class="citation">(see Hinds and Joinson 2018 for a recent overview)</span>. However, many of those studies have been conducted by researchers from computer science. While they often rely on search query data obtained from search engine providers like Bing and Google, they typically focus more on the technical aspects and on the evaluation of the performance of the underlying algorithms. As social scientists, however, we often focus on the theory-driven development of models and the testing of hypotheses about human and societal behavior. Thus, this blog post will focus on the latter.</p>
<p>One challenge that researchers face when designing studies that rely on web logs, records of app use, and search queries is how to get access to such data. Several of the studies mentioned above use large amounts of search query data from search engines like Google or Bing <span class="citation">(Stephens-Davidowitz 2014; Chancellor and Counts 2018; White and Horvitz 2009; Fourney, White, and Horvitz 2015)</span>. These data are, however, usually only available if one teams up with researchers from the respective companies. Another way to obtain data is through commercial providers who keep opt-in panels of users who occasionally answer survey questions in exchange for money. Researchers can pay those vendors in order to get access to their panels and ask participants survey questions. In addition, several of these providers also offer web log and mobile device use records from users who (in exchange for additional pay) agreed to having their online mobile activities monitored. In this blog post, we will mainly focus on this latter way of obtaining data. Before we talk about this topic in more detail, we will briefly summarize what we mean when we speak of web logs, records of app use, and search queries.</p>
<p>To get a better understanding of such data, the table below shows a collection of a few artificial web logs made up for this blog post. Typically, we observe a person identifier (first column) for the person whose records we observe. Second, we have a URL (Uniform Resource Locator) column which tells us which URL this person visited, when (column “Timestamp”) and for how long (“Duration of use”; here, in seconds). The most interesting information in this table is the URL column. Even without a detailed understanding of the specific form of a URL, we can easily see that, in the first row, Person 1 <a href="https://www.wetter.de/deutschland/wetter-mannheim-18224779.html?q=mannheim">visited a web address that seems to inform her about the weather in Mannheim</a>. In addition, we observe when she visited this address and how much time she spent there. The second row tells us that this person likely sent (or received) a message to (from) Peter Mustermann. We cannot, however, observe the content of the message (which we also should not, given obvious privacy reasons). The third row shows that Person 1 then visited the <a href="https://www.facebook.com/CDU/">Facebook page of the CDU</a>. The fourth row shows that she watched a <a href="https://www.youtube.com/watch?v=SpoXEEdsNfE">video on YouTube</a>. If we accessed the web address (or programmed a scraping tool), we could also learn what the video was about. From the visit in the fifth row, we learn that this person read an article about the state of the German economy on <a href="https://www.zeit.de/wirtschaft/unternehmen/2020-01/ifo-index-geschaeftsklima-deutsche-wirtschaft-konjunktur">DIE ZEIT</a>, a German newspaper.</p>
<p>With respect to Person 2, we can also observe what items they <a href="https://www.amazon.de/s?k=mostly+harmless+econometrics">searched on Amazon</a> (sixth row) or <a href="https://twitter.com/realDonaldTrump/status/1222008772102705152">which tweet they saw</a> (seventh row). We also learn that they searched for the voting advice application <a href="https://www.google.com/search?q=wahl+o+mat+hamburg+2020">“Wahl-O-Mat”</a> (ninth row). From the URL in row 9, we can extract the exact words a person entered into a search engine (the “search queries”). From row 10, we also know that Person 2 <a href="https://www.wahl-o-mat.de/hamburg2020/">then actually used this tool</a>. Thus, we already see that we can learn a lot about users’ behavior, their interests and preferences.</p>
<table>
<colgroup>
<col width="8%" />
<col width="65%" />
<col width="13%" />
<col width="11%" />
</colgroup>
<thead>
<tr class="header">
<th>Person ID</th>
<th>URL</th>
<th>Timestamp</th>
<th>Duration of use</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td><a href="https://www.wetter.de/deutschland/wetter-mannheim-18224779.html?q=mannheim" class="uri">https://www.wetter.de/deutschland/wetter-mannheim-18224779.html?q=mannheim</a></td>
<td>2020-01-21 10:43:45</td>
<td>24</td>
</tr>
<tr class="even">
<td>1</td>
<td><a href="https://www.facebook.com/messages/t/Peter.Mustermann" class="uri">https://www.facebook.com/messages/t/Peter.Mustermann</a></td>
<td>2020-01-22 11:43:45</td>
<td>32</td>
</tr>
<tr class="odd">
<td>1</td>
<td><a href="https://www.facebook.com/CDU/" class="uri">https://www.facebook.com/CDU/</a></td>
<td>2020-01-23 23:43:01</td>
<td>45</td>
</tr>
<tr class="even">
<td>1</td>
<td><a href="https://www.youtube.com/watch?v=SpoXEEdsNfE" class="uri">https://www.youtube.com/watch?v=SpoXEEdsNfE</a></td>
<td>2020-01-24 08:21:45</td>
<td>3625</td>
</tr>
<tr class="odd">
<td>1</td>
<td><a href="https://www.zeit.de/wirtschaft/unternehmen/2020-01/ifo-index-geschaeftsklima-deutsche-wirtschaft-konjunktur" class="uri">https://www.zeit.de/wirtschaft/unternehmen/2020-01/ifo-index-geschaeftsklima-deutsche-wirtschaft-konjunktur</a></td>
<td>2020-01-25 16:14:07</td>
<td>67</td>
</tr>
<tr class="even">
<td>2</td>
<td><a href="https://www.amazon.de/s?k=mostly+harmless+econometrics" class="uri">https://www.amazon.de/s?k=mostly+harmless+econometrics</a></td>
<td>2020-01-26 23:54:45</td>
<td>23</td>
</tr>
<tr class="odd">
<td>2</td>
<td><a href="https://www.sueddeutsche.de/politik/bundeswehr-wehrbeauftragter-bericht-1.4774621" class="uri">https://www.sueddeutsche.de/politik/bundeswehr-wehrbeauftragter-bericht-1.4774621</a></td>
<td>2020-01-27 07:43:45</td>
<td>245</td>
</tr>
<tr class="even">
<td>2</td>
<td><a href="https://twitter.com/realDonaldTrump/status/1222008772102705152" class="uri">https://twitter.com/realDonaldTrump/status/1222008772102705152</a></td>
<td>2020-01-28 01:01:45</td>
<td>56</td>
</tr>
<tr class="odd">
<td>2</td>
<td><a href="https://www.google.com/search?q=wahl+o+mat+hamburg+2020" class="uri">https://www.google.com/search?q=wahl+o+mat+hamburg+2020</a></td>
<td>2020-01-29 17:14:45</td>
<td>32</td>
</tr>
<tr class="even">
<td>2</td>
<td><a href="https://www.wahl-o-mat.de/hamburg2020/" class="uri">https://www.wahl-o-mat.de/hamburg2020/</a></td>
<td>2020-01-30 09:09:45</td>
<td>578</td>
</tr>
<tr class="odd">
<td>2</td>
<td><a href="https://www.notebookcheck.com/Top-10-Ultrabooks-im-Test-bei-Notebookcheck.125873.0.html" class="uri">https://www.notebookcheck.com/Top-10-Ultrabooks-im-Test-bei-Notebookcheck.125873.0.html</a></td>
<td>2020-01-30 11:49:45</td>
<td>243</td>
</tr>
<tr class="even">
<td>2</td>
<td><a href="https://www.google.com/search?q=flug+frankfurt+new+york" class="uri">https://www.google.com/search?q=flug+frankfurt+new+york</a></td>
<td>2020-01-30 12:09:45</td>
<td>67</td>
</tr>
<tr class="odd">
<td>2</td>
<td><a href="https://www.google.com/flights?lite=0#flt=/m/02z0j./m/02_286.2020-01-30*/m/02_286./m/02z0j.2020-02-12;c:EUR;e:1;sd:1;t:f" class="uri">https://www.google.com/flights?lite=0#flt=/m/02z0j./m/02_286.2020-01-30*/m/02_286./m/02z0j.2020-02-12;c:EUR;e:1;sd:1;t:f</a></td>
<td>2020-01-30 18:56:45</td>
<td>456</td>
</tr>
</tbody>
</table>
<p>We observe similar records regarding app use on mobile devices, as shown in the next table. Here, we observe the name of an app instead of a URL. Unfortunately, we cannot observe what users do inside the apps they use. That is, we do not know which articles they read when they open a newspaper app or which videos they watch on YouTube or Netflix. Yet, if somebody frequently opens newspaper apps, we might infer that this person may be interested in politics. Moreover, users who use an app called “Period Tracker” are likely female and we may even learn when they have their period by observing temporal variations in usage of this app. Similarly, somebody who uses an app that informs them about prayer times is likely religious. These are just a few examples that show how observing users’ web logs or app use can reveal information that users may perceive as sensitive personal information <span class="citation">(see Bach et al. 2019 for a study on this topic)</span>.</p>
<table>
<thead>
<tr class="header">
<th>Person ID</th>
<th>App</th>
<th>Timestamp</th>
<th>Duration of use</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td>Facebook</td>
<td>2020-01-21 10:43:45</td>
<td>24</td>
</tr>
<tr class="even">
<td>1</td>
<td>Kleiderkreisel</td>
<td>2020-01-22 11:43:45</td>
<td>32</td>
</tr>
<tr class="odd">
<td>1</td>
<td>WhatsApp Messenger</td>
<td>2020-01-23 23:43:01</td>
<td>45</td>
</tr>
<tr class="even">
<td>1</td>
<td>Youtube</td>
<td>2020-01-24 08:21:45</td>
<td>3625</td>
</tr>
<tr class="odd">
<td>1</td>
<td>GMX Mail</td>
<td>2020-01-25 16:14:07</td>
<td>67</td>
</tr>
<tr class="even">
<td>2</td>
<td>Period tracker</td>
<td>2020-01-26 23:54:45</td>
<td>23</td>
</tr>
<tr class="odd">
<td>2</td>
<td>Pedometer</td>
<td>2020-01-27 07:43:45</td>
<td>245</td>
</tr>
<tr class="even">
<td>2</td>
<td>WhatsApp Messenger</td>
<td>2020-01-28 01:01:45</td>
<td>56</td>
</tr>
<tr class="odd">
<td>2</td>
<td>WhatsApp Messenger</td>
<td>2020-01-29 17:14:45</td>
<td>32</td>
</tr>
<tr class="even">
<td>2</td>
<td>Youtube</td>
<td>2020-01-30 09:09:45</td>
<td>578</td>
</tr>
<tr class="odd">
<td>2</td>
<td>Netflix</td>
<td>2020-01-30 18:56:45</td>
<td>456</td>
</tr>
</tbody>
</table>
<p>It is important to note that participants of commercial panels can usually pause data collection on their devices temporarily. This might happen if they do not feel comfortable having their activities recorded, e.g. when they do online banking or watch movies from illegal streams. However, our own data collections show large amounts of potentially sensitive information (such as adult content, gambling, and illegal streaming), which suggests that users do not make use of this possibility very often.</p>
</div>
<div id="how-to-obtain-web-log-and-related-data" class="section level3">
<h3>How to obtain web log and related data</h3>
<p>As mentioned earlier, the easiest way to obtain web log and app use data is through commercial vendors who operate online access panels (“non-probability panels”). So far, there seem to be only a handful of providers, such as <a href="https://www.respondi.com/EN/">respondi AG</a> (Germany, UK and France, for example), <a href="https://yougov.co.uk">YouGov</a> (UK and US, amongst others) and <a href="https://www.netquest.com/en/online-surveys-investigation">netquest</a> (mostly Spain and Latin America). For an overview of providers, see for example, <a href="https://wakoopa.com/get-data/">Wakoopa Hub</a>. Since such data have not been available for long, we do not know much about the quality of such data <span class="citation">(for an exception see Revilla, Ochoa, and Loewe 2017)</span>.. In addition to accessing web logs and app use data, researchers can collect survey data for the same individuals. That is, by asking users questions we can enrich the web log data with important information about users’ socio-demographic characteristics, their voting behavior, or their political preferences. This information is usually not directly observable from the web logs and app use records, but likely makes the data much more valuable for social scientific research purposes. However, we should always keep in mind that due to the opt-in nature of these access panels, we need to be careful when making statements about the representativeness of our findings. That is, because most of our statistical estimators (e.g., for variances) rely on true probability sampling, we need to adjust our models and estimators, e.g. by using weights <span class="citation">(for more information on non-probability sampling and non-probability panels see, e.g., Mercer et al. 2017; Cornesse et al. 2020)</span>.</p>
<p>Another way to obtain data on users’ online activities is through <a href="https://trends.google.com">Google Trends</a>. Briefly speaking, this service allows the estimation of the popularity of search queries in Google Search across various regions and languages. Several R packages in R allow users to automatically extract data from Google Trends – see, for instance, <a href="https://cran.r-project.org/web/packages/gtrendsR/gtrendsR.pdf"><code>gtrendsR</code></a>. Further details on Google Trends can be found <a href="https://www.aeaweb.org/conference/2016/retrieve.php?pdfid=772">here</a>. A major drawback of these data is that they do not come with additional information about users and only provide information at aggregate levels. That is, one can learn only about the popularity of a specific search query compared to the popularity of other search queries, which is arguably much less informative than individual-level web logs.</p>
<p>A third way to obtain web log, app, and search query data is through developing one’s own research app. A few projects using this approach were launched in recent years. One of the most prominent examples is the <a href="https://www.iab.de/751/section.aspx/1470">IAB-SMART project</a> <span class="citation">(for details, see Kreuter et al. 2019)</span>. Researchers of the Institute for Employment Research (IAB) in Nuremberg, Germany, developed an app that has since been downloaded by several hundred respondents of the IAB’s panel survey “Labour Market and Social Security”. The app collects information on respondents’ smartphone activities and can access even more information than those described above (including geolocation and address books). While this approach is likely the most elaborate, it is also the most difficult and most costly one to implement: Researchers need to program their own tools and recruit participants on their own or piggyback on an existing study.</p>
</div>
<div id="how-to-handle-big-data" class="section level3">
<h3>How to handle “big data”</h3>
<p>Finally, a note on useful tools and prerequisites for analyzing web logs and records of smartphone use. First of all, the amount of data can quickly exceed the computational power of a standard desktop computer. Four months of web log data used in <span class="citation">Bach et al. (2019)</span>, for example, contained about 38 million observations. Working with data of this size, researchers may have to consider using remote computing services like <a href="https://www.digitalocean.com">Digital Ocean</a>, <a href="https://aws.amazon.com">Amazon AWS</a> or <a href="https://azure.microsoft.com/en-us/">Microsoft Azure</a>, which offer computational resources for little money through virtual servers. Second, understanding URL contents by observing single URLs is straightforward. Analyzing thousands of URLs, however, requires text mining and natural language processing (NLP) techniques if one wants, for example, to select only those URLs that point to news articles. Moreover, in addition to analyzing the title of a news article (which can often be observed from the URL alone), one might also want to analyze the whole content of the article. In such cases, in addition to being able to automatically extract the topic of an article through NLP techniques, knowing how to scrape website contents will likely also be helpful. Some useful materials are linked <a href="#further-reading">below</a>.</p>
</div>
<div id="about-the-presenter" class="section level3">
<h3>About the presenter</h3>
<p>Ruben Bach <a href="mailto: r.bach@uni-mannheim.de "><i class="fa
              fa-envelope"></i> </a>
<a href="https://ruben-bach.com/"><i class="fa
              fa-globe"></i> </a>
<a href=" https://twitter.com/rub3n_luc"><i class="fa
              fa-twitter"></i></a> is a postdoctoral researcher at the University of Mannheim, focusing on social science quantitative research methods. His interests include topics related to big data in the social sciences, machine learning, causal inference, and survey research.</p>
</div>
<div id="further-reading" class="section level3">
<h3>Further reading</h3>
<ul>
<li><a href="https://socialsciencedatalab.mzes.uni-mannheim.de/article/advancing-text-mining/">Meyer, Cosima and Cornelius Puschmann. 2019. <em>Advancing Text Mining with R and quanteda</em>. Methods Bites: Blog of the MZES Social Science Data Lab. Blog Post Tutorial.</a></li>
<li><a href="https://github.com/SocialScienceDataLab/Intro-to-web-scraping-with-R">Munzert, Simon. 2016. <em>Three easy-to-learn tools to scrape data from the Web with R</em>. MZES Social Science Data Lab. Workshop Materials.</a></li>
</ul>
</div>
<div id="references" class="section level3 unnumbered">
<h3 class="unnumbered">References</h3>
<div id="refs" class="references hanging-indent">
<div id="ref-bach2019">
<p>Bach, R. L., C. Kern, A. Amaya, F. Keusch, F. Kreuter, J. Heinemann, and J. Hecht. 2019. “Predicting Voting Behavior Using Digital Trace Data.” <em>Social Science Computer Review</em>. <a href="https://doi.org/10.1177/0894439319882896">https://doi.org/10.1177/0894439319882896</a>.</p>
</div>
<div id="ref-chancellor2018">
<p>Chancellor, S., and S. Counts. 2018. “Measuring Employment Demand Using Internet Search Data.” In <em>Proceeding of the 2018 Chi Conference on Human Factors in Computing Systems</em>, 1–14. CHI ’18. New York, NY, USA: ACM.</p>
</div>
<div id="ref-cornesse2020">
<p>Cornesse, C., A. G. Blom, D. Dutwin, J. A. Krosnick, E. D. De Leeuw, S. Legleye, J. Pasek, et al. 2020. “A Review of Conceptual Approaches and Empirical Evidence on Probability and Nonprobability Sample Survey Research.” <em>Journal of Survey Statistics and Methodology</em>. <a href="https://doi.org/10.1093/jssam/smz041%20">https://doi.org/10.1093/jssam/smz041</a>.</p>
</div>
<div id="ref-dvir2017">
<p>Dvir-Girsman, S. 2017. “Media Audience Homophily: Partisan Websites, Audience Identity and Polarization Processes.” <em>New Media &amp; Society</em> 19 (7): 1072–91.</p>
</div>
<div id="ref-flaxman2016filter">
<p>Flaxman, Seth, Sharad Goel, and Justin M Rao. 2016. “Filter Bubbles, Echo Chambers, and Online News Consumption.” <em>Public Opinion Quarterly</em> 80 (S1): 298–320.</p>
</div>
<div id="ref-Fourney2015">
<p>Fourney, Adam, Ryen W. White, and Eric Horvitz. 2015. “Exploring Time-Dependent Concerns About Pregnancy and Childbirth from Search Logs.” In <em>Proceedings of the 33rd Annual Acm Conference on Human Factors in Computing Systems</em>, 737–46. CHI ’15. New York, NY, USA: ACM. <a href="https://doi.org/https://doi.org/10.1145/2702123.2702427">https://doi.org/https://doi.org/10.1145/2702123.2702427</a>.</p>
</div>
<div id="ref-ginsberg2009detecting">
<p>Ginsberg, Jeremy, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark S Smolinski, and Larry Brilliant. 2009. “Detecting Influenza Epidemics Using Search Engine Query Data.” <em>Nature</em> 457 (7232): 1012–4.</p>
</div>
<div id="ref-guess2020exposure">
<p>Guess, Andrew M, Brendan Nyhan, and Jason Reifler. 2020. “Exposure to Untrustworthy Websites in the 2016 Us Election.” <em>Nature Human Behaviour</em>, 1–9.</p>
</div>
<div id="ref-hinds">
<p>Hinds, J., and A. N. Joinson. 2018. “What Demographic Attributes Do Our Digital Footprints Reveal? A Systematic Review.” <em>PLoS One</em> 13: 1–40.</p>
</div>
<div id="ref-KreuterSMART">
<p>Kreuter, Frauke, Georg-Christoph Haas, Florian Keusch, Sebastian Bähr, and Mark Trappmann. 2019. “Collecting Survey and Smartphone Sensor Data with an App: Opportunities and Challenges Around Privacy and Informed Consent.” <em>Social Science Computer Review</em>. <a href="https://doi.org/10.1177/0894439318816389">https://doi.org/10.1177/0894439318816389</a>.</p>
</div>
<div id="ref-lazer2014parable">
<p>Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” <em>Science</em> 343 (6176): 1203–5.</p>
</div>
<div id="ref-AMercer">
<p>Mercer, Andrew W., Frauke Kreuter, Scott Keeter, and Elizabeth A. Stuart. 2017. “Theory and Practice in Nonprobability Surveys: Parallels between Causal Inference and Survey Inference.” <em>Public Opinion Quarterly</em> 81 (S1): 250–71. <a href="https://doi.org/10.1093/poq/nfw060">https://doi.org/10.1093/poq/nfw060</a>.</p>
</div>
<div id="ref-moller2019explaining">
<p>Möller, Judith, Robbert Nicolai van de Velde, Lisa Merten, and Cornelius Puschmann. 2019. “Explaining Online News Engagement Based on Browsing Behavior: Creatures of Habit?” <em>Social Science Computer Review</em>, 0894439319828012. <a href="https://doi.org/10.1177/0894439319828012">https://doi.org/10.1177/0894439319828012</a>.</p>
</div>
<div id="ref-peterson2018echo">
<p>Peterson, Erik, Sharad Goel, and Shanto Iyengar. 2018. “Echo Chambers and Partisan Polarization: Evidence from the 2016 Presidential Campaign.”</p>
</div>
<div id="ref-revilla2017using">
<p>Revilla, Melanie, Carlos Ochoa, and Germán Loewe. 2017. “Using Passive Data from a Meter to Complement Survey Data in Order to Study Online Behavior.” <em>Social Science Computer Review</em> 35 (4): 521–36.</p>
</div>
<div id="ref-stephens2014cost">
<p>Stephens-Davidowitz, Seth. 2014. “The Cost of Racial Animus on a Black Candidate: Evidence Using Google Search Data.” <em>Journal of Public Economics</em> 118: 26–40.</p>
</div>
<div id="ref-white2009">
<p>White, Ryen W., and Eric Horvitz. 2009. “Cyberchondria: Studies of the Escalation of Medical Concerns in Web Search.” <em>ACM Trans. Inf. Syst.</em> 27 (4): 23:1–23:37. <a href="https://doi.org/https://doi.org/10.1145/1629096.1629101">https://doi.org/https://doi.org/10.1145/1629096.1629101</a>.</p>
</div>
</div>
</div>
]]>
      </description>
    </item>
    
    <item>
      <title>Shiny Apps: Development and Deployment</title>
      <link>https://socialsciencedatalab.mzes.uni-mannheim.de/article/shiny-apps/</link>
      <pubDate>Tue, 17 Dec 2019 01:00:00 +0100</pubDate>
      
      <guid>https://socialsciencedatalab.mzes.uni-mannheim.de/article/shiny-apps/</guid>
      <description><![CDATA[
        </p>
<p>Shiny Apps allow developers and researchers to easily build interactive web applications within the environment of the statistical software R. Using these apps, R users can interactively communicate their work to a broader audience. In this <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/categories/tutorials/">Method Bites Tutorial</a>, <a href="https://twitter.com/kongavras">Konstantin Gavras</a> and <a href="https://twitter.com/Nick_Baumann97">Nick Baumann</a> present a comprehensive recap of Konstantin Gavras’ (University of Mannheim) workshop materials to illustrate how Shiny Apps enable vivid data presentation as well as its usefulness as an analytical tool.</p>
<p>After reading this blog post and engaging with the applied examples, readers should:</p>
<ul>
<li>be able to retrace the logic behind the structure of Shiny Apps,</li>
<li>be able to build their own Shiny App, and</li>
<li>be able to deploy their Shiny App and run them in the world wide web.</li>
</ul>
<p><em>Note:</em> This blog post provides a summary of Konstantin’s workshop in the MZES Social Science Data Lab with some adaptations. Konstantin’s original workshop materials, including slides and scripts, are available from our <a href="https://github.com/SocialScienceDataLab/shiny-development-deployment">GitHub</a>. A recording of the workshop is available on <a href="https://www.youtube.com/watch?v=QT3WUQu99pM">Youtube</a>.</p>
<div id="contents" class="section level5">
<h5>Contents</h5>
<ol style="list-style-type: decimal">
<li><a href="#shiny-apps-and-data-presentation">Shiny Apps and Data Presentation</a></li>
<li><a href="#the-structure-of-shiny-apps">The Structure of Shiny Apps</a></li>
<li><a href="#developing-shiny-apps">Developing Shiny Apps</a>
<ol style="list-style-type: decimal">
<li><a href="#building-a-shiny-app-from-scratch">Building a Shiny App from Scratch</a></li>
<li><a href="#building-the-ui">Building the UI</a></li>
<li><a href="#implementing-the-server-logic">Implementing the Server Logic</a></li>
<li><a href="#rendering-plots">Rendering Plots</a></li>
</ol></li>
<li><a href="#deployment">Deployment</a>
<ol style="list-style-type: decimal">
<li><a href="#deployment-using-shinyapps.io">Deployment using shinyapps.io</a></li>
<li><a href="#deployment-using-shiny-server">Deployment using Shiny Server</a></li>
</ol></li>
<li><a href="#conclusion">Conclusion</a></li>
</ol>
</div>
<div id="shiny-apps-and-data-presentation" class="section level3">
<h3>Shiny Apps and Data Presentation</h3>
<p>Data are usually collected in a raw format and, no matter how well processed and analyzed, results should be presented in a plain and simple way in order to make them accessible for everyone. Unfortunately, the potential of effective data presentation is all too often not fully exploited. Yet, presentation is crucial as it illustrates the actual outcome of research. If not effectively visualized, we waste a great opportunity to build credibility, attract and sustain the interest of readers, and make large amounts of information easily accessible. Here, we introduce Shiny Apps as a powerful and highly flexible tool for interactive data presentation in the world wide web.</p>
<p>Shiny is an R package that offers cost- and programming-free tools for building web applications using R. It was developed by <a href="https://github.com/jcheng5">Joe Chang</a> to serve as a reactive web framework for R that allows calculations, display of R objects, and the presentation of results. Since Shiny Apps come with an extensive back-end setup, users do not need extensive web development skills to build and host standalone apps on a homepage. However, for those keen on bringing their apps to perfection, Shiny Apps allows for CSS, HTML and JavaScript extensions.Shiny Apps can be used either for data presentation, as a communication tool for results, or even as an interactive analytical tool.</p>
<p>In what follows, we introduce the Shiny environment and guide readers through the development of Shiny Apps. Using the famous Kaggle <code>titanic</code> data set, we draw the distinction between the front-end <code>ui.R</code> and back-end <code>server.R</code>, which are required to build Shiny apps. Following this, we introduce important concepts and features to build an interactive app, including control widgets, reactivity, and rendering.</p>
</div>
<div id="the-structure-of-shiny-apps" class="section level3">
<h3>The Structure of Shiny Apps</h3>
<p>Shiny applications have two components. The front-end builds the webpage that is actually shown to the user. As already mentioned, the HTML page is written by Shiny itself and includes layout, appearance, and design features. In Shiny terminology, this is called the <code>ui</code>, which stands for user interface. The <code>ui</code> file contains R functions that are then translated into an HTML file.
The other component is the back-end, which includes the code for producing the app’s contents (e.g. functions or data import, management, and analysis). Here, we create the objects that are later shown on the front-end. In Shiny terminology this is called the <code>server</code>.</p>
<div id="ways-of-setting-up-a-shiny-app" class="section level5">
<h5>Ways of Setting Up a Shiny App</h5>
<p>Shiny Apps can be set up in two different ways.</p>
<p><strong>1. Single-file App:</strong>
In a single-file app, <code>ui</code> and <code>server</code> are stored in one script. In this case, we create a file named <code>app.R</code> that contains both the server and UI components. This technique is used when developing very simple Shiny Apps and lacks some advantages of the alternative two-files App method.</p>
<p><strong>2. Two-files App:</strong>
Here, <code>ui</code> and <code>server</code> are stored in two separate scripts, which implies a clear separation between front-end and back-end. This method is preferable when developing more advanced Shiny Apps. It is <strong>important that the files are named <code>ui.R</code> and <code>server.R</code> and always stored in a separate folder</strong>. In this tutorial, we are going to develop Shiny Apps using the two-files method.</p>
</div>
</div>
<div id="developing-shiny-apps" class="section level3">
<h3>Developing Shiny Apps</h3>
<div id="building-a-shiny-app-from-scratch" class="section level5">
<h5>Building a Shiny App from Scratch</h5>
<p>To set up, make sure that all required packages are installed and subsequently load the <code>shiny</code>,<code>tidyverse</code>, and <code>plotly</code> packages along with R’s example data set <code>titanic</code>.</p>
<details>
<p><summary> Code: R packages used in this tutorial</summary></p>
<pre class="r"><code>## Packages
pkgs &lt;- c(&quot;shiny&quot;,
          &quot;tidyverse&quot;,
          &quot;titanic&quot;,
          &quot;plotly&quot;)

## Install uninstalled packages
lapply(pkgs[!(pkgs %in% installed.packages())], install.packages)

## Load all packages to library
lapply(pkgs, library, character.only = TRUE)</code></pre>
<pre><code>## Warning: package &#39;shiny&#39; was built under R version 3.6.3</code></pre>
<pre><code>## Warning: package &#39;ggplot2&#39; was built under R version 3.6.3</code></pre>
<pre><code>## Warning: package &#39;tidyr&#39; was built under R version 3.6.3</code></pre>
<pre><code>## Warning: package &#39;plotly&#39; was built under R version 3.6.3</code></pre>
</details>
<p><br />
Next, create a new folder with two R scripts, <code>ui.R</code> and <code>server.R</code>:</p>
<details>
<p><summary>Code: Load packages and create folder with two R scripts</summary></p>
<pre class="r"><code>ui &lt;- fluidPage()
server &lt;- function(input, output) {
}</code></pre>
</details>
<p></br>
Subsequently, launch the Shiny App by pressing the “Run App” button in the top right corner of the Source pane. Also, have a look at the the code chunk below, which provides an example of a Shiny App.</p>
<details>
<p><summary>Code: Example </summary></p>
<pre class="r"><code>runExample(&quot;01_hello&quot;)
# To show other Apps, please type runExample(NA) and choose another example</code></pre>
</details>
<p></br></p>
</div>
<div id="building-the-ui" class="section level5">
<h5>Building the UI</h5>
<p>When building a Shiny App, one should have in mind what the app should look like. Hence, we build the UI first. In simple Shiny Apps, the whole UI fits in the <code>fluidPage</code>: every new object is passed comma-separated, and text can be passed to the UI simply by entering strings. In order to format text, Shiny uses HTML wrappers, i.e., functions that take text as arguments (plus further style options).</p>
<ul>
<li><code>h1()</code>: top-level header</li>
<li><code>h2()</code>: secondary header</li>
<li><code>strong()</code>: make text bold</li>
<li><code>em()</code>: make text italicized</li>
<li><code>br()</code>: add line break</li>
<li><code>titlePanel()</code>: adds an official header</li>
</ul>
<p>Until now, we only have a plain white page. We need a proper layout to make our app look nicer. <code>sidebarlayout</code> is the simplest layout format. It splits your page in two parts: a <code>sidebarPanel</code>and a <code>mainPanel</code>. Inside <code>sidebarLayout</code> you may use <code>sidebarPanel</code> to specify the appearance of the side panel of your page. The <code>mainPanel</code> contains the results of the
<code>server</code>.</p>
<details>
<p><summary>Code: Building the plain UI </summary></p>
<pre class="r"><code>ui &lt;- fluidPage(titlePanel(&quot;Title of my Shiny App&quot;),
                sidebarLayout(
                  sidebarPanel(&quot;My input goes here&quot;),
                  mainPanel(&quot;The results go here&quot;)
                ))</code></pre>
</details>
<p></br>
In order to interact with Shiny Apps, we need control widgets, such as buttons, select boxes, or sliders. These allow us to navigate the app. The type of the widget you choose depends on the design of your app. You can either specify inputs, enter text, or select specific dates to create results. All input functions have two arguments: <code>inputId</code> and <code>label</code>.</p>
<p><code>inputId</code> tells Shiny to refer to a particular input when retrieving values for the back-end. <strong>It is important that the ID names you use are unique</strong>. If you have two IDs with the same name, there will not be a warning or error message. Consequently, the app may use the wrong input in the wrong place without notification. An overview of all types of widgets you can work with is provided <a href="https://shiny.rstudio.com/tutorial/written-tutorial/lesson3/">here</a>. The <code>label</code> option specifies the text displaying the label of the control widget. The control labels go in the <code>sidebarPanel</code>. Here, we specify the possible values, range, and the appearance of the control widget.</p>
<p>Having added the input and the control widgets, we finally need to specify the output elements of our app. These outputs might be plots, tables, text, images, or maps. These will be built in the <code>server</code> file of our two-files app. In the UI, we only build the placeholder, which means that every output function has an <code>outputId</code> argument to identify the output created in <code>server.R</code>. Here, we use <code>plotlyOutput</code> to refer to a plot that will be generated later in <code>server.R</code>. Output elements should always be added to the <code>mainPanel()</code> function. Other types of output elements are:</p>
<ul>
<li><code>dataTableOutput()</code>: data table</li>
<li><code>imageOutput()</code>: image</li>
<li><code>plotOutput()</code>: plot</li>
<li><code>tableOutput()</code>: table</li>
<li><code>verbatimTextOutput()</code>: text</li>
<li><code>textOutput()</code>: text</li>
<li><code>uiOutput()</code>: raw HTML</li>
<li><code>htmlOutput()</code>: raw HTML</li>
</ul>
<p>The code chunk below demonstrates how widget and output elements can be combined.</p>
<details>
<p><summary>Code: Finishing the UI </summary></p>
<pre class="r"><code># The most simple structure of the UI:
ui &lt;- fluidPage(
  titlePanel(&quot;Visualizing the Titanic data set&quot;),
  sidebarLayout(
    sidebarPanel(
      checkboxInput(&quot;checkbox&quot;,
                    &quot;Activate all filters&quot;,
                    FALSE),
      radioButtons(
        &quot;buttons&quot;,
        &quot;Did the passenger survive?&quot;,
        choices = c(&quot;did not survive&quot; = 0, &quot;did survive&quot; = 1),
        selected = 0
      ),
      selectInput(&quot;selector&quot;,
                  &quot;Select the class of the passenger&quot;,
                  choices = c(1, 2, 3)),
      sliderInput(
        &quot;slider&quot;,
        &quot;Pick a range of fare (in $)&quot;,
        min = 0,
        max = 550,
        value = c(10, 50),
        pre = &quot;$&quot;
      ),
      selectInput(
        &quot;selectVars&quot;,
        &quot;Select the variable to be displayed&quot;,
        choices = c(
          &quot;Age&quot; = &quot;Age&quot;,
          &quot;# of Siblings/Spouses on board&quot; = &quot;SibSp&quot;,
          &quot;# of Parents/Children on board&quot; = &quot;Parch&quot;
        )
      )
    ),
    mainPanel(plotlyOutput(&quot;greatPlot&quot;),
              textOutput(&quot;goodText&quot;))
  )
)</code></pre>
</details>
<p></br></p>
</div>
<div id="implementing-the-server-logic" class="section level5">
<h5>Implementing the Server Logic</h5>
<p>Having finished the front-end, we now turn to building the <code>server</code> or the back-end. The <code>server</code> logic in Shiny Apps uses an <code>input</code> argument, which contains the values of the input given by the users, as well as an <code>output</code> argument, which contains the plots and tables created as a function of the input arguments. Output objects can be created without any input specification, but always need to be connected with the UI by an <code>outputId</code>. To do so, save the output object into the <code>output</code> list. We then build the object using one of the available render functions. For every object type, there is a unique render function, e.g. <code>renderPlot()</code>. The table below provides a brief overview of which <code>render</code> functions work with which output formats.</p>
<table class="table table-striped table-hover table-condensed table-responsive" style="margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:left;">
render()
</th>
<th style="text-align:left;">
Output()
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
<code>renderDataTable()</code>
</td>
<td style="text-align:left;">
<code>dataTableOutput()</code>
</td>
</tr>
<tr>
<td style="text-align:left;">
<code>renderImage()</code>
</td>
<td style="text-align:left;">
<code>imageOutput()</code>
</td>
</tr>
<tr>
<td style="text-align:left;">
<code>renderPlot()</code>
</td>
<td style="text-align:left;">
<code>plotOutput()</code>
</td>
</tr>
<tr>
<td style="text-align:left;">
<code>renderPrint()</code>
</td>
<td style="text-align:left;">
<code>verbatimTextOutput()</code>
</td>
</tr>
<tr>
<td style="text-align:left;">
<code>renderTable()</code>
</td>
<td style="text-align:left;">
<code>tableOutput()</code>
</td>
</tr>
<tr>
<td style="text-align:left;">
<code>renderText()</code>
</td>
<td style="text-align:left;">
<code>textOutput()</code>
</td>
</tr>
<tr>
<td style="text-align:left;">
<code>renderUI()</code>
</td>
<td style="text-align:left;">
<code>uiOutput()</code> and <code>htmlOutput()</code>
</td>
</tr>
</tbody>
</table>
<p>We have to make sure that we accurately assign our IDs to their corresponding render and output functions. Since we named our <code>plotOutput</code> in the UI <code>greatPlot</code>, we have to assign it to this ID. As we use the <code>plotOutput</code> function in the UI, we need the <code>renderPlot</code> function to match it. Furthermore, our plots need a connection to the input and the control widgets. The logic behind the input list is the same as for the output list: As we used the <code>sliderInput</code> function in the UI and named it <code>slider</code>, we now access the respective object. Given the structure of the input, we specify it to fit the output object.</p>
<details>
<p><summary>Code: Fitting IDs and output/render functions </summary></p>
<pre class="r"><code>output$greatPlot &lt;- renderPlot({
  plot(rnorm(input$slider[1]))
})</code></pre>
</details>
<p></br></p>
</div>
<div id="rendering-plots" class="section level5">
<h5>Rendering Plots</h5>
<p>There are several ways of increasing the interactivity of plots. For example, one can simply add code to the <code>renderplot()</code> function. Another very simple way of increasing interactivity is filtering. However, when filtering, the whole code will get re-executed every time the user re-specifies anything, which can become very memory-intensive and repetitive when building several plots! We therefore use reactive variables to extract the filtering from the render functions. Reactive variables can now be used for different objects, which change simultaneously with different input. Additionally, we use <code>plotly</code> plots to make it even more interactive. You simply need to replace the <code>plotOutput()</code> with <code>plotlyOutput()</code> in the <code>UI</code>.</p>
<details>
<p><summary>Code: Make plots reactive and interactive </summary></p>
<pre class="r"><code># Back-end
# The server function takes input and output as arguments
# input: Input provided by the user to specify a certain appearance, data 
#        or result
# output: The output created by the functions specified in the back end

server &lt;- function(input, output) {
  # Create a reactive variable to use it for different objects interactively
  filtered &lt;- reactive({
    titanic_train %&gt;%
      filter(
        Fare &gt;= input$slider[1],
        Fare &lt;= input$slider[2],
        Pclass == input$selector,
        Survived == input$buttons
      )
  })
  
  # Plot the variables of the titanic_train data set
  # Using if-conditions to specify the variables being plotted
  output$greatPlot &lt;- renderPlotly({
    if (input$checkbox == TRUE) {
      x &lt;- filtered()[, input$selectVars]
      plotHist &lt;- ggplot(filtered(), aes(x)) +
        geom_histogram()
    } else {
      x &lt;- titanic_train[, input$selectVars]
      plotHist &lt;- ggplot(titanic_train, aes(x)) +
        geom_histogram()
    }
    
    ggplotly(plotHist)
  })
  
  output$goodText &lt;- renderText({
    if (input$checkbox == TRUE) {
      x &lt;- nrow(filtered()[filtered()$Survived == 1, ])
      text &lt;-
        paste0(&quot;Using your filter &quot;, x, &quot; passengers have survived&quot;)
      return(text)
    } else {
      x &lt;- nrow(titanic_train[titanic_train$Survived == 1, ])
      text &lt;- paste0(&quot;In total &quot;, x, &quot; passengers have survived&quot;)
      return(text)
    }
  })
}</code></pre>
</details>
<p></br>
To finally visiualize our plot locally, we simply run <code>shinyApp(ui = ui, server = server)</code>. This allows us to see the final Shiny App, including some additional components, which increases the functionality of the app even further.</p>
<div style="position:relative;padding-top:56.25%;">
<p><iframe src="https://kgavras.shinyapps.io/Shiny_Development_Deployment/" frameborder="0" allowfullscreen marginheight="100"
    style="position:absolute;top:0;left:0;width:100%;height:100%;" scrolling="yes" onload="resizeIframe(this)"></iframe></p>
</div>
</div>
</div>
<div id="deployment" class="section level3">
<h3>Deployment</h3>
<div id="deployment-using-shinyapps.io" class="section level5">
<h5>Deployment using shinyapps.io</h5>
<p>Until now, we only ran our apps locally on our machine. In order to present an app to a broader audience, we need to deploy it in the world wide web. We have to take care that our app is protected by a firewall and that we have a stable URL. For simple apps, deploy the app using a free account on <a href="https://www.shinyapps.io/">shinyapps.io</a>. If done so, go back to RStudio, press the deploy button in the top right corner of the Source pane, re-enter your credentials, select the correct files, and let the magic happen!</p>
</div>
<div id="deployment-using-shiny-server" class="section level5">
<h5>Deployment using Shiny Server</h5>
<p>For certain apps you might not want to use <a href="wwww.shinyapps.io">shinyapps.io</a>, for instance if:</p>
<ul>
<li>the size of the app would require a (very) costly plan, or</li>
<li>you want full control over the app and host it yourself, or</li>
<li>you love playing around with Unix code.</li>
</ul>
<p>A free alternative is Shiny Server, which allows you to host your own app in a controlled environment (e.g. inside your organization). The professional version of Shiny Server (Shiny Server Pro) allows you to deploy password-protected apps and use an administrative dashboard that provides you with usage statistics. However, you need an own server to host the app (e.g. Digital Ocean or Amazon Web Services), which may still be costly.</p>
<p>Another alternative is hosting the app on university servers. For instance, members of the University of Mannheim can use the web services provided by the universities of Baden-Württemberg, the <em>bwCloud</em>, for free.</p>
<p>When using Shiny Server, the required steps are a bit more complicated:</p>
<ol style="list-style-type: decimal">
<li><a href="https://www.bw-cloud.org/de/erste_schritte">Register</a> for the bwCloud</li>
<li>Log into the bwCloud Dashboard</li>
<li>Create an SSH-key pair to connect to your server (find a short intro <a href="https://www.bw-cloud.org/de/bwcloud_scope/nutzen">here</a>)</li>
<li>Install <a href="https://www.putty.org/">PuTTY</a> (Windows) or start a remote connection in the Shell (MAC) as well as <a href="https://filezilla-project.org/">Filezilla</a></li>
<li>Set up the SSH-client to access remote connections</li>
<li>Build a virtual machine (the server) using your bwCloud dashboard and connect to your virtual machine using SSH with PuTTY</li>
</ol>
<p>After setting up the server, we need to enter our Unix system and install R, all relevant packages, and Shiny Server using Unix code</p>
<ol style="list-style-type: decimal">
<li>Install R on your machine</li>
</ol>
<pre class="r"><code>sudo apt-get install r-base</code></pre>
<p>If you are lucky, this will install the correct version. Otherwise, this <a href="https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-18-04">tutorial</a> might be of help.</p>
<ol start="2" style="list-style-type: decimal">
<li>Install dependencies to install R packages</li>
</ol>
<pre class="r"><code>sudo apt-get -y install libcurl4-gnutls-dev 
sudo apt-get -y install libxml2-dev libssl-dev</code></pre>
<ol start="3" style="list-style-type: decimal">
<li>Install R packages within R, including <code>shiny</code> (easy!)</li>
</ol>
<pre class="r"><code>install.packages(&quot;shiny&quot;)</code></pre>
<ol start="4" style="list-style-type: decimal">
<li>Install Shiny Server (check for latest version!)</li>
</ol>
<pre class="r"><code>wget https://download3.rstudio.org/ubuntu-12.04/x86_64/shiny-server-1.5.6.875-amd64.deb
sudo gdebi shiny-server-1.5.6.875-amd64.deb</code></pre>
<ol start="5" style="list-style-type: decimal">
<li>Open port 3838 and check whether your firewall is set up</li>
</ol>
<pre class="r"><code>sudo ufw allow 3838</code></pre>
<ol start="6" style="list-style-type: decimal">
<li>Test whether Shiny Server runs correctly. Open this in your browser: <code>http://134.155.108.111:3838/</code> (replace with your IP-address)</li>
<li>Use Filezilla to access your server and upload your app files to the <code>/srv/shiny-server/</code> folder. It should run when typing in the correct URL associated with your app: <code>http://134.155.108.111:3838/foldername/projectname/</code></li>
<li>The above will likely not work immediately. In this case, you need to go back and troubleshoot… (click <a href="https://www.digitalocean.com/community/tutorials/how-to-set-up-shiny-server-on-ubuntu-16-04">here</a> for a tutorial)</li>
</ol>
</div>
</div>
<div id="conclusion" class="section level3">
<h3>Conclusion</h3>
<p>Data presentation is crucial for accessible scientific research. Even with accurate and comprehensive data that supports relevant findings, failure to present the data in an easily accessible way can result in wasted opportunities. Shiny Apps are a powerful and free-of-cost tool that allows scholars to provide readers with easy-to-use, interactive, and comprehensive insights into their research. For instance, scholars can use Shiny Apps to program an interactive online appendix and thereby offer readers full control to compare findings under different specifications of measurement and modeling.</p>
<p>However, the advantages of creating a functional web-application exclusively with R has some limitations. Considering the design and the placement of inputs (such as sliders) or outputs (such as graphics and tables), Shiny is limited. When combining Shiny with HTML, CSS, and JavaScript, however, it offers a good way to program professional web-apps. Shiny is particularly great for fast prototyping and fairly easy to use with little experience in programming. It comes with many different charting libraries and captures feedback and comments in a structured manner. It therefore offers an exciting toolbox that can be a valuable addition to scientific research.</p>
</div>
<div id="about-the-presenter" class="section level3">
<h3>About the Presenter</h3>
<p>Konstantin Gavras
<a href="mailto:konstantin@gavras.de"><i class="fa
              fa-envelope"></i> </a>
<a href="http://konstantin.gavras.de/"><i class="fa
              fa-globe"></i> </a>
<a href="https://twitter.com/kongavras"><i class="fa
              fa-twitter"></i></a> is a Ph.D. candidate at the Graduate School of Economic and Social Sciences in Political Science, research associate at the Chair of Political Psychology at the University of Mannheim and doctoral researcher for the MZES project “Fighting together, moving apart? European common defence and shared security in an age of Brexit and Trump”. His research interests comprise the intersection of Social Psychology and Political Behavior, focusing on the behavioral consequences and conditions underlying political attitudes regarding both domestic and foreign policies.</p>
</div>
]]>
      </description>
    </item>
    
    <item>
      <title>Advancing Text Mining with R and quanteda</title>
      <link>https://socialsciencedatalab.mzes.uni-mannheim.de/article/advancing-text-mining/</link>
      <pubDate>Thu, 17 Oct 2019 00:00:00 +0100</pubDate>
      
      <guid>https://socialsciencedatalab.mzes.uni-mannheim.de/article/advancing-text-mining/</guid>
      <description><![CDATA[
        
<script src="/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<script src="/rmarkdown-libs/kePrint/kePrint.js"></script>


<p>Everyone is talking about text analysis. Is it puzzling that this data source is so popular right now? Actually no. Most of our datasets rely on (hand-coded) textual information. Extracting, processing, and analyzing this oasis of information becomes increasingly relevant for a large variety of research fields. This Methods Bites Tutorial by <a href="https://twitter.com/cosima_meyer">Cosima Meyer</a> summarizes <a href="http://cbpuschmann.net">Cornelius Puschmann</a>’s workshop in the <a href="https://www.mzes.uni-mannheim.de/d7/">MZES</a> Social Science Data Lab in January 2019 on <strong>advancing text mining with R and the package <code>quanteda</code></strong>. The workshop offered guidance through the use of <code>quanteda</code> and covered various classification methods, including classification with <a href="#knowncategories">known categories (dictionaries and supervised machine learning)</a> and with <a href="#unknowncategories">unknown categories (unsupervised machine learning)</a>.
<!-- This post does not cover cross-validation. --></p>
<blockquote>
<p><em>This post was updated in December 2020 to be consistent with quanteda’s version 2.1.2. For more information on differences between quanteda versions, have a look at <a href="https://blog.quanteda.org/2020/02/27/whats-new-in-quanteda-version-2.0/">this excellent overview</a>.</em></p>
</blockquote>
<div id="overview" class="section level3">
<h3>Overview</h3>
<ol style="list-style-type: decimal">
<li><a href="#whyquanteda"><strong>What is quanteda?</strong></a></li>
<li><a href="#usequanteda"><strong>How do we use quanteda?</strong></a>
<!-- 1. [Calculate the DFM](#dfm) --></li>
<li><a href="#classification"><strong>Classification</strong></a>
<ol style="list-style-type: decimal">
<li><a href="#knowncategories"><strong>Known categories</strong></a>
<ol style="list-style-type: decimal">
<li><a href="#dictionaries">Dictionaries</a></li>
<li><a href="#supervised">Supervised machine learning</a>
<ol style="list-style-type: decimal">
<li><a href="#nb">Naive Bayes (NB)</a></li>
</ol></li>
</ol></li>
<li><a href="#unknowncategories"><strong>Unknown categories</strong></a>
<ol style="list-style-type: decimal">
<li><a href="#unsupervised">Unsupervised machine learning</a>
<ol style="list-style-type: decimal">
<li><a href="#lsa">Latent semantic analysis (LSA)</a></li>
<li><a href="#lda">Latent Dirichlet Allocation (LDA)</a></li>
<li><a href="#stm">Structural topic models (STM)</a></li>
</ol></li>
</ol></li>
</ol></li>
<li><a href="#furtherreadings"><strong>Further readings</strong></a></li>
</ol>
<p>This blog post is based on <a href="http://cbpuschmann.net/quanteda_mzes/">this report</a> and on <a href="http://inhaltsanalyse-mit-r.de/themenmodelle.html">Cornelius’ post on topic models in R</a>.</p>
</div>
<div id="what-is-quanteda" class="section level3">
<h3>What is quanteda? <a name="whyquanteda"></a></h3>
<p>In order to analyze text data, R has several packages available. In this blog post we focus on <code>quanteda</code>. <code>quanteda</code> is one of the most popular R packages for the <strong>qu</strong>antitative <strong>an</strong>alysis of <strong>te</strong>xtual <strong>da</strong>ta that is <a href="http://quanteda.io">fully-featured and allows the user to easily perform natural language processing tasks. It was originally developed by Ken Benoit and other contributors</a>. It offers an extensive documentation and is regularly updated. <code>quanteda</code> is most useful for preparing data that can then be further analyzed using unsupervised/supervised machine learning or other techniques. A combination with <code>tidyverse</code> leads to a more transparent code structure and offers a mere variety of useful areas that could not be addressed within the limited time of the workshop (e.g., scaling models, part-of-speech (POS) tagging, named entities, word embeddings, etc.).</p>
<p>There are also similar R packages such as <code>tm</code>, <code>tidytext</code>, and <code>koRpus</code>. <a href="https://cran.r-project.org/web/packages/tm/index.html"><code>tm</code></a> has simpler grammer but slightly fewer features, <a href="https://cran.r-project.org/web/packages/tidytext/index.html"><code>tidytext</code></a> is very closely integrated with <code>dplyr</code> and well-documented, and <a href="https://cran.r-project.org/web/packages/koRpus/index.html"><code>koRpus</code></a> is good for tasks such as <a href="https://reaktanz.de/R/pckg/koRpus/koRpus_vignette.html">part-of-speech (POS) tagging</a>).</p>
</div>
<div id="how-do-we-use-quanteda" class="section level3">
<h3>How do we use quanteda? <a name="usequanteda"></a></h3>
<p>Most analyses in quanteda require three steps:</p>
<blockquote>
<p><sub><b> 1. Import the data </b></sub></br></p>
</blockquote>
<p>The data that we usually use for text analysis is available in text formats (e.g., .txt or .csv files).</p>
<blockquote>
<p><sub><b> 2. Build a corpus </b></sub></br></p>
</blockquote>
<p>After reading in the data, we need to generate a <strong>corpus</strong>. A corpus is a type of dataset that is used in text analysis. It contains “a collection of text or speech material that has been brought together according to a certain set of predetermined criteria” <a href="https://www.igi-global.com/book/automated-systems-aviation-aerospace-industries/209468">(Shmelova et al. 2019, p. 33)</a>. These criteria are usually set by the researchers and are in concordance with the guiding question. For instance, if you are interested in analyzing speeches in the UN General Debate, these predetermined criteria are the time and scope conditions of these debates (speeches by countries at different points in time).</p>
<blockquote>
<p><sub><b> 3. Calculate a document-feature matrix (DFM) </b></sub></p>
</blockquote>
<p>Another essential component for text analysis is a <strong>document-feature matrix (DFM)</strong>; also called <strong>document-term matrix (DTM)</strong>. These two terms are synonyms but <code>quanteda</code> refers to a DFM whereas others will refer to DTM. It describes how frequently terms occur in the corpus by counting single terms.
To generate a DFM, we first split the text into its single terms (tokens). We then count how frequently each term (token) occurs in each document.</p>
<p>The following graphic describes visually how we turn raw text into a vector-space representation that is easily accessible and analyzable with quantitative statistical tools. It also visualizes how we can think of a DFM. The rows represent the documents that are part of the corpus and the columns show the different terms (tokens). The values in the cells indicate how frequently these terms (tokens) are used across the documents.</p>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-3"></span>
<img src="/../../../../article/advancing-text-mining/figures/dfm.png" alt="Model of a DFM" width="70%" />
<p class="caption">
Figure 1: Model of a DFM
</p>
</div>
<p>       </p>
<p><strong>Important things to remember about DFMs:</strong></p>
<ul>
<li>A <strong>corpus is positional (string of words)</strong> and a <strong>DFM is non-positional (bag of words)</strong>. Put differently, the order of the words matters in a corpus whereas a DFM does not have information on the position of words.</li>
<li>A <strong>token</strong> is each individual word in a text (but it could also be a sentence, paragraph, or character). This is why we call creating a “bag of words” also <strong>tokenizing text</strong>. In a nutshell, a DFM is a very efficient way of organizing the frequency of features/tokens but does not contain any information on their position. In our example, the <strong>features</strong> of a text are represented by the columns of a DFM and aggregate the frequency of each <strong>token</strong>.</li>
<li>In most projects you want <strong>one corpus to contain all your data</strong> and <strong>generate many DFMs</strong> from that.</li>
<li>The <strong>rows of a DFM</strong> can contain <strong>any unit</strong> on which you can <strong>aggregate documents</strong>. In the example above, we used the single documents as the unit. It may also well be more fine-grained with sub-documents or more aggregated with a larger collection of documents.</li>
<li>The <strong>columns of a DFM</strong> are <strong>any unit</strong> on which you can <strong>aggregate features</strong>. <strong>Features</strong> are extracted from the texts and quantitatively measurable. Features can be words, parts of the text, content categories, word counts, etc. In the example above, we used single words such as “united”, “nations”, and “peace”.</li>
</ul>
<p>       </p>
<p>To showcase the three steps introduced above, we are using the <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0TJX8Y">UN General Debate data by Mikhaylov, Baturo, and Dasandi</a> dataset. There is also a <a href="https://rdrr.io/github/quanteda/quanteda.corpora/man/data_corpus_ungd2017.html">pre-processed version of the dataset</a> accessible with <code>quanteda.corpora</code>.</p>
<br/>
<details>
<p><summary>How to access the UNGD data with <code>quanteda.corpora</code></summary></p>
<pre class="r"><code># Install package quanteda.corpora
devtools::install_github(&quot;quanteda/quanteda.corpora&quot;)

# Load the dataset
quanteda.corpora::data_corpus_ungd2017</code></pre>
</details>
<p><br/></p>
<p>We will, however, mainly rely on the original dataset throughout the following explanations to match closely the regular workflow of textual data in R. If you want to replicate the steps, please download the data <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0TJX8Y">here</a> and unzip the zip file. Your <code>global_path</code> should direct you to the text file folders.</p>
<p>In a first step, we need to load the necessary packages and read in the data.</p>
<pre class="r"><code># Load all required packages
library(tidyverse)        # Also loads dplyr, ggplot2, and haven
library(quanteda)         # For NLP
library(readtext)         # To read .txt files
library(stm)              # For structural topic models
library(stminsights)      # For visual exploration of STM
library(wordcloud)        # To generate wordclouds
library(gsl)              # Required for the topicmodels package
library(topicmodels)      # For topicmodels
library(caret)            # For machine learning

# Download data here: 
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0TJX8Y 
# and unzip the zip file

# Read in data (.txt files)
global_path &lt;- &quot;path/to/folder/UN-data/&quot;

# We load the data (.txt files) from all subfolders (readtext can handle 
# this without specification)  and store them in the main UNGDspeeches 
# dataframe. Beyond the speech text, this data also includes the 
# meta-data from the text filenames and add variables for the country, 
# UN session, and year.
# The code is based on https://github.com/quanteda/quanteda.corpora/issues/6
# and https://github.com/sjankin/UnitedNations/blob/master/files/UNGD_analysis_example.Rmd 

# For the purpose of this blog post, we use the data from all sessions.
UNGDspeeches &lt;- readtext(
  paste0(global_path, &quot;*/*.txt&quot;),
  docvarsfrom = &quot;filenames&quot;,
  docvarnames = c(&quot;country&quot;, &quot;session&quot;, &quot;year&quot;),
  dvsep = &quot;_&quot;,
  encoding = &quot;UTF-8&quot;
)</code></pre>
<p>We can then proceed and generate a corpus.</p>
<pre class="r"><code>mycorpus &lt;- corpus(UNGDspeeches)

# Assigns a unique identifier to each text
docvars(mycorpus, &quot;Textno&quot;) &lt;-
  sprintf(&quot;%02d&quot;, 1:ndoc(mycorpus)) </code></pre>
<p>As we can see (by calling the object <code>mycorpus</code>), the corpus consists of 8,093 documents.</p>
<details>
<p><summary>Output: <code>mycorpus</code></summary></p>
<pre class="r"><code>mycorpus</code></pre>
<pre><code>Corpus consisting of 8,093 documents and 4 docvars.</code></pre>
</details>
<p><br/></p>
<p>With this data, we can already generate first descriptive statistics.</p>
<pre class="r"><code># Save statistics in &quot;mycorpus.stats&quot;
mycorpus.stats &lt;- summary(mycorpus)

# And print the statistics of the first 10 observations
head(mycorpus.stats, n = 10)</code></pre>
<pre><code>#               Text Types Tokens Sentences country session year Textno
# 1  ALB_25_1970.txt  1727   9077       256     ALB      25 1970     01
# 2  ARG_25_1970.txt  1425   5192       218     ARG      25 1970     02
# 3  AUS_25_1970.txt  1611   5688       270     AUS      25 1970     03
# 4  AUT_25_1970.txt  1340   4717       164     AUT      25 1970     04
# 5  BEL_25_1970.txt  1289   4783       207     BEL      25 1970     05
# 6  BLR_25_1970.txt  1427   6138       204     BLR      25 1970     06
# 7  BOL_25_1970.txt  1559   5612       225     BOL      25 1970     07
# 8  BRA_25_1970.txt  1333   4422       154     BRA      25 1970     08
# 9  CAN_25_1970.txt   728   1887        97     CAN      25 1970     09
# 10 CMR_25_1970.txt   928   3144       106     CMR      25 1970     10</code></pre>
<!-- ##### Calculate the DFM <a name="dfm"></a> -->
<p>In a next step, we can also calculate the <strong>document-feature matrix</strong>. To do so, first we need to generate tokens (<code>tokens()</code>) and can also already <a href="https://github.com/sjankin/UnitedNations/blob/master/files/UNGD_analysis_example.Rmd"><strong>pre-process the data</strong></a>. This includes removing the numbers (<code>remove_numbers</code>), punctuations (<code>remove_punct</code>), symbols (<code>remove_symbols</code>), and urls beginning with http(s) (<code>remove_url</code>).</p>
<p>An earlier version of this blog post used <code>remove_hyphens</code> to remove hyphens as well as <code>remove_twitter</code>to remove symbols such as @ and #. Both commands are either deprecated or defunctional. To remove hyphens, it is recommended to use <code>split_hyphens</code> instead. For twitter symbols there is no new command. Quanteda’s manual recommends to use <a href="https://quanteda.io/reference/tokens.html">``an alternative tokenizer, including non-quanteda options’’</a>.</p>
<p>We further include the docvars from our corpus (<code>include_docvars</code>).</p>
<pre class="r"><code># Preprocess the text

# Create tokens
token &lt;-
  tokens(
    mycorpus,
    split_hyphens = TRUE,
    remove_numbers = TRUE,
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE,
    include_docvars = TRUE
  )</code></pre>
<p>Since the pre-1994 documents were scanned with OCR scanners, several tokens with combinations of digits and characters were introduced. We clean them manually following <a href="https://github.com/sjankin/UnitedNations/blob/master/files/UNGD_analysis_example.Rmd">this guideline</a>.</p>
<pre class="r"><code># Clean tokens created by OCR
token_ungd &lt;- tokens_select(
  token,
  c(&quot;[\\d-]&quot;, &quot;[[:punct:]]&quot;, &quot;^.{1,2}$&quot;),
  selection = &quot;remove&quot;,
  valuetype = &quot;regex&quot;,
  verbose = TRUE
)</code></pre>
<p>In the next step, we then create the <strong>document-feature matrix</strong>. We lower and stem the words (<code>tolower</code> and <code>stem</code>) and remove common stop words (<code>remove=stopwords()</code>). Stopwords are words that appear in texts but do not give the text a substantial meaning (e.g., “the”, “a”, or “for”). Since the language of all documents is English, we only remove English stopwords here. <code>quanteda</code> can also deal with stopwords from other languages (for more information see <a href="https://quanteda.io/reference/stopwords.html">here</a>).</p>
<pre class="r"><code>mydfm &lt;- dfm(token_ungd,
             tolower = TRUE,
             stem = TRUE,
             remove = stopwords(&quot;english&quot;)
             )</code></pre>
<p>We can also trim the text with <code>dfm_trim</code>. Using the command and its respective specifications, we filter words that appear less than 7.5% and more than 90%. This rather conservative approach is possible because we have a sufficiently large corpus.</p>
<pre class="r"><code>mydfm.trim &lt;-
  dfm_trim(
    mydfm,
    min_docfreq = 0.075,
    # min 7.5%
    max_docfreq = 0.90,
    #  max 90%
    docfreq_type = &quot;prop&quot;
  ) </code></pre>
<p>To get a look at the DFM, we now print their first 5 observations and first 10 features:</p>
<pre class="r"><code># And print the results of the first 10 observations and first 10 features in a DFM
head(dfm_sort(mydfm.trim, decreasing = TRUE, margin = &quot;both&quot;),
     n = 10,
     nf = 10) </code></pre>
<pre><code>Document-feature matrix of: 5 documents, 10 features (4.0% sparse) and 4 docvars.
                 features
docs              problem session conflict council africa global resolut hope south situat
  CUB_34_1979.txt      36       8        1       0     13      3      10    8    10     23
  IRL_39_1984.txt      41       9       18      11     14      5      16   21    19      9
  PAN_37_1982.txt      14      12       12       8     11      2       6   11    20     10
  BFA_29_1974.txt      25      17        1       4     15      0      10   20     6      9
  GRC_43_1988.txt      27       9       13      14     10      2      10   10    11      5</code></pre>
<!-- ``` -->
<!-- ## Document-feature matrix of: 10 documents, 10 features (3.0% sparse). -->
<!-- ## 10 x 10 sparse Matrix of class "dfm" -->
<!-- ##                  features -->
<!-- ## docs              human government organization rights political session council africa -->
<!-- ##   CUB_34_1979.txt    11         27            3      8        12       8       0     12 -->
<!-- ##   IRL_39_1984.txt    24         15            8     22        22       9      11     14 -->
<!-- ##   BFA_29_1974.txt     2         18           11      6        15      17       4     15 -->
<!-- ##   PAN_37_1982.txt     8         15           14      7        22       9       8     11 -->
<!-- ##   GRC_43_1988.txt    17          3            7     14        20       9      14     10 -->
<!-- ##   PRY_38_1983.txt     8         20           12      8        14      17       3      0 -->
<!-- ##   UGA_30_1975.txt    16         16           17     10        10      17       7     54 -->
<!-- ##   RUS_32_1977.txt     6          2            2      5         8       8       1     12 -->
<!-- ##   RUS_31_1976.txt     4          2            3      8        14       9       2      8 -->
<!-- ##   ALB_28_1973.txt     0         22            4      8        10       4       1      5 -->
<!-- ##                  features -->
<!-- ## docs              developing time -->
<!-- ##   CUB_34_1979.txt         69    8 -->
<!-- ##   IRL_39_1984.txt         19    8 -->
<!-- ##   BFA_29_1974.txt         33   24 -->
<!-- ##   PAN_37_1982.txt          5    7 -->
<!-- ##   GRC_43_1988.txt         11    6 -->
<!-- ##   PRY_38_1983.txt          7   15 -->
<!-- ##   UGA_30_1975.txt          6    8 -->
<!-- ##   RUS_32_1977.txt          7   12 -->
<!-- ##   RUS_31_1976.txt          6   10 -->
<!-- ##   ALB_28_1973.txt          3   18 -->
<!-- ``` -->
<p>The sparsity gives us information about the proportion of cells that have zero counts.</p>
</div>
<div id="classification" class="section level3">
<h3>Classification <a name="classification"></a></h3>
<p>A next step can involve the classification of the text. The article by <a href="https://www.cambridge.org/core/journals/political-analysis/article/text-as-data-the-promise-and-pitfalls-of-automatic-content-analysis-methods-for-political-texts/F7AAC8B2909441603FEB25C156448F20">Grimmer and Stewart (2013)</a> provides a good overview for this step. The upcoming section follows their structure. Classification sorts texts into categories. The following picture is leaned on the figure by <a href="https://www.cambridge.org/core/journals/political-analysis/article/text-as-data-the-promise-and-pitfalls-of-automatic-content-analysis-methods-for-political-texts/F7AAC8B2909441603FEB25C156448F20">Grimmer and Stewart (2013, 268)</a> and illustrates a possible structure of classification.</p>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-14"></span>
<img src="/../../../../article/advancing-text-mining/figures/overview.png" alt="Overview of classification (own illustration, based on [Grimmer and Stewart (2013, 268)](https://www.cambridge.org/core/journals/political-analysis/article/text-as-data-the-promise-and-pitfalls-of-automatic-content-analysis-methods-for-political-texts/F7AAC8B2909441603FEB25C156448F20))" width="70%" />
<p class="caption">
Figure 2: Overview of classification (own illustration, based on <a href="https://www.cambridge.org/core/journals/political-analysis/article/text-as-data-the-promise-and-pitfalls-of-automatic-content-analysis-methods-for-political-texts/F7AAC8B2909441603FEB25C156448F20">Grimmer and Stewart (2013, 268)</a>)
</p>
</div>
<p>       </p>
<p>A researcher usually faces one of the following situations: <strong>The categories are known beforehand</strong> or <strong>the categories are unknown</strong>. If the researcher knows the categories, s/he can use automated methods to minimize the workload that is associated with the categorization of the texts. Throughout the workshop, two methods were presented: a <strong>dictionary method</strong> and a <strong>supervised method</strong>. If the researcher does not know the categories, s/he is likely to resort to <strong>unsupervised machine learning</strong>. The following section provides illustrative examples for both methods.</p>
<div id="known-categories" class="section level4">
<h4>Known categories <a name="knowncategories"></a></h4>
<div id="known-categories-dictionaries" class="section level5">
<h5>Known categories: Dictionaries <a name="dictionaries"></a></h5>
<p><strong>Dictionaries</strong> contain lists of words that correspond to different categories. If we apply a dictionary approach, we count how often words that are associated with different categories are represented in each document. These dictionaries help us to classify (or categorize) the speeches based on the frequency of the words that they contain. Popular dictionaries are sentiment dictionaries (such as <a href="https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html">Bing</a>, <a href="http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010">Afinn</a> or <a href="http://liwc.wpengine.com">LIWC</a>) or <a href="http://www.lexicoder.com/index.html">LexiCoder</a>.
<!-- [You may also want to normalize the length of the document.](http://pablobarbera.com/big-data-upf/slides/05-text.pdf) --></p>
<p>We use the “LexiCoder Policy Agenda” dictionary that can be accessed <a href="http://www.lexicoder.com/download.html">here</a> in a .lcd format.
The “LexiCoder Policy Agenda” dictionary captures major topics from the <a href="https://www.comparativeagendas.net">comparative Policy Agenda project</a> and is currently available in Dutch and English.</p>
<p>To read in the dictionary, we use <code>quanteda</code>’s built-in function <code>dictionary()</code>.</p>
<pre class="r"><code># load the dictionary with quanteda&#39;s built-in function
dict &lt;- dictionary(file = &quot;policy_agendas_english.lcd&quot;)</code></pre>
<p>We apply this dictionary to filter the share of each country’s speeches on immigration, international affair and defence.</p>
<pre class="r"><code>mydfm.un &lt;- dfm(mydfm.trim, groups = &quot;country&quot;, dictionary = dict)

un.topics.pa &lt;- convert(mydfm.un, &quot;data.frame&quot;) %&gt;%
  dplyr::rename(country = doc_id) %&gt;%
  select(country, immigration, intl_affairs, defence) %&gt;%
  tidyr::gather(immigration:defence, key = &quot;Topic&quot;, value = &quot;Share&quot;) %&gt;%
  group_by(country) %&gt;%
  mutate(Share = Share / sum(Share)) %&gt;%
  mutate(Topic = haven::as_factor(Topic))</code></pre>
<p>In a next step, we can visualize the results with <code>ggplot</code>. This gives us a first impression of the distribution of the topics in the 2018 UN General Debate across countries.</p>
<pre class="r"><code>un.topics.pa %&gt;%
  ggplot(aes(country, Share, colour = Topic, fill = Topic)) +
  geom_bar(stat = &quot;identity&quot;) +
  scale_colour_brewer(palette = &quot;Set1&quot;) +
  scale_fill_brewer(palette = &quot;Pastel1&quot;) +
  ggtitle(&quot;Distribution of PA topics in the UN General Debate corpus&quot;) +
  xlab(&quot;&quot;) +
  ylab(&quot;Topic share (%)&quot;) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())</code></pre>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-18"></span>
<img src="/../../../../article/advancing-text-mining/figures/distribution_topics_un.png" alt="Distribution of PA topics in the UN General Debate corpus" width="70%" />
<p class="caption">
Figure 3: Distribution of PA topics in the UN General Debate corpus
</p>
</div>
<p><br/></p>
<p>We observe a relatively high share for both defence and international affairs whereas immigration receives fewer attention in the speeches.</p>
</div>
<div id="known-categories-supervised-machine-learning---naive-bayes-nb" class="section level5">
<h5>Known categories: Supervised machine learning - Naive Bayes (NB) <a name="nb"></a></h5>
<p>We now turn to <strong>supervised machine learning</strong>. Similar to the dictionary approach explained above, <strong>this method also requires some pre-existing classifications</strong>. But in contrast to a dictionary, we now divide the data into a <strong>training</strong> and a <strong>test dataset</strong>. This follows the general logic of machine learning algorithms. The training data already contains the classifications and <em>trains</em> the algorithm (e.g., our Naive Bayes classifier) to predict the class of our speech based on the features that are given. A Naive Bayes classifier now calculates the probability for each class based on the features. It eventually goes for the class with the highest probability and selects this class as the corresponding category. It is <strong>based on the Bayes theorem for conditional probability</strong>. It can be formally written as:
<span class="math display">\[ P(A | B) = \frac{P(A) * P(B | A)}{P(B)}\]</span>
In plain words, the probability of A is conditional on B.</p>
<br/>
<details>
<p><summary>Bayes’ theorem</summary></p>
<ul>
<li><span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> are events</li>
<li><span class="math inline">\(P(A)\)</span> and <span class="math inline">\(P(B)\)</span> is the probability of observing <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> (respectively) independent from each other</li>
<li><span class="math inline">\(P(A) \neq 0\)</span> and <span class="math inline">\(P(B) \neq 0\)</span></li>
<li><span class="math inline">\(P(A|B)\)</span> is the <strong>conditional probability</strong> that <span class="math inline">\(A\)</span> occurs when <span class="math inline">\(B\)</span> is true
<span class="math display">\[P(A|B) = \frac{P(A \cap B)}{P(B)}, if P(B) \neq 0\]</span></li>
<li><span class="math inline">\(P(B|A)\)</span> is the <strong>conditional probability</strong> that <span class="math inline">\(B\)</span> occurs when <span class="math inline">\(A\)</span> is true
<span class="math display">\[P(B|A) = \frac{P(B \cap A)}{P(A)}, if P(A) \neq 0\]</span></li>
<li>And we also have the <strong>joint probability</strong> of $ P(A B) = P(B A)$ because</li>
</ul>
<span class="math display">\[ \Longrightarrow P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)\]</span>
<span class="math display">\[ \Longrightarrow P(A | B) = \frac{P(A \cap B)}{P(B)}\]</span>
<span class="math display">\[ \Longrightarrow P(A | B) = \frac{\frac{P(A \cap B)}{P(B)}*P(A)}{P(B)}\]</span>
<span class="math display">\[ \Longrightarrow P(A | B) = \frac{P(B|A)*P(A)}{P(B)}\]</span>
</details>
<p><br/></p>
<p><strong>Why is Naive Bayes “naive”?</strong> Naive Bayes is “naive” because of its <strong>strong independence assumptions</strong>. It assumes that all features are equally important and that all features are independent. If you think of n-grams and compare unigrams and bigrams, you can intuitively understand why the last assumption is a strong assumption. A unigram counts each word as a gram (“I” “like” “walking” “in” “the” “sun”) whereas a bigram counts two words as a gram (“I like” “like walking” “walking in” “in the” “the sun”).</p>
<p>However, even when the assumptions are not fully met, Naive Bayes still performs well.</p>
<p>A Naive Bayes is a <strong>relatively simple classification algorithm</strong> because it does not require much time and working capacity of your machine. To use a Naive Bayes classifier, we rely on <code>quanteda</code>’s built-in function <a href="http://quanteda.io/reference/textmodel_nb.html"><code>textmodel_nb</code></a>.</p>
<p>To perform the Naive Bayes estimation, we proceed with the following steps:</p>
<blockquote>
<p><sub><strong>1. We set up training and test data based on the corpus.</strong></sub></br>
<sub><strong>2. Based on these two datasets, we generate a DFM.</strong></sub></br>
<sub><strong>3. We train the algorithm by feeding in the training data and eventually use the test data for performance.</strong></sub></br>
<sub><strong>4. We then check the performance (accuracy) of our results.</strong></sub></br>
<sub><strong>5. And compare it with a random prediction.</strong></sub></br></p>
</blockquote>
<p>For this example, we use the pre-labeled dataset that is used for the algorithm <a href="https://github.com/koheiw/Newsmap">newsmap</a> by <a href="https://koheiw.net/?p=293">Kohei Watanabe</a>. The dataset contains information on the geographical location of newspaper articles. We introduce this new dataset as Naive Bayes – a supervised machine learning algorithm – requires pre-labeled data.</p>
<p>We first load the dataset. To do so, we follow Kohei Watanabe’s description <a href="https://github.com/koheiw/Newsmap">here</a>, download the <a href="https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1">corpus of Yahoo News from 2014</a>, and follow the subsequent processing steps he describes.</p>
<pre class="r"><code># load data
load(&quot;../newspaper.RData&quot;)

# transform variables
pred_data$text &lt;- as.character(pred_data$text)
pred_data$country &lt;- as.character(pred_data$country)</code></pre>
<p>For simplicity, we keep only the USA, Great Britain, France, Brazil, and Japan.</p>
<pre class="r"><code>pred_data &lt;- pred_data %&gt;%
  dplyr::filter(country %in% c(&quot;us&quot;, &quot;gb&quot;, &quot;fr&quot;, &quot;br&quot;, &quot;jp&quot;)) %&gt;%
                  dplyr::select(text, country)</code></pre>
<pre class="r"><code>head(pred_data)</code></pre>
<table class="table" style="margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:right;">
row
</th>
<th style="text-align:left;">
text
</th>
<th style="text-align:left;">
country
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right;">
1
</td>
<td style="text-align:left;">
’08 French champ Ivanovic loses to Safarova in 3rd. PARIS (AP) - Former French Open champion Ana Ivanovic lost in the third round Saturday, beaten 6-3, 6-3 by 23rd-seeded Lucie Safarova of the Czech Republic.
</td>
<td style="text-align:left;">
fr
</td>
</tr>
<tr>
<td style="text-align:right;">
2
</td>
<td style="text-align:left;">
Up to USD1,000 a day to care for child migrants. More than 57,000 unaccompanied children, mostly from Central America, have been caught entering the country illegally since last October, and President Barack Obama has asked for USD3.7 billion in emergency funding to address what he has called an ‘urgent humanitarian solution.’ ‘One of the figures that sticks in everybody’s mind is we’re paying about USD250 to USD1,000 per child,’ Senator Jeff Flake told reporters, citing figures presented at a closed-door briefing by Homeland Security Secretary Jeh Johnson. Federal authorities are struggling to find more cost-effective housing, medical care, counseling and legal services for the undocumented minors. The base cost per bed was USD250 per day, including other services, Senator Dianne Feinstein said, without providing details.
</td>
<td style="text-align:left;">
us
</td>
</tr>
<tr>
<td style="text-align:right;">
3
</td>
<td style="text-align:left;">
1,400 gay weddings in England, Wales in first three months. Just over 1,400 gay couples tied the knot in the three months after same-sex marriage was allowed in England and Wales, figures out Thursday showed. The Office for National Statistics said 1,409 marriages took place between March 29 and June 30. ‘The novelty and significance of marriage becoming available led to an initial rush among same-sex couples wanting to be among the very first to assume the same rights and protection afforded to heterosexual couples,’ said James Brown, a partner at law firm JMW Solicitors. The figures will likely surge from December once civil partnerships can be converted into marriages.
</td>
<td style="text-align:left;">
gb
</td>
</tr>
<tr>
<td style="text-align:right;">
4
</td>
<td style="text-align:left;">
1 dead after fan fighting in Brazil. SAO PAULO (AP) - Police say a 21-year-old man died after a confrontation between rival football fan groups in Brazil on Sunday.
</td>
<td style="text-align:left;">
br
</td>
</tr>
<tr>
<td style="text-align:right;">
5
</td>
<td style="text-align:left;">
1 dead as plane with French tourists crashes in US. PAGE, Arizona (AP) - Authorities say a small plane carrying French tourists crashed while trying to land at an airport in Arizona, and one person was killed and another hospitalized.
</td>
<td style="text-align:left;">
fr
</td>
</tr>
<tr>
<td style="text-align:right;">
6
</td>
<td style="text-align:left;">
1 US theory is someone diverted missing plane. WASHINGTON (AP) - A U.S. official says investigators are examining the possibility that someone caused the disappearance of a Malaysia Airlines jet with 239 people on board, and that it may have been ‘an act of piracy.’
</td>
<td style="text-align:left;">
us
</td>
</tr>
</tbody>
</table>
<p>We pre-process the data again. Our final corpus thus includes the newspaper headlines by country.</p>
<pre class="r"><code>data_corpus &lt;- corpus(pred_data, text_field = &quot;text&quot;)</code></pre>
<p>In a first step, we need to define our training and our test dataset. Based on these two datasets, we generate a DFM. This code is based on <a href="http://cbpuschmann.net/quanteda_mzes/">Cornelius code</a> and <a href="https://tutorials.quanteda.io/machine-learning/nb/">quanteda’s example</a>. To do so, we apply similar general data pre-processing steps as discussed above.</p>
<pre class="r"><code># Set a seed for replication purposes
set.seed(68159)

# Generate random 10,000 numbers without replacement
training_id &lt;- sample(1:29542, 10000, replace = FALSE)

# Create docvar with ID
docvars(data_corpus, &quot;id_numeric&quot;) &lt;- 1:ndoc(data_corpus)

# Get training set
dfmat_training &lt;-
  corpus_subset(data_corpus, id_numeric %in% training_id) %&gt;%
  dfm(stem = TRUE)

# Get test set (documents not in training_id)
dfmat_test &lt;-
  corpus_subset(data_corpus,!id_numeric %in% training_id) %&gt;%
  dfm(stem = TRUE)</code></pre>
<p>We can now check the distribution of the countries across the two DFMs:</p>
<pre class="r"><code>print(prop.table(table(docvars(
  dfmat_training, &quot;country&quot;
))) * 100)</code></pre>
<pre><code>   br    fr    gb    jp    us 
13.34 14.37 33.65 11.10 27.54 </code></pre>
<pre class="r"><code>print(prop.table(table(docvars(
  dfmat_test, &quot;country&quot;
))) * 100)</code></pre>
<pre><code>   br       fr       gb       jp       us 
13.70893 15.13151 33.02630 10.93542 27.19783 </code></pre>
<p>As we can see, the countries are equally distributed across both DFMs.</p>
<p>In a next step, we train the Naive Bayes classifier. Going back to the formula stated above, we know that A is conditional on B.</p>
<p><span class="math display">\[ P(A | B) = \frac{P(A) * P(B | A)}{P(B)}\]</span></p>
<p><strong>A is what we want to know</strong> (the country that is mainly addressed in each text) and <strong>B is what we see</strong> (the text). We can now proceed and replace A and B with the respective terms. This leads us to the next equation:</p>
<p><span class="math display">\[ P(Country | Text) = \frac{P(Country) * P(Text | Country)}{P(Text)}\]</span></p>
<p>We can then proceed and train our algorithm using quanteda’s built-in function <code>textmodel_nb</code>.</p>
<pre class="r"><code># Train naive Bayes
# The function takes a DFM as the first argument 
model.NB &lt;-
  textmodel_nb(dfmat_training, docvars(dfmat_training, &quot;country&quot;), prior = &quot;docfreq&quot;)

# The prior indicates an assumed distribution. 
# Here we choose how frequently the categories occur in our data.</code></pre>
<pre class="r"><code>dfmat_matched &lt;-
  dfm_match(dfmat_test, features = featnames(dfmat_training))</code></pre>
<p>The command <code>summary(model.NB)</code> gives us the results of our prediction. Click unfold to see the results.</p>
<br/>
<details>
<p><summary>Code: <code>summary(model.NB)</code></summary></p>
<pre class="r"><code>summary(model.NB)</code></pre>
<pre><code>Call:
textmodel_nb.dfm(x = dfmat_training, y = docvars(dfmat_training, 
    &quot;country&quot;), prior = &quot;docfreq&quot;)

Class Priors:
(showing first 5 elements)
    br     fr     gb     jp     us 
0.1334 0.1437 0.3365 0.1110 0.2754 

Estimated Feature Scores:
         &#39;     08   french   champ  ivanov    lose      to safarova     in     3rd      .     pari
br 0.08949 0.1679 0.007791 0.07636 0.08195 0.08707 0.09989   0.1474 0.1214 0.34322 0.1176 0.008863
fr 0.14212 0.2778 0.940061 0.37912 0.20345 0.16469 0.13803   0.3659 0.1473 0.24345 0.1296 0.947706
gb 0.43332 0.1532 0.032008 0.24397 0.14963 0.49395 0.35845   0.1346 0.3540 0.17905 0.3321 0.032363
jp 0.06507 0.1184 0.004396 0.10771 0.28900 0.06580 0.10287   0.1040 0.1007 0.06916 0.1059 0.002778
us 0.27000 0.2827 0.015743 0.19285 0.27598 0.18850 0.30077   0.2482 0.2766 0.16512 0.3148 0.008290
         (      ap       )      -  former    open champion     ana    lost    the   third   round
br 0.16080 0.24091 0.16100 0.1611 0.09034 0.17116  0.12708 0.08142 0.08045 0.1101 0.08964 0.30833
fr 0.14221 0.14820 0.14219 0.1363 0.17312 0.18636  0.24110 0.13475 0.16643 0.1350 0.17178 0.19535
gb 0.34860 0.28303 0.34870 0.3403 0.45720 0.36654  0.51497 0.07433 0.38923 0.3603 0.39194 0.24189
jp 0.09123 0.08125 0.09126 0.0922 0.04583 0.07443  0.05684 0.22969 0.07943 0.1005 0.07653 0.04757
us 0.25715 0.24661 0.25685 0.2702 0.23351 0.20152  0.06002 0.47982 0.28446 0.2940 0.27010 0.20686
   saturday      ,  beaten     6-3      by 23rd-seed
br  0.14287 0.1078 0.19323 0.21226 0.09731    0.1955
fr  0.19921 0.1384 0.20558 0.40985 0.12900    0.3236
gb  0.44845 0.3437 0.45359 0.21530 0.33256    0.1785
jp  0.09248 0.1099 0.07787 0.08317 0.11636    0.1379
us  0.11699 0.3001 0.06972 0.07942 0.32477    0.1646</code></pre>
</details>
<p><br/></p>
<p>To better understand how well we did, we can also generate two frequency tables for right and wrong predictions.</p>
<pre class="r"><code>prop.table(table(predict(model.NB) == docvars(dfmat_training, &quot;country&quot;))) * 100</code></pre>
<pre><code>FALSE  TRUE 
 2.71 97.29  </code></pre>
<p>To check if this result indicates a good performance, we compare it with a random result. We randomize the list of countries (and keep the overall frequency distribution of our countries constant) to allow our random algorithm a legitimate chance for a correct classification.</p>
<pre class="r"><code>prop.table(table(sample(predict(model.NB)) == docvars(dfmat_training, &quot;country&quot;))) * 100</code></pre>
<pre><code>FALSE  TRUE 
76.63 23.37  </code></pre>
<p>As we can see from the the summarized table below, our Naive Bayes classifier clearly outperforms a random algorithm.</p>
<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:left;">
</th>
<th style="text-align:right;">
Naive Bayes
</th>
<th style="text-align:right;">
Random
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
False
</td>
<td style="text-align:right;">
2.71
</td>
<td style="text-align:right;">
76.63
</td>
</tr>
<tr>
<td style="text-align:left;">
True
</td>
<td style="text-align:right;">
97.29
</td>
<td style="text-align:right;">
23.37
</td>
</tr>
</tbody>
</table>
<p>We are likely to increase our accuracy even more by pre-processing our text data.</p>
<p>A confusion matrix helps us assess how well our algorithm performed. It shows us the prediction for all five countries in contrast to the actual class that is given by our data. For example, we predict 2580 articles as belonging to Great Britain that actually belong to Great Britain. However, we also predict 39 articles as British articles while they are actually French. Overall, when we look at the diagonal, we see that most predictions correctly classify the articles and that our algorithm performs well.</p>
<pre class="r"><code>actual_class &lt;- docvars(dfmat_matched, &quot;country&quot;)
predicted_class &lt;- predict(model.NB, newdata = dfmat_matched)
tab_class &lt;- table(actual_class, predicted_class)
tab_class</code></pre>
<pre><code>             predicted_class
actual_class   br   fr   gb   jp   us
          br 2580    6   50    6   37
          fr   39 2697  124   11   86
          gb   24   52 6072   24  282
          jp   33    9   31 1945  119
          us   56   32  194   66 4967</code></pre>
<p>We store our confusion matrix in an object because we need it later to visualize the results.</p>
<pre class="r"><code>confusion &lt;- confusionMatrix(tab_class, mode = &quot;everything&quot;)</code></pre>
<pre><code>Confusion Matrix and Statistics

            predicted_class
actual_class   br   fr   gb   jp   us
          br 2580    6   50    6   37
          fr   39 2697  124   11   86
          gb   24   52 6072   24  282
          jp   33    9   31 1945  119
          us   56   32  194   66 4967

Overall Statistics
                                          
               Accuracy : 0.9344          
                 95% CI : (0.9309, 0.9379)
    No Information Rate : 0.3311          
    P-Value [Acc &gt; NIR] : &lt; 2.2e-16       
                                          
                  Kappa : 0.914           
                                          
 Mcnemar&#39;s Test P-Value : &lt; 2.2e-16       

Statistics by Class:

                     Class: br Class: fr Class: gb Class: jp Class: us
Sensitivity             0.9444    0.9646    0.9383   0.94786    0.9046
Specificity             0.9941    0.9845    0.9708   0.98902    0.9752
Pos Pred Value          0.9630    0.9121    0.9408   0.91015    0.9345
Neg Pred Value          0.9910    0.9940    0.9695   0.99385    0.9632
Precision               0.9630    0.9121    0.9408   0.91015    0.9345
Recall                  0.9444    0.9646    0.9383   0.94786    0.9046
F1                      0.9536    0.9376    0.9396   0.92862    0.9193
Prevalence              0.1398    0.1431    0.3311   0.10500    0.2810
Detection Rate          0.1320    0.1380    0.3107   0.09953    0.2542
Detection Prevalence    0.1371    0.1513    0.3303   0.10935    0.2720
Balanced Accuracy       0.9692    0.9745    0.9546   0.96844    0.9399</code></pre>
<p>To display our confusion matrix visually, we could either produce a heatmap or a confusion matrix.</p>
<p>We first plot a heatmap using the code below.</p>
<details>
<p><summary>Code: Heatmap using <code>ggplot2</code></summary></p>
<pre class="r"><code># Save confusion matrix as data frame
confusion.data &lt;- as.data.frame(confusion[[&quot;table&quot;]])

# Reverse the order
level_order_y &lt;-
  factor(confusion.data$actual_class,
         level = c(&#39;us&#39;, &#39;jp&#39;, &#39;gb&#39;, &#39;fr&#39;, &#39;br&#39;))

ggplot(confusion.data,
       aes(x = predicted_class, y = level_order_y, fill = Freq)) +
  xlab(&quot;Predicted class&quot;) +
  ylab(&quot;Actual class&quot;) +
  geom_tile() + theme_bw() + coord_equal() +
  scale_fill_distiller(palette = &quot;Blues&quot;, direction = 1) +
  scale_x_discrete(labels = c(&quot;Brazil&quot;, &quot;France&quot;, &quot;Great \n Britain&quot;, &quot;Japan&quot;, &quot;USA&quot;)) +
  scale_y_discrete(labels = c(&quot;USA&quot;, &quot;Japan&quot;, &quot;Great \n Britain&quot;, &quot;France&quot;, &quot;Brazil&quot;))</code></pre>
</details>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-37"></span>
<img src="/../../../../article/advancing-text-mining/figures/confusion.png" alt="Contingency table" width="80%" />
<p class="caption">
Figure 4: Contingency table
</p>
</div>
<p><br/></p>
<p>To plot the following confusion matrix, we need to slightly adjust the code from <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/article/datavis/">this post on data visualization</a> by <a href="http://richardtraunmueller.com">Richard Traunmüller</a>.</p>
<details>
<p><summary>Code: Confusion matrix</summary></p>
<pre class="r"><code># Generate a data matrix from the confusion table
dat &lt;- data.matrix(confusion$table)

# Change order of column names
order.columns &lt;- c(5, 4, 3, 2, 1) 
dat &lt;- dat[order.columns,]

par(mgp = c(1.5, .3, 0))

# Plot
plot(
  0,
  0,
  # Type of plotting symbol
  pch = &quot;&quot;,
  # Range of x-axis
  xlim = c(0.5, 5.5),
  # Range of y-axis
  ylim = c(0.5, 6.5),
  # Suppresses both x and y axes
  axes = FALSE,
  # Label of x-axis
  xlab = &quot;Predicted class&quot;,
  # Label of y-axis
  ylab = &quot;Actual class&quot;,
)

# Write a for-loop that adds the bubbles to the plot
for (i in 1:dim(dat)[1]) {
  symbols(
    c(1:dim(dat)[2]),
    rep(i, dim(dat)[2]),
    circle = sqrt(dat[i,] / 9000 / pi),
    add = TRUE,
    inches = FALSE,
    fg = brewer.pal(sqrt(dat[i,] / 9000 / pi), &quot;Blues&quot;),
    bg = brewer.pal(sqrt(dat[i,] / 9000 / pi), &quot;Blues&quot;)
  )
}

axis(
  1,
  col = &quot;white&quot;,
  col.axis = &quot;black&quot;,
  at = c(1:5),
  label = colnames(dat)
)
axis(
  2,
  at = c(1:5),
  label = rownames(dat),
  las = 1,
  col.axis = &quot;black&quot;,
  col = &quot;white&quot;
)

# Add numbers to plot
for (i in 1:5) {
  text(c(1:5), rep(i, 5), dat[i,], cex = 0.8)
}</code></pre>
</details>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-39"></span>
<img src="/../../../../article/advancing-text-mining/figures/cont-plot.png" alt="Contingency table" width="80%" />
<p class="caption">
Figure 5: Contingency table
</p>
</div>
<p><br/></p>
<p>Both figures show us that our prediction performs well for all countries but it performs particularly well for the USA and Great Britain. Darker colors show a higher frequency in both plots, the contingency table also indicates a greater frequency with the size of the bubbles.</p>
</div>
</div>
<div id="unknown-categories" class="section level4">
<h4>Unknown categories <a name="unknowncategories"></a></h4>
<!-- ##### Unknown categories: Unsupervised machine learning <a name="unsupervised"></a> -->
<div id="unknown-categories-unsupervised-machine-learning---latent-semantic-analysis-lsa" class="section level5">
<h5>Unknown categories: Unsupervised machine learning - Latent semantic analysis (LSA) <a name="lsa"></a></h5>
<p>The next section addresses how to analyze texts with unknown categories. <strong>Latent Semantic Analysis (LSA)</strong> evaluates documents and seeks to find the underlying meaning or concept of these documents. If each word only had one meaning, LSA would have an easy job. However, oftentimes, words are ambiguous, have multiple meanings or are synonyms. One example from our corpus is “may” - it could be a verb, a noun for a month, or a name. To overcome this problem, LSA essentially compares how often words appear together in one document and then compares this across all other documents. By grouping words with other words, we try to identify those words that are semantically related and eventually also get the true meaning of ambiguous words.
More technically, LSA is a useful technique for aligning feature distributions to an n-dimensional space. This is achieved via <a href="https://blog.statsbot.co/singular-value-decomposition-tutorial-52c695315254">singular value decomposition (SVD)</a>. This decomposition allows us to decompose both a quadratic and a rectangular matrix.
LSA can (among other things) be used to compare similarity of documents/documents grouped by some variable.</p>
<p><a href="https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/">What are the major assumptions and simplifications that LSA has?</a></p>
<ol style="list-style-type: decimal">
<li><strong>Documents are non-positional</strong> (“bag of words”). The “bag of words” approach assumes that the order of the words does not matter. What matters is only the frequency of the single words.</li>
<li><strong>Concepts are</strong> understood as <strong>patterns of words</strong> where certain words often go together in similar documents.</li>
<li><strong>Words only have one meaning</strong> given the contexts surrounding the patterns of words.</li>
</ol>
<p>For the next example, we go back to the UN General Assembly speech data set.</p>
<pre class="r"><code>corpus.un.sample &lt;- corpus_sample(mycorpus, size = 500)</code></pre>
<p>We again follow the cleaning steps described above.</p>
<br/>
<details>
<p><summary>Code: Data pre-processing steps</summary></p>
<pre class="r"><code># Create tokens
token_sample &lt;-
  tokens(
    split_hyphens = TRUE,
    corpus.un.sample,
    remove_numbers = TRUE,
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE,
    include_docvars = TRUE
  )

# Clean tokens created by OCR
token_ungd_sample &lt;- tokens_select(
  token_sample,
  c(&quot;[\\d-]&quot;, &quot;[[:punct:]]&quot;, &quot;^.{1,2}$&quot;),
  selection = &quot;remove&quot;,
  valuetype = &quot;regex&quot;,
  verbose = TRUE
)

dfmat &lt;- dfm(token_ungd_sample,
             tolower = TRUE,
             stem = TRUE,
             remove = stopwords(&quot;english&quot;)
             )</code></pre>
</details>
<p><br/></p>
<p>In an earlier version of this post, the function <code>textmodel_lsa</code> was already implemented in <code>quanteda</code>. It is now part of the package <a href="https://cran.r-project.org/web/packages/quanteda.textmodels/quanteda.textmodels.pdf"><code>quanteda.textmodels</code></a>. After loading the package, we can run the <code>textmodel_lsa</code> command.</p>
<pre class="r"><code># Load the package
library(quanteda.textmodels)

# Run the textmodel_lsa command
mylsa &lt;- quanteda.textmodels::textmodel_lsa(dfmat, nd = 10)</code></pre>
<p>One interesting question would be: How similar are the USA and Russia? Each dot represents a country-year observation. The USA are colored blue, Russia is colored red, and all other countries are grey.</p>
<pre class="r"><code># We need the &quot;stringr&quot; package for the following command
sources &lt;-
  str_remove_all(rownames(mylsa$docs), &quot;[0-9///&#39;._txt]&quot;) 

sources.color &lt;- rep(&quot;gray&quot;, times = length(sources))
sources.color[sources %in% &quot;USA&quot;] &lt;- &quot;blue&quot;
sources.color[sources %in% &quot;RUS&quot;] &lt;- &quot;red&quot;

plot(
  mylsa$docs[, 4:5],
  col = alpha(sources.color, 0.3),
  pch = 19,
  xlab = &quot;Dimension 4&quot;,
  ylab = &quot;Dimension 5&quot;,
  main = &quot;LSA dimensions by subcorpus&quot;
)</code></pre>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-44"></span>
<img src="/../../../../article/advancing-text-mining/figures/lsa.png" alt="Distribution of PA topics in the UN General Debate corpus" width="80%" />
<p class="caption">
Figure 6: Distribution of PA topics in the UN General Debate corpus
</p>
</div>
<p><br/></p>
<p>On Dimension 5 we do not really observe a difference between documents from the US and Russia while we do see a topical divide on Dimension 4.
<!-- Documents from Russia clearly cluster on Dimension 2. --></p>
<pre class="r"><code># create an LSA space; return its truncated representation in the low-rank space
tmod &lt;- quanteda.textmodels::textmodel_lsa(dfmat[1:10, ])</code></pre>
<pre class="r"><code># matrix in low_rank LSA space
tmod$matrix_low_rank[, 1:5]</code></pre>
<pre><code>                        forti         third session general assembl
GNQ_43_1988.txt  4.000000e+00  6.000000e+00      10      12       9
UZB_64_2009.txt -6.306067e-14 -1.748601e-15       2       1       1
PRT_50_1995.txt  2.000000e+00 -2.962543e-13       5       8       4
VCT_70_2015.txt -4.979003e-14 -1.879096e-13       3       1       6
GUY_67_2012.txt -7.959952e-14 -1.302430e-14       3       6       2
MDG_34_1979.txt -1.627830e-13  7.000000e+00      14      13       6
CHL_60_2005.txt -1.064461e-13  2.000000e+00       5       3       4
GTM_68_2013.txt -1.319652e-13  5.002986e-14       6       7      11
GIN_64_2009.txt -1.256287e-13  1.000000e+00       5       5       2
LBN_59_2004.txt -9.672818e-14  1.000000e+00       2       4       5</code></pre>
<p>We now fold the queries into the space generated by <code>dfmat[1:10,]</code> and return its truncated versions of its representation in the new low-rank space. For more information on this, see <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9?casa_token=04042MH098kAAAAA:fScvmoc2WWrFxM4w2XTOkg1hAmBfaNuulZ3WKEnwjCpH727SAdVmphzv29VIvcAtcKkutMcKKVhaiZQ_">Deerwester et al. (1990)</a> and <a href="http://www.cse.msu.edu/~cse960/Papers/LSI/LSI.pdf">Rosario (2000)</a>.</p>
<pre class="r"><code>pred &lt;- predict(tmod, newdata = dfmat[1:10, ])
pred$docs_newspace</code></pre>
<pre><code>10 x 10 Matrix of class &quot;dgeMatrix&quot;
                      [,1]        [,2]        [,3]        [,4]         [,5]        [,6]
GNQ_43_1988.txt -0.3587033  0.04626331 -0.40268736  0.75368464 -0.198343099  0.15890426
UZB_64_2009.txt -0.1304653  0.04983523 -0.06010142 -0.07259054  0.057245087 -0.03112114
PRT_50_1995.txt -0.5120486  0.45389699  0.69697128  0.03247785 -0.071841805  0.18464906
VCT_70_2015.txt -0.2241820  0.09965473 -0.36666157 -0.45680821 -0.462552853  0.28818827
GUY_67_2012.txt -0.2905706  0.05927260 -0.29307648 -0.35382563  0.366086659  0.47269539
MDG_34_1979.txt -0.4919563 -0.83849245  0.20130813 -0.06146108 -0.038858129 -0.05927530
CHL_60_2005.txt -0.2663441  0.20190586 -0.13824724 -0.23099427 -0.006964881 -0.67210030
GTM_68_2013.txt -0.2074862  0.05127196 -0.13858784 -0.11930716  0.021618854 -0.33022555
GIN_64_2009.txt -0.2738299  0.12184387 -0.19025526  0.13644592  0.659835350 -0.13044787
LBN_59_2004.txt -0.1625876  0.12082658 -0.11439537  0.04220718 -0.408458883 -0.22780319
                       [,7]        [,8]         [,9]        [,10]
GNQ_43_1988.txt  0.13239767  0.10535490  0.203828010 -0.065029344
UZB_64_2009.txt -0.21573385 -0.08282906 -0.130474714 -0.947070625
PRT_50_1995.txt  0.07183401  0.00858728 -0.005558773  0.020944555
VCT_70_2015.txt  0.35837271 -0.38370263 -0.152115117  0.029859727
GUY_67_2012.txt -0.34813601  0.38234056  0.254633069  0.106243837
MDG_34_1979.txt -0.01837620 -0.06738449  0.012211891  0.023580112
CHL_60_2005.txt  0.04450401 -0.07554420  0.596888043  0.009695998
GTM_68_2013.txt  0.35028911  0.66063039 -0.502174774 -0.007009133
GIN_64_2009.txt  0.03757466 -0.49043679 -0.361551806  0.174062011
LBN_59_2004.txt -0.74478679 -0.03668237 -0.337785293  0.234975826</code></pre>
</div>
<div id="unknown-categories-unsupervised-machine-learning---latent-dirichlet-allocation-lda" class="section level5">
<h5>Unknown categories: Unsupervised machine learning - Latent Dirichlet Allocation (LDA) <a name="lda"></a></h5>
<p>Both Latent Dirichlet Allocation (LDA) and Structural Topic Modeling (STM) belong to <strong>topic modelling</strong>. Topic models find patterns of words that appear together and group them into topics. The researcher decides on the number of topics and the algorithms then discover the main topics of the texts without prior information, training sets or human annotations.</br></p>
<p><a href="http://pablobarbera.com/ECPR-SC105/slides/07-slides-text-unsupervised.pdf">LDA is a Bayesian mixture model for discrete data where topics are assumed to be uncorrelated.</a> It is a model that describes how the documents in a dataset were created. We assign an arbitrary number of topics (K) where each topic is a distribution over a fixed vocabulary. Each document is considered as a collection of words, one for each of K topics. It also follows the “bag of words” approach that considers each word in a document separately.</p>
<p>To calculate the LDA models, we need to load the package <code>topicmodels</code>. If you use this package for the first time on your machine, you need to execute a specific sequence of commands, detailed in the code below.</p>
<br/>
<details>
<p><summary>Code: Installing the package <code>topicmodels</code></summary></p>
<pre class="r"><code># 1) Install GSL
# We first need to make sure that GSL (e.g. &#39;brew install gsl&#39; in the terminal) is installed on our machine

# 2) Install gsl
# Then we proceed and choose either of the following commands:
# A)
install.packages(&quot;gsl&quot;)
# or B)
install.packages(
  &quot;https://cran.rstudio.com/src/contrib/gsl_2.1-6.tar.gz&quot;,
  repos = NULL,
  method = &quot;libcurl&quot;
)

# 3) Load the gsl package
library(gsl)

# 4) Install topicmodels
# Here we choose again either of the following commands:
# A)
install.packages(&quot;topicmodels&quot;)
# or B)
install.packages(
  &quot;https://cran.r-project.org/src/contrib/topicmodels_0.2-8.tar.gz&quot;,
  repos = NULL,
  method = &quot;libcurl&quot;
)

# 5) Load the topicmodels package
library(topicmodels)

# Now you are all set for the following models.</code></pre>
</details>
<p><br/></p>
<p>For the LDA, we again first trim our DFM. As above, the command <code>dfm_trim</code> trimms the text. This allows us to filter words that appear less than 7.5% and more than 90%.</p>
<pre class="r"><code>mydfm.un.trim &lt;-
  dfm_trim(
    mydfm,
    min_docfreq = 0.075,
    # min 7.5%
    max_docfreq = 0.90,
    # max 90%
    docfreq_type = &quot;prop&quot;
  ) </code></pre>
<p>We then assign an arbitrary topic number and convert the trimmed DFM to a topicmodels object.</p>
<pre class="r"><code># Assign an arbitrary number of topics
topic.count &lt;- 15

# Convert the trimmed DFM to a topicmodels object
dfm2topicmodels &lt;- convert(mydfm.un.trim, to = &quot;topicmodels&quot;)</code></pre>
<p>Eventually, we can calculate the LDA model with <code>quanteda</code>’s <code>LDA()</code> command.</p>
<pre class="r"><code>lda.model &lt;- LDA(dfm2topicmodels, topic.count)</code></pre>
<br/>
<details>
<p><Summary>Output for <code>lda.model</code></Summary></p>
<pre class="r"><code>lda.model</code></pre>
<pre><code>A LDA_VEM topic model with 15 topics.</code></pre>
<pre class="r"><code>as.data.frame(terms(lda.model, 6))</code></pre>
<table class="table" style="margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:left;">
topic1
</th>
<th style="text-align:left;">
topic2
</th>
<th style="text-align:left;">
topic3
</th>
<th style="text-align:left;">
topic4
</th>
<th style="text-align:left;">
topic5
</th>
<th style="text-align:left;">
topic6
</th>
<th style="text-align:left;">
topic7
</th>
<th style="text-align:left;">
topic8
</th>
<th style="text-align:left;">
topic9
</th>
<th style="text-align:left;">
topic10
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
nuclear
</td>
<td style="text-align:left;">
cooper
</td>
<td style="text-align:left;">
problem
</td>
<td style="text-align:left;">
trade
</td>
<td style="text-align:left;">
america
</td>
<td style="text-align:left;">
arab
</td>
<td style="text-align:left;">
arab
</td>
<td style="text-align:left;">
global
</td>
<td style="text-align:left;">
oper
</td>
<td style="text-align:left;">
war
</td>
</tr>
<tr>
<td style="text-align:left;">
weapon
</td>
<td style="text-align:left;">
council
</td>
<td style="text-align:left;">
interest
</td>
<td style="text-align:left;">
problem
</td>
<td style="text-align:left;">
american
</td>
<td style="text-align:left;">
small
</td>
<td style="text-align:left;">
palestinian
</td>
<td style="text-align:left;">
cooper
</td>
<td style="text-align:left;">
negoti
</td>
<td style="text-align:left;">
today
</td>
</tr>
<tr>
<td style="text-align:left;">
republ
</td>
<td style="text-align:left;">
reform
</td>
<td style="text-align:left;">
hope
</td>
<td style="text-align:left;">
product
</td>
<td style="text-align:left;">
latin
</td>
<td style="text-align:left;">
nuclear
</td>
<td style="text-align:left;">
israel
</td>
<td style="text-align:left;">
social
</td>
<td style="text-align:left;">
conflict
</td>
<td style="text-align:left;">
now
</td>
</tr>
<tr>
<td style="text-align:left;">
soviet
</td>
<td style="text-align:left;">
global
</td>
<td style="text-align:left;">
possibl
</td>
<td style="text-align:left;">
per
</td>
<td style="text-align:left;">
respect
</td>
<td style="text-align:left;">
issu
</td>
<td style="text-align:left;">
territori
</td>
<td style="text-align:left;">
order
</td>
<td style="text-align:left;">
problem
</td>
<td style="text-align:left;">
power
</td>
</tr>
<tr>
<td style="text-align:left;">
relat
</td>
<td style="text-align:left;">
effect
</td>
<td style="text-align:left;">
now
</td>
<td style="text-align:left;">
economi
</td>
<td style="text-align:left;">
law
</td>
<td style="text-align:left;">
pacif
</td>
<td style="text-align:left;">
isra
</td>
<td style="text-align:left;">
futur
</td>
<td style="text-align:left;">
confer
</td>
<td style="text-align:left;">
live
</td>
</tr>
<tr>
<td style="text-align:left;">
disarma
</td>
<td style="text-align:left;">
process
</td>
<td style="text-align:left;">
even
</td>
<td style="text-align:left;">
increas
</td>
<td style="text-align:left;">
principl
</td>
<td style="text-align:left;">
weapon
</td>
<td style="text-align:left;">
resolut
</td>
<td style="text-align:left;">
common
</td>
<td style="text-align:left;">
south
</td>
<td style="text-align:left;">
mani
</td>
</tr>
</tbody>
</table>
<table class="table" style="margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:left;">
topic11
</th>
<th style="text-align:left;">
topic12
</th>
<th style="text-align:left;">
topic13
</th>
<th style="text-align:left;">
topic14
</th>
<th style="text-align:left;">
topic15
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
global
</td>
<td style="text-align:left;">
terror
</td>
<td style="text-align:left;">
african
</td>
<td style="text-align:left;">
deleg
</td>
<td style="text-align:left;">
africa
</td>
</tr>
<tr>
<td style="text-align:left;">
sustain
</td>
<td style="text-align:left;">
council
</td>
<td style="text-align:left;">
africa
</td>
<td style="text-align:left;">
session
</td>
<td style="text-align:left;">
south
</td>
</tr>
<tr>
<td style="text-align:left;">
climat
</td>
<td style="text-align:left;">
terrorist
</td>
<td style="text-align:left;">
republ
</td>
<td style="text-align:left;">
problem
</td>
<td style="text-align:left;">
independ
</td>
</tr>
<tr>
<td style="text-align:left;">
chang
</td>
<td style="text-align:left;">
iraq
</td>
<td style="text-align:left;">
conflict
</td>
<td style="text-align:left;">
concern
</td>
<td style="text-align:left;">
african
</td>
</tr>
<tr>
<td style="text-align:left;">
challeng
</td>
<td style="text-align:left;">
resolut
</td>
<td style="text-align:left;">
democrat
</td>
<td style="text-align:left;">
great
</td>
<td style="text-align:left;">
namibia
</td>
</tr>
<tr>
<td style="text-align:left;">
goal
</td>
<td style="text-align:left;">
law
</td>
<td style="text-align:left;">
elect
</td>
<td style="text-align:left;">
republ
</td>
<td style="text-align:left;">
struggl
</td>
</tr>
</tbody>
</table>
</details>
<p><br/></p>
<p>How similar are the fifteen topics? This question is particularly interesting because it allows us to (possibly) cluster homogeneous topics. To get a better idea of our LDA model and about the similarity among the different topics, we can plot our results using the following chunck of code. <code>dist</code> and <code>hclust</code> are standard R commands that allow us to calculate the similarity.</p>
<pre class="r"><code>lda.similarity &lt;- as.data.frame(lda.model@beta) %&gt;%
  scale() %&gt;%
  dist(method = &quot;euclidean&quot;) %&gt;%
  hclust(method = &quot;ward.D2&quot;)

par(mar = c(0, 4, 4, 2))
plot(lda.similarity,
     main = &quot;LDA topic similarity by features&quot;,
     xlab = &quot;&quot;,
     sub = &quot;&quot;)</code></pre>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-57"></span>
<img src="/../../../../article/advancing-text-mining/figures/lda2.png" alt="LDA topic similarity by features"  />
<p class="caption">
Figure 7: LDA topic similarity by features
</p>
</div>
<p><br/></p>
<p>The plot is called dendogram and visualizes a hierarchial clustering. The x-axis gives you the topics and the clusters of these topics. Put differently, it gives you information on the smilarity of the topics. On the y-axis, we see the dissmilarity (or distance) between our fifteen topics.</p>
</div>
<div id="unknown-categories-unsupervised-machine-learning---structural-topic-models-stm" class="section level5">
<h5>Unknown categories: Unsupervised machine learning - Structural topic models (STM) <a name="stm"></a></h5>
<p>The <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/ajps.12103">structural topic models (STM) are a popular extension of the standard LDA models</a>. The STM allows to include metadata (the information about each document) into the topicmodel and it offers an alternative initialization mechanism (“Spectral”). For STMs, the covariates can be used in priors. The <a href="https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf">stm vignette</a> provides a good overview how to use a STM. The <a href="https://www.structuraltopicmodel.com">package includes estimation algorithms and tools for every stage of the workflow</a>. A particularly large emphasis is on a number of diagnostic functions that are integrated into the R package.</p>
<p>The <a href="https://github.com/cschwem2er/stminsights">package <code>stiminsights</code></a> is very useful for visual exploration. It allows the user to process <strong>interactive validation, interpretation and visualization</strong> of one or several Structural Topic Models (stm).</p>
<p>We again trim our dfm with the command <code>dfm_trim</code>.</p>
<pre class="r"><code>mydfm.un.trim &lt;-
  dfm_trim(
    mydfm,
    min_docfreq = 0.075,
    # min 7.5%
    max_docfreq = 0.90,
    # max 90%
    docfreq_type = &quot;prop&quot;
  ) </code></pre>
<p>We then assign the number of topics arbitrarily.</p>
<pre class="r"><code>topic.count &lt;- 25 # Assigns the number of topics</code></pre>
<p>And eventually convert the DFM (with <code>convert()</code>) and calculate the STM (with <code>stm()</code>).</p>
<pre class="r"><code># Calculate the STM 
dfm2stm &lt;- convert(mydfm.un.trim, to = &quot;stm&quot;)

model.stm &lt;- stm(
  dfm2stm$documents,
  dfm2stm$vocab,
  K = topic.count,
  data = dfm2stm$meta,
  init.type = &quot;Spectral&quot;
)</code></pre>
<p>To get a first insight, we print the terms that appear in each topic.</p>
<pre class="r"><code>as.data.frame(t(labelTopics(model.stm, n = 10)$prob))</code></pre>
<table class="table" style="margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:left;">
V1
</th>
<th style="text-align:left;">
V2
</th>
<th style="text-align:left;">
V3
</th>
<th style="text-align:left;">
V4
</th>
<th style="text-align:left;">
V5
</th>
<th style="text-align:left;">
V6
</th>
<th style="text-align:left;">
V7
</th>
<th style="text-align:left;">
V8
</th>
<th style="text-align:left;">
V9
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
council
</td>
<td style="text-align:left;">
lebanon
</td>
<td style="text-align:left;">
soviet
</td>
<td style="text-align:left;">
global
</td>
<td style="text-align:left;">
african
</td>
<td style="text-align:left;">
nuclear
</td>
<td style="text-align:left;">
european
</td>
<td style="text-align:left;">
trade
</td>
<td style="text-align:left;">
african
</td>
</tr>
<tr>
<td style="text-align:left;">
reform
</td>
<td style="text-align:left;">
arab
</td>
<td style="text-align:left;">
union
</td>
<td style="text-align:left;">
sustain
</td>
<td style="text-align:left;">
africa
</td>
<td style="text-align:left;">
weapon
</td>
<td style="text-align:left;">
europ
</td>
<td style="text-align:left;">
economi
</td>
<td style="text-align:left;">
situat
</td>
</tr>
<tr>
<td style="text-align:left;">
effect
</td>
<td style="text-align:left;">
resolut
</td>
<td style="text-align:left;">
relat
</td>
<td style="text-align:left;">
challeng
</td>
<td style="text-align:left;">
republ
</td>
<td style="text-align:left;">
treati
</td>
<td style="text-align:left;">
cooper
</td>
<td style="text-align:left;">
product
</td>
<td style="text-align:left;">
guinea
</td>
</tr>
<tr>
<td style="text-align:left;">
activ
</td>
<td style="text-align:left;">
problem
</td>
<td style="text-align:left;">
militari
</td>
<td style="text-align:left;">
chang
</td>
<td style="text-align:left;">
democrat
</td>
<td style="text-align:left;">
disarma
</td>
<td style="text-align:left;">
union
</td>
<td style="text-align:left;">
industri
</td>
<td style="text-align:left;">
africa
</td>
</tr>
<tr>
<td style="text-align:left;">
cooper
</td>
<td style="text-align:left;">
territori
</td>
<td style="text-align:left;">
forc
</td>
<td style="text-align:left;">
climat
</td>
<td style="text-align:left;">
millennium
</td>
<td style="text-align:left;">
arm
</td>
<td style="text-align:left;">
conflict
</td>
<td style="text-align:left;">
resourc
</td>
<td style="text-align:left;">
particular
</td>
</tr>
<tr>
<td style="text-align:left;">
strengthen
</td>
<td style="text-align:left;">
palestinian
</td>
<td style="text-align:left;">
republ
</td>
<td style="text-align:left;">
goal
</td>
<td style="text-align:left;">
commit
</td>
<td style="text-align:left;">
test
</td>
<td style="text-align:left;">
process
</td>
<td style="text-align:left;">
increas
</td>
<td style="text-align:left;">
hope
</td>
</tr>
<tr>
<td style="text-align:left;">
convent
</td>
<td style="text-align:left;">
withdraw
</td>
<td style="text-align:left;">
arm
</td>
<td style="text-align:left;">
commit
</td>
<td style="text-align:left;">
poverti
</td>
<td style="text-align:left;">
non
</td>
<td style="text-align:left;">
stabil
</td>
<td style="text-align:left;">
market
</td>
<td style="text-align:left;">
concern
</td>
</tr>
<tr>
<td style="text-align:left;">
role
</td>
<td style="text-align:left;">
principl
</td>
<td style="text-align:left;">
war
</td>
<td style="text-align:left;">
agenda
</td>
<td style="text-align:left;">
particular
</td>
<td style="text-align:left;">
prolifer
</td>
<td style="text-align:left;">
contribut
</td>
<td style="text-align:left;">
price
</td>
<td style="text-align:left;">
solut
</td>
</tr>
<tr>
<td style="text-align:left;">
confer
</td>
<td style="text-align:left;">
solut
</td>
<td style="text-align:left;">
socialist
</td>
<td style="text-align:left;">
respons
</td>
<td style="text-align:left;">
like
</td>
<td style="text-align:left;">
pakistan
</td>
<td style="text-align:left;">
bosnia
</td>
<td style="text-align:left;">
global
</td>
<td style="text-align:left;">
session
</td>
</tr>
<tr>
<td style="text-align:left;">
contribut
</td>
<td style="text-align:left;">
war
</td>
<td style="text-align:left;">
polici
</td>
<td style="text-align:left;">
address
</td>
<td style="text-align:left;">
summit
</td>
<td style="text-align:left;">
india
</td>
<td style="text-align:left;">
respect
</td>
<td style="text-align:left;">
financi
</td>
<td style="text-align:left;">
problem
</td>
</tr>
</tbody>
</table>
<table class="table" style="margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:left;">
V10
</th>
<th style="text-align:left;">
V11
</th>
<th style="text-align:left;">
V12
</th>
<th style="text-align:left;">
V13
</th>
<th style="text-align:left;">
V14
</th>
<th style="text-align:left;">
V15
</th>
<th style="text-align:left;">
V16
</th>
<th style="text-align:left;">
V17
</th>
<th style="text-align:left;">
V18
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
council
</td>
<td style="text-align:left;">
problem
</td>
<td style="text-align:left;">
cooper
</td>
<td style="text-align:left;">
republ
</td>
<td style="text-align:left;">
power
</td>
<td style="text-align:left;">
africa
</td>
<td style="text-align:left;">
order
</td>
<td style="text-align:left;">
problem
</td>
<td style="text-align:left;">
israel
</td>
</tr>
<tr>
<td style="text-align:left;">
iraq
</td>
<td style="text-align:left;">
now
</td>
<td style="text-align:left;">
stabil
</td>
<td style="text-align:left;">
korea
</td>
<td style="text-align:left;">
great
</td>
<td style="text-align:left;">
south
</td>
<td style="text-align:left;">
principl
</td>
<td style="text-align:left;">
session
</td>
<td style="text-align:left;">
arab
</td>
</tr>
<tr>
<td style="text-align:left;">
resolut
</td>
<td style="text-align:left;">
oper
</td>
<td style="text-align:left;">
dialogu
</td>
<td style="text-align:left;">
democrat
</td>
<td style="text-align:left;">
viet
</td>
<td style="text-align:left;">
debt
</td>
<td style="text-align:left;">
social
</td>
<td style="text-align:left;">
oper
</td>
<td style="text-align:left;">
palestinian
</td>
</tr>
<tr>
<td style="text-align:left;">
law
</td>
<td style="text-align:left;">
negoti
</td>
<td style="text-align:left;">
terror
</td>
<td style="text-align:left;">
china
</td>
<td style="text-align:left;">
deleg
</td>
<td style="text-align:left;">
deleg
</td>
<td style="text-align:left;">
univers
</td>
<td style="text-align:left;">
confer
</td>
<td style="text-align:left;">
isra
</td>
</tr>
<tr>
<td style="text-align:left;">
aggress
</td>
<td style="text-align:left;">
hope
</td>
<td style="text-align:left;">
sudan
</td>
<td style="text-align:left;">
south
</td>
<td style="text-align:left;">
coloni
</td>
<td style="text-align:left;">
hope
</td>
<td style="text-align:left;">
cultur
</td>
<td style="text-align:left;">
solut
</td>
<td style="text-align:left;">
palestin
</td>
</tr>
<tr>
<td style="text-align:left;">
charter
</td>
<td style="text-align:left;">
agreement
</td>
<td style="text-align:left;">
call
</td>
<td style="text-align:left;">
korean
</td>
<td style="text-align:left;">
territori
</td>
<td style="text-align:left;">
process
</td>
<td style="text-align:left;">
today
</td>
<td style="text-align:left;">
resolut
</td>
<td style="text-align:left;">
east
</td>
</tr>
<tr>
<td style="text-align:left;">
violat
</td>
<td style="text-align:left;">
mani
</td>
<td style="text-align:left;">
process
</td>
<td style="text-align:left;">
asia
</td>
<td style="text-align:left;">
nam
</td>
<td style="text-align:left;">
environ
</td>
<td style="text-align:left;">
life
</td>
<td style="text-align:left;">
cyprus
</td>
<td style="text-align:left;">
middl
</td>
</tr>
<tr>
<td style="text-align:left;">
iraqi
</td>
<td style="text-align:left;">
last
</td>
<td style="text-align:left;">
promot
</td>
<td style="text-align:left;">
north
</td>
<td style="text-align:left;">
republ
</td>
<td style="text-align:left;">
confer
</td>
<td style="text-align:left;">
respect
</td>
<td style="text-align:left;">
concern
</td>
<td style="text-align:left;">
resolut
</td>
</tr>
<tr>
<td style="text-align:left;">
islam
</td>
<td style="text-align:left;">
way
</td>
<td style="text-align:left;">
commit
</td>
<td style="text-align:left;">
east
</td>
<td style="text-align:left;">
sea
</td>
<td style="text-align:left;">
welcom
</td>
<td style="text-align:left;">
societi
</td>
<td style="text-align:left;">
negoti
</td>
<td style="text-align:left;">
occupi
</td>
</tr>
<tr>
<td style="text-align:left;">
iran
</td>
<td style="text-align:left;">
possibl
</td>
<td style="text-align:left;">
issu
</td>
<td style="text-align:left;">
cooper
</td>
<td style="text-align:left;">
charter
</td>
<td style="text-align:left;">
conflict
</td>
<td style="text-align:left;">
becom
</td>
<td style="text-align:left;">
disarma
</td>
<td style="text-align:left;">
territori
</td>
</tr>
</tbody>
</table>
<table class="table" style="margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:left;">
V19
</th>
<th style="text-align:left;">
V20
</th>
<th style="text-align:left;">
V21
</th>
<th style="text-align:left;">
V22
</th>
<th style="text-align:left;">
V23
</th>
<th style="text-align:left;">
V24
</th>
<th style="text-align:left;">
V25
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
island
</td>
<td style="text-align:left;">
today
</td>
<td style="text-align:left;">
per
</td>
<td style="text-align:left;">
terror
</td>
<td style="text-align:left;">
america
</td>
<td style="text-align:left;">
conflict
</td>
<td style="text-align:left;">
africa
</td>
</tr>
<tr>
<td style="text-align:left;">
small
</td>
<td style="text-align:left;">
war
</td>
<td style="text-align:left;">
cent
</td>
<td style="text-align:left;">
afghanistan
</td>
<td style="text-align:left;">
american
</td>
<td style="text-align:left;">
refuge
</td>
<td style="text-align:left;">
south
</td>
</tr>
<tr>
<td style="text-align:left;">
pacif
</td>
<td style="text-align:left;">
live
</td>
<td style="text-align:left;">
social
</td>
<td style="text-align:left;">
afghan
</td>
<td style="text-align:left;">
latin
</td>
<td style="text-align:left;">
war
</td>
<td style="text-align:left;">
independ
</td>
</tr>
<tr>
<td style="text-align:left;">
caribbean
</td>
<td style="text-align:left;">
want
</td>
<td style="text-align:left;">
educ
</td>
<td style="text-align:left;">
cooper
</td>
<td style="text-align:left;">
central
</td>
<td style="text-align:left;">
somalia
</td>
<td style="text-align:left;">
african
</td>
</tr>
<tr>
<td style="text-align:left;">
chang
</td>
<td style="text-align:left;">
terror
</td>
<td style="text-align:left;">
poverti
</td>
<td style="text-align:left;">
terrorist
</td>
<td style="text-align:left;">
respect
</td>
<td style="text-align:left;">
assist
</td>
<td style="text-align:left;">
struggl
</td>
</tr>
<tr>
<td style="text-align:left;">
ocean
</td>
<td style="text-align:left;">
mani
</td>
<td style="text-align:left;">
health
</td>
<td style="text-align:left;">
problem
</td>
<td style="text-align:left;">
social
</td>
<td style="text-align:left;">
humanitarian
</td>
<td style="text-align:left;">
regim
</td>
</tr>
<tr>
<td style="text-align:left;">
issu
</td>
<td style="text-align:left;">
let
</td>
<td style="text-align:left;">
drug
</td>
<td style="text-align:left;">
global
</td>
<td style="text-align:left;">
process
</td>
<td style="text-align:left;">
mani
</td>
<td style="text-align:left;">
namibia
</td>
</tr>
<tr>
<td style="text-align:left;">
sea
</td>
<td style="text-align:left;">
now
</td>
<td style="text-align:left;">
democraci
</td>
<td style="text-align:left;">
asia
</td>
<td style="text-align:left;">
solut
</td>
<td style="text-align:left;">
elect
</td>
<td style="text-align:left;">
apartheid
</td>
</tr>
<tr>
<td style="text-align:left;">
climat
</td>
<td style="text-align:left;">
everi
</td>
<td style="text-align:left;">
programm
</td>
<td style="text-align:left;">
central
</td>
<td style="text-align:left;">
possibl
</td>
<td style="text-align:left;">
situat
</td>
<td style="text-align:left;">
coloni
</td>
</tr>
<tr>
<td style="text-align:left;">
call
</td>
<td style="text-align:left;">
know
</td>
<td style="text-align:left;">
million
</td>
<td style="text-align:left;">
issu
</td>
<td style="text-align:left;">
express
</td>
<td style="text-align:left;">
forc
</td>
<td style="text-align:left;">
racist
</td>
</tr>
</tbody>
</table>
<p>The following plot allows us to intuitively get information on the share of the different topics at the overall corpus.</p>
<pre class="r"><code>plot(
  model.stm,
  type = &quot;summary&quot;,
  text.cex = 0.5,
  main = &quot;STM topic shares&quot;,
  xlab = &quot;Share estimation&quot;
)</code></pre>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-66"></span>
<img src="/../../../../article/advancing-text-mining/figures/stm.png" alt="STM topic shares"  />
<p class="caption">
Figure 8: STM topic shares
</p>
</div>
<p><br/></p>
<p>Using the package <code>stm</code>, we can now visualize the different words of a topic with a wordcloud. Since topic 4 has the highest share, we use it for the next visualization. The location of the words is randomized and changes each time we plot the wordcloud while the size of the words is relative to their frequency and remains the same.</p>
<pre class="r"><code>stm::cloud(model.stm,
           topic = 4,
           scale = c(2.25, .5))</code></pre>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-68"></span>
<img src="/../../../../article/advancing-text-mining/figures/cloudstm.png" alt="Wordcloud with `stm`"  />
<p class="caption">
Figure 9: Wordcloud with <code>stm</code>
</p>
</div>
<p><br/></p>
<p>If we want, we can also put several different topics in visually perspective using the following lines of code:</p>
<pre class="r"><code>plot(model.stm,
     type = &quot;perspectives&quot;,
     topics = c(4, 5),
     main = &quot;Putting two different topics in perspective&quot;)</code></pre>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-70"></span>
<img src="/../../../../article/advancing-text-mining/figures/cloudstmperspective.png" alt="Wordcloud using `stm` -- Perspective plots"  />
<p class="caption">
Figure 10: Wordcloud using <code>stm</code> – Perspective plots
</p>
</div>
<p><br/></p>
<p>The <a href="https://www.rdocumentation.org/packages/stm/versions/1.3.3/topics/plot.STM">perspective plot</a> visualizes the combination of two topics (here topic 4 and topic 5). The size of the words is again relative to their frequency (within the combination of the two topics). The x-axis shows the dregree that specific words align with Topic 4 or Topic 5. <em>Global</em> is closely aligned with Topic 4 whereas <em>commit</em> is more central in both topics.
       </p>
</div>
</div>
</div>
<div id="further-readings" class="section level3">
<h3>Further readings <a name="furtherreadings"></a></h3>
<ul>
<li><a href="http://quanteda.io">quanteda: Quantitative Analysis of Textual Data</a></li>
<li><a href="https://CRAN.R-project.org/package=quanteda">Benoit, K., &amp; Nulty, P.. 2016. Quanteda: Quantitative Analysis of Textual Data.</a></li>
<li><a href="https://www.theoj.org/joss-papers/joss.00774/10.21105.joss.00774.pdf">Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., &amp; Matsuo, A. (2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774.</a></li>
<li><a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9?casa_token=04042MH098kAAAAA:fScvmoc2WWrFxM4w2XTOkg1hAmBfaNuulZ3WKEnwjCpH727SAdVmphzv29VIvcAtcKkutMcKKVhaiZQ_">Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., &amp; Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391-407.
Chicago</a></li>
<li><a href="https://www.cambridge.org/core/journals/political-analysis/article/text-as-data-the-promise-and-pitfalls-of-automatic-content-analysis-methods-for-political-texts/F7AAC8B2909441603FEB25C156448F20">Grimmer, J., &amp; Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267-297.</a>
<!-- * [Jurka, T., Collingwood, L.,Boydstun, A. E., Grossman, E., & van Atteveldt, W. H. (2013). RTextTools: A supervised learning package for text classification.](http://rjournal.github.io/archive/2013-1/collingwood-jurka-boydstun-etal.pdf) --></li>
<li><a href="https://muellerstefan.net/files/quanteda-cheatsheet.pdf">Mueller, S. Quanteda Cheat Sheet.</a></li>
<li><a href="http://inhaltsanalyse-mit-r.de/">Puschmann, C. Inhaltsanalyse mit R.</a></li>
<li><a href="http://inhaltsanalyse-mit-r.de/0_einleitung.html">Puschmann, C. Automatisierte Inhaltsanalyse mit R.</a></li>
<li><a href="https://api.rpubs.com/cbpuschmann/AIR6">Puschmann, C. Automatisierte Inhaltsanalyse mit R. Überwachtes maschinelles Lernen.</a></li>
<li><a href="https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf">Roberts, M. E., Stewart, B. M., &amp; Tingley, D. (2014). stm: R package for structural topic models. R package, 1, 12.</a></li>
<li><a href="https://pdfs.semanticscholar.org/9598/1f057cb76a24329fcf2b572f75d8c2b1613e.pdf">Rosario, B. (2000). Latent semantic indexing: An overview. Technical Report. INFOSYS, 240, 1-16.</a></li>
<li><a href="https://www.tandfonline.com/doi/abs/10.1080/19312458.2017.1387238">Welbers, K., Van Atteveldt, W., &amp; Benoit, K. (2017). Text analysis in R. Communication Methods and Measures, 11(4), 245-265.</a></li>
</ul>
<p>       </p>
</div>
<div id="about-the-presenter" class="section level3">
<h3>About the presenter</h3>
<p><a href="http://cbpuschmann.net">Cornelius Puschmann</a> is professor of media and communication at <a href="https://uni-bremen.de/en/">ZeMKI, University of Bremen</a> and an affiliate researcher at the <a href="https://leibniz-hbi.de/en">Leibniz Institute for Media Research</a> in Hamburg. His research interests center on online hate speech, the role of algorithms for the selection of media content, and methodological aspects of computational social science.</p>
</div>
]]>
      </description>
    </item>
    
    <item>
      <title>Studying Politics on and with Wikipedia</title>
      <link>https://socialsciencedatalab.mzes.uni-mannheim.de/article/studying-politics-wikipedia/</link>
      <pubDate>Mon, 26 Aug 2019 01:00:00 +0100</pubDate>
      
      <guid>https://socialsciencedatalab.mzes.uni-mannheim.de/article/studying-politics-wikipedia/</guid>
      <description><![CDATA[
        </p>
<p>The online encyclopedia Wikipedia, together with its sibling, the collaboratively edited knowledge base Wikidata, provides incredibly rich yet largely untapped sources for political research. In this <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/categories/tutorials/">Methods Bites Tutorial</a>, <a href="https://twitter.com/denis_cohen">Denis Cohen</a> and <a href="https://twitter.com/Nick_Baumann97">Nick Baumann</a> offer a hands-on recap of <a href="https://twitter.com/simonsaysnothin">Simon Munzert</a>’s (Hertie School of Governance) workshop materials to show how these platforms can inform research on public attention dynamics, policies, political and other events, political elites, and parties, among other things.</p>
<p>After reading this blog post and engaging with the applied exercises, readers should:</p>
<ul>
<li>be able to collect Wikipedia data and Wikidata items using <strong>R</strong></li>
<li>be able to conduct explorative analyses of Wikipedia data using <strong>R</strong></li>
<li>have a basic intuition of the potentials and limitations of using Wikipedia data in research projects</li>
</ul>
<p><em>Note:</em> This blog post provides a summary of Simon’s workshop in the <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/page/events/index.html#munzert-201905">MZES Social Science Data Lab</a> with some adaptations. Simon’s original workshop materials, including slides and scripts, are available from our <a href="https://github.com/SocialScienceDataLab/political-wikipedia-workshop">GitHub</a>.</p>
<div id="contents" class="section level5">
<h5>Contents</h5>
<ol style="list-style-type: decimal">
<li><a href="#wikipedia-for-political-research">Wikipedia for Political Research</a></li>
<li><a href="#collecting-and-analyzing-wikipedia-data">Collecting and Analyzing Wikipedia Data</a>
<ol style="list-style-type: decimal">
<li><a href="#application-1-using-pageviews-to-measure-public-attention">Application 1: Using Pageviews to Measure Public Attention</a></li>
<li><a href="#application-2-using-article-links-to-create-a-network-graph-of-german-mps">Application 2: Using Article Links to Create a Network Graph of German MPs</a></li>
<li><a href="#application-3-using-clickstream-data-to-analyze-referral-patterns">Application 3: Using Clickstream Data to Analyze Referral Patterns</a></li>
</ol></li>
<li><a href="#collecting-data-via-wikidata-queries">Collecting Data via Wikidata Queries</a></li>
<li><a href="#legislator">legislatoR</a>
<ol style="list-style-type: decimal">
<li><a href="#application-1-social-media-adoption-rates">Application 1: Social Media Adoption Rates</a></li>
<li><a href="#application-2-public-attention-to-members-of-the-german-bundestag">Application 2: Public Attention to Members of the German Bundestag</a></li>
</ol></li>
<li><a href="#conclusion">Conclusion</a></li>
<li><a href="#about-the-presenter">About the Presenter</a></li>
<li><a href="#references">References</a></li>
</ol>
</div>
<div id="wikipedia-for-political-research" class="section level3">
<h3>Wikipedia for Political Research</h3>
<p>According to its website, <a href="https://en.wikipedia.org/wiki/Wikipedia:About"><em>“Wikipedia […] is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content”</em></a>. As of July 2019, it comprises <a href="https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia">more than 48 million articles</a> and is ranked <a href="https://www.alexa.com/topsites">sixth in the list of the most frequently visited websites</a>.</p>
<p>Wikipedia harbors numerous types of data. These include both article contents as well as meta information such as pageviews, clickstreams, links and backlinks, or edits and revision histories. Additionally, Wikipedia’s sibling, the collaboratively edited document-oriented data base <a href="https://en.wikipedia.org/wiki/Wikidata">Wikidata</a>, provides access to over 58 million data items (as of July 2019). Given the broad collection of articles on politicians and institutions from all over the world, Wikipedia offers tremendous potential for (comparative) political research.</p>
<p>In what follows, we will introduce the functionalities of various <strong>R</strong> packages, including <a href="https://cran.r-project.org/web/packages/WikipediR/index.html"><code>WikipediR</code></a>, <a href="https://cran.r-project.org/web/packages/WikidataR/index.html"><code>WikidataR</code></a>, and <a href="https://cran.rstudio.com/web/packages/pageviews/index.html"><code>pageviews</code></a>. In doing so, we will showcase how to connect to Wikipedia and Wikidata APIs, how to efficiently access and parse content, and how to process the retrieved data in order to address various questions of substantive interest. We will also provide an overview of the <code>legislatoR</code> package, a fully relational individual-level data package that comprises political, sociodemographic, and Wikipedia-related data on elected politicians from various consolidated democracies.</p>
<details>
<p><summary> Code: <strong>R</strong> packages used in this tutorial</summary></p>
<pre class="r"><code>## Packages
pkgs &lt;- c(
  &quot;devtools&quot;,
  &quot;ggnetwork&quot;,
  &quot;igraph&quot;,
  &quot;intergraph&quot;,
  &quot;tidyverse&quot;,
  &quot;rvest&quot;,
  &quot;devtools&quot;,
  &quot;magrittr&quot;,
  &quot;plotly&quot;,
  &quot;RColorBrewer&quot;,
  &quot;colorspace&quot;,
  &quot;lubridate&quot;,
  &quot;networkD3&quot;,
  &quot;pageviews&quot;,
  &quot;readr&quot;,
  &quot;wikipediatrend&quot;,
  &quot;WikipediR&quot;,
  &quot;WikidataR&quot;
)

## Install uninstalled packages
lapply(pkgs[!(pkgs %in% installed.packages())], install.packages)

## Load all packages to library
lapply(pkgs, library, character.only = TRUE)

## legislatoR
devtools::install_github(&quot;saschagobel/legislatoR&quot;)
library(legislatoR)</code></pre>
</details>
<p><br /></p>
</div>
<div id="collecting-and-analyzing-wikipedia-data" class="section level3">
<h3>Collecting and Analyzing Wikipedia Data</h3>
<div id="application-1-using-pageviews-to-measure-public-attention" class="section level5">
<h5>Application 1: Using Pageviews to Measure Public Attention</h5>
<p>Pageviews measure the aggregate number of clicks for a given Wikipedia article. Data on pageviews can be collected from different sources. First, this <a href="https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&amp;platform=all-access&amp;agent=user&amp;range=latest-20&amp;pages=Cat%7CDog">interactive tool</a> provides summary data which allows users to compare various search items’ popularity in a specified period. Secondly, <a href="https://dumps.wikimedia.org/">Wikimedia Downloads</a>, a collection of archived Wikimedia wikis, offers <a href="https://dumps.wikimedia.org/other/pagecounts-raw/">pageviews data through August 2016</a> as well as <a href="https://dumps.wikimedia.org/other/pageviews/">data using a new pageviews definition from May 2015 onward</a>.</p>
<p>The code chunk below demonstrates how to collect and graphically display pageviews data using the <code>pageviews</code> package. We use the command <code>article_pageviews()</code>, where the argument <code>project = "en.wikipedia"</code> specifies that we want to collect pageviews of <code>article = "Donald Trump"</code> from the English Wikipedia. We can only restrict our query to a given language edition; it is not possible to limit queries to pageviews from a specific country. We also specify the argument <code>user_type = "user"</code>, which ensures that we exclude pageviews generated by bots and spiders. Finally, <code>start</code> and <code>end</code> define the period on which we want to collect pageviews data: July 2015 to May 2017. We proceed analogously for <code>article = "Hillary Clinton"</code>.</p>
<details>
<p><summary> Code: Pageviews Data Collection </summary></p>
<pre class="r"><code># get pageviews
trump_views &lt;-
  article_pageviews(
    project = &quot;en.wikipedia&quot;,
    article = &quot;Donald Trump&quot;,
    user_type = &quot;user&quot;,
    start = &quot;2015070100&quot;,
    end = &quot;2017050100&quot;
  )
head(trump_views)

clinton_views &lt;-
  article_pageviews(
    project = &quot;en.wikipedia&quot;,
    article = &quot;Hillary Clinton&quot;,
    user_type = &quot;user&quot;,
    start = &quot;2015070100&quot;,
    end = &quot;2017050100&quot;
  )</code></pre>
</details>
<p><br />
This query allows us to retrieve the pageviews for both Trump’s and Clinton’s Wikipedia articles by date. We can then plot the frequencies of pageviews over time to identify trends in search behaviour. As we can see, the data indicate that Trump attracted considerably more attention than Clinton throughout the 2016 election campaign.</p>
<details>
<p><summary> Code: Plotting Pageviews </summary></p>
<pre class="r"><code># Plot pageviews
plot(ymd(trump_views$date), trump_views$views, col = &quot;red&quot;, type = &quot;l&quot;, xlab=&quot;Time&quot;, ylab=&quot;Pageviews&quot;)
lines(ymd(clinton_views$date), clinton_views$views, col = &quot;blue&quot;)
legend(&quot;topleft&quot;, legend=c(&quot;Trump&quot;,&quot;Clinton&quot;), cex=.8,col=c(&quot;red&quot;,&quot;blue&quot;), lty=1) </code></pre>
</details>
<p><img src="/../../../../../article/studying-politics-wikipedia_files/figure-html/code%203b-1.png" width="672" style="display: block; margin: auto;" /></p>
</div>
<div id="application-2-using-article-links-to-create-a-network-graph-of-german-mps" class="section level5">
<h5>Application 2: Using Article Links to Create a Network Graph of German MPs</h5>
<p>The <a href="https://cran.r-project.org/web/packages/WikipediR/index.html"><code>WikipediR</code></a> package is a wrapper for the MediaWiki API that can be used to retrieve page contents as well as metadata for articles and categories, e.g. information about users or page edit histories. The functionality of the package includes:</p>
<ul>
<li><code>page_content()</code>: Retrieve current article versions (HTML and wikitext as possible output formats)</li>
<li><code>revision_content()</code>: Retrieve older versions of the article; this also includes metadata about the revision history</li>
<li><code>page_links()</code>: Retrieve outgoing links from the page’s content (which Wikipedia articles does the page link to?)</li>
<li><code>page_backlinks()</code>: Retrieve incoming links (which Wikipedia articles link to the page?)</li>
<li><code>page_external_links()</code>: Retrieve outgoing links to external sites</li>
<li><code>page_info()</code>: Page metadata</li>
<li><code>categories_in_page()</code>: What categories is a given page in?</li>
<li><code>pages_in_category()</code>: What pages are in a given category?</li>
</ul>
<p>For our application, we use the <code>page_links()</code> function to extract mutual referrals between the articles on members of the 2017-2021 German Bundestag. We can then use this information to create a network graph of current German MPs. First, we use the <a href="#legislator"><code>legislatoR</code></a> package to retrieve a list of all German MPs of the 2017-2021 German Bundestag, including information on their page IDs and page titles in the German Wikipedia. Using this information, we then extract all <code>page_links()</code> in every MP’s Wikipedia articles. The third step identifies the subset of links for every MP that link to the Wikipedia article of another current MP.</p>
<p>This allows us to finally plot an interactive network using the <code>forceNetwork()</code> command from the <a href="https://cran.r-project.org/web/packages/networkD3/index.html"><code>networkD3</code></a> package. We can save the interactive network graph as an HTML widget, which is included below.</p>
<details>
<p><summary> Code: Creating an Interactive Network Graph Based on Article Links</summary></p>
<pre class="r"><code>## step 1: get info about legislators
dat &lt;- semi_join(
  x = get_core(legislature = &quot;deu&quot;),
  y = filter(get_political(legislature = &quot;deu&quot;), session == 19),
  by = &quot;pageid&quot;
)

## step 2: get page links (max 500 links)
if (!file.exists(&quot;studying-politics-wikipedia/data/wikipediR/mdb_links_list.RData&quot;)) {
  links_list &lt;- list()
  for (i in 1:nrow(dat)) {
    links &lt;-
      page_links(
        &quot;de&quot;,
        &quot;wikipedia&quot;,
        page = dat$wikititle[i],
        clean_response = TRUE,
        limit = 500,
        namespaces = 0
      )
    links_list[[i]] &lt;- lapply(links[[1]]$links, &quot;[&quot;, 2) %&gt;% unlist
  }
  save(links_list, file = &quot;studying-politics-wikipedia/data/wikipediR/mdb_links_list.RData&quot;)
} else{
  load(&quot;studying-politics-wikipedia/data/wikipediR/mdb_links_list.RData&quot;)
}

## step 3: identify links between MPs
# loop preparation
connections &lt;- data.frame(from = NULL, to = NULL)
# loop
for (i in seq_along(dat$wikititle)) {
  links_in_pslinks &lt;-
    seq_along(dat$wikititle)[str_replace_all(dat$wikititle, &quot;_&quot;, &quot; &quot;) %in%
                               links_list[[i]]]
  links_in_pslinks &lt;- links_in_pslinks[links_in_pslinks != i]
  connections &lt;-
    rbind(connections,
          data.frame(
            from = rep(i - 1, length(links_in_pslinks)), # -1 for zero-indexing
            to = links_in_pslinks - 1 # here too
            )
          )
}

# results
names(connections) &lt;- c(&quot;from&quot;, &quot;to&quot;)

# make symmetrical
connections &lt;- rbind(connections,
                     data.frame(from = connections$to,
                                to = connections$from))
connections &lt;- connections[!duplicated(connections), ]


## step 4: visualize connections
connections$value &lt;- 1
nodesDF &lt;- data.frame(name = dat$name, group = 1)

network_out &lt;-
  forceNetwork(
    Links = connections,
    Nodes = nodesDF,
    Source = &quot;from&quot;,
    Target = &quot;to&quot;,
    Value = &quot;value&quot;,
    NodeID = &quot;name&quot;,
    Group = &quot;group&quot;,
    zoom = TRUE,
    opacityNoHover = 3,
    height = 360,
    width = 636
  )</code></pre>
</details>
<div style="position:relative;padding-top:56.25%;">
<p><iframe src="/socialsciencedatalab/studying-politics-wikipedia/network_out.html" frameborder="0" allowfullscreen
    style="position:absolute;top:0;left:0;width:100%;height:100%;" scrolling="no" onload="resizeIframe(this)"></iframe></p>
</div>
<p>Using the underlying <code>connections</code> data set, we can also identify which members of the German parliament share the most nodes with others. Perhaps unsurprisingly, we see the German chancellor Angela Merkel on top of the list, followed by a list of current and former federal ministers and (deputy) party leaders.</p>
<details>
<p><summary> Code: Top 10 MPs by Connections Counts</summary></p>
<pre class="r"><code>nodesDF$id &lt;- as.numeric(rownames(nodesDF)) - 1
connections_df &lt;-
  merge(connections,
        nodesDF,
        by.x = &quot;to&quot;,
        by.y = &quot;id&quot;,
        all = TRUE)
to_count_df &lt;- count(connections_df, name)
arrange(to_count_df, desc(n))</code></pre>
</details>
<pre><code>## # A tibble: 712 x 2
##    name                     n
##    &lt;fct&gt;                &lt;int&gt;
##  1 Angela Merkel           59
##  2 Andrea Nahles           40
##  3 Heiko Maas              38
##  4 Katarina Barley         38
##  5 Peter Altmaier          38
##  6 Wolfgang Schäuble       38
##  7 Wolfgang Kubicki        37
##  8 Hans-Peter Friedrich    34
##  9 Hermann Gröhe           34
## 10 Ursula von der Leyen    33
## # ... with 702 more rows</code></pre>
</div>
<div id="application-3-using-clickstream-data-to-analyze-referral-patterns" class="section level5">
<h5>Application 3: Using Clickstream Data to Analyze Referral Patterns</h5>
<p>Wikipedia articles usually <a href="https://en.wikipedia.org/wiki/Wikipedia:About"><em>“provide links designed to guide the user to related pages with additional information”</em></a>. This allows us to collect <a href="https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream">clickstream data</a>. Clickstreams yield information on the incoming and outgoing traffic of articles. They capture the articles that refer users to a given article as well as the links within a given article that users click to navigate to other articles. Clickstream data are inherently dyadic: Observations represent referral patterns for article-pairs (previous site → current site). Thus, our quantity of interest is the cumulated number of times this pattern was observed in a given period of time.</p>
<p>Clickstream data are offered as monthly aggregate counts for the major Wikipedia language editions. To obtain the data, we first have to download the raw clickstream data from <a href="https://dumps.wikimedia.org/other/clickstream/">this page</a>, where they are offered as compressed files. After extracting the files, we can load them into <strong>R</strong>.</p>
<p>In the example below, we focus on two party groups of the 8th (2014-2019) European Parliament: the euroskeptic EFDD (Europe of Freedom and Direct Democracy) and the far right ENF (Europe of Nations and Freedom). In particular, we are interested in clickstreams between the two party groups, between the party groups and their member parties, and between the individual member parties.</p>
<p>Toward this end, we download clickstream data from the English Wikipedia for May 2019, the month of the 2019 European Parliament elections. We identify 19 articles of interest and store them in the object <code>articles</code>. Having retrieved and extracted the clickstream data from May 2019, we import the TSV file into <strong>R</strong> using <code>read.table()</code>. Lastly, we subset the data to observations that involve referrals between all available article-pairs of the 19 articles.</p>
<details>
<p><summary> Code: Collecting and Processing Clickstream Data </summary></p>
<pre class="r"><code># retrieve article titles of interest
enf &lt;- &quot;Europe_of_Nations_and_Freedom&quot;
efdd &lt;- &quot;Europe_of_Freedom_and_Direct_Democracy&quot;

enf_parties &lt;- c(
  &quot;Freedom_Party_of_Austria&quot;,
  &quot;Vlaams_Belang&quot;,
  &quot;National_Rally_(France)&quot;,
  &quot;The_Blue_Party_(Germany)&quot;,
  &quot;Lega_Nord&quot;,
  &quot;Party_for_Freedom&quot;,
  &quot;Congress_of_the_New_Right&quot;
)

efdd_parties &lt;- c(
  &quot;Svobodní&quot;,
  &quot;The_Patriots_(France)&quot;,
  &quot;Debout_la_France&quot;,
  &quot;Alternative_for_Germany&quot;,
  &quot;Five_Star_Movement&quot;,
  &quot;Order_and_Justice&quot;,
  &quot;Liberty_(Poland)&quot;,
  &quot;Brexit_Party&quot;,
  &quot;Social_Democratic_Party_(UK,_1990–present)&quot;,
  &quot;Libertarian_Party_(UK)&quot;
)

articles &lt;- c(enf, efdd, enf_parties, efdd_parties)

# import raw clickstream data
cs &lt;-
  read.table(
    &quot;clickstream-enwiki-2019-05.tsv&quot;,
    header = FALSE,
    col.names = c(&quot;prev&quot;, &quot;curr&quot;, &quot;type&quot;, &quot;n&quot;),
    fill = TRUE,
    stringsAsFactors = FALSE
  )
cs$n &lt;- as.integer(cs$n)

# subset
cs &lt;- subset(cs, prev %in% articles &amp; curr %in% articles)</code></pre>
</details>
<p><br />
Next, we aim to analyze aggregate referral patterns. We first assign both previous (<code>prev</code>) and current (<code>curr</code>) articles to one of four categories: Articles on the EFDD and ENF parliamentary groups (one article each), articles on ENF member parties (7 articles), and articles on EFDD member parties (10 articles). We then summarize the data to obtain aggregate referral counts between all category pairs. Lastly, we display these in an interactive Sankey diagram using the <a href="https://cran.r-project.org/web/packages/plotly/index.html"><code>plotly</code></a> package.</p>
<details>
<p><summary> Code: Analyzing and Plotting Clickstream Data </summary></p>
<pre class="r"><code># assign categories
cs &lt;- cs %&gt;%
  mutate(
    curr_cat = ifelse(
      curr == enf,
      &quot;ENF Group&quot;,
      ifelse(
        curr == efdd,
        &quot;EFDD Group&quot;,
        ifelse(curr %in% enf_parties, &quot;ENF Parties&quot;,
               &quot;EFDD Parties&quot;)
      )
    ),
    prev_cat = ifelse(
      prev == enf,
      &quot;ENF Group&quot;,
      ifelse(
        prev == efdd,
        &quot;EFDD Group&quot;,
        ifelse(prev %in% enf_parties, &quot;ENF Parties&quot;,
               &quot;EFDD Parties&quot;)
      )
    )
  )

# summarize data
cs_sum &lt;-  cs %&gt;%
  group_by(curr_cat, prev_cat) %&gt;%
  summarize(n = sum(n)) %&gt;%
  arrange(prev_cat)

# Sankey diagram using plotly
labels &lt;- c(unique(cs_sum$prev_cat), unique(cs_sum$curr_cat))
colors &lt;- ifelse(grepl(&quot;EFDD&quot;, labels), &quot;#24B9B9&quot;, &quot;#2B3856&quot;)
sankey_plot &lt;- plot_ly(
  type = &quot;sankey&quot;,
  orientation = &quot;h&quot;,
  
  node = list(
    label = labels,
    color = colors,
    pad = 15,
    thickness = 15,
    line = list(color = &quot;black&quot;,
                width = 0.5)
  ),
  
  link = list(
    source = as.numeric(as.factor(cs_sum$prev_cat)) - 1L,
    target = as.numeric(as.factor(cs_sum$curr_cat)) + 3L,
    value =  cs_sum$n
  ),
  
  height = 340,
  width = 600
) %&gt;%
  layout(font = list(size = 10))</code></pre>
</details>
<div style="position:relative;padding-top:56.25%;">
<iframe src="/studying-politics-wikipedia/sankey_plot.html" frameborder="0" allowfullscreen style="position:absolute;top:0;left:0;width:100%;height:100%;" scrolling="no" onload="resizeIframe(this)">
</iframe>
</div>
<p>     </p>
<p>The diagram shows that in our data, clickstream dyads involving the articles on the EFDD and ENF parliamentary groups are much more numerous than dyads involving the member parties. Much of this can be attributed to clickstreams between the two party groups, EFDD ↔︎ ENF. Whereas clickstreams between members of the same parliamentary group are also fairly frequent, clickstreams between the member of one group to a member of the respective other group are rare.</p>
<p>Moving beyond clickstreams between the four categories, we can also visualize the full network structure of <em>all</em> individual articles in our data. The code below starts with some preparatory data management and then uses the <a href="https://cran.r-project.org/web/packages/igraph/index.html"><code>igraph</code></a> package to create the network and to customize its graphical display.</p>
<p>In the final section of the code, we use the <a href="https://cran.r-project.org/web/packages/intergraph/index.html"><code>intergraph</code></a>, <a href="https://cran.r-project.org/web/packages/ggnetwork/index.html"><code>ggnetwork</code></a> and <a href="https://cran.r-project.org/web/packages/plotly/index.html"><code>plotly</code></a> packages to produce an interactive HTML5-compatible figure for this blog post. On your own machine, you may skip this section and simply use <code>plot.igraph()</code> on <code>cs_net</code> without transforming the object to a <code>ggplot</code> friendly format.</p>
<details>
<p><summary> Code: Interactive Network Graph </summary></p>
<pre class="r"><code># construct edges
cs_edge &lt;-
  cs %&gt;%
  group_by(prev, curr, prev_cat, curr_cat) %&gt;%
  dplyr::summarise(weight = sum(n)) %&gt;%
  arrange(curr)

# get list of unique articles to construct as nodes
cs_node &lt;- 
  gather(cs_edge,
         `prev`,
         `curr`,
         key = &quot;where&quot;,
         value = &quot;article&quot;) %&gt;%
  ungroup() %&gt;%
  select(article) %&gt;%
  distinct(article)
names(cs_node) &lt;- c(&quot;node&quot;)
cs_node$category &lt;-
  ifelse(cs_node$node == enf,
         &quot;ENF Group&quot;,
         ifelse(
           cs_node$node == efdd,
           &quot;EFDD Group&quot;,
           ifelse(cs_node$node %in% enf_parties, &quot;ENF Parties&quot;,
                  &quot;EFDD Parties&quot;)
         )
  )

# generate graph
set.seed(3)
cs_net &lt;-
  graph.data.frame(cs_edge,
                   vertices = cs_node,
                   directed = F)
cs_net &lt;-
  igraph::simplify(cs_net, remove.multiple = T, remove.loops = T)

# generate colors based on category
V(cs_net)$color &lt;- 
  ifelse(grepl(&quot;EFDD&quot;, V(cs_net)$category), &quot;#24B9B9&quot;, &quot;#2B3856&quot;)

# compute node degrees (#links) and use that to set node size
deg &lt;- igraph::degree(cs_net, mode = &quot;all&quot;)
V(cs_net)$size &lt;- deg / 10

# set labels
V(cs_net)$label &lt;- NA
V(cs_net)$label.cex = 0.5
V(cs_net)$label = ifelse(igraph::degree(cs_net) &gt; 5, V(cs_net)$label, NA)
cs_hc_labels &lt;- as.vector(cs_node$node)

# set edge width based on weight
E(cs_net)$width &lt;- log(E(cs_net)$weight) / 5
E(cs_net)$edge.color &lt;- &quot;gray80&quot;

# transform the network to a ggplot friendly format
# (required to generate interactive graph embedded in blog post)
gg_cs_net &lt;-
  ggnetwork(
    cs_net,
    layout = &quot;fruchtermanreingold&quot;,
    weights = &quot;weight&quot;,
    niter = 50000,
    arrow.gap = 0
  )

cs_plot &lt;- ggplot(gg_cs_net, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_edges(aes(color = edge.color), size = 0.4, alpha = 0.25) +
  geom_nodes(aes(color = color, size = size)) +
  geom_nodetext(aes(color = color, label = vertex.names, cex = 0.6)) +
  guides(size=FALSE) +
  theme_blank() +
  theme(legend.position = &quot;none&quot;)</code></pre>
</details>
<div style="position:relative;padding-top:56.25%;">
<p><iframe src="/socialsciencedatalab/studying-politics-wikipedia/cs_plot.html" frameborder="0" allowfullscreen
      style="position:absolute;top:0;left:0;width:100%;height:100%;" scrolling="no" onload="resizeIframe(this)"></iframe></p>
</div>
<p> </p>
<p>In the graph above, node diameters indicate the relative weight (total counts) of each article; node colors indicate whether an articles belongs to the EFDD or ENF. We see that members of the same party group tend to share more clickstreams. The Alternative for Germany (AfD), however, shares many connections with members of the ENF. This makes sense when we consider that the AfD has sought closer cooperation with numerous ENF member parties since 2016 with whom it eventually formed the new far right EP group, Identity and Democracy, in June 2019.</p>
<p>Lastly, a word of caution: One should keep in mind that clickstream counts heaviliy depend on how prominently (if at all) outgoing links are placed in a given Wikipedia article. Furthermore, raw counts from an isolated subset of clickstreams (as in the examples above) give no information on the relative importance of a given referral pattern relative to all outgoing referrals of a given article. Users should thus ensure that they use clickstream data in a way that adequately addresses their substantive inquiries.</p>
</div>
</div>
<div id="collecting-data-via-wikidata-queries" class="section level3">
<h3>Collecting Data via Wikidata Queries</h3>
<p>Wikidata is a collaboratively edited knowledge base with <a href="https://www.wikidata.org/wiki/Wikidata:Statistics">over 58 million entries as of July 2019</a>. It harbors various types of database items, including text, numerical quantities, coordinates, and images. There are no language editions, but individual entries can have values in different languages.</p>
<p>Wikidata allows users to submit queries using SPARQL, a query language for data stored in RDF (Resource Description Framework) format (see <a href="https://query.wikidata.org/">this link</a>). Click <a href="https://towardsdatascience.com/a-brief-introduction-to-wikidata-bb4e66395eb1">here</a> for a brief introduction to SPARQL. While basic queries can be used to answer mundane questions (e.g. <em>“what is the capital city of every member of the European Union, and how many inhabitants live there?”</em>), a targeted combination of related queries can be used for systematic data collection.</p>
<p>Instead of submitting explicit SPARQL queries, the example below uses the <code>WikidataR</code> package to combine various queries in order to collect data on the candidates in the <a href="https://en.wikipedia.org/wiki/2019_Conservative_Party_(UK)_leadership_election">2019 leadership election of the UK Conservative Party</a>. Suppose we want to retrieve the following information on each candidate:</p>
<ul>
<li>name</li>
<li>sex</li>
<li>date of birth</li>
<li>political experience</li>
<li>education</li>
<li>official website URL</li>
<li>Twitter accout</li>
<li>Facebook account</li>
</ul>
<p>In Wikidata, entries are stored as <em>items</em> with a unique item ID that starts with “Q”. For instance, the item <a href="https://www.wikidata.org/wiki/Q30325756">2019 Conservative Party (UK) leadership election</a> is stored as “Q30325756”. Items are characterized by a number of <em>statements</em> or <em>claims</em>. Claims start with “P” and detail an item’s properties. For instance, the claim “candidate” is stored as “P726”. Claims have values, which may once again be items. For example, the values of claim “P726” (candidate) of item “Q30325756” (2019 UK Conservative Party leadership election) are 10 items: one entry for each of the 10 candidates running in the leadership election. Take, for example, winning candidate Boris Johnson, who is listed as a candidate under claim “P726”. In turn, the entry on Boris Johnson is stored as item “Q180589”. This item is characterized by numerous claims, including “P1559” (name in native language), “P21” (sex or gender), “P569” (date of birth), “P39” (positions held), “P69” (educated at), “P856” (official website), “P2002” (Twitter username), and “P2013” (Facebook ID).</p>
<p>In order to collect the data for all 10 candidates in the 2019 Conservative Party leadership election, the code chunk below implements the following steps:</p>
<ol style="list-style-type: decimal">
<li>We retrieve <em>item</em> “Q30325756”, i.e., the entry for 2019 Conservative Party (UK) leadership election</li>
<li>We extract <em>claims</em> “P726” of the above item to retrieve the item IDs of all 10 candidates, which we store in the object <code>candidates</code></li>
<li>We save the IDs of the claims of interest, stored in the object <code>claims</code></li>
<li>We then use some nested <code>sapply</code>commands to do the following:
<ul>
<li>Retrieve the <em>item</em> (entry) for each candidate</li>
<li>Extract the eight <em>claims</em> from each candidate <em>item</em></li>
<li>Process the informational value of each extracted claim, depending on whether the claim value is
<ul>
<li>an atomic object (such as web site URLs)</li>
<li>a textual object with auxiliary information (such as names, which come with language information)</li>
<li>a time/date (such as date of birth)</li>
<li>yet another item (such as previous positions, where each position has an own data base entry)</li>
</ul></li>
</ul></li>
</ol>
<details>
<p><summary> Code: Retrieving Items and Claims from Wikidata</summary></p>
<pre class="r"><code># get item based on item id
uk_item &lt;- get_item(&quot;Q30325756&quot;, language = &quot;en&quot;)

# extract candidates
candidates &lt;- extract_claims(uk_item, claims = &quot;P726&quot;)
candidates &lt;- candidates[[1]][[1]]$mainsnak$datavalue$value$id

# collect the following attributes (&quot;claims&quot;) for each candidate
claims &lt;- c(&quot;P1559&quot;, &quot;P21&quot;, &quot;P569&quot;, &quot;P39&quot;, &quot;P69&quot;, &quot;P856&quot;, &quot;P2002&quot;, &quot;P2013&quot;)
names(claims) &lt;- c(&quot;nam&quot;, &quot;sex&quot;, &quot;dob&quot;, &quot;exp&quot;, &quot;edu&quot;, &quot;web&quot;, &quot;twi&quot;, &quot;fbk&quot;)
claims</code></pre>
<pre><code>##     nam     sex     dob     exp     edu     web     twi     fbk 
## &quot;P1559&quot;   &quot;P21&quot;  &quot;P569&quot;   &quot;P39&quot;   &quot;P69&quot;  &quot;P856&quot; &quot;P2002&quot; &quot;P2013&quot;</code></pre>
<pre class="r"><code># retrieve data
uk_data &lt;-
  sapply(candidates,
         function (item) {
           tmp_item &lt;- get_item(item, language = &quot;en&quot;)
           sapply(claims,
                  function(claim) {
                    tmp_claim &lt;- extract_claims(tmp_item, claim)[[1]][[1]]
                    if (any(is.na(tmp_claim))) {
                      return(NA)
                    } else {
                      tmp_claim &lt;- tmp_claim$mainsnak$datavalue$value
                      if (is.atomic(tmp_claim)) {
                        return(tmp_claim)
                      } else if (&quot;text&quot; %in% names(tmp_claim)) {
                        return(tmp_claim$text)
                      } else if (&quot;time&quot; %in% names(tmp_claim)) {
                        tmp_claim &lt;- as.Date(substr(tmp_claim$time, 2, 11))
                        return(tmp_claim)
                      } else if (&quot;id&quot; %in% names(tmp_claim)) {
                        tmp_claim &lt;- tmp_claim$id
                        tmp_claim &lt;- 
                          sapply(tmp_claim, 
                                 get_item, 
                                 language = &quot;en&quot;,
                                 simplify = FALSE,
                                 USE.NAMES = TRUE)
                        tmp_claim &lt;-
                          sapply(tmp_claim,
                                 function (x) {
                                   x[[1]]$labels$en$value
                                 })
                        return(tmp_claim)
                      }
                    }
                  },
                  simplify = FALSE,
                  USE.NAMES = TRUE)
         },
         simplify = FALSE,
         USE.NAMES = TRUE
  )</code></pre>
</details>
<p><br />
The retrieved data are stored in a nested list. At the upper level of the list, we have the ten candidates, named with their respective item IDs. Nested within each of the ten upper-level elements, we have the values of the eight claims, named with the labels we specified above. Claim values may either be atomic (such as date of birth) or vectors (such as “positions held”, which may have multiple entries). Below, we can see the retrieved data for the first candidate on the list, winning candidate Boris Johnson.</p>
<details>
<p><summary> Output: Retrieved Data for Boris Johnson</summary></p>
<pre><code>## $nam
## [1] &quot;Boris Johnson&quot;
## 
## $sex
## Q6581097 
##   &quot;male&quot; 
## 
## $dob
## [1] &quot;1964-06-19&quot;
## 
## $exp
##                                                    Q38931 
##                                         &quot;Mayor of London&quot; 
##                                                  Q1371091 
## &quot;Secretary of State for Foreign and Commonwealth Affairs&quot; 
##                                                 Q28841847 
##       &quot;Member of the Privy Council of the United Kingdom&quot; 
##                                                 Q30524710 
##     &quot;Member of the 57th Parliament of the United Kingdom&quot; 
##                                                 Q30524718 
##     &quot;Member of the 56th Parliament of the United Kingdom&quot; 
##                                                 Q35647955 
##     &quot;Member of the 54th Parliament of the United Kingdom&quot; 
##                                                 Q35921591 
##     &quot;Member of the 53rd Parliament of the United Kingdom&quot; 
##                                                    Q14211 
##                    &quot;Prime Minister of the United Kingdom&quot; 
##                                                  Q3303456 
##                        &quot;Leader of the Conservative Party&quot; 
##                                                 Q77685926 
##     &quot;Member of the 58th Parliament of the United Kingdom&quot; 
##                                                   Q609884 
##                              &quot;First Lord of the Treasury&quot; 
##                                                  Q3315116 
##                          &quot;Minister for the Civil Service&quot; 
##                                                 Q65988624 
##                                  &quot;Minister for the Union&quot; 
## 
## $edu
##                       Q192088                       Q805285 
##                &quot;Eton College&quot;             &quot;Balliol College&quot; 
##                      Q4804780                      Q5413121 
##        &quot;Ashdown House School&quot; &quot;European School, Brussels I&quot; 
## 
## $web
## [1] &quot;http://www.boris-johnson.com&quot;
## 
## $twi
## [1] &quot;BorisJohnson&quot;
## 
## $fbk
## [1] &quot;borisjohnson&quot;</code></pre>
</details>
<p><br /></p>
</div>
<div id="legislator" class="section level3">
<h3>legislatoR</h3>
<p><a href="https://github.com/saschagobel/legislatoR"><code>legislatoR</code></a> is a joint project of <a href="https://github.com/saschagobel">Sascha Göbel</a> and <a href="https://simonmunzert.github.io/">Simon Munzert</a>. It offers a comprehensive relational individual-level database that provides political, sociodemographic, and other Wikipedia-related data on members of various national parliaments, including the all sessions of the Austrian Nationalrat, the German Bundestag, the Irish Dáil, the French Assemblée, and the United States Congress (House and Senate). It currently comprises data of 42,534 elected representatives and holds information for a wide variety of variables, including:</p>
<ul>
<li>sociodemographics (<em>Core</em>)</li>
<li>basic political variables (<em>Political</em>)</li>
<li>records of individual Wikipedia data, including full revision histories (<em>History</em>)</li>
<li>daily user traffic on individual Wikipedia biographies (<em>Traffic</em>)</li>
<li>social media handles and website URLs (<em>Social</em>)</li>
<li>URLs to individual Wikipedia portraits (<em>Portraits</em>)</li>
<li>information on public offices held by MPs (<em>Offices</em>)</li>
<li>MPs’ occupations (<em>Professions</em>)</li>
<li>IDs that link politicians to other files, databases and websites (<em>IDs</em>)</li>
</ul>
<p>The figure below, taken from <span class="citation">Göbel and Munzert (2019)</span>, illustrates the data structure:</p>
<p><img src="/../../../../../article/studying-politics-wikipedia/img/data-structure.png" width="60%" style="display: block; margin: auto;" /></p>
<p>The package provides a relational database. This means that all data sets can be joined with the core data set via one of two keys: the Wikipedia page ID or the Wikidata ID, which uniquely identify individual politicians.</p>
<p><code>legislatoR</code> services the increasing demand for micro-level data on political elites among political scientists, political analysts, and journalists and offers an accessible and rich collection of data on past and present politicians. The inclusion of Wikipedia and other web data allows for the inclusion of detailed information on politicians’ biographies.</p>
<p>To install the current developmental version from <a href="https://github.com/saschagobel/legislatoR">GitHub</a>, we use the <code>devtools</code> package. After installing and loading <code>legislatoR</code>, we can use the <code>ls()</code> command to explore the full functionality of the package.</p>
<details>
<p><summary> Code: Installing <code>legislatoR</code> </summary></p>
<pre class="r"><code>## Install from GitHub
devtools::install_github(&quot;saschagobel/legislatoR&quot;)
library(legislatoR)

## View functionality
ls(&quot;package:legislatoR&quot;)</code></pre>
</details>
<p><br /></p>
<div id="application-1-social-media-adoption-rates" class="section level5">
<h5>Application 1: Social Media Adoption Rates</h5>
<p>To retrieve <code>legislatoR</code> data, we first load the entire core data set of a given national parliament using the <code>get_core()</code> command. In the example below, we focus on the German Bundestag. We immediately <code>right_join()</code> the core data set with the <em>political</em> component using <code>get_political()</code>. We then <code>filter()</code> the data to retain the legislative <code>session</code> of interest (here, the most recent session of the Bundestag, 2017-2021).</p>
<p>In the next step, we <code>left_join()</code> this data set with the <em>social</em> component using <code>get_social()</code>. This gives us full information on the social media accounts of all MPs of the 2017-2021 Bundestag. Whenever MPs do not have an account, this is stored as missing information (<code>NA</code>). Using this information, we can calculate the social media adoption rates in the German parliament.</p>
<details>
<p><summary> Code: Retrieving <code>legislatoR</code> Data and Calculating Social Media Adoption Rates </summary></p>
<pre class="r"><code>## Get social media adoption rates
# get data: Germany
dat_ger &lt;- right_join(
  x = get_core(legislature = &quot;deu&quot;),
  y = filter(
    get_political(legislature = &quot;deu&quot;),
    as.numeric(session) == max(as.numeric(session))
  ),
  by = &quot;pageid&quot;
)
dat_ger &lt;- left_join(x = dat_ger,
                     y = get_social(legislature = &quot;deu&quot;),
                     by = &quot;wikidataid&quot;)
dat_ger$legislature &lt;- &quot;Germany&quot;
dat_ger_sum &lt;- dat_ger %&gt;%
  dplyr::summarize(
    twitter = mean(not(is.na(twitter)), na.rm = TRUE),
    facebook = mean(not(is.na(facebook)), na.rm = TRUE),
    website = mean(not(is.na(website)), na.rm = TRUE),
    session_start = ymd(first(session_start)),
    session_end = ymd(first(session_end)),
    legislature = first(legislature)
  )</code></pre>
</details>
<pre><code>##     twitter  facebook  website session_start session_end legislature
## 1 0.7591036 0.6918768 0.640056    2017-10-24  2021-10-24     Germany</code></pre>
<p>Given the inclusion of <em>political</em> variables in our data set, we could think of numerous feasible extensions. For instance, we could look at social media adoption rates by <code>party</code>. Alternatively, we could use the <code>constituency</code> identifiers to add external data on the rurality/urbanity of German electoral districts and analyze whether politicians competing in urban districts are more likely to maintain social media profiles.</p>
</div>
<div id="application-2-public-attention-to-members-of-the-german-bundestag" class="section level5">
<h5>Application 2: Public Attention to Members of the German Bundestag</h5>
<p>In the second application, we use pageviews to identify peaks in public attention for MPs over time. This is particularly interesting in the context of politically significant events. For instance, we may want to know about public attention to parliamentarians following scandals, around elections, or during election campaigns. The code below illustrates this logic by averaging daily pageviews across all Wikipedia articles on members of the German Bundestag between July 2015 and December 2017.</p>
<details>
<p><summary> Code: Plotting Average Daily Pageviews for German MPs </summary></p>
<pre class="r"><code>## Visualize average pageviews data of German MPs
# get data
ger_traffic &lt;- right_join(
  x = get_traffic(legislature = &quot;deu&quot;),
  y = filter(
    get_political(legislature = &quot;deu&quot;),
    session_end &gt;= as.Date(&quot;2015-07-01&quot;)
  ),
  by = &quot;pageid&quot;
)
ger_traffic &lt;- left_join(x = ger_traffic,
                         y = get_core(legislature = &quot;deu&quot;),
                         by = &quot;pageid&quot;)
ger_traffic &lt;-
  dplyr::select(ger_traffic, pageid, date, traffic, session, party, name)

# aggregate data
ger_traffic$date &lt;- ymd(ger_traffic$date)
ger_traffic_date &lt;- group_by(ger_traffic, date)
ger_traffic_legislators &lt;- group_by(ger_traffic, pageid)
ger_traffic_sum &lt;-
  summarize(ger_traffic_date, mean = mean(traffic, na.rm = TRUE))
ger_traffic_sum &lt;- mutate(
  ger_traffic_sum,
  mean_l1 = lag(mean, 1),
  mean_f1 = lead(mean, 1),
  peak = (mean &gt;= 1.8 * mean_l1 &amp;
            mean &gt; 180)
)

# identify peaks
ger_traffic_peaks &lt;- filter(ger_traffic_sum, peak == TRUE)
ger_traffic_peaks_df &lt;-
  filter(ger_traffic, date %in% ger_traffic_peaks$date)
ger_traffic_peaks_group &lt;-
  group_by(ger_traffic_peaks_df, date) %&gt;% 
  dplyr::arrange(desc(traffic)) %&gt;% 
  filter(row_number() == 1)
ger_traffic_peaks_group &lt;- arrange(ger_traffic_peaks_group, date)
events_vec &lt;-
  c(
    &quot;deceased&quot;,
    &quot;drug affair&quot;,
    &quot;bullying affair&quot;,
    &quot;candidacy for presidency&quot;,
    &quot;deceased&quot;,
    &quot;???&quot;,
    &quot;chancellorchip announcement&quot;,
    &quot;elected president&quot;,
    &quot;???&quot;,
    &quot;policy success&quot;,
    &quot;TV debate&quot;,
    &quot;general election&quot;,
    &quot;threat to resign&quot;,
    &quot;elected speaker of parliament&quot;
  )

# plot
par(oma = c(0, 0, 0, 0))
par(mar = c(0, 4, 0, .5))
par(yaxs = &quot;i&quot;, xaxs = &quot;i&quot;, bty = &quot;n&quot;)
layout(matrix(c(1, 1, 3, 2, 2, 3), 2, 3, byrow = TRUE),
       heights = c(1, 2, 3),
       widths = c(5, 5, 1.8))
# names labels
plot(
  ymd(ger_traffic_sum$date),
  rep(0, length(ger_traffic_sum$date)),
  xlim = c(ymd(&quot;2015-07-01&quot;), ymd(&quot;2018-01-01&quot;)),
  xaxt = &quot;n&quot;,
  ylim = c(0, 8),
  yaxt = &quot;n&quot;,
  xlab = &quot;&quot;,
  ylab = &quot;&quot;,
  cex = 0
)
text(
  ger_traffic_peaks_group$date,
  0,
  ger_traffic_peaks_group$name,
  cex = .75,
  srt = 90,
  adj = c(0, 0)
)
# pageviews time series
par(mar = c(2, 4, 0, .5))
plot(
  ymd(ger_traffic_sum$date),
  ger_traffic_sum$mean,
  type = &quot;l&quot;,
  ylim = c(0, 1.25 * max(ger_traffic_sum$mean)),
  xlim = c(ymd(&quot;2015-07-01&quot;), ymd(&quot;2018-01-01&quot;)),
  xaxt = &quot;n&quot;,
  yaxt = &quot;n&quot;,
  xlab = &quot;&quot;,
  ylab = &quot;mean(pageviews)&quot;,
  col = &quot;white&quot;
)
abline(h = seq(0, 1.5 * max(ger_traffic_sum$mean), 250), col = &quot;lightgrey&quot;)
lines(ymd(ger_traffic_sum$date), ger_traffic_sum$mean, lwd = .5)
dates &lt;- seq(ymd(&quot;2015-07-01&quot;), ymd(&quot;2018-01-01&quot;), by = 1)
axis(1, dates[day(dates) == 1 &amp;
                month(dates) %in% c(1, 4, 7, 10)], labels = FALSE)
axis(1,
     dates[day(dates) == 1 &amp;
             month(dates) %in% c(1)],
     lwd = 0,
     lwd.ticks = 3,
     labels = FALSE)
axis(1,
     dates[day(dates) == 1 &amp;
             month(dates) %in% c(7)],
     labels = as.character(year(dates[day(dates) == 15 &amp;
                                        month(dates) %in% c(7)])),
     tick = F,
     lwd = 0)
axis(2, seq(0, 1.5 * max(ger_traffic_sum$mean), 250), las = 2)
# events labels in time series
for (i in seq_along(events_vec)) {
  text(ger_traffic_peaks_group$date[i],
       ger_traffic_sum$mean[ger_traffic_sum$peak == TRUE][i] + 80,
       i,
       cex = .8)
  points(
    ger_traffic_peaks_group$date[i],
    ger_traffic_sum$mean[ger_traffic_sum$peak == TRUE][i] + 80,
    pch = 1,
    cex = 2.2
  )
}
# election date
# events labels explained
par(mar = c(0, 0, 0, 0))
plot(
  0,
  0,
  xlim = c(0, 5),
  ylim = c(0, 10),
  xaxt = &quot;n&quot;,
  yaxt = &quot;n&quot;,
  xlab = &quot;&quot;,
  ylab = &quot;&quot;,
  cex = 0
)
positions &lt;-
  data.frame(
    events_xpos = 0.45,
    events_ypos = seq(6.5, (6.5 - .5 * length(events_vec)),-.5),
    text_xpos = .5
  )
text(0, 7, &quot;Events&quot;, pos = 4, cex = .75)
for (i in seq_along(events_vec)) {
  text(positions$events_xpos[i], positions$events_ypos[i], i, cex = .8)
  points(positions$events_xpos[i],
         positions$events_ypos[i],
         pch = 1,
         cex = 2.2)
  text(
    positions$text_xpos[i],
    positions$events_ypos[i],
    events_vec[i],
    pos = 4,
    cex = .75
  )
}</code></pre>
</details>
<img src="/../../../../../article/studying-politics-wikipedia_files/figure-html/Code%2013b-1.png" width="960" style="display: block; margin: auto;" />
</details>
<p><br/>
The plot shows 14 notable spikes in daily pageviews. We identify which politician’s article generated the most traffic during each of these events. Furthermore, we add a legend that lists salient political events that likely caused these spikes in attention to MPs Wikipedia entries. For example, spike 14 marks the day on which Wolfgang Schäuble (CDU) was elected speaker of the parliament.</p>
</div>
</div>
<div id="conclusion" class="section level3">
<h3>Conclusion</h3>
<p>Collecting and analyzing Wikipedia data is relatively easy and entirely free. It enables researchers to use and analyze an enormous body of data that offers valuable information for research in political science and beyond. Tools that facilitate the collection, processing, and analysis of Wikipedia data advance rapidly, broadening the realm of possibilities for scientific research. Political science research is increasingly picking up on these developments, as is evident in recent contributions <span class="citation">(Munzert 2015; Göbel and Munzert 2018; Shi et al. 2019)</span> and softwares (such as <a href="https://github.com/saschagobel/legislatoR"><code>legislatoR</code></a>).</p>
<p>However, using Wikipedia data may also come with limitations and pitfalls. As entries can be read and edited by both humans and machines, the accuracy of contents and the validity of metadata are not guaranteed. With respect to the latter, Wikidata adds provenance information to all the data. These can be used to evaluate the validity of the data in question for applied research.
Researchers should also keep in mind that Wikipedia data highly depends on user-driven creation, editing, and use of contents. This may not only lead to systematic selection bias due to data availability but also induce problems of equivalence of data points (e.g., articles on historical political figures likely receive fewer views and edits than articles on active politicians for reasons unrelated to their legislative activity or real-world importance).</p>
<p>These caveats are however all but exclusive to Wikipedia data. They merely underline that Wikipedia data is no exception when it comes to the general necessity of thoroughly scrutinizing and critically assessing the suitability of any given data for addressing substantive research questions.</p>
</div>
<div id="about-the-presenter" class="section level3">
<h3>About the Presenter</h3>
<p>Simon Munzert <a href="mailto:munzert@hertie-school.org"><i class="fa
              fa-envelope"></i> </a>
<a href="https://simonmunzert.github.io/"><i class="fa
              fa-globe"></i> </a>
<a href="https://twitter.com/simonsaysnothin"><i class="fa
              fa-twitter"></i></a> is a lecturer in Political Data Science at the Hertie School of Governance in Berlin, Germany. A former member of the MZES Data and Methods Unit, Simon founded the Social Science Data Lab in 2016. His research focuses on public opinion, political representation, and the role of new media for political processes.</p>
</div>
<div id="references" class="section level3 unnumbered">
<h3 class="unnumbered">References</h3>
<div id="refs" class="references hanging-indent">
<div id="ref-Gobel2018">
<p>Göbel, Sascha, and Simon Munzert. 2018. “Political Advertising on the Wikipedia Marketplace of Information.” <em>Social Science Computer Review</em> 36 (2): 157–75. <a href="https://doi.org/10.1177/0894439317703579">https://doi.org/10.1177/0894439317703579</a>.</p>
</div>
<div id="ref-Gobel2019">
<p>———. 2019. “legislatoR: Political, sociodemographic, and Wikipedia-related data on political elites.” <a href="https://github.com/saschagobel">https://github.com/saschagobel</a>.</p>
</div>
<div id="ref-Munzert2015a">
<p>Munzert, Simon. 2015. “Using Wikipedia Article Traffic Volume to Measure Public Issue Attention.” <em>Working Paper</em>. <a href="https://github.com/simonmunzert/workingPapers/blob/master/wikipedia-salience-v3.pdf">https://github.com/simonmunzert/workingPapers/blob/master/wikipedia-salience-v3.pdf</a>.</p>
</div>
<div id="ref-Shi2019">
<p>Shi, Feng, Misha Teplitskiy, Eamon Duede, and James A. Evans. 2019. “The wisdom of polarized crowds.” <em>Nature Human Behaviour</em> 3 (4): 329–36. <a href="https://doi.org/10.1038/s41562-019-0541-6">https://doi.org/10.1038/s41562-019-0541-6</a>.</p>
</div>
</div>
</div>
]]>
      </description>
    </item>
    
    <item>
      <title>Quantitative Analysis of Political Text</title>
      <link>https://socialsciencedatalab.mzes.uni-mannheim.de/article/quantitative-analysis-of-political-text/</link>
      <pubDate>Mon, 22 Jul 2019 00:00:00 +0100</pubDate>
      
      <guid>https://socialsciencedatalab.mzes.uni-mannheim.de/article/quantitative-analysis-of-political-text/</guid>
      <description><![CDATA[
        </p>
<p>How can we infer actors’ positions, substantive topics, or sentiments from (political) texts? This <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/categories/tutorials/">Methods Bites Tutorial</a> by <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/page/team/">Julian Bernauer</a> summarizes <a href="https://denisetraber.net/">Denise Traber</a>’s workshop in the <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/page/events/">MZES Social Science Data Lab</a> in Spring 2018. Using exemplary sets of political documents (election manifestos and coalition agreements), it showcases tools of QTA for a variety of analytical objectives and demonstrates how to create, process, and analyse a text corpus through a series of hands-on applications.</p>
<p>After reading this blog post and engaging with the applied exercises, readers should:</p>
<ul>
<li>be able to perform some basic preprocessing of text</li>
<li>be able to estimate the sentiment of texts</li>
<li>be able to find topics in texts</li>
<li>be able to estimate (scale) positions of texts</li>
</ul>
<p>You can use these links to navigate across the main sections of this tutotial:</p>
<ol style="list-style-type: decimal">
<li><a href="#tour"><strong>A tour of Quantitative Text Analysis</strong></a></li>
<li><a href="#preprocessing"><strong>(Pre-)processing text</strong></a></li>
<li><a href="#smallcoalition"><strong>A small coalition corpus</strong></a></li>
<li><a href="#sentimentanalysis"><strong>Sentiment analysis using a dictionary</strong></a></li>
<li><a href="#lda"><strong>LDA topic modeling</strong></a></li>
<li><a href="#wordfish"><strong>Wordfish scaling</strong></a></li>
<li><a href="#intraparty"><strong>Estimating intra-party preferences: Comparing speeches to votes</strong></a></li>
<li><a href="#furtherreadings"><strong>Further readings</strong></a></li>
</ol>
<p><em>Note:</em> This blog post presents Denise’s workshop materials in condensed form. The complete workshop materials, including slides and scripts, are available from our GitHub.</p>
<div id="a-tour-of-quantitative-text-analysis" class="section level3">
<h3>A tour of Quantitative Text Analysis <a name="tour"></a></h3>
<p>The workshop started with a few basics: While QTA can be efficient and cheap, it always fails to rely on a correct model of language. It does not free us from reading texts, and validation is key. We learned about the basic distinction between classification (organizing text into categories) and scaling (estimation positions of actors), and its supervised (where hand-coded or other external data is available) and unsupervised (without such data) variants.</p>
</div>
<div id="pre-processing-text" class="section level3">
<h3>(Pre-)processing text <a name="preprocessing"></a></h3>
<p>We relied on the R package <a href="http://quanteda.io"><strong>quanteda</strong></a> developed by Ken Benoit and collaborators, which takes QTA by storm, at least for those working in R. Together with the <a href="https://cran.r-project.org/web/packages/readtext/index.html"><strong>readtext</strong></a> package, it easily allows to get your text data into R, create a so-called “corpus” of texts with the actual content as well as meta-information, and perform various tasks of corpus and text processing (subsetting a corpus, creating a document-feature matrix (dfm), stopword removal) as well as analysis (scaling, classification). A large and increasing number of extras is also available, such as ways to assess text similarity (function <code>textsta_simil()</code>) and lexical diversity (<code>textstat_lexdiv()</code>). Some of these features are demonstrated in an example below. Also see <a href="http://quanteda.io/reference/index.html">this overview</a> by quanteda for a full list of functions and the <a href="https://cran.r-project.org/web/packages/preText/vignettes/getting_started_with_preText.html"><strong>preText</strong></a> package for advise on evaluating pre-processing specifications.</p>
</div>
<div id="a-small-coalition-corpus" class="section level3">
<h3>A small coalition corpus <a name="smallcoalition"></a></h3>
<p>For a few examples from the workshop, consider a small set of three documents: The coalition agreement between the CDU/CSU and the SPD as well as the respective election manifestos from the 2017 Bundestag election. The corpus is created by:</p>
<pre class="r"><code>library(readtext)
library(quanteda)
text &lt;- readtext(paste0(wd, &quot;coalition/*.txt&quot;),
                 docvarsfrom = &quot;filenames&quot;,
                 docvarnames = &quot;Party&quot;)
text$text &lt;- gsub(&quot;\n&quot;, &quot; &quot;, text$text)
coalitioncorpus &lt;- corpus(text, docid_field = &quot;doc_id&quot;)
coalitioncorpus$metadata$source &lt;- &quot;[directory] on [system] by [user]&quot;
summary(coalitioncorpus)</code></pre>
<pre><code>## Corpus consisting of 3 documents:
## 
##           Text Types Tokens Sentences     Party
##     cducsu.txt  4738  26004      1288    cducsu
##  coalition.txt 11660  93214      3763 coalition
##        spd.txt  7650  50298      2402       spd
## 
## Source: [directory] on [system] by [user]
## Created: Wed Nov 11 15:06:41 2020
## Notes:</code></pre>
<p>The code relies on the two packages, <strong>readtext</strong> and <strong>quanteda</strong>, to create a data frame with the text files, using their names for a document-level variable called “Party”. The <code>gsub()</code> command removes whitespace, and <code>corpus()</code> turns the data frame into a corpus, which is a special case of a data frame containing texts, some meta-information and document-level variables, all optimized to perform a variety of quantitative text analysis operations using quanteda.</p>
<p>Further document-level variables are added via:</p>
<pre class="r"><code>docvars(coalitioncorpus, &quot;Year&quot;) &lt;- 2017
docvars(coalitioncorpus, &quot;Party_regex&quot;) &lt;- 
  sub(&quot;[\\.].*&quot;, &quot;&quot;, names(texts(coalitioncorpus)))
docvars(coalitioncorpus)</code></pre>
<pre><code>##                   Party Year Party_regex
## cducsu.txt       cducsu 2017      cducsu
## coalition.txt coalition 2017   coalition
## spd.txt             spd 2017         spd</code></pre>
<p>Note that this uses a regular expression (regex) to alternatively retrieve the party names from the filenames after creating the corpus. For specific analyses, we want to know the distribution of words across documents and create a document-feature matrix (dfm):</p>
<pre class="r"><code>dfm_coal &lt;- dfm(
  coalitioncorpus,
  remove = c(stopwords(&quot;german&quot;),
             &quot;dass&quot;,
             &quot;sowie&quot;,
             &quot;insbesondere&quot;),
  remove_punct = TRUE,
  stem = FALSE
)
dfm_coal[, 1:8]</code></pre>
<pre><code>## Document-feature matrix of: 3 documents, 8 features (16.7% sparse).
## 3 x 8 sparse Matrix of class &quot;dfm&quot;
##                features
## docs            gutes land zeit deutschland liebens lebenswertes gut
##   cducsu.txt        6   48   11         147       1            1  16
##   coalition.txt     1   39   14         195       0            0  13
##   spd.txt           6   44   33          97       0            0  17
##                features
## docs            wohnen
##   cducsu.txt         1
##   coalition.txt     10
##   spd.txt            6</code></pre>
<p>Creating a dfm induces a bag-of-words assumption. This means that the order in which words appear is ignored. A dfm is a means of information reduction and the most efficient way of storing text as data, but allows only analyses under this assumption. We quickly glance at the similarity (function <code>textstat_simil()</code>) and lexical diversity (function <code>textstat_lexdiv()</code>) of texts:</p>
<pre class="r"><code>simil &lt;- textstat_simil(dfm_coal,
                        margin = &quot;documents&quot;,
                        method = &quot;correlation&quot;)
simil </code></pre>
<pre><code>## textstat_simil object; method = &quot;correlation&quot;
##               cducsu.txt coalition.txt spd.txt
## cducsu.txt         1.000         0.968   0.975
## coalition.txt      0.968         1.000   0.986
## spd.txt            0.975         0.986   1.000</code></pre>
<pre class="r"><code>textstat_lexdiv(dfm_coal)[, 1:2]</code></pre>
<pre><code>##        document       TTR
## 1    cducsu.txt 0.3302084
## 2 coalition.txt 0.2487993
## 3       spd.txt 0.2802713</code></pre>
<p>From this, we learn that the SPD manifesto has more similarity to the coalition agreement than that of the CDU/CSU, a notion which somewhat resembles the assessment of the 2017 German coalition. Also, the lexical diversity of the manifestos, measured in terms of types (different words) per token (total words), appears to be higher than the coalition agreement, especially for the CDU/CSU.</p>
</div>
<div id="sentiment-analysis-using-a-dictionary" class="section level3">
<h3>Sentiment analysis using a dictionary <a name="sentimentanalysis"></a></h3>
<p>For sentiment analyis, existing dictionaries are available. It is important to note that these do not necessarily fit the research question at hand. In this example, the German “LIWC” (linguistic inquiry and word count) dictionary is used, but alternatives exist, such as “Lexicoder” for political text. LIWC features the categories “anger”, “posemo” (positive emotion) and “religion”. After obtaining the dictionary and applying it while creating a dfm from the corpus, the share of the texts in the respective categories is displayed. The results indicate that the coalition agreement features less positive emotions as compared to the manifestos and that the SPD manifesto is the most “angry” text, while the CDU/CSU speaks most about religion.</p>
<details>
<p><summary>Code: Using a Dictionary</summary></p>
<pre class="r"><code># Create dictionary
liwcdict &lt;- dictionary(file = paste0(wd, &quot;German_LIWC2001_Dictionary.dic&quot;),
                       format = &quot;LIWC&quot;)

# Create dfm
liwcdfm &lt;- dfm(
  coalitioncorpus,
  remove = c(stopwords(&quot;german&quot;)),
  remove_punct = TRUE,
  stem = FALSE,
  dictionary = liwcdict
)

# Subset and calculate percentage
liwcsub &lt;-
  dfm_select(liwcdfm,
             pattern = c(&quot;Anger&quot;, &quot;Posemo&quot;, &quot;Relig&quot;),
             selection = &quot;keep&quot;)

liwcsub &lt;- convert(liwcsub, to = &quot;data.frame&quot;)
liwcsub$sum &lt;- apply(dfm_coal, FUN = sum, 1)
liwcparties &lt;- data.frame(
  docs = liwcsub$document,
  ShareAnger = liwcsub$Anger / liwcsub$sum,
  SharePosemo = liwcsub$Posemo / liwcsub$sum,
  ShareRelig = liwcsub$Relig / liwcsub$sum
)

liwcparties </code></pre>
</details>
<pre><code>##            docs  ShareAnger SharePosemo  ShareRelig
## 1    cducsu.txt 0.004314995  0.06414959 0.005609493
## 2 coalition.txt 0.004666188  0.05050462 0.004074997
## 3       spd.txt 0.005550042  0.05697063 0.003454993</code></pre>
</div>
<div id="lda-topic-modelling" class="section level3">
<h3>LDA topic modelling <a name="lda"></a></h3>
<p>LDA stands for Latent Dirichlet allocation. In a nutshell, the method represents texts as a mixture of topics, and simultaneously topics as mixtures of words. Fixing the number of topics to <span class="math inline">\(k = 5\)</span>, and using the <strong>topicmodels</strong> package, the command <code>lda()</code> delivers posterior probabilities of the topics for each document.</p>
<details>
<p><summary>Code: LDA Topic Model</summary></p>
<pre class="r"><code>library(topicmodels)

# Preparation
dfm_coal &lt;-
  dfm(
    coalitioncorpus,
    remove = c(
      stopwords(&quot;german&quot;),
      &quot;dass&quot;,
      &quot;sowie&quot;,
      &quot;insbesondere&quot;,
      &quot;b&quot;,
      &quot;z&quot;,
      &quot;a&quot;,
      &quot;u&quot;
    ),
    remove_punct = TRUE,
    remove_numbers = TRUE,
    stem = FALSE
  )

dfm_coal &lt;- dfm_wordstem(dfm_coal, language = &quot;german&quot;)

# Define parameters
burnin &lt;- 1000
iter &lt;- 500
keep &lt;- 50
seed &lt;- 2010
ntopics &lt;- 5

# Run LDA with 5 topics
ldaOut &lt;- LDA(
  dfm_coal,
  k = ntopics,
  method = &quot;Gibbs&quot;,
  control = list(
    burnin = burnin,
    iter = iter,
    keep = keep,
    seed = seed,
    verbose = FALSE
  )
)
# Posterior probabilities of the topics for each document
k &lt;- posterior(ldaOut)</code></pre>
</details>
<p><br />
Interpretation is the difficult part. Each document can be expressed as a mixture of topics, and notwithstanding the precise meaning of the topics, we learn that all texts share content referring to topic 1, while the CDU/CSU manifesto also features topic 2 and the SPD manifesto topic 3 to some extent.</p>
<pre class="r"><code>k$topics</code></pre>
<pre><code>##                       1          2          3          4          5
## cducsu.txt    0.5815482 0.02729874 0.36122496 0.01144536 0.01848272
## coalition.txt 0.6950774 0.03700023 0.06163624 0.10570834 0.10057777
## spd.txt       0.6755292 0.17646590 0.11404313 0.01837605 0.01558576</code></pre>
</div>
<div id="wordfish-scaling" class="section level3">
<h3>Wordfish Scaling <a name="wordfish"></a></h3>
<p>Wordfish scaling derives latent positions from texts based on a bag-of-words assumption. Here is an example relying on a set of Swiss manifestos, using only the sections on immigration. In preparation, a dfm is created while removing stopwords, stemming the remaining words and removing punctuation.</p>
<details>
<p><summary>Code: Preparing Corpus for Wordfish</summary></p>
<pre class="r"><code>manifestos &lt;- readtext(paste0(wd, &quot;manifestos/*.txt&quot;))
manifestocorpus &lt;- corpus(manifestos)
dfm_manifesto &lt;-
  dfm(
    manifestocorpus,
    remove = c(
      &quot;gruen*&quot;,
      &quot;sp&quot;,
      &quot;sozialdemokrat*&quot;,
      &quot;cvp&quot;,
      &quot;fdp&quot;,
      &quot;svp&quot;,
      &quot;fuer&quot;,
      &quot;dass&quot;,
      &quot;koennen&quot;,
      &quot;koennte&quot;,
      &quot;ueber&quot;,
      &quot;waehrend&quot;,
      &quot;wuerde&quot;,
      &quot;wuerden&quot;,
      &quot;schweiz*&quot;,
      &quot;partei*&quot;,
      stopwords(&quot;german&quot;)
    ),
    valuetype = &quot;glob&quot;,
    stem = FALSE,
    remove_punct = TRUE
  )
dfm_manifesto &lt;- dfm_wordstem(dfm_manifesto, language = &quot;german&quot;)</code></pre>
</details>
<p><br />
The function <code>textmodel_wordfish()</code> computes the Wordfish model, originally decribed in an <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-5907.2008.00338.x">AJPS article</a> by Slapin and Proksch in 2009. It assumes that the distribution of words across texts follows a Poisson distribution, and can be modeled by document and word fixed effects as well as word-specific weights and document positions. The model is a variant of unsupervised scaling, only requiring the relative location of two texts on the latent dimension. Here, a text of the Swiss People’s Party (SVP) is assumed to be right to that of the Social Democratic Party of Switzerland (SPS). The results make some sense, with the other manifestos aligning as expected on what could be interpreted as a anti-immigration dimension.</p>
<pre class="r"><code>wf &lt;- textmodel_wordfish(dfm_manifesto,
                         dir = c(13, 19),
                         dispersion = &quot;poisson&quot;)
textplot_scale1d(wf)</code></pre>
<p><img src="/../../../../../article/quantitative-analysis-of-political-text_files/figure-html/fish-1.png" width="672" /></p>
<p>Or, with some improvements to the plot:</p>
<details>
<p><summary>Code: Improved Plot of Party Positions</summary></p>
<pre class="r"><code>library(ggplot2)

# Save document scores and confidence intervals in data frame
wfdata &lt;- as.data.frame(predict(wf, interval = &quot;confidence&quot;))

# Add document variables
wfdata$docs &lt;- rownames(wfdata)
wfdata$electionyear &lt;- substr(wfdata$docs, 5, 8)
wfdata$party &lt;- as.factor(substr(wfdata$docs, 1, 3))
wfdata$party &lt;-
  factor(wfdata$party, levels = c(&quot;gps&quot;, &quot;sps&quot;, &quot;cvp&quot;, &quot;fdp&quot;, &quot;svp&quot;))

ggplot(wfdata) +
  geom_pointrange(
    aes(
      x = electionyear,
      y = fit.fit,
      ymin = fit.lwr,
      ymax = fit.upr,
      group = party,
      color = party
    ),
    size = 0.5
  ) +
  geom_line(aes(
    x = electionyear,
    y = fit.fit,
    group = party,
    color = party
  )) +
  theme_bw() +
  labs(title = &quot;Wordfish analysis&quot;,
       y = &quot;Document position&quot;,
       x = &quot;Electionyear&quot;) +
  scale_color_manual(values = c(&quot;green3&quot;,
                                &quot;red1&quot;,
                                &quot;darkorange1&quot;,
                                &quot;dodgerblue4&quot;,
                                &quot;springgreen4&quot;))</code></pre>
</details>
<p><br />
<img src="/../../../../../article/quantitative-analysis-of-political-text_files/figure-html/fish3-1.png" width="672" /></p>
</div>
<div id="further-readings" class="section level3">
<h3>Further readings <a name="furtherreadings"></a></h3>
<ul>
<li>An introductory article to QTA in R, especially relying on <strong>quanteda</strong>: <a href="https://www.tandfonline.com/doi/abs/10.1080/19312458.2017.1387238">Welbers, Kasper, Wouter Van Atteveldt and Kenneth Benoit (2017): Text Analysis in R, <em>Communication Methods and Measures</em> 11(4): 245-65.</a></li>
<li>An application of the methods by the workshop host and co-authors: <a href="https://www.cambridge.org/core/journals/political-science-research-and-methods/article/estimating-intraparty-preferences-comparing-speeches-to-votes/D5812B196E0945B1341AFCD050F24858">Schwarz, Daniel, Denise Traber and Kenneth Benoit (2017): Estimating Intra-Party Preferences: Comparing Speeches to Votes. <em>Political Science Research and Methods</em> 5(2): 379-396.</a></li>
</ul>
</div>
<div id="about-the-presenter" class="section level3">
<h3>About the presenter</h3>
<p><a href="https://denisetraber.net">Denise Traber</a> is a Senior Research Fellow at the University of Lucerne, Switzerland, where she heads an Ambizione research grant project on “The divided people: polarization of political attitudes in Europe” funded by the Swiss National Science Foundation. She has a strong interest in quantitative text analysis, co-organizes the “Zurich Summer School for Women in Political Methodology” and has published the article “Estimating Intra-Party Preferences: Comparing Speeches to Votes” in Political Science Research and Methods in 2017, jointly with Daniel Schwarz and Ken Benoit.</p>
</div>
]]>
      </description>
    </item>
    
    <item>
      <title>Collecting and Analyzing Twitter Data Using R</title>
      <link>https://socialsciencedatalab.mzes.uni-mannheim.de/article/collecting-and-analyzing-twitter-using-r/</link>
      <pubDate>Mon, 15 Jul 2019 01:01:01 +0100</pubDate>
      
      <guid>https://socialsciencedatalab.mzes.uni-mannheim.de/article/collecting-and-analyzing-twitter-using-r/</guid>
      <description><![CDATA[
        </p>
<p>How do you access Twitter’s API, collect a stream of tweets, and analyze the retrieved data? Which potentials, challenges, and limitations for social scientific research come along with using Twitter data? This <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/categories/tutorials/">Methods Bites Tutorial</a> by <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/page/team/">Denis Cohen</a>, based on a workshop by <a href="https://www.simon-kuehne.de">Simon Kühne</a> (Bielefeld University) in the <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/page/events/">MZES Social Science Data Lab</a> in Spring 2019, aims to tackle these questions.</p>
<p>After reading this blog post and engaging with the applied exercises, readers should:</p>
<ul>
<li>be able to collect Twitter data using R</li>
<li>be able to perform explorative analyses of the data using R</li>
<li>have a better understanding of Twitter data, and thus, the potentials and limitations of using it in research projects</li>
</ul>
<p>You can use these links to navigate across the four main sections of this tutotial:</p>
<ol style="list-style-type: decimal">
<li><a href="#about-twitter">About Twitter</a></li>
<li><a href="#collecting-twitter-data">Collecting Twitter Data</a></li>
<li><a href="#analyzing-twitter-data">Analyzing Twitter Data</a></li>
<li><a href="#potential-issues-and-challenges">Potential Issues and Challenges</a></li>
</ol>
<p><em>Note:</em> This blog post presents Simon’s workshop materials in condensed form. The complete workshop materials, including slides and scripts, are available from our <a href="https://github.com/SocialScienceDataLab/twitter">GitHub</a>.</p>
<div id="about-twitter" class="section level3">
<h3>About Twitter</h3>
<p>Twitter is an online news and social networking service, also used for micro-blogging. In everyday use, it mostly serves as a platform for publicly sharing short texts – often along with media content and/or links – in the form of so-called “tweets”. Twitter has approx. 326 Million monthly active users who send about 500 Million tweets each day (see this <a href="https://s22.q4cdn.com/826641620/files/doc_financials/2018/q3/TWTR-Q3_18_InvestorFactSheet.pdf">fact sheet</a>).</p>
<div id="the-basics" class="section level5">
<h5>The Basics</h5>
<ul>
<li>Each user has a profile (page) and can add a photo and information about themselves</li>
<li>Users can <em>follow</em> each other</li>
<li>Users can <em>tweet</em>, i.e., publicly sharing a text/photo/link</li>
<li>Each Tweet is restricted to a maximum of 280 characters</li>
<li>Users can interact with a Tweet via <em>comments (replies), likes,</em> and <em>shares (retweets)</em></li>
<li>Users can interact with other users via <em>direct messaging</em></li>
<li>Users can create a <em>thread:</em> A series of connected tweets</li>
<li>Users use <em>hashtags</em> (e.g., #mannheim) in order to associate their tweets with certain topics and to make them easier to find</li>
<li>Users can search for keywords/hashtags in order to find relevant tweets and users</li>
</ul>
</div>
<div id="twitter-in-social-science-research" class="section level5">
<h5>Twitter in Social Science Research</h5>
<p>Analyzing tweets and social interaction on Twitter can help to answer social science research questions, especially in communication research and political science. Contrary to Facebook (API depreciation/shut-down in April 2018) and Instagram (API depreciation/shut-down in December 2018), Twitter data is (easily) accessible for researchers.</p>
<p>According to the <a href="https://login.webofknowledge.com/">Web of Science</a> database, there are 2,598 journal articles published since 2009 with the word “Twitter” in their titles.</p>
<div class="figure" style="text-align: center"><span id="fig:img1"></span>
<img src="/../../../../../article/collecting-and-analyzing-twitter-using-r/img/twitter_wos_1.PNG" alt="Number of articles by year"  />
<p class="caption">
Figure 1: Number of articles by year
</p>
</div>
<div class="figure" style="text-align: center"><span id="fig:img2"></span>
<img src="/../../../../../article/collecting-and-analyzing-twitter-using-r/img/twitter_wos_2.PNG" alt="Number of articles by discipline"  />
<p class="caption">
Figure 2: Number of articles by discipline
</p>
</div>
</div>
</div>
<div id="collecting-twitter-data" class="section level3">
<h3>Collecting Twitter Data</h3>
<div id="the-twitter-api-platform" class="section level5">
<h5>The Twitter API Platform</h5>
<p>An API (Application Programming Interface) allows users to access (real-time) Twitter data. Twitter offers <a href="https://developer.twitter.com/en/docs.html">a variety of API services</a> – some for free, others not. Using these services, one can search for tweets published in the past, stream tweets in realtime, and manage Twitter accounts and ads. The following exercises will focus on the free-of-charge API service, which is used in the vast majority of research projects: <a href="https://developer.twitter.com/en/docs/tweets/filter-realtime/overview">The Realtime Streaming API</a>.</p>
</div>
<div id="the-streaming-api---collecting-tweets-in-realtime" class="section level5">
<h5>The Streaming API - Collecting Tweets in Realtime</h5>
<p><em>“Establishing a connection to the streaming APIs means making a very long lived HTTP request, and parsing the response incrementally. Conceptually, you can think of it as downloading an infinitely long file over HTTP”</em>. This way, we can receive up to a maximum of 1% of all tweets worldwide. As a query is usually specified by selected keywords or geographic areas, you will be able to collect (almost) all relevant tweets for your research interest. There are three filter parameters that you can use:</p>
<ul>
<li>‘Follow’: Receive tweets of up to 5,000 users</li>
<li>‘Track’: Receive tweets that contain up to 400 keywords</li>
<li>‘Location’: Receive tweets from within a set of up to 25 geographic bounding boxes</li>
</ul>
</div>
<div id="api-authentification" class="section level5">
<h5>API Authentification</h5>
<p>You need to authenticate yourself when making requests to the Twitter API. Twitter uses the <a href="https://oauth.net/">OAuth protocol</a>, an <a href="https://oauth.net/"><em>“open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications”</em></a>. This involves five necessary steps:</p>
<ul>
<li>creating a Twitter account</li>
<li>logging in to your Twitter account via <a href="https://developer.twitter.com/" class="uri">https://developer.twitter.com/</a></li>
<li>creating an app</li>
<li>creating keys, access token &amp; secret</li>
</ul>
</div>
<div id="data-collection" class="section level5">
<h5>Data Collection</h5>
<p>There are a number of ways to collect Twitter data, including writing your own script to make continuous HTTP requests, Python’s <code>tweepy</code> package, and R’s <code>rtweet</code> package. The following demonstrates how to collect Twitter data using different Streaming API endpoints and the <a href="https://rtweet.info/reference/stream_tweets.html"><code>rtweet</code></a> package.</p>
</div>
<div id="collecting-tweets-using-the-rtweet-package" class="section level5">
<h5>Collecting Tweets Using the rtweet Package</h5>
<p>As usual, we start with a little housekeeping: Installing required packages per <code>install.packages()</code> and specifying a working directory using <code>setwd()</code>.</p>
<details>
<p><summary>Code: Setup</summary></p>
<pre class="r"><code># Install Packages
install.packages(&quot;rtweet&quot;)
install.packages(&quot;ggmap&quot;)
install.packages(&quot;igraph&quot;)
install.packages(&quot;ggraph&quot;)
install.packages(&quot;tidytext&quot;)
install.packages(&quot;ggplot2&quot;)
install.packages(&quot;dplyr&quot;)
install.packages(&quot;readr&quot;)

# Set Working Directory
setwd(&quot;/PATH&quot;)</code></pre>
</details>
<p><br />
The code chunk below then illustrates three examples of collecting Twitter data using the <code>rtweet</code> package. After loading the required packages into the <code>library()</code>, we specify our authentication token per <code>create_token()</code> (see <a href="#api-authentification">API Authentification</a> above). After this, we are live: Using the function <code>stream_tweets()</code>, we collect:</p>
<ol style="list-style-type: decimal">
<li>a sample of current tweets for <code>timeout = 10</code> seconds.</li>
<li>a sample of current tweets containing either of the keywords <code>q = "trump, donald trump"</code> for <code>timeout = 30</code> seconds.</li>
<li>a sample of current tweets for <code>timeout = 180</code> seconds from a specific location (in this instance, Berlin), restricting our search to a rectangular area defined by coordinates for longitude and latitude. Use, for instance, <a href="https://boundingbox.klokantech.com/" class="uri">https://boundingbox.klokantech.com/</a> to retrieve the coordinates of your choosing.</li>
</ol>
<details>
<p><summary>Code: Data Collection</summary></p>
<pre class="r"><code># Open Libraries
library(rtweet)

# Speficy Authentification Token&#39;s provided in your Twitter App
create_token(
  app = &quot;APPNAME&quot;,
  consumer_key = &quot;consumer_key&quot;,
  consumer_secret = &quot;consumer_secret&quot;,
  access_token = &quot;access_token&quot;,
  access_secret = &quot;access_secret&quot;
)

# Collect a &#39;random&#39; sample of Tweets for 10 seconds
stream_tweets(
  q = &quot;&quot;,
  timeout = 10,
  file_name = &quot;sample.json&quot;,
  parse = FALSE
)
sample &lt;- parse_stream(&quot;sample.json&quot;)
save(sample, file = &quot;sample_live.Rda&quot;)

# Collect Tweets that contain specific keywords for 30 seconds
stream_tweets(
  q = &quot;trump, donald trump&quot;,
  timeout = 30,
  file_name = &quot;trump.json&quot;,
  parse = FALSE
)
trump &lt;- parse_stream(&quot;trump.json&quot;)
save(trump, file = &quot;trump_live.Rda&quot;)

# Collect Tweets from a specific location
stream_tweets(
  c(13.0883,52.3383,13.7612,52.6755), 
  timeout = 180,
  file_name = &quot;berlin.json&quot;,
  parse = FALSE
)
berlin &lt;- parse_stream(&quot;berlin.json&quot;)
save(berlin, file = &quot;berlin_live.Rda&quot;)</code></pre>
</details>
<p><br />
By running the <code>steam_tweets()</code> function, we receive Tweets and related meta-information from the Twitter API. The data is stored in .json format (Java Script Object Notation), though we can store these files as data frames after parsing them per <code>parse_stream()</code>. Each row in the data frame represents a Tweet or Re-Tweet and contains, amongst other, the following information:</p>
<ul>
<li>The content of a Tweet + Tweet-URL + Tweet-ID</li>
<li>User-name + User-ID</li>
<li>Time-stamp</li>
<li>Place, country, geocodes (rarely)</li>
<li>User self-description, residence, no. of followers, no. of friends</li>
<li>URLs to images, videos</li>
</ul>
<p>For illustration, we take the example of the data frame <code>sample</code>, the result of our first query. As we can see, our 10 second sample without specified key words contains 88 variables (columns) and 359 tweets (rows). You can see the full list of variables below, along with a small anonymized portion of five English language tweets (you can only see the first few characters of each tweet, stored in the variable <code>sample$text</code>).</p>
<details>
<p><summary>Code: Viewing the Data</summary></p>
<pre class="r"><code># Dimensions of the data frame
dim(sample)</code></pre>
<pre><code>## [1] 359  88</code></pre>
<pre class="r"><code># Variables in the data frame
names(sample)</code></pre>
<pre><code>##  [1] &quot;user_id&quot;                 &quot;status_id&quot;              
##  [3] &quot;created_at&quot;              &quot;screen_name&quot;            
##  [5] &quot;text&quot;                    &quot;source&quot;                 
##  [7] &quot;display_text_width&quot;      &quot;reply_to_status_id&quot;     
##  [9] &quot;reply_to_user_id&quot;        &quot;reply_to_screen_name&quot;   
## [11] &quot;is_quote&quot;                &quot;is_retweet&quot;             
## [13] &quot;favorite_count&quot;          &quot;retweet_count&quot;          
## [15] &quot;hashtags&quot;                &quot;symbols&quot;                
## [17] &quot;urls_url&quot;                &quot;urls_t.co&quot;              
## [19] &quot;urls_expanded_url&quot;       &quot;media_url&quot;              
## [21] &quot;media_t.co&quot;              &quot;media_expanded_url&quot;     
## [23] &quot;media_type&quot;              &quot;ext_media_url&quot;          
## [25] &quot;ext_media_t.co&quot;          &quot;ext_media_expanded_url&quot; 
## [27] &quot;ext_media_type&quot;          &quot;mentions_user_id&quot;       
## [29] &quot;mentions_screen_name&quot;    &quot;lang&quot;                   
## [31] &quot;quoted_status_id&quot;        &quot;quoted_text&quot;            
## [33] &quot;quoted_created_at&quot;       &quot;quoted_source&quot;          
## [35] &quot;quoted_favorite_count&quot;   &quot;quoted_retweet_count&quot;   
## [37] &quot;quoted_user_id&quot;          &quot;quoted_screen_name&quot;     
## [39] &quot;quoted_name&quot;             &quot;quoted_followers_count&quot; 
## [41] &quot;quoted_friends_count&quot;    &quot;quoted_statuses_count&quot;  
## [43] &quot;quoted_location&quot;         &quot;quoted_description&quot;     
## [45] &quot;quoted_verified&quot;         &quot;retweet_status_id&quot;      
## [47] &quot;retweet_text&quot;            &quot;retweet_created_at&quot;     
## [49] &quot;retweet_source&quot;          &quot;retweet_favorite_count&quot; 
## [51] &quot;retweet_retweet_count&quot;   &quot;retweet_user_id&quot;        
## [53] &quot;retweet_screen_name&quot;     &quot;retweet_name&quot;           
## [55] &quot;retweet_followers_count&quot; &quot;retweet_friends_count&quot;  
## [57] &quot;retweet_statuses_count&quot;  &quot;retweet_location&quot;       
## [59] &quot;retweet_description&quot;     &quot;retweet_verified&quot;       
## [61] &quot;place_url&quot;               &quot;place_name&quot;             
## [63] &quot;place_full_name&quot;         &quot;place_type&quot;             
## [65] &quot;country&quot;                 &quot;country_code&quot;           
## [67] &quot;geo_coords&quot;              &quot;coords_coords&quot;          
## [69] &quot;bbox_coords&quot;             &quot;status_url&quot;             
## [71] &quot;name&quot;                    &quot;location&quot;               
## [73] &quot;description&quot;             &quot;url&quot;                    
## [75] &quot;protected&quot;               &quot;followers_count&quot;        
## [77] &quot;friends_count&quot;           &quot;listed_count&quot;           
## [79] &quot;statuses_count&quot;          &quot;favourites_count&quot;       
## [81] &quot;account_created_at&quot;      &quot;verified&quot;               
## [83] &quot;profile_url&quot;             &quot;profile_expanded_url&quot;   
## [85] &quot;account_lang&quot;            &quot;profile_banner_url&quot;     
## [87] &quot;profile_background_url&quot;  &quot;profile_image_url&quot;</code></pre>
<pre class="r"><code># An anonymized portion of the data frame, only tweets in English
sample$user_id &lt;- seq(1, nrow(sample), 1)
sample$screen_name &lt;- paste(&quot;name&quot;, seq(1, nrow(sample), 1), sep = &quot;_&quot;)
sample &lt;- subset(sample, lang == &quot;en&quot;)
sample[1:5, c(&quot;user_id&quot;, &quot;created_at&quot;, &quot;screen_name&quot;, &quot;text&quot;, &quot;is_quote&quot;, 
              &quot;is_retweet&quot;)]</code></pre>
<pre><code>## # A tibble: 5 x 6
##   user_id created_at          screen_name text          is_quote is_retweet
##     &lt;dbl&gt; &lt;dttm&gt;              &lt;chr&gt;       &lt;chr&gt;         &lt;lgl&gt;    &lt;lgl&gt;     
## 1       1 2019-04-03 10:22:32 name_1      just see..&lt;U+2764&gt;&lt;U+FE0F&gt;~ FALSE    FALSE     
## 2       4 2019-04-03 10:22:33 name_4      I’m slowy gi~ FALSE    TRUE      
## 3      15 2019-04-03 10:22:33 name_15     &quot;\&quot;I didn&#39;t ~ TRUE     FALSE     
## 4      20 2019-04-03 10:22:33 name_20     Hmm. I’ma de~ FALSE    TRUE      
## 5      28 2019-04-03 10:22:33 name_28     I still hear~ TRUE     TRUE</code></pre>
</details>
<p><br /></p>
</div>
</div>
<div id="analyzing-twitter-data" class="section level3">
<h3>Analyzing Twitter Data</h3>
<p>We have now collected Tweets and meta-information based on selected keywords and/or regional parameters. This begs the question what to do with the raw data. Some common research interests include:</p>
<ul>
<li><em>Content Analysis:</em> What kind of topics are users talking about?</li>
<li><em>Sentiment Analysis:</em> What kind of opinions, attitudes, and emotions towards objects are users communicating?</li>
<li><em>Network Analysis:</em> Who is related to whom? Who are important users?</li>
<li><em>Geospatial Analysis:</em> Where are users/Tweets coming from?</li>
</ul>
<p>In the following, we present two quick examples that showcase possible avenues for the analysis of Twitter data.</p>
<div id="example-1-prepping-tweets-for-text-analysis" class="section level5">
<h5>Example 1: Prepping Tweets for Text Analysis</h5>
<p>Quantitative Text Analysis (QTA) typically relies on pre-processed textual data (see <a href="/../../../../../article/quantitative-analysis-of-political-text/">this blog post</a> for our SSDL workshop on QTA by <a href="https://denisetraber.net/">Denise Traber</a>). The code chunk below illustrates how a collection of Tweets can easily be prepared for QTA techniques such as content or sentiment analysis.</p>
<p>In this example, we use our 30 second sample of Tweets containing the key words “donald trump” and/or “trump”, which we stored as <code>trump_live.RDa</code> (see <a href="#collecting-tweets-using-the-rtweet-package">Collecting Tweets Using the rtweet Package</a> above). Tweet contents are stored in the variable <code>trump$text</code>. Using the <code>tidytext</code>and <code>dplyr</code> packages, we then process the Tweets into a text corpus as follows:</p>
<ol style="list-style-type: decimal">
<li>Remove URLs from all tweets using <code>gsub()</code></li>
<li>Remove punctuation, convert to lowercase, and seperate all words using <code>unnest_tokens()</code></li>
<li>Remove <a href="https://en.wikipedia.org/wiki/Stop_words">stop words</a> by first loading a list of stop words from the <code>tidytext</code> package via <code>data("stop_words")</code> and then removing these words from our tweets via <code>anti_join(stop_words)</code></li>
</ol>
<details>
<p><summary>Code: Preparing Data for Content Analysis</summary></p>
<pre class="r"><code># Open Libraries
library(tidytext)
library(dplyr)

# Data Cleaning
# Delete Links in the Tweets
trump$text &lt;- gsub(&quot;http.*&quot;, &quot;&quot;, trump$text)
trump$text &lt;- gsub(&quot;https.*&quot;, &quot;&quot;, trump$text)
trump$text &lt;- gsub(&quot;&amp;amp;&quot;, &quot;&amp;&quot;, trump$text)

# Remove punctuation, convert to lowercase, seperate all words
trump_clean &lt;- trump %&gt;%
  dplyr::select(text) %&gt;%
  unnest_tokens(word, text)

# Load list of stop words - from the tidytext package
data(&quot;stop_words&quot;)

# Remove stop words from your list of words
cleaned_tweet_words &lt;- trump_clean %&gt;%
  anti_join(stop_words)</code></pre>
</details>
<p><br />
Following this, we can calculate the word counts in our Tweet collection and visualize the 15 most frequent words:</p>
<details>
<p><summary>Code: Plotting the 15 Most Frequent Words</summary></p>
<pre class="r"><code># Plot the top 15 words
cleaned_tweet_words %&gt;%
  count(word, sort = TRUE) %&gt;%
  top_n(15) %&gt;%
  mutate(word = reorder(word, n)) %&gt;%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(y = &quot;Count&quot;,
       x = &quot;Unique words&quot;,
       title = &quot;Count of unique words found in tweets&quot;,
       subtitle = &quot;Stop words removed from the list&quot;)</code></pre>
</details>
<p><br /></p>
<pre><code>## Selecting by n</code></pre>
<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-3"></span>
<img src="/../../../../../article/collecting-and-analyzing-twitter-using-r_files/figure-html/unnamed-chunk-3-1.png" alt="Top 15 Most Frequent Words" width="672" />
<p class="caption">
Figure 3: Top 15 Most Frequent Words
</p>
</div>
<p><br /></p>
</div>
<div id="example-2-analyzing-locations" class="section level5">
<h5>Example 2: Analyzing Locations</h5>
<p>Provided that users share their exact geo-information, we can locate their tweets on geographical maps. For example, in our 3 minute sample of Tweets from <code>berlin</code> (see <a href="#collecting-tweets-using-the-rtweet-package">Collecting Tweets Using the rtweet Package</a> above), this information was available for only two Tweets.</p>
<p>The code below demonstrates how one can view these Tweets mapped onto their exact geographical locations. We first pre-process the data, extracting numerical geo-coordinates for longitude and latitude where this information is available. We then subset the data frame to observations where these information exist. Lastly, using the same geo-coordinates as in the data collection, we set up an empty rectangular map of Berlin. Note that the last step involves accessing the <a href="https://cloud.google.com/maps-platform/#get-started">Google Maps API</a>, for which you will have to register separately (see, e.g., <a href="https://www.r-bloggers.com/geocoding-with-ggmap-and-the-google-api/">this post</a> at R-bloggers).</p>
<details>
<p><summary>Code: Analyzing Locations - Setup</summary></p>
<pre class="r"><code>library(ggmap)
library(dplyr)

# Seperate Geo-Information (Lat/Long) Into Two Variables
berlin &lt;- tidyr::separate(data = berlin,
                          col = geo_coords,
                          into = c(&quot;Latitude&quot;, &quot;Longitude&quot;),
                          sep = &quot;,&quot;,
                          remove = FALSE)

# Remove Parentheses
berlin$Latitude &lt;- stringr::str_replace_all(berlin$Latitude, &quot;[c(]&quot;, &quot;&quot;)
berlin$Longitude &lt;- stringr::str_replace_all(berlin$Longitude, &quot;[)]&quot;, &quot;&quot;)

# Store as numeric
berlin$Latitude &lt;- as.numeric(berlin$Latitude)
berlin$Longitude &lt;- as.numeric(berlin$Longitude)

# Keep only those tweets where geo information is available
berlin &lt;- subset(berlin, !is.na(Latitude) &amp; !is.na(Longitude))

# Set up empty map
berlin_map &lt;- get_map(location = c(lon = mean(c(13.0883, 13.7612)),
                                   lat = mean(c(52.3383, 52.6755))),
                      zoom = 10,
                      maptype = &quot;terrain&quot;,
                      source = &quot;google&quot;)</code></pre>
</details>
<p><br />
Building up on this, we can then map our collected tweets to see from where they were sent: Berlin-Mitte and Potsdam.</p>
<details>
<p><summary>Code: Analyzing Locations - Map</summary></p>
<pre class="r"><code>tweet_map &lt;- ggmap(berlin_map)
tweet_map + geom_point(data = berlin,
                       aes(x = Longitude,
                           y = Latitude),
                           color = &quot;red&quot;,
                           size = 4,
                           alpha = .5)</code></pre>
</details>
<br />
<div class="figure" style="text-align: center"><span id="fig:img3"></span>
<img src="/../../../../../article/collecting-and-analyzing-twitter-using-r/img/tweets_map.png" alt="GPS Coordinates of 2 Tweets"  />
<p class="caption">
Figure 4: GPS Coordinates of 2 Tweets
</p>
</div>
</div>
</div>
<div id="potential-issues-and-challenges" class="section level3">
<h3>Potential Issues and Challenges</h3>
<div id="bias-and-representativity" class="section level5">
<h5>Bias and Representativity</h5>
<p>Twitter users do not represent a random sample from a given population. This is not only due to the presence of bots and company or institutional accounts, but also to the manifold self-selection processes that using Twitter entails:</p>
<p>Population → Internet Users → Twitter Users → Active Twitter Users → Users sharing geo-information</p>
<p>Below are some references that analyze the magnitude and severity of these (and related) problems:</p>
<ul>
<li><a href="https://journals.sagepub.com/doi/10.1177/0894439314558836">Barberá, P. &amp; G. Rivero, 2015: Understanding the Political Representativeness of Twitter Users. Social Science Computer Review 33(6) 712-729.</a></li>
<li><a href="https://journals.sagepub.com/doi/full/10.1177/2053168017720008">Mellon, J. &amp; C. Prosser, 2017: Twitter and Facebook are not representative of the general population: Political attitudes and demographics of British social media users. Research and Politics 2017: 1-9.</a></li>
<li><a href="https://www.nomos-elibrary.de/10.5771/1615-634X-2018-2-140/eine-meinungsstarke-minderheit-als-stimmungsbarometer-ueber-die-persoenlichkeitseigenschaften-aktiver-twitterer-jahrgang-66-2018-heft-2?page=1">Hölig, S., 2018: Eine meinungsstarke Minderheit als Stimmungsbarometer?! Über die Persönlichkeitseigenschaften aktiver Twitterer. M&amp;K Medien- und Kommunikationswissenschaft 66: 140-169.</a></li>
</ul>
</div>
<div id="replicability-and-black-box-twitter" class="section level5">
<h5>Replicability and Black-Box Twitter</h5>
<p>Real-time Twitter data collection is not reproducible and for a given query. Furthermore, you can only hope that Twitter will provide you with a true random sample of Tweets. If completeness is crucial for your research interest, you will have to pay for complete access to all Tweets ever tweeted: <a href="https://developer.twitter.com/en/docs/tutorials/choosing-historical-api.html"><em>“Both Historical PowerTrack and Full-Archive Search provide access to any publicly available Tweet, starting with the first Tweet from March 2006”</em></a>.</p>
</div>
<div id="data-privacy-and-research-ethics" class="section level5">
<h5>Data Privacy and Research Ethics</h5>
<p>Tweets on public Twitter profiles are generally available. There are no measures in place that prevent the collection and analysis of the data, and users’ consent for the collection and processing of their tweets and profile information is usually not required. This practice is also congruous with certain guidelines for academic and commerical social media research (e.g <a href="http://rat-marktforschung.de/fileadmin/user_upload/pdf/R11_RDMS_D.pdf">DGOF Richtlinien zur Social Media Forschung:</a> <em>“In offenen Sozialen Medien bzw. den entsprechenden Bereichen dürfen die personenbezogenen Daten der Teilnehmer grundsätzlich ohne entsprechende explizite Einwilligung auf der Grundlage der gesetzlichen Erlaubnisnorm auch für Zwecke der Markt- und Sozialforschung verarbeitet und genutzt werden.”</em></p>
<p>However, Twitter’s terms of service are not necessarily congruous with German or EU data protection regulations (e.g. DSGVO).
Ultimately, this leaves the ethical and legal questions of how to ensure data privacy to us as researchers. Should we, for instance, further anonymize data, e.g. by separating user IDs from Tweet content and meta-information? What are the implications for open science? Should the full data, including user IDs and geo locations be publicly and permanently shared (e.g., as part of replication materials)? Should the answer to these questions be the same for regular users as opposed to public figures (e.g., politicians)? These questions highlight the urgent need for ongoing discussion about these topics.</p>
</div>
<div id="uncertainty-of-data-access" class="section level5">
<h5>Uncertainty of Data Access</h5>
<p>One should always have in mind that data access is 100% dependent upon Twitter’s willingness to share the data, and therby also on jurisdiction by which Twitter must abide (think, for instance, about Article 13). Data access for research projects through Facebook’s and Instagram’s APIs has previously been shut-down completely with only few weeks notice. Given that, research projects relying on Twitter data are always risky. This applies particularly to research projects that depends on a constant Twitter data influx over a long period of time (e.g., PhD projects).</p>
</div>
<div id="data-storage" class="section level5">
<h5>Data Storage</h5>
<p>Data storage can be an issue when tweets are collected over a long period of time. In many applications, data collections can easily amount to 100-200 GB per month. The use of powerful servers and storage in a relational database (e.g. SQL) are therefore recommended.</p>
</div>
</div>
<div id="conclusion-twitter-in-the-social-sciences" class="section level3">
<h3>Conclusion: Twitter in the Social Sciences</h3>
<p>Collecting Twitter data is comparatively easy and cheap. However, we usually know close to nothing about the users whom we collect tweets from. Thus, the research potential for social science projects is very limited when we are interested in questions that address ‘outside-social-media’ phenomena.</p>
<p>Arguably, we benefit most from Twitter data when we treat it as auxiliary or proxy information, or when we use it in combination with other data sources. Potential applications along those lines include:</p>
<ul>
<li>Variation in fear of crime across different regions</li>
<li>Health monitoring over time</li>
<li>Monitoring highly discussed topics in realtime</li>
<li>Measuring existing stereotypes towards minorities</li>
</ul>
</div>
<div id="further-reading" class="section level3">
<h3>Further Reading</h3>
<ul>
<li><a href="https://mkearney.github.io/nicar_tworkshop/#1">Workshop on using rtweet by its developer Michael W. Kearney</a></li>
<li><a href="https://rtweet.info/index.html">rtweet documentation</a></li>
<li><a href="https://www.earthdatascience.org/courses/earth-analytics/get-data-using-apis/text-mining-twitter-data-intro-r/">Test mining of tweets</a></li>
<li><a href="https://www.datascience.com/blog/beginners-guide-to-shiny-and-leaflet-for-interactive-mapping">Using Shiny and Leaflet</a></li>
<li><a href="https://www.halem-verlag.de/geospatial-analysis-of-social-media-data%e2%80%84-a-practical-framework-and-applications-using-twitter/">Rieder, Y. &amp; S. Kühne, 2018: Geospatial Analysis of Social Media Data - A Practical Framework and Applications. In: Stuetzer, C.M., Welker, M. &amp; M. Egger (Eds.), Computational Social Science in the Age of Big Data. Concepts, Methodologies, Tools, and Applications. DGOF Schriftenreihe, Köln: Herbert van Halem Verlag. URL: http://www.halem-verlag.de/computational-social-science-in-the-age-of-big-data/.</a></li>
</ul>
</div>
<div id="about-the-presenter" class="section level3">
<h3>About the Presenter</h3>
<p>Simon Kühne <a href="mailto:simon.kuehne@uni-bielefeld.de"><i class="fa fa-envelope"></i> </a> <a href="http://simon-kuehne.de/"><i class="fa fa-globe"></i> </a> <a href="https://twitter.com/SimonKuehne"><i class="fa fa-twitter"></i></a> is a post-doc at Bielefeld University. He holds a BA in Sociology and an MA in Survey Methodology from the University of Duisburg-Essen and a PhD in Sociology from Humboldt University of Berlin. His research focuses on survey methodology, social media and online data, and social inequality.</p>
</div>
]]>
      </description>
    </item>
    
    <item>
      <title>Visual Inference with R</title>
      <link>https://socialsciencedatalab.mzes.uni-mannheim.de/article/visinference/</link>
      <pubDate>Sun, 14 Jul 2019 03:03:03 +0100</pubDate>
      
      <guid>https://socialsciencedatalab.mzes.uni-mannheim.de/article/visinference/</guid>
      <description><![CDATA[
        </p>
<p>How can we use data visualization for hypothesis testing? This question lies at the heart of this <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/categories/tutorials/">Methods Bites Tutorial</a> by <a href="https://twitter.com/cosima_meyer">Cosima Meyer</a>, which is based on <a href="https://www.richardtraunmueller.com">Richard Traunmüller</a>’s workshop in the <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/page/events/">MZES Social Science Data Lab</a> in Fall 2017. We already covered the basic idea of visual inference in our blog post on <a href="https://socialsciencedatalab.mzes.uni-mannheim.de/article/datavis/">Data visualization with R</a>.</p>
<p>Note: This blog post presents Richard’s workshop materials in condensed form. The complete workshop materials are available from our <a href="https://github.com/SocialScienceDataLab/Visual_Inference">GitHub</a>.</p>
<div id="overview" class="section level3">
<h3>Overview</h3>
<ol style="list-style-type: decimal">
<li><a href="#what"><strong>What is visual inference?</strong></a></li>
<li><a href="#challenges"><strong>Potential challenges and how to overcome them</strong></a></li>
<li><a href="#stepbystep"><strong>Practical applications: How do we reveal the “true” data graphically? A step-by-step guide</strong></a>
<ol style="list-style-type: decimal">
<li><a href="#maps">Maps</a></li>
<li><a href="#scatterplot">Scatter plot</a></li>
<li><a href="#groupcomparison">Group comparisons</a></li>
</ol></li>
<li><a href="#furtherreadings"><strong>Further readings</strong></a></li>
</ol>
</div>
<div id="what-is-visual-inference" class="section level3">
<h3>What is visual inference? <a name="whyquanteda"></a></h3>
<p>Visual inference uses our ability to detect graphical anomalies. The idea of formal testing remains the same in visual inference – with one exception: The test statistic is now a graphical display which is compared to a “reference distribution” of plots showing the null. Put differently, we plot both the “true pattern of the data” and additional random plots of our data. By comparing both, we should be able to identify the true data – if the pattern is not based on randomness. This approach can be applied to various (research) situations – some of them are described in the “Practical applications” section.</p>
<!-- ## Why do we need visual inference?  -->
</div>
<div id="potential-challenges-and-how-to-overcome-them" class="section level3">
<h3>Potential challenges and how to overcome them <a name="challenges"></a></h3>
<!-- > “Humans’ pattern recognition skills are amazing and the source of great insights, but sometimes they’re too good. We are so adept at finding patterns that we sometimes detect ones that aren’t really there” (Few 2009: 139). -->
<p><strong>Major concerns</strong> related to exploratory data analysis are its seemingly <strong>informal approach</strong> to data analysis and the potential <strong>over-interpretation of patterns</strong>. Richard provides a <strong>line-up protocol how to best overcome these concerns</strong>:</p>
<blockquote>
<p><sub> 1. <strong>Identify the question</strong> the plot is trying to answer or the pattern it is intended to show.</sub></br>
<sub> 2. <strong>Formulate a null hypothesis</strong> (usually this will be <span class="math inline">\(H_0\)</span>: “There is no pattern in the plot.”)</sub></br>
<sub> 3. <strong>Generate and visualize a null datasets</strong> (e.g., permutations of variable values, random simulations) </sup></p>
</blockquote>
<p>The following examples illustrate this procedure and explain the steps in detail.</p>
</div>
<div id="practical-applications-how-do-we-reveal-the-true-data-graphically-a-step-by-step-guide" class="section level3">
<h3>Practical applications: How do we reveal the “true” data graphically? A step-by-step guide <a name="stepbystep"></a></h3>
<p>To reveal the “true” data, we may use several visual approaches. In the following, we present three different examples: <strong>1) maps</strong>, <strong>2) scatter plots</strong>, and <strong>3) group comparisons</strong>. The underlying logic follows the line-up protocol described above. To produce the visual inference, we always apply the following steps:</p>
<blockquote>
<p><sub> 1. <strong>Identify the question</strong>: ‘Is there a visual pattern?’</sub></br>
<sub> 2. <strong>Formulate a null hypothesis</strong>: ‘There is no visual pattern.’ </sub></br>
<sub> 3. <strong>Generate null datasets</strong>: Just randomly permute one variable column and plot the data. </sub></br>
<sub> 4. <strong>Add the “true” data</strong>: Add the true data to the null datasets. </sub></br>
<sub> 5. <strong>Visual inference</strong>: Is there a visual difference between the randomly permuted data and the “true” data? </sub></p>
</blockquote>
<div id="maps" class="section level5">
<h5>1) Maps <a name="maps"></a></h5>
<p>This map provides an intuitive understanding of how to apply the line-up protocol to a real-world example. Richard uses data from the GLES (German Longitdunal Election Survey) as an example to <strong>analyze the interviewer selection effects</strong>. These biases arise if interviewers selectively contact certain households and fail to reach to others. Reasons might be that researchers try to avoid less comfortable areas.</p>
<p>As a first step, we need to read in the required packages as well as the data and code the interviewer behavior by color.</p>
<pre class="r"><code># Read all required packages
library(maps)
library(mapdata)
library(RColorBrewer)

# Read data
data &lt;- readRDS(&quot;sub_data.rds&quot;)

# Code interviewer behavior by color
data$col &lt;-
  ifelse(data$status == &quot;No Contact&quot;, &quot;maroon3&quot;, &quot;darkolivegreen2&quot;)</code></pre>
<p>Following the line-up protocol described above, we seek to answer the question <strong>if there is a visual pattern</strong>. Our <strong>null hypothesis</strong> assumes that there is <strong>no visual pattern</strong>. To generate the <strong>null dataset</strong>, we <strong>randomly permute one variable column</strong> and <strong>plot the data</strong>.</p>
<pre class="r"><code># Generate random plot placement
placement &lt;- sample((1:20), 20)
layout(matrix(placement, 4, 5))

# Generate 19 null plots
par(mar = c(.01, .01, .01, .01), oma = c(0, 0, 0, 0))
for(i in 1:19) {
  # Randomize the order
  random &lt;- sample(c(1:15591), 15591)
  # Plot
  map(
    # Refer to dataset
    database = &quot;worldHires&quot;,
    fill = F,
    col = &quot;darkgrey&quot;,
    # Range of x-axis
    xlim = c(6, 15),
    # Range of y-axis
    ylim = c(47.3, 55)
    ) 
  points(
    # Refer to data
    data$g_lon,
    data$g_lat,
    cex = .1,
    # Type of plotting symbol
    pch = 19,
    col = data$col[random]
    )
}</code></pre>
<p>We then proceed and <strong>add the true data to the null datasets</strong>.</p>
<pre class="r"><code># Add the true plot
map(
  database = &quot;worldHires&quot;,
  fill = F,
  col = &quot;darkgrey&quot;,
  # Range of x-axis
  xlim = c(6, 15),
  # Range of y-axis
  ylim = c(47.3, 55)
  ) 
points(
  # Refer to data
  data$g_lon,
  data$g_lat,
  cex = .1,
  # Type of plotting symbol
  pch = 19,
  col = data$col
  )

# Reveal the true plot
box(col = &quot;red&quot;, # Draw a box in red
    lty = 2, # Defines line type
    lwd = 2) # Defines line width
    which(placement == 20) # Defines the place of the box</code></pre>
<p>Using the code above, we receive <strong>twenty maps from Germany</strong>. In a last step, we ask if these plots are substantially different from one another. If yes, can you tell <strong>which one is the odd-one-out?</strong> Just wait for a few seconds to let the image reveal the answer.</p>
<p><img src="/../../../../../article/visinference/figures/maps.gif" style="display: block; margin: auto;" />
<!-- <center><b>Visual inference -- Can you find the odd-one-out?</b></center> --></p>
</div>
<div id="scatter-plot" class="section level5">
<h5>2) Scatter plot <a name="scatterplot"></a></h5>
<p>Mimicking the approach for the maps, we proceed in a similar way with scatter plots.</p>
<p>Assume we have two variables and want to plot their correlation with a scatter plot. To compare if their relation is random, we can make use of visual inference. To do so, we first need to load all required packages and read in the data.</p>
<pre class="r"><code># Read required package
library(foreign) # Necessary to load datasets in other formats (such as .dta)

# Read the data
slop &lt;- read.dta(&quot;slop_2009_agg_example.dta&quot;)</code></pre>
<p>We then proceed and place randomly 20 plots within a 4x5 grid cells.</p>
<pre class="r"><code># Generate a random plot placement
placement &lt;- sample((1:20), 20)
layout(matrix(placement, 4, 5))</code></pre>
<p>We want to position 19 out of 20 random plots and leave one grid cell empty for the “true” plot.</p>
<!-- The following code first plots the 19 random scatter plots. -->
<details>
<p><summary>Code: Plotting nineteen random scatter plots</summary></p>
<pre class="r"><code># Plot 19 null plots
par(mar = c(.1, .1, .1, .1))
for(i in 1:19) {
  # Plot random scatter plots of the data
  random &lt;- sample(c(1:dim(slop)[1]), dim(slop)[1])
  plot(slop$mkath[random],
  slop$cdu,
  axes = F,
  ann = F,
  cex = .4)
  # Plot a box with grey lines
  box(bty = &quot;l&quot;, 
  col = &quot;grey&quot;)
}</code></pre>
</details>
<p><img src="/../../../../../article/visinference/figures/scatter19.png" style="display: block; margin: auto;" /></p>
<p>As we can see, we get a 4x5 grid cell with 19 randomly assigned scatter plots and one empty cell. We now proceed and fill this empty cell with the “true” data and plot a box around it.</p>
<details>
<p><summary>Code: Adding and revealing the true data</summary></p>
<pre class="r"><code># Add true plot
plot(slop$mkath,
     slop$cdu,
     axes = F,
     ann = F,
     cex = .4)
box(bty = &quot;l&quot;, # Plot a box with grey lines
    col = &quot;grey&quot;)

# Reveal true plot
box(col = &quot;red&quot;, # Plot a box with red dashed lines
    lty = 2,
    lwd = 2)
which(placement == 20) # Define the position of the box</code></pre>
</details>
<p><img src="/../../../../../article/visinference/figures/scatter20.png" style="display: block; margin: auto;" /></p>
<p>We can even go one step further by adding an abline to the plots. To do this, we need to include the following line of code:</p>
<pre class="r"><code>abline(lm(slop$cdu ~ slop$mkath[random]))</code></pre>
<details>
<p><summary>Code: Adding an abline</summary></p>
<pre class="r"><code># Generate a random plot placement
placement &lt;- sample((1:20), 20)
layout(matrix(placement, 4, 5))

# Plot 19 null plots
par(mar = c(.1, .1, .1, .1))
for(i in 1:19) {
  # Plot random scatter plots of the data
  random &lt;- sample(c(1:dim(slop)[1]), dim(slop)[1])
  plot(slop$mkath[random],
  slop$cdu,
  axes = F,
  ann = F,
  cex = .4)
  # Add the abline to the plots
  abline(lm(slop$cdu ~ slop$mkath[random]))
  # Plot a box with grey lines
  box(bty = &quot;l&quot;,
  col = &quot;grey&quot;)
}
# Add true plot
plot(slop$mkath,
     slop$cdu,
     axes = F,
     ann = F,
     cex = .4)
abline(lm(slop$cdu ~ slop$mkath)) # Add the abline to the plot
box(bty = &quot;l&quot;, # Plot a box with grey lines
    col = &quot;grey&quot;)

# Reveal true plot
box(col = &quot;red&quot;, # Plot a box with red dashed lines
    lty = 2,
    lwd = 2)
which(placement == 20) # Define the position of the box</code></pre>
</details>
<p><img src="/../../../../../article/visinference/figures/scatter20abline.png" style="display: block; margin: auto;" /></p>
</div>
<div id="group-comparisons" class="section level5">
<h5>3) Group comparisons <a name="groupcomparison"></a></h5>
<p>This plot allows us to visually compare two groups: The dataset provides us information about the vote share for the CDU. It also includes a dummy variable that indicates if the constituency is in Bavaria or not. The following plot compares the vote share for the CDU and distinguishes between constituencies within Bavaria (purple) and outside of Bavaria (green).</p>
<p>We need to generate again the 4x5 grid cells with the random plots and the “true” plot. The following code first plots the 19 random scatter plots. As we can see, we get a 4x5 grid cell with 19 randomly assigned scatter plots and one empty cell. We now proceed and fill again this empty cell with the “true” data and plot a box around it.</p>
<details>
<p><summary>Code: Create group comparison</summary></p>
<pre class="r"><code># Generate random plot placement
placement &lt;- sample((1:20), 20)
layout(matrix(placement, 4, 5))

# Plot 19 Null Plots
par(mar = c(.1, .1, .1, .1))
for (i in 1:19) {
  random &lt;- sample(c(1:dim(slop)[1]), dim(slop)[1])
  
  plot(
    slop$bayern[random],
    slop$cdu,
    axes = F,
    ann = F,
    cex = .4,
    xlim = c(-1, 2)
    )
  points(1,
         mean(slop$cdu[slop$bayern[random] == 1]),
         pch = &quot;-&quot;,
         col = &quot;purple4&quot;,
         cex = 3)
  points(0,
         mean(slop$cdu[slop$bayern[random] == 0]),
         pch = &quot;-&quot;,
         col = &quot;darkolivegreen2&quot;,
         cex = 3)
  box(bty = &quot;l&quot;, col = &quot;grey&quot;)
  
}

# Add true plot
plot(
  slop$bayern,
  slop$cdu,
  axes = F,
  ann = F,
  cex = .4,
  xlim = c(-1, 2)
  )
points(1,
       mean(slop$cdu[slop$bayern == 1]),
       pch = &quot;-&quot;,
       col = &quot;purple4&quot;,
       cex = 3)
points(0,
       mean(slop$cdu[slop$bayern == 0]),
       pch = &quot;-&quot;,
       col = &quot;darkolivegreen2&quot;,
       cex = 3)
box(bty = &quot;l&quot;, col = &quot;grey&quot;)

# Reveal True Plot
box(col = &quot;red&quot;, lty = 2, lwd = 2)
which(placement == 20)</code></pre>
</details>
<p><img src="/../../../../../article/visinference_files/figure-html/visual%20inference7-1.png" width="672" style="display: block; margin: auto;" /></p>
</div>
</div>
<div id="further-readings" class="section level3">
<h3>Further readings <a name="furtherreadings"></a></h3>
<ul>
<li><a href="https://royalsocietypublishing.org/doi/full/10.1098/rsta.2009.0120">Buja, A., Cook, D., Hofmann, H., Lawrence, M., Lee, E. K., Swayne, D. F., &amp; Wickham, H. (2009). Statistical inference for exploratory data analysis and model diagnostics. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906), 4361-4383.</a></li>
<li><a href="http://www.stephen-few.com/nysi.php">Few, S. (2009). Now you see it: simple visualization techniques for quantitative analysis. Analytics Press.</a></li>
<li><a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1751-5823.2003.tb00203.x">Gelman, A. (2003). A Bayesian formulation of exploratory data analysis and goodness‐of‐fit testing. International Statistical Review, 71(2), 369-382.</a></li>
<li><a href="https://www.tandfonline.com/doi/abs/10.1198/106186004X11435">Gelman, A. (2004). Exploratory data analysis for complex models. Journal of Computational and Graphical Statistics, 13(4), 755-779.</a></li>
<li><a href="https://www.richardtraunmueller.com/wp-content/uploads/2019/01/Traunmueller-Visual-Inference-CCCP.pdf">Traunmüller, R. Visual statistical inference for political research.</a></li>
<li><a href="https://ieeexplore.ieee.org/abstract/document/5613434">Wickham, H., Cook, D., Hofmann, H., &amp; Buja, A. (2010). Graphical inference for infovis. IEEE Transactions on Visualization and Computer Graphics, 16(6), 973-979.</a></li>
</ul>
<p>       </p>
</div>
<div id="about-the-presenter" class="section level3">
<h3>About the presenter</h3>
<p>Richard Traunmüller <a href="mailto:traunmueller@soz.uni-frankfurt.de"><i class="fa fa-envelope"></i> </a> <a href="https://www.richardtraunmueller.com/"><i class="fa fa-globe"></i> </a> is a Visiting Associate Professor of Political Science at the University of Mannheim and currently on leave from Goethe University Frankfurt, where he is an Assistant Professor of Empirical Democracy Research. He has a strong interest in Bayesian analysis, data visualization, and survey experiments. He studies challenges that arise from deep-seated societal change: global migration and religious diversity, free speech in the digital age, as well as the legacies of civil war and sexual violence.</p>
</div>
]]>
      </description>
    </item>
    
  </channel>
</rss>