- Political scientists mine news stories for information about global events
- TERRIER dataset extracts event data from 300 million news articles
- Easy access to text-derived event data can provide early warnings of civil conflict
Journalists provide an invaluable service, sharing information about global events to which many of us would not otherwise have access. They send missives directly from event sites, recording what’s happening during protests, summits, speeches, and violent actions.
For political scientists, these articles offer a rich mine of data. Jill Irvine, Presidential Professor of International and Area Studies at the University of Oklahoma (OU), Christan Grant, OU assistant professor of computer science, MIT political science PhD candidate Andrew Halterman, and the rest of their team want to make this data accessible.
Enter the Temporally Extended, Regular, Reproducible International Event Records (TERRIER) dataset, which extracts event data from roughly 300 million news articles and puts it into a form researchers can use.
In determining the kinds of actions that qualify as an event, the team uses a framework established within political science that events consist of actors, actions, and targets. The process of coding events consists of three steps.
First, software searches every sentence in the corpus of text and parses the grammar. A second piece of software finds “candidate spans” (i.e., words that look like they belong to an actor or action), then checks a custom dictionary to see if they match with a known actor.
If a match occurs, the software assigns each actor or action to a predefined category, such as "military" actors and "protest" actions. For instance, the term “George W. Bush,” during the timespan of his presidency, would result in the actor category of “USA government.” Importantly, the dictionaries are coded to account for different ways people might speak of actions such as protests through words like “demonstrated,” “chanted,” or “carried placards.”
The project not only produces data for researchers, but also aims to improve the tools available to other researchers to generate their own datasets. To accomplish this, Grant built several open-source tools to speed the natural language processing (NLP) of the large corpus of documents.
NLP is the third—and most time-intensive—step of the event coding process, but without the grammatical information it provides, the event coder can't process the sentences. Once that step is complete, the events still need to be extracted and geolocated—more time-intensive tasks. The scale of the news corpus—hundreds of millions of documents—complicated the effort.
According to Grant, the initial projected timeline for extraction and geolocation on a single machine was many years; thus, it became clear that the team needed more resources. So they called on Jetstream, a cloud-based on-demand computing and data analysis resource which gave the team the large-scale computation and storage capabilities it needed in order expedite the process.
Grant built a distributed container system, and Jetstream provided the storage and structure to launch the pipeline and process the many documents. Grant speaks highly of the Jetstream team, calling them “responsive, helpful, and accommodating to researchers,” and a “lifesaver” for the project.
The TERRIER team’s work will serve future researchers in at least two ways. The complete event coding pipeline software is available to other researchers in NLP and political science and the dataset is available to political scientists.
Thus far, the datatool has been used to gain a better sense of the causes and dynamics related to conflict and levels of violence. Now that the process works, the team is excited to see what it will look like in application.
The team is currently working on a paper investigating the relationship between locations of the 2011 protests in Syria, and where the government exerted violence once the civil war began.
Researchers can potentially use protests reported in news media as a way of tracking prewar anti-regime mobilization and thereby measuring how mobilization affects later violence. More broadly, the data will contribute to the growing set of applied models using text-derived event data to provide early warning of civil conflict.