Show Less
Open access

Tweets from the Campaign Trail

Researching Candidates’ Use of Twitter During the European Parliamentary Elections

Series:

Edited By Alex Frame, Arnaud Mercier, Gilles Brachotte and Caja Thimm

Hailed by many as a game-changer in political communication, Twitter has made its way into election campaigns all around the world. The European Parliamentary elections, taking place simultaneously in 28 countries, give us a unique comparative vision of the way the tool is used by candidates in different national contexts. This volume is the fruit of a research project bringing together scholars from 6 countries, specialised in communication science, media studies, linguistics and computer science. It seeks to characterise the way Twitter was used during the 2014 European election campaign, providing insights into communication styles and strategies observed in different languages and outlining methodological solutions for collecting and analysing political tweets in an electoral context.

Show Summary Details
Open access

1. SNFreezer: a Platform for Harvesting and Storing Tweets in a Big Data Context (Leclercq, Éric / Savonnet, Marinette / Grison, Thierry / Kirgizov, Sergey / Basaille, Ian)

Éric Leclercq, Marinette Savonnet, Thierry Grison, Sergey Kirgizov & Ian Basaille, Laboratoire LE2I – UMR6306 – CNRS – ENSAM, Univ. Bourgogne Franche-Comté

1. SNFreezer: a Platform for Harvesting and Storing Tweets in a Big Data Context

Abstract

In this chapter we show how a multi-paradigm platform can fulfill the requirements of building a corpus of tweets and can reduce the waiting time for researchers to perform analysis on data. We highlight major issues such as the scalability of the architecture that is collecting tweets, as well as its failover mechanism.

1.1 Introduction and Objectives

In general, analysing complex interaction networks including Twitter data requires different types of algorithms with different theoretical foundations such as graph theory, linear algebra, or statistical models. Regarding tweet analysis, intrinsic links built by operators (i.e. hashtags denoted by #, user mentions denoted by @, reply, and retweets) have a strong impact on the data model and on the performances of the algorithms being used.

Addressing a scientific question often requires mixing different classes of algorithms using different data models that retrieve data from different storage structures. For instance, graph-based algorithms using a matrix adjacency representation are useful for discovering community structure; a Laplacian matrix is useful to evaluate centrality. In general, graph-based algorithms are near-sighted, they do not take into account contextual information. Linear algebra algorithms can be used to identify large scaled structures, for instance clusters can be found using Singular Value Decomposition (SVD) or Principal Component Analysis (PCA). Machine learning algorithms and statistical models are used to predict links or behaviors, to detect anomalies or events. In a context of massive datasets, the most efficient storage structure should be used according to the selected algorithms, knowing that different kinds of algorithms are usually mandatory.

Analysis of tweets can be performed at different levels of granularity: at an individual level like influence assessment, sentiment analysis, by extraction of features that are not explicit; and at a corpus level, through emergence of groups←19 | 20→ of users exhibiting similar behavior. Thus, possible outcomes of the analysis are the discovery of social structures, social positions, i.e. the role of individuals.

Our contribution is an open source platform named SNFreezer (https://github.com/SNFreezer) that supports the management and the analysis of social data with different paradigms. We have developed a polyglot storage system to store and retrieve tweets in different structures that are able to scale up to the data flow requirements.

1.2 New Paradigms for Data Management Systems

Most enterprise business applications rely on relational database management systems (RDBMS). This technology is mature, widely understood and adapted. However, some issues have recently emerged:

RDBMS may not have adequate performance for massive datasets;

RDBMS cannot provide the scalability required by online social network applications;

The structure of the relational model can be too rigid or not relevant to deal with the variability of complex data networks;

SQL was not designed to perform explanatory data analysis queries which does not provide exact results.

Explicit and implicit links between social data can also be a hindrance to the use of RDBMS if they are combined with massive data. Explicit links are usually used by path queries that require joining tables, but when the length of the path is not known, SQL often needs to be embedded in a programming language. Implicit links can be discovered by data analysis but the schema of the relational database cannot be quickly and easily updated according to newly discovered relationships.

Considering these drawbacks, a number of systems, not following the relational model paradigms, have recently emerged. They are often denoted under the umbrella term of NoSQL databases (Moniruzzaman 2013).

In general, NoSQL databases rely on schema-less data models and scale horizontally. Their common features are scalability and flexibility in the structure of data. NoSQL database management systems provide different solutions for specific problems: the volume of datasets is addressed in the column-oriented NoSQL or key-value (HBase1, Cassandra2"); documents and links management is supported←20 | 21→ by document databases (CouchDB3, MongoDB4); high density of links, nodes and properties are taken into account in graph database management systems (GDBMS) which are also ideal for performing queries that walk down hierarchical relationships (Neo4j5, HypergraphDB6). XML oriented databases provide a highly extensible data model but lack scalability in the context of social networks.

NoSQL databases can be accessed by different APIs and different query languages, so Atzeni and al. (2014) propose a common programming interface to NoSQL systems hiding the specification details of the various systems for developing applications. The TinkerPop project7 adopts a similar approach for graph databases. It introduces a graph query language, Gremlin, which is a domain-specific language based on Groovy8, supported by most GDBMS. Unlike most query languages that are declarative, Gremlin is an imperative language focusing on graph traversals.

The multi-paradigm principle tends to generalize these different approaches. In modelling, multi-paradigm approaches address the necessity of using multiple modelling paradigms to design complex systems (Hodge et al. 2011). Indeed, complex systems require the use of multiple modelling languages to: 1) cope with the inherent heterogeneity of such systems; 2) offer different points of view on all their relevant aspects; 3) cover different activities of the design cycle; 4) allow reasoning at different levels of detail during the design process (Hardebolle and Boulanger 2009). As a result, multi-paradigm modelling addresses three orthogonal directions of research: 1) multi-formalism modelling, concerned with the coupling and transformation between models described in different formalisms; 2) model abstraction concerned with the relationship between models at different levels of abstraction; 3) meta-modelling concerned with the description of classes of models dedicated to particular domains or applications called Domain Specific Languages (DSL). Multi-paradigm data storage or polyglot persistence uses multiple data storage technologies, chosen according to the way data is used by applications and/or algorithms (Sharp et al. 2013). As Ghosh states in (Ghosh 2010), storing data the way it is used in an application simplifies programming and makes it easier to decentralize data processing. ExSchema (Castrejon et al. 2013) is a tool that enables automatic discovery of data schema from a system that relies on multiple documents, graph, relational, column-family data stores.←21 | 22→

1.3 SNFreezer Architecture

In order to collect tweets during the political campaign, we have started by analysing existing solutions and we retained the project YourTwapperKeeper9 (YTK), an open source project that claims to provide users with a tool that archives data from Twitter directly on a server. After a period of tests and code review, we identified some major drawbacks. YTK was not able to collect tweets in various languages; the choice of the database engine limits the volume of datasets; and it does not retrieve information on accounts such as the list of following/followers nor the timeline of the users. Thus, we chose to enhance YTK with a real storage layer and to add database connectors in order to allow analysis tools such as R to retrieve data directly from SNFreezer repositories.

1.3.1 The Storage System

To address the problem of tweet storage, both in terms of performance and interoperability (i.e. easy connection of third party tools) we have specified and developed a storage layer. The proposed polyglot persistence storage layer includes relational databases (RDBMS), a graph data store (GDBMS), and a scalable document database management system (DDBMS) that can be used simultaneously (figure 1).

Figure 1. SNFreezer architecture

image

←22 | 23→

The configuration of the storage layer allows duplicating information in different storage systems according to the planned analysis so Neo4j or PostgreSQL ready to use databases are available during the collect. Developers can add their specific drivers for other storage systems. The data harvesting is provided by the data ingestion module connected to the Streaming and Search APIs of Twitter using two main processes that run continuously. Two other processes are used to collect the followers/following information at a defined frequency and to retrieve all the possible tweets from the timeline of the users. According to the needs and type of information to be analyzed, one can choose between various storage structures:

For computing global metrics (e.g. number of tweets/retweets/mentions per user), a relational schema in PostgreSQL can be used.

In the case of high traffic, it is preferable to store information in a non-normalized database scheme (one table for tweets and a few others for followers) or in JSON files or in MongoDB.

In the case of the study of linked information (for example social network type data structure), a Neo4j graph database is suitable.

The choice of storage backend can be cumulative (both relational normalized databases and JSON files for example). We also propose a set of tools that implement model transformation to asynchronously transform data from one storage system to another.

1.3.2 Storing Tweets

Tweets coming through the Twitter APIs are JSON strings10 that contain information on the tweet itself, but also the user who sends or retweets (table 1). JSON strings can be stored in MongoDB or directly in the filesystem.

Table 1: Example of a tweet in JSON

[ created_at ] => Mon May 05 16:58:20 +0000 2014

[id] => 463362068553674752

[ id_str ] => 463362068553674752

[ text ] => RT @JeunesavecBLM : .@Bruno_LeMaire est ce soir à Thionville pour un meeting avec @Nadine__Morano @AnneGrommerch @ArnaudDanjean # EP2014 http…

[ source ] => <a href =” http://twitter.com/download/iphone “ rel =” nofollow “>

Twitter for iPhone </a>

[ truncated ] =>←23 | 24→

[ in_reply_to_status_id ] =>

[ in_reply_to_status_id_str ] =>

[ in_reply_to_user_id ] =>

[ in_reply_to_user_id_str ] =>

[ in_reply_to_screen_name ] =>

[ user ] => stdClass Object

(

[id] => 439463212

[ id_str ] => 439463212

[ name ] => Arnaud Danjean

[ screen_name ] => ArnaudDanjean

[ location ] =>

[ description ] => Député Européen – Membre de la commission Affaires Etrangères & sous-commission Défense du Parlement Européen – Conseiller Régional de Bourgogne

[url] => http://t.co/tOt2Mb2YUb

[ entities ] => stdClass Object …….

JSON strings are not stored directly in RDBMS (in a single attribute). First of all the string is changed into a tuple with different attributes (one tuple = one tweet). Attributes that are extracted from this JSON string are stored in a table named gtweets. The gtweets table contains more than 40 attributes, for example the most significant are:

the id of the tweet (id), the content (text) and the mailing date (time), the creating time (created_at);

the id of the twitter account who sents the tweet (from_user_id) and screen name (from_user);

the language of the tweet (ISO_language_code), which can give an idea of the language of tweet;

the source of the tweet (what criteria match the tweet i.e. hashtag, keyword or account)

optional information on geolocation tweet: point, latitude, longitude (geo_type, geo_coordinates_0, geo_coordinates_1)

the link to the user’s profile image (profile_image_url)

the id of the user the tweet replies to (if it exists) (to_user_id)

the id of the tweet to which it answers (in_reply_to_status_id);

the id the initial tweet if retweet (initial_tweet_id) ;

the text of the original tweet if retweet (initial_tweet_text) ;

the user’s initial tweet (initial_tweet_user) ;

the list of mentioned accounts (user_mentions) ;

the list of hashtags (hashtags);

the list of URLS (urls).←24 | 25→

Table 2: An entry in gtweets according to JSON based on the example given in Table 1

image

Tweet Relational Database Schema

The gtweets table is complex (many attributes) and voluminous, so queries can become difficult to write. For obvious reasons of querying performance, interesting objects for data analysis are transformed and stored in separate tables (figure 2):

The Tweet table contains tweets and retweeted tweets. The Tweet table is connected (by foreign keys) with hashtags (Tweet_Hashtag table), URLs (Tweets_URL table), symbols (Tweet_Symbol table) and query sources, i.e. presence of a keyword, a hashtag or an account in the tweet (Tweet_Source table).

Retweet and Mention are tables that represent relationships between users and tweets. The Retweet table contains the user that retweeted, the id of the retweeted tweet and the date. The Mention table connects a tweet with the mentioned users.

The User and Identity tables represent information about users. The User table represents information that cannot change in a Twitter account (id, created_at), other information can be updated by the user and are represented as an Identity.←25 | 26→

Figure 2. Storing tweets in multiple relational tables

image

1.3.3 Tweet Graph Database

In a graph database, objects (tweets, users, hashtags, etc.) are nodes with properties, and relationships are described by edges with properties (figure 3). The relationship use is not materialized, it may be deducted by composition. The schema has been implemented with Neo4j.

Figure 3. Storage of tweets in GDBMS

image

Cypher is a query language for Neo4j, which works by using patterns on the graph database. Operators are optimized for queries requiring graphs traversal, and notably allow finding relationships between nodes despite their separation by several links (which would have required costly joins by SQL). The query in table 3 shows a simple example of what can be expressed using Cypher, here we←26 | 27→ wish to find the identifiers of the users who retweeted ‘@Europarl_FR’ and the number of retweets.

Table 3. A query example using Cypher

match (u: user) -[: USER_HAVE_IDENTITY ]->(i: identity)

where i. screen_name = '@Europarl_FR '

with distinct u as europarl_fr

match europarl_fr -[: USER_SEND_TWEET ]->(t: tweet) <-[: USER_RETWEET_TWEET ]-(urt: user)

return urt.id , count (t);

The first three lines of the query request that, a user node corresponding to the identity with the screen name ‘Europarl_FR’ should be stored as a variable u. Line 5 retrieves Users (urt) having retweeted a tweet sent by ‘Europarl_FR’. Finally, line 6 returns results grouped by user and counts, for each user, the number of his retweets.

1.3.4 Cluster Mode and Fail Over

Cluster and failover modes have been developed to overcome some limitations of the Streaming and Search APIs of Twitter. For instance, one application can only retrieve tweets matching one of the conditions called source query (attribute querysource in database), in a set of 400. Each account on Twitter can define applications, but for each IP address hosting the application the previous limitations apply. However, some projects require more than 400 query sources. To fulfill this requirement we have developed a cluster management module.

In the cluster mode, several virtual machines, in which instances of the data harvesting processes are deployed, can be used in a project to collect tweets using matching criteria in a set of several thousand source queries. Source queries can be defined on each virtual machine using a web interface or can be imported automatically using scripts and import files submitted to one of the virtual machines that acts as a master. In the cluster mode, it is assumed that there will be a high volume of tweets, thus the storage layer can use a non-normalized schema on PostgreSQL (gtweets table described in section 3.2), or can store data in MongoDB or in JSON files. Figure 4 gives an example of the cluster deployed to collect tweets during European Election campaign of 2014.←27 | 28→

Figure 4. Cluster mode

image

In case tweet harvesting was interrupted for any reason, the process in charge of the Search API can retrieve tweets for a period of seven days. Automatic emails are sent to the administrator with a notification of the failure (storage, networks, etc.) and when the harvesting process is restarted, the latest tweet ID is checked and the process in charge of the Search API starts collecting any lost tweets. Moreover, it is possible to include a local database on each virtual machine using MySQL as a temporary backup, and a global database using PostgreSQL.

Tools were also developed in order to monitor tweet harvesting and the instances of the different local storage in each virtual machine of the cluster. They indicate, amongst other things, the number of tweets for a given period of time, the most used hashtags or the most active users and can send alerts by email to researchers or administrators. In the case of a high volume of tweets, a replica database is used to calculate general statistics for the monitoring tools.

During the TEE2014 project this architecture allowed to harvest more than 50M tweets and the resulting database with non-normalized schema is about 50GB without indexes. For the French corpus, experiments have been performed←28 | 29→ using Neo4j. Starting from a normalized schema of 5M tweets and a database size of 7GB we obtain a graph database of 20GB.

1.4 Connection with Analysis Tools and Web Applications

A specific data exchange layer is dedicated to application services, and third party tools are plugged in according to the expected analysis. Connectors have been developed to analyse data with third party software such as R and igraph11 and to display results with D3.js12. If a connector is not available, the layer provides export files for the different classes of algorithms such as files containing adjacency matrix, graph triple, multidimensional array or CSV (Comma-Separated Values) for spreadsheet. In the following subsection, we give some key elements for starting analysis.

In order to perform explanatory analysis on the corpus we have developed some web tools. For example, figure 5 is a component that displays the timeline of a user and shows the frequency of tweets or retweets in a period. By clicking on a point, users can display the tweet or retweet. Figure 6 depicts the interface of an algorithm to detect and characterize events. We have proposed to use a simple method that finds local minima and maxima of temporal density of tweets. This method finds the moments after which the number of published tweets begins to increase (or decrease). The method has only one parameter (bandwidth) allowing us to smoothly change the mode of the event detection: when the bandwidth is small, micro-events can be detected, and when it is large only macro-events appear. This method, by itself, does not give any description of the event, except its start and end times. Thus, in order to describe an event, we split the data using hashtag decomposition: for each hashtag of interest, we draw a temporal density plot, and we combine all these plots in the same image, which allows us to compare them. The difference in hashtag frequencies can be regarded as a first approximation of the semantic event description. A part of the program source code (detection of local minima and maxima of temporal density) is available on github13.←29 | 30→

Figure 5. A web component to display user’s timeline

image

Figure 6. A web component to detect and analyse events

image

←30 | 31→

1.5 Conclusion

In this chapter, we have presented an open framework integrating a polyglot scalable storage system that support different analysis algorithms. Different analysis tools have been implemented on the top the platform to perform iterative and incremental analysis through a dedicated web application. We plan to add new web components such as community analysis or heat map clustering representations, that can connect to the different data repositories in to build an observatory that can be used to perform explanatory analysis in real time.

References

Atzeni, Paolo / Bugiotti, Francesca / Rossi, Luca: “Uniform access to NoSQL systems. Information Systems, 43, 2014, pp. 117–133.

Castrejon, J. / Vargas-Solar, G. / Collet, C. / Lozano, R.: “ExSchema: Discovering and Maintaining Schemas from Polyglot Persistence Applications”, 29th IEEE International Conference on Software Maintenance (ICSM), 2013, pp. 496–499.

Ghosh Debasish: “Multiparadigm Data Storage for Enterprise Applications”, IEEE Software, 27(5), 2010, pp. 57–60.

Hardebolle, Cécile / Boulanger, Frédéric: “Exploring Multi-Paradigm Modeling Techniques”, Simulation, 85(11–12), 2009, pp. 688–708.

Hodge, Bri-Mathias S. / Huang, Shisheng / Siirola, John D. / Pekny, Joseph F. / Reklaitis, Gintaras V.: “A multi-paradigm modeling framework for energy systems simulation and analysis”, Computers &Chemical Engineering, 35(9), 2011, pp. 1725–1737.

Moniruzzaman, ABM / Hossain, Syed Akhter: “NoSQL database: New era of databases for big data analytics-classification, characteristics and comparison”, International Journal of Database Theory and Application, 6(4), 2013.

Sharp, John / McMurtry, Douglas / Oakley, Andrew / Subramanian, Mani / Zhang, Hanzhong: “Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence”, In: Microsoft patterns & practices, 2013.

Acknowledgements

The authors gratefully acknowledge Arnaud Da Costa, engineer in charge of the server infrastructure which was used during the TEE2014 project and Jonathan Norblin who developed the major parts of SNFreezer during the final internship of his masters degree in computer science.←31 | 32→ ←32 | 33→


1 http://hbase.apache.org/

2 http://cassandra.apache.org/

3 http://couchdb.apache.org/

4 https://www.mongodb.org/

5 http://neo4j.com/

6 http://hypergraphdb.org/

7 http://www.tinkerpop.com

8 http://groovy.codehaus.org/

9 https://github.com/540co/yourTwapperKeeper

10 JSON (JavaScript Object Notation) is a lightweight data-interchange format.

11 http://igraph.org/

12 http://d3js.org/

13 https://github.com/kerzol/changepoint-pelt-vs-timedensity