Visualizing and investigating semantically open data, by Antoine Vion, Aix-Marseille University
“A piece of data is open if anyone is free to use, reuse, and redistribute it – subject only, at most, to the requirement to attribute and/or share-alike.” Investigating such kind of data has been a growing matter of interest for fifteen years in the social sciences. A wide range of social studies has first been conducted from a social network analysis (SNA) background, to quantitatively study socialisation through the web (Papacharissi, 2004, Latapy et al., 2005, King 2014, Viard et al. 2016), online social movements and emotional contagion (Sano et al. 2013, Kramer et al. 2014, Eggert & Pavan 2014), political campaigns (Adamic & Glance 2005), knowledge networks (Zhang et al. 2007 ; Roth & Cointet 2010 ; Rogers & Marres 2016), etc. Regarding quali-quantitative research implementing content analysis, the main kinds of stored data investigated have been blogs (Adar et al. 2004 ; Drezner & Farrell, 2004 ; Adamic et al. 2005 ; Hookway 2008), social networks (Adamic et al. 2008 ; Gerlitz & Helmond 2013 ; Bakshy et al. 2015) and mailing-lists (Calderaro 2007, Nguyen et al. 2011). Most of this second range of studies has been conducted through graph and text-mining techniques. Providing a deeper, clearer understanding of such data (Lazer et al. 2014) is now on the agenda of what is sometimes called web social science (Ackland 2013). On another side, a new kind of journalism, “data journalism”, now intends to analyse online contents and to methodically report related facts and debates, as well as sets of practices and opinion flows (Gray et al. 2012).
Efficient tools for data exploration
Here, the quality of data highly conditions the relevance of the investigation conducted. A good example is provided by the Panama Papers. Theoretically, this kind of data provides an adequate sample to conduct studies of tax evasion on a large scale. The data allows to reconstruct relational chains, according to the small-world research program drawn by Travers and Milgram (1967) and Watts (1999). This should help to go beyond the usual analysis developed in sociology or management, such as deviance, white-collar crime, or organizational wrongdoing. Advancing this kind of structural analysis should be a way to measure social mechanisms that tax studies have explored for a long time, but only qualitatively.
Unfortunately, the data are made of aggregate uploadings and contain lots of doublets. The way data were initially extracted is never questioned. The fact that the International Consortium of Investigation Journalists (ICIJ) won the Pulitzer Prize is not surprising, as the work done by the consortium was huge. But ICIJ reporters, when they treated the online data, constantly missed the fact that occulting data is current in tax evasion professional practices (Falciani 2015), and that semantic approximation methods are needed to retrieve the concealed occurrences of multiallocated addresses in tax heavens. To solve this kind of problems, investigators need efficient tools to explore the data base, define ontologies, retrieve similarities, etc. Usual quantitative research through numeric graph mining simply crashes data complexity and provides partial or even false results. Finally, conducting social science inquiries in such data requires to go beyond usual graph mining and text mining tools and shape adequate tools to query graphs, specify ontologies and eventually process machine learning within knowledge graphs. This is a way to reach the objectives of enriching data for social science treatment and reassembling social science methods in web data investigations (Rupert et al. 2013).
The Suite 102 case
A good example is provided by a case ICIJ has edited online. ICIJ composed a simple query to retrieve the companies (defined as Entities in the online database) related to an Address in a Seychelles hotel, the [Suite 102, Aarti Chamber, Mont Fleuri, Victoria, Mahe, Seychelles].
Figue 1. Dyadic links related to the suite 102 Aarti Chamber – ICIJ, https://offshoreleaks.icij.org/nodes/233584
Let’s first observe that, as doublets have not been erased, two of the three companies tied together are the same one under a full or abreviated name: Green Apple Systems Limited and Green Apple Systems Ltd. This is noted by ICIJ.
Unfortunately, without any model of semantic approximation, manual queries should be multiplied to look for other possible occurrences of the address. Using such a model allows to retrieve twelve different occurrences of the [Suite 102, Aarti Chamber, Mont Fleuri, Victoria, Mahe, Seychelles].
Once the multiple occurrences are retrieved, it becomes possible to bring out hidden clusters of companies linked to the same registered address. This leads
Figure 2. List of companies linked to [Suite 102, Aarti Chambers à Mont Fleuri], including the varied letterings of this address (N=85)
Social scientists thus have to cope with this kind of constraints when conducting their inquiries in online data. In the research project we are managing, we begin to define and develop specific tools to support investigation of Open Data in human and social sciences. This requires an original combination of visual analytics and knowledge graph data mining.
Visual analytics can be defined as the science of analytical reasoning facilitated by interactive visual interfaces. It can attack certain problems whose size, complexity, and need for closely coupled human and machine analysis may make them otherwise intractable (Kosara 2007, Ribarsky et al. 2009, Von Lansberger et al. 2011, Dill 2015). Visual analytics is a multidisciplinary approach, which takes advantage of various related research areas such as visualization, data mining, data management, data fusion, statistics, and cognition science (Kielman & Thomas 2009). From users’ point of view, data visualization is indeed a preliminary challenge to allow data-driven discoveries (Min et al. 2009, Riche 2015). Graphic visual interfaces uncover the way knowledge may be brought out of the data, tested, refined and shared (Pike et al. 2009) and help reflexively control the provenance of entities (Chen et al. 2014).
In this project, we use the GraphScale and SemSpect tools developed by the DERIVO Company.
Knowledge Graph Data Mining
Graph-based data mining or graph mining is the extraction of novel and useful knowledge from graph representation of data. Graph mining uses the natural structure of the application domain and mines directly over that structure. The most natural form of knowledge that can be extracted from graphs is also a graph, for instance sub-graphs (Kavitha et al. 2011). Graph mining can use specific graph theory-based algorithms or specific machine-learning techniques. In this project, graphs concerned are Knowledge Graphs (RDF Graphs). For several reasons we choose to use machine-learning techniques.
In this context, graph mining applies Relational data mining (RDM). Unlike traditional data mining algorithms, which look for patterns in a single table (propositional patterns), RDM algorithms look for patterns among multiple tables (relational patterns). Note that for most types of propositional patterns, there are corresponding relational patterns. For example, there are relational classification rules, relational regression trees, and relational association rules.
The main objective of the project is to allow all kind of investigators to manage inductive analysis within open data. To do so, the project will mobilize extended social science skills to program new semantic investigation intelligent software. The main practical stake is to ensure continuity between data exploration through visual analytics, case-building and similarity querying, and machine learning in comparative and/or statistical research. As far as the potential complexity of underlying patterns is high in symbolic and parametric terms, it is imperative that social scientists co-design their tools with computer scientists, so as to prioritize functionalities over formal models.
Practically speaking, at the end of the project, investigators should be able to:
- Enrich open data on the basis of domain ontologies
- Select patterns of relations which draw a ‘good’ case (case-based reasoning querying)
- Benefit from an identification of the relevant variables of the case and of the related ontologies
- Select variables among the ones identified to look for similarities
- Gear constraints towards flexibility or rigidity
- Be guided by machine-learning to deepen investigation
The challenge of qualitative-quantitative social science research in open data
Improving these semantic techniques should help managing many kinds of inquiries in open data and applying some of the most classical methods. In this project, we started from the idea of retrieving relational chains and small worlds of tax evasion. But our method will help to implement diverse kinds of sociological research based on more qualitative reasoning, such as analytic induction, qualitative comparative analysis (QCA) or multiple correspondence analysis (MCA).
Analytic induction (Znaniecki 1934; Robinson 1951) is a social science tradition in which researchers begin by studying a small number of cases of the phenomenon, searching for similarities that could point to common factors, draw hypotheses before testing them on other cases. If any one of the new cases does not verify the hypothesis, either the hypothesis is reformulated so as to match the features of all the cases studied so far, or the original definition of the type of phenomenon to be explained is redefined, on the grounds that it does not represent a causally homogeneous category (Hammersley 2004). Further cases are then investigated until no more irregularities appear. Analytic induction addresses complex challenges to graph mining and semantic analysis. It underlines the potential need to refine and develop the initial categorisation of a social phenomenon to be explained. Computationnally speaking, this corresponds to ontology refinement and iterative semantic processing. The main challenge in analytic induction is to draw cases (Ragin & Becker 1992) that can be used for comparative research. In terms of query processing in web data, this requires new techniques such as case-based reasoning search for similarities.
In the data we explore, case-based investigation supposes to identify a relevant pattern and to define the semantic rules under which similar cases can be retrieved. To go back to our first example, if an address in a hotel or a residence located in a tax heaven corresponds to diverse letterings and relates entities and intermediaries, this can figure an elementary pattern. To retrieve comparable elements, one will need 1/ ontologies that subsume all the occurrences of countries defined as tax heavens and addresses defined as addresses in a hotel or a residence, 2/ automatic approximation processes through wich a diversity of letterings can be brought out, 3 / machine learning functionalities to capitalize advances and open new paths for query processing.
QCA (Ragin 1987 ; Rihoux 2006) goes on with this comparative ambition but is a bit different from analytic induction (Hammersley & Cooper 2012). As a data analysis technique, it consists in determining through boolean calculation the combination of variables a dataset logically supports (Romme 1995), and then applying logical rules of inference to determine which descriptive inferences or implications the data support. QCA is less inductive than KG, as it delimits a priori the set of relevant variables the data holds in, but it provides a strong basis for comparative research. The main question it addresses to KG processing is the background knowledge on the basis of which the number of relevant variables is delimited. This requires multidisciplinary reflexion on the domain ontologies used to logically restrain the variables which should be taken into account to support calculation..
As an infinite field of possibilities, categories that correspond to semantic data are also used as variables in statistical social science, such as MCA, which measures geometric distance between projected qualitative variables held in a data set (Leroux & Rouanet 2004 ; Greeanacre & Blasius 2006). MCA practitioners most often use datasets they produce themselves. Achieving such kind of analysis out of online data sets requires semantic techniques to set up a disjunctive table that one cannot construct without sharp understanting of the way raw data may be logically and iteratively reframed into analytical categories by the investigator.
In these three traditions of quali-quantitative analysis, new semantic techniques to visualize and investigate open data are more than needed. Improving investigative methods into open data requires advanced research accross subfields of social sciences and subfields of computer science. Smart inquiries based on smart data is a big challenge for contemporary social inquiries.
 A disjunctive table consists of the breakdown of a table defined by n observations and q categorical variables into a table defined by n observations and p indicators where p is the sum of the number of values of q variables. Each variable is decomposed into a sub-array with q columns where column k contains 1 for the observations corresponding to the k-th modality and 0 for the other observations.
Ackland, R. (2013). Web social science: Concepts, data and tools for social scientists in the digital age. Sage.
Adamic L.A., Glance N. (2005) The political blogosphere and the 2004 US elections: divided they blog, In Proceedings of the 3rd International Workshop on Link Discovery, pp 36‐43, 2005.
Adamic, L. A., Zhang, J., Bakshy, E., & Ackerman, M. S. (2008, April). Knowledge sharing and yahoo answers: everyone knows something. In Proceedings of the 17th international conference on World Wide Web (pp. 665‐674). ACM.
Adar, E., Zhang, L., Adamic, L. A., & Lukose, R. M. (2004, May). Implicit structure and the dynamics of blogspace. In Workshop on the weblogging ecosystem (Vol. 13, No. 1, pp. 16989‐16995).
Bakshy, E., Messing, S., & Adamic, L. A. (2015). Exposure to ideologically diverse news and opinion on Facebook. Science, 348(6239), 1130‐1132.
Calderaro, A. (2007) Empirical analysis of political spaces on the Internet. The role of mailing‐lists Communication in the 1992 Presidential Campaign: A content analysis of the Bush, Clinton and Perot Computer Lists, Communication Research Reports 13, 1996, 138‐146.
Chen, Y. V., Qian, Z. C., Woodbury, R., Dill, J., & Shaw, C. D. (2014) ‘Employing a parametric model for analytic provenance’ ACM Transactions on Interactive Intelligent Systems (TiiS), 4(1), 6.
Dill J. (2015) ‘Future Directions in Computer Graphics and Visualization: From CG&A’s Editorial Board’, Computer Graphics and applications, 35, 1: 20 – 32
Drezner D.W., Farrell H., The Power and Politics of Blogs, In American Political Science Association, Chicago, USA, 2004
Eggert, N., Pavan, E. (2014). Researching Collective Action Through Networks: Taking Stock and Looking Forward. Mobilization: An International Quarterly, 19(4), 363‐368.
Falciani H. (2015) Séisme sur la planète finance. Au cœur du scandale HSBC, Paris, Cahiers Libres, La Découverte.
Gerlitz, C., & Helmond, A. (2013). The like economy: Social buttons and the data‐intensive web. New Media & Society, 15(8), 1348‐1365.
Gray, J., Chambers, L. et Bounegru, L. (2012). The data journalism handbook. O’Reilly Media, Inc.
Greenacre M., Blasius J.(editors) (2006). Multiple Correspondence Analysis and Related Methods. London: Chapman & Hall/CRC.
Hammersley M. (2004) ‘Analytic induction’, in Lewis‐Beck, M. et al. (eds) The Sage Encyclopedia of Social Science Research Methods, Thousand Oaks CA, Sage.
Hookway, N. (2008). Entering the blogosphere’: some strategies for using blogs in social research. Qualitative research, 8(1), 91‐113.
Kavitha D., Manikyala Rao B.V., Kishore Babu V. (2011). A Survey on Assorted Approaches to Graph Data Mining. International Journal of Computer Applications (0975 – 8887) Vol. 14, N°1, January 2011.
Kielman, J. and Thomas, J. (Guest Eds.) (2009). “Special Issue: Foundations and Frontiers of Visual Analytics”. in: Information Visualization, Volume 8, Number 4, Winter 2009 Page(s): 239‐314.
King, G. (2014). Restructuring the social sciences: reflections from Harvard’s Institute for Quantitative Social Science. PS: Political Science & Politics, 47(01), 165‐172.
Kosara R. (2007). Visual Analytics. Fall 2007. (http://www.viscenter.uncc.edu/courses/visanalytics.html). ITCS 4122/5122.
Kramer, A. D., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of massive‐scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 111(24), 8788‐8790.
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flu: traps in big data analysis. Science, 343(6176), 1203-1205.
Le Roux B., Rouanet H. (2004) Geometric Data Analysis, From Correspondence Analysis to Structured Data Analysis. Dordrecht. Kluwer.
Min C., Ebert, D., Hagen, H.; Laramee, R.S., van Liere, R., Ma, K.‐L., Ribarsky, W., Scheuermann, G. and Silver, D., “Data, Information, and Knowledge in Visualization”, Computer Graphics and Applications, IEEE, 1:12‐19, 2009
Nguyen, B., Vion, A., Dudouet, F. X., Colazzo, D., Manolescu, I., & Senellart, P. (2011). XML content warehousing: Improving sociological studies of mailing lists and web data. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 112(1), 5‐31.
Papacharissi Z., Democracy online : civility, politeness, and the democratic potential of online political discussion groups, New Media and Society, Vol 6(2), pp254‐283, 2004.
Pike, W. A., Stasko, J., Chang, R., O’connell, T. A. (2009) ‘The science of interaction’, Information Visualization, 8(4), 263‐274.
Ragin C.C., Becker H.S. (1992) What is a case ? Exploring the foundations of social inquiry, Cambridge, Cambridge University Press.
Ribarsky W., Fisher B., Pottenger W. (2009) ‘Science of Analytical Reasoning’, Information Visualization, December 1, 2009 8: 254‐262.
Riche N.H. (2015) ‘Data‐Driven Discoveries: Pushing Visualization Research Further’, Computer Graphics and Applications, IEEE, 1: 42 – 43
Rihoux, Benoît (2006), “Qualitative Comparative Analysis (QCA) and Related Systematic Comparative Methods: Recent Advances and Remaining Challenges for Social Science Research”, International Sociology, 21 (5): 679, doi:10.1177/0268580906067836
Robinson, W. S. (1951). The logical structure of analytic induction. American Sociological Review, Vol 16, no 6, pgs 812‐818
Rogers, R., & Marres, N. (2016). Landscaping climate change: A mapping technique for understanding science and technology debates on the World Wide Web. Public Understanding of Science.
Romme, A.G.L. (1995), Self‐organizing Processes in Top Management Teams: A Boolean Comparative Approach. Journal of Business Research 34 (1): 11‐34.
Roth, C., & Cointet, J. P. (2010). Social and semantic coevolution in knowledge networks. Social Networks, 32(1), 16‐29.
Ruppert, E., Law, J. et Savage, M. (2013). « Reassembling social science methods: The challenge of digital », Theory, culture & society, 30(4).
Sano, Y., Yamada, K., Watanabe, H., Takayasu, H., & Takayasu, M. (2013). Empirical analysis of collective human behavior for extraordinary events in the blogosphere. Physical Review E, 87(1), 012805.
Travers, J., & Milgram, S. (1967). The small world problem. Phychology Today, 1, 61-67.
Viard, T., Latapy, M., & Magnien, C. (2016). Computing maximal cliques in link streams. Theoretical Computer Science, 609, 245‐252.
Von Landesberger, T., Kuijper, A., Schreck, T., Kohlhammer, J., van Wijk, J. J., Fekete, J. D., & Fellner, D. W. (2011, September). Visual analysis of large graphs: state‐of‐the‐art and future research challenges. In Computer graphics forum (Vol. 30, No. 6, pp. 1719‐1749). Blackwell Publishing Ltd.
Watts, D. J. (1999). Networks, dynamics, and the small-world phenomenon. American Journal of sociology, 105(2), 493-527.
Zhang, J., Ackerman, M. S., & Adamic, L. (2007, May). Expertise networks in online communities: structure and algorithms. In Proceedings of the 16th international conference on World Wide Web (pp. 221‐230). ACM.
Znaniecki F. (1934). The method of sociology, New York, Farrar and Rinehart.
Antoine Vion (email@example.com) is a senior lecturer in Social sciences at Aix-Marseille University and a member of the LEST Institute of Labour Economics and Industrial (CNRS). After editing studies on the Europeanization of cities, standards and business elites, he now specializes in corporate networks from the perspective of transnational business networks and tax evasion networks. He has a long-standing experience of cooperation with computer scientists (Webstand project) and coordinates the VISO project with the help of the LSIS (an Aix-Marseille University-CNRS CS research centre) and Derivo GMBH.