DATA ANALYSIS: SOCIAL NETWORKS

Only connect

Only connect
Claim-specific citation network, from Greenberg, on distortions and unfounded authority in medical literature

Felix Grant looks at the application of data analysis software to social networks

Scientific Computing World: June/July 2010

Once upon a time I spent several weeks, completely alone (long story; don’t ask), on top of a mesa with a precarious microecology, surrounded by desert. My only company was the local wildlife: mostly lizards, small rodents, and an unlikely colony of feral cats at the top of the food chain.

Those cats, as the only visible social species, became the focus of my attention for much of the time – particularly the complex network of behavioural relations that bound their tribe together. There were mentor/mentee relations. There were sibling and parent/offspring relations. There were hunting partners. There were bonded sexual pairings and temporary liaisons. There was a power hierarchy. And there was a rich web of what I could only describe as ‘best friend’ relations. Then there was also a small number of maverick individuals who appeared to be entirely outside all of this, part of the tribe, but operating alone, connected to the main only by apparently random acts of altruism or brigandism.

Studying systems of inter-relational linkage was, at that time, an important part of my day job. But computing resources were scarce in those days, and sheer volume of content and association limited what could realistically be done; things are very different today. While analysis of social networks stretches back at least to the 1940s (arguably to the late 19th century), and social network analysis (SNA) as a formal field has been well established in sociology (to which I shall return, shortly) for decades, ‘it has only recently been discovered by behavioural biologists as a useful tool in the study of animal behaviour’, to quote Amelia Coleing in Bioscience Horizons just over a year ago. Coleing goes on to observe that ‘...methods devised to measure social complexity in studies of animal behaviour... often reflect the social relationships between individuals indirectly... social network analysis provides formal descriptors... and by providing quantitative measures... allows testing of statistical models about relationships and structure’.

A network map for a small subset of the feral mesa cat tribe is begun in Simile

Coleing emphasises the use of SNA as a tool particularly suitable for studying small, captive populations of obligate social species in restricted territory, her own work focussing on 10 elephants in a zoo. Each of those specifications makes for better control of data capture and, therefore, better data quality. Interconnection complexity increases rapidly with population size, populations that are wild or spatially unconfined may be impossible to observe continuously in toto, and social relations that are not obligate introduce numerous uncertainties.

This doesn’t, however, mean that SNA methods are not more widely applicable; their widest application, in fact, has been to less controlled circumstances. One study by Julian Drew, for example, seeks to learn lessons about tuberculosis transmission through examination by SNA techniques of the relation between infection and social interaction among wild meerkat. The population here contains five interacting tribes comprising a total of 110 ‘actors’ (for a short glossary of terms such as actor, see box: Lexicon), the network recording five types of interaction both within and between groups. From the study emerge indicators suggesting that: ‘Contrary to predictions, the most socially interactive animals were not at highest risk... type and direction of interactions must be considered.’ In particular, ‘meerkats that groomed others most were more likely to become infected than individuals who received high levels of grooming... receiving, but not initiating, aggression was associated with M. bovis infection’.

Epidemiology is, in general, a fruitful area of application for SNA. One group in Eire is in the early stages of investigating the application of methods similar to Drew’s across the boundary between a wild population and a captive one in investigation of hypothesised transmission of TB from badgers to farmed cattle. Martínez-López et al discuss the use of SNA in preventative veterinary medicine. And epidemiology brings me neatly back to human populations, where SNA is applied to exploration of specific geometries of disease distribution and propagation.

Two different illustrations of network relations among elephants at Chester Zoo, studied by Amelia Coeling. On the left, a family tree showing two separate consanguineal subsets. On the right, a weighted mapping of social play

SNA originated as a means for application of scientific computing to quantitative sociological analysis, and remains of major importance there. It sees heavy use in studies of business models and strategies, but is also applied intensively to military, criminological and antiterrorist studies. It also sees service across a remarkable range of other fields too numerous to even summarise – including areas of human/animal interaction such as farming and, alongside other scientific computing-based methods such as agent-based modelling and fractal analysis, animal welfare.

The hardcore social networking analysis fraternity have particular software tools that they favour. ‘UCINet is the dominant app, with a lot written in R,’ says Barry Wellman (see section: Barry Wellman and INSNA), while ‘ORA is used heavily by the intelligence and law enforcement community’. Further out, however, the same methods are being implemented using a wide variety of other scientific computing software. Other network study products, not specifically aimed at SNA, are often used. Generic computer algebra suites are popular too. Among high-level programming languages, several have their adherents and LISP seems to be particularly favoured. Statistics packages, especially the heavyweights, can do the analysis although they may well not have built in tools for graphical networking display. All of these alternatives are also widely used in tandem with specific SNA software – Madden et al, for example, describe use of both UCINet and GenStat to study different aspects of interaction within a meerkat population.

UCINet and ORA offer all the obvious advantages of dedicated and integrated study environments for those whom the majority of study time is spend in SNA-related activity. Both handle large numbers of nodes, though the theoretical maxima are in practice limited by rapidly slowing execution and memory consumption as complexity increases – expect to invest in hardware power if you are planning to use either to the full.

Interdisciplinary analysis, by Bellanca, of coauthorship by research focus at the University of York

The applicability of generic network analysis products is obvious, especially for anyone coming to SNA in collaboration with colleagues having a previous network background in a different area – and much the same is true of many modelling systems. This is the case with one group of medical researchers who, exploring impacts of consanguinity on a range of health issues in a small isolated community, have recruited a passing agronomist who is building virtual worlds for them in Simile, the modelling program from Simulistics. Simile is not designed for such work, but does provide building blocks for the core concepts together with powerful tools for rapidly controlling and investigating complex scenarios with varying edge weights.

Wellman himself coauthored a paper on the application of SPSS, just over a decade ago, and in a recent issue of INSNA journal Connections Alan Ellis deals with a particular use of SAS for the calculation of betweenness centrality. Betweenness centrality is a measure used to represent the degree of potential significance to network interaction (in various forms) of an actor who occupies a position on geodesics connecting other actors. The algorithm Ellis presents could, as he points out, be generalised to a number of other measures, but examination of its structure shows that it could also be implemented in other language-based statistical software environments. Two colleagues have, by way of illustration, sketched out in principle how it might be ported to GenStat, Statistica or Systat.

Taking a straw poll of mathematical epidemiologists who happened to cross my path, Maple seems to be widely used for their SNA purposes while military and police intelligence analysts tend to use Mathematica. Those are only generalisations, however; there is a lot of overlap, and both groups contain a lot of MatLab fans as well.

The best-known police use of social network data is the data mining of ANPR (automatic number plate recognition) from heterogenous CCTV sources in the UK for associational links. As one officer put it to me: ‘We are usually less interested in who is going where than who visits whom and who consistently travels with whom.’ Vehicles of interest are nodes. Two such vehicles that travel a particular route in tandem, or visit the same locations, are connected by an edge. Geodesics formed from the edges, or cliques that share edges, are flagged and analysed for what they may reveal of underlying associations.

There is increasing talk in the police (and the security services, and the military, both of which have access to ANPR databases for counterterrorism purposes) of extending the same system to embrace other types of node. A vehicle, whether identified by number or chip, can be driven by anyone, so nodes that correspond to individuals represent far higher data quality and precision. Facial recognition systems that can scan crowds are not at the capacity level of ANPR as yet, but are progressing. Biometric passports provide data at a much lower spatial resolution, biometric identity cards potentially much higher. Devices other than vehicles, such as mobile phones and bank cards, fill out the picture.

A Wolfram demonstration, by Phillip Bonacich, of clique location in networks

Fingerprint and other biometric logging or access systems provide further input, and not only for policing. One head teacher described to me how SNA methods applied to an automated register system led to a ‘protection racket’ run within a school by bullies being identified and broken up. The same head also mentioned discipline of a staff member for ethically questionable analysis of complex friendship patterns among pupils.

The technological facilitation of social interaction makes some types of SNA particularly easy to tackle. Mobile phone monitoring was mentioned above, but less Orwellian uses for them can be found. One academic is currently monitoring dozens of voluntary GPS-tracked mobile phones and their itemised bills belonging to students who have signed up to a collective study of their own cohort. Another is conducting experiments using unique meme transmission through a large (greater than 10,000 nodes) student body as a network revelation tracer mechanism. A quick and cursory analysis of spam and chain emails turning up on my system reveals strong associational networks among thousands of people (or, more accurately, their email and IP addresses) of whom I have never heard. Social network systems of the Facebook or Twitter variety reveal nodes, edges and all the rest to anyone who cares to look.

One of the networks most easily visible to any researcher is that of interrelated citations, and they have attracted numerous studies with concerns as wide as analysis of differences between Chinese and Western academic systems or as narrow as collaboration within a faculty or even seminar group, with exploration of impacts by a specific field of research (for example, human information behaviour) falling somewhere between the two. Systems such as Delicious or Elsevier’s 2collab provide ready material for this sort of investigation.

From origination in sociology to application in widely disparate fields, social network analysis underpinned by every growing, ever cheaper and ever more ubiquitous computing infrastructures is remarkably versatile in providing scientific means of investigation for a whole raft of disciplines. That range is already impossible to summarise, and looks set to grow exponentially. SNA as a tool of metaresearch into research process is also expanding; an interesting study would be the social network analysis of the cross disciplinary links wrought by SNA itself.

Barry Wellman and INSNA

Barry Wellman is a sociologist by background, and a faculty member of more than 30 years standing in that discipline at the University of Toronto, but works intensively with information and communication technology (ICT) specialists. He played a central rôle in the development of SNA methods, and has focused his interest on those social networks that are facilitated by technological ones such as the internet. He has, as one commentator put it, ‘devoted an entire career to exploring and documenting natural social worlds in network terms’. A prominent feature of his research is the ways in which ICT mediation is changing the relation of social networks to physical loci.

Wellman founded the field’s main professional association, the International Network for Social Network Analysis (INSNA) which, among other things, serves as a focus for SNA work and workers, maintains a ListServ discussion forum, and publishes Connections, a peer-reviewed online journal whose archive is available for public access.

References and sources

For a full list of the sources and references cited in this article, please visit www.scientific-computing.com/features/referencesjun10.php

Lexicon

Networks look slightly different from their various fields of application, with nomenclature varying considerably across the literatures. My own background inclines me to nodes, arcs and regions, but SNA usually talks of actors and edges, with regions usually ignored. Here, with feline illustrations, is a quick tour of the most common terms.

Actor (node, vertex): basic network element, drawn as a dot or as a hollow shape. In my feral colony, each cat would be an actor. Sometimes coloured or otherwise coded to represent attributes, for example red for male, blue for female, and/or square for mentor and circular for mentee.

Edge (arc): connection, drawn as a line, between two actors. A partnered hunting pair of cats would be represented as two actors connected by one edge. A weighted edge has a numerical value associated with it – perhaps the food yield productivity of their partnership. A directed edge indicates the sense of a relation – a mentor or parent relation would be directed, but a sibling relation would not. Multiplicity is more than one edge connecting the same pair of actors, such as a sexually bonded pair, which also hunts together. A navigable consecutive sequence of actors and edges, beginning and ending with actors, is a walk. For example, if Cat A is the mentor of Cat B, and Cat B is the mother of Cat C, there is a walk from A to C comprising two edges (the length of the walk is said to be two) and three actors. In this example, the walk is also a trail, since no edge is traversed more than once. The shortest walk connecting two specified nodes is a geodesic.

Order is the number of actors in a network or clique, while size is the number of edges. A clique is a group of actors in which every actor has a connection to every other. Every kitten in a litter, for example, is an actor with an identical (sibling relation) edge linking it to every other kitten in that same litter. The order of a clique is the number of actors connected by it, a k-clique being a clique of order k.