Facebook network analysis

This example demonstrates how to use the worker as a workflow system to load graph data, perform analyses and transformations of the data using NetworkX, and then visualize the result using d3.js.

In this example we will:
  1. Obtain a set of Facebook data
  2. Find the most “popular” person in our data
  3. Find the subgraph of the most popular person’s neighborhood
  4. Visualize this neighborhood using d3

Obtain the dataset

The dataset is a small sample of Facebook links representing friendships, which can be obtained here [1].

The data we’ll be using is in a format commonly used when dealing with graphs, referred to as an adjacency list. The worker supports using adjacency lists with graphs out of the box.

Note

A full list of the supported types and formats is documented in Types and formats.

Here is a sample of what the data looks like:

86      127
303     325
356     367
373     404
475     484

Each integer represents an anonymized Facebook user. Users belonging to the same line in the adjacency list indicates a symmetric relationship in our undirected graph.

Build a workflow

Create a file named workflow.py, this is the file we’ll be using to create our workflow.

Find the neighborhood

Now that we have the most popular node in the graph, we can take the subgraph including only this person and all of their neighbors. These are sometimes referred to as Ego Networks.

from networkx import ego_graph

subgraph = ego_graph(G, most_popular_person)

Again, we can create a task using our new script, like so:

Note

Since these steps are going to be connected, our inputs are going to be the same as the last steps outputs.

find_neighborhood_task = {
    'inputs': [
        {'name': 'G',
         'type': 'graph',
         'format': 'networkx'},
        {'name': 'most_popular_person',
         'type': 'string',
         'format': 'text'}
    ],
    'outputs': [
        {'name': 'subgraph',
         'type': 'graph',
         'format': 'networkx'}
    ],
    'script':
    """
from networkx import ego_graph

subgraph = ego_graph(G, most_popular_person)
    """
 }

Put it together

Conceptually, this is what our workflow will look like:

Visualize Facebook Data Workflow Diagram

* The format changes because of Girder Worker’s auto-conversion functionality.

The entire rectangle is our workflow, and the blue rectangles are our tasks. Black arrows represent inputs and outputs and the red arrows represent connections which we’ll see shortly.

To make this happen, since we’ve written the tasks already, we just need to format this in a way the worker understands.

To start, let’s create our workflow from a high level, starting with just its inputs and outputs (the black arrows):

workflow = {
    'mode': 'workflow',
    'inputs': [
        {'name': 'G',
         'type': 'graph',
         'format': 'adjacencylist'}
    ],
    'outputs': [
        {'name': 'result_graph',
         'type': 'graph',
         'format': 'networkx'}
    ]
}

Now we need to add our tasks to the workflow, which is pretty straightforward since we’ve defined them in the previous steps.

workflow['steps'] = [{'name': 'most_popular',
                      'task': most_popular_task},
                     {'name': 'find_neighborhood',
                      'task': find_neighborhood_task}]

Finally, we need to add the red arrows within the workflow, telling the worker how the inputs and outputs are going to flow from each task. These are called connections in Girder Worker parlance.

workflow['connections'] = [
    {'name': 'G',
     'input_step': 'most_popular',
     'input': 'G'},
    {'output_step': 'most_popular',
     'output': 'G',
     'input_step': 'find_neighborhood',
     'input': 'G'},
    {'output_step': 'most_popular',
     'output': 'most_popular_person',
     'input_step': 'find_neighborhood',
     'input': 'most_popular_person'},
    {'name': 'result_graph',
     'output': 'subgraph',
     'output_step': 'find_neighborhood'}
]

We now have a complete workflow! Let’s run this, and write the final data to a file.

with open('docs/static/facebook-sample-data.txt') as infile:
    output = girder_worker.tasks.run(workflow,
                               inputs={'G': {'format': 'adjacencylist',
                                             'data': infile.read()}},
                               outputs={'result_graph': {'format': 'networkx.json'}})

with open('data.json', 'wb') as outfile:
    outfile.write(output['result_graph']['data'])

Running workflow.py will produce the JSON in a file called data.json, which we’ll pass to d3.js in the next step.

For completeness, here is the complete workflow specification as pure JSON:

{
  "mode": "workflow",
  "inputs": [
    {
      "type": "graph",
      "name": "G",
      "format": "adjacencylist"
    }
  ],
  "outputs": [
    {
      "type": "graph",
      "name": "result_graph",
      "format": "networkx"
    }
  ],
  "connections": [
    {
      "input": "G",
      "input_step": "most_popular",
      "name": "G"
    },
    {
      "output": "G",
      "input_step": "find_neighborhood",
      "input": "G",
      "output_step": "most_popular"
    },
    {
      "output": "most_popular_person",
      "input_step": "find_neighborhood",
      "input": "most_popular_person",
      "output_step": "most_popular"
    },
    {
      "output": "subgraph",
      "name": "result_graph",
      "output_step": "find_neighborhood"
    }
  ],
  "steps": [
    {
      "name": "most_popular",
      "task": {
        "inputs": [
          {
            "type": "graph",
            "name": "G",
            "format": "networkx"
          }
        ],
        "script": "\nfrom networkx import degree\n\ndegrees = degree(G)\nmost_popular_person = max(degrees, key=degrees.get)\n",
        "outputs": [
          {
            "type": "string",
            "name": "most_popular_person",
            "format": "text"
          },
          {
            "type": "graph",
            "name": "G",
            "format": "networkx"
          }
        ]
      }
    },
    {
      "name": "find_neighborhood",
      "task": {
        "inputs": [
          {
            "type": "graph",
            "name": "G",
            "format": "networkx"
          },
          {
            "type": "string",
            "name": "most_popular_person",
            "format": "text"
          }
        ],
        "script": "\nfrom networkx import ego_graph\n\nsubgraph = ego_graph(G, most_popular_person)\n",
        "outputs": [
          {
            "type": "graph",
            "name": "subgraph",
            "format": "networkx"
          }
        ]
      }
    }
  ]
}

This file can be loaded with Python’s json package and directly sent to girder_worker.tasks.run():

import json

with open('docs/static/facebook-example-spec.json') as spec:
    workflow = json.load(spec)

with open('docs/static/facebook-sample-data.txt') as infile:
    output = girder_worker.tasks.run(workflow,
                               inputs={'G': {'format': 'adjacencylist',
                                             'data': infile.read()}},
                               outputs={'result_graph': {'format': 'networkx.json'}})

with open('data.json', 'wb') as outfile:
    outfile.write(output['result_graph']['data'])

Note

More information on Girder Worker tasks and workflows can be found in API documentation.

Visualize the results

Using JavaScript similar to this d3.js example we’re going to add the following to our index.html file:

<!DOCTYPE html>
<meta charset="utf-8">
<style>
 .node {
     stroke: #fff;
     stroke-width: 1.5px;
 }

 .link {
     stroke: #999;
     stroke-opacity: .6;
 }
</style>
<body>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js"></script>
    <script>
    var width = 700,
    height = 400;

    var force = d3.layout.force()
            .charge(-120)
            .linkDistance(30)
            .size([width, height]);

    var svg = d3.select("#popularity-graph").append("svg")
            .attr("width", width)
            .attr("height", height);

    d3.json("/data.json", function(error, graph) {
        if (error) throw error;

        force
            .nodes(graph.nodes)
            .links(graph.links)
            .start();

        var link = svg.selectAll(".link")
                .data(graph.links)
                .enter().append("line")
                .attr("class", "link")
                .style("stroke-width", function(d) { return 1; });

        var node = svg.selectAll(".node")
                .data(graph.nodes)
                .enter().append("circle")
                .attr("class", "node")
                .attr("r", 5)
                .style("fill", "#1f77b4")
                .call(force.drag);

        node.append("title")
            .text(function(d) { return d.id; });

        force.on("tick", function() {
            link.attr("x1", function(d) { return d.source.x; })
                .attr("y1", function(d) { return d.source.y; })
                .attr("x2", function(d) { return d.target.x; })
                .attr("y2", function(d) { return d.target.y; });

            node.attr("cx", function(d) { return d.x; })
                .attr("cy", function(d) { return d.y; });
        });
    });

Which should leave us with a visualization similar to the following:

This is of course a more verbose than necessary workflow for the purposes of demonstration. This could have easily been done with one task, however by following this you should have learned how to do the following with the Girder Worker:

  • Create tasks which consume and produce multiple inputs and outputs
  • Run tasks as part of a multi-step workflow
  • Use the worker’s converter system to serialize it in a format JavaScript can read
  • Visualize the data using d3.js
[1]For attribution refer here.