Characterizing the Marvel Comic Universe

This project aims to provide a platform for comic book enthusiasts and a significant portion of the general populace — as indicated by the popularity of movies like the Avengers (among the highest grossing films ever), Spider-man (5 made since 2002), and X-Men (7 made since 2000)* — to explore the vast array of Marvel characters. Marvel is one of the most popular and prolific comic book producers in the US — there are over 42,000 characters documented in one of our data sources, the Marvel Wikia. Our primary aim is to provide entertainment and enjoyment (as opposed to supporting learning or understanding tasks). Media critics may also be interested in aggregate patterns that can be found in the the portrayal of different types of characters.

We aim to support two primary user tasks:

  1. Character Lookup: exploration of individual characters and their networks, e.g. Who is this Jessica Jones and why does she have her own Netflix series? When was she introduced in the comics? Who are the characters she appears with most frequently in the comics? How strong and of what type are those connections?
  2. Aggregate Analysis: exploration of trends in the overall distribution of characters (critical media analysis), e.g. Of the top characters in terms of number of appearances, how many are women? Was there a (time) period where more female characters were introduced? Are there any top appearing characters from more recent times or does time in existence always win out?

Visualization Demo

The Visualization


The vis has two modes: circular bar chart and circular network. The bar chart provides a circular axis for reference which is dynamically mapped and created based on the range of the entire selected data set (which may vary due to filtering). Selections are maintained across mode changes to make it easier to compare characters both on their connections and on the attribute charted in the bars.

Captain America in bar chart mode, sorted by gender and then appearances

Network Connections

We wanted to provide more information than that two characters are simply “linked.” As such, our vis supports a rich model for mapping the network connections. Connection strength (number of co-appearances) is represented by a linear scale of line thickness and type (core, currently familial relationships, to be expanded upon in the future) by color. Because we wanted to support users exploring both types of connections, there are two different types of selections – selected where all connections are shown and core selected, where only core connections are shown with greater detail (i.e. their character bubble and the number of co-appearances). Because you can also multi-select both of these types of selections, the event model/processing of which chords in the vis should be displayed is a complex feat.

Captain America with all his connections

That there is a connections threshold slider on the left control panel which users may utilize to increase the threshold for how strong a connection must be to be displayed in the diagram. After it took 10.5 hours to process the 45,666 inter-connections in the database, we realized the network could quickly turn into “ball-of-string” without an additional control even over core selection. This way users can control the degree of connectedness they would like to see. Core connections are always displayed.

Captain America with only core (familial) connections.


Captain America with only core (familial) connections

I created an HTML parser in Python that pulled character data from three sources:

  1. Official Marvel API (
    • Great imagery
    • JSON formatted
  2. Marvel Wikia site (
    • 150,000 pages of structured text hand-crafted by enthusiasts
    • Extremely rich but much much harder to process
  3. Marvel’s Official site ( & Wikia as viewer resources
    • Easy to parse information about specific characters
    • Much harder to get a sense of the overall connections + trends

We managed to parse data from 1402 Characters, 1060 Organizations/Affiliations (eg. the Avengers, S.H.I.E.L.D., X-Men, etc) and 30179 Comics.

Data Processing Complexity

We found the Marvel API to be unreliable in its current Beta form. The same query would return varying results over time. To account for this we created a web scraper for the fan maintained wikia site that queried each of the 1402 characters we had shortlisted for their individual character attributes. We created several Many-to-Many linked tables in SQLite to store Characters, their Affiliations (e.g. The Avengers Initiative, X-Men, etc), their Relationships with each other (e.g. Family, Other), and the Comics they appeared in.

The major challenge with both the Marvel API and the fandom Wikia was that all attributes were listed in descriptive paragraphs. We ran the data we received through several iterations of text parsers to clean the dataset and obtain clean attributes. The initial Wikia data was then augmented by calling the Marvel Developer APIs over 2-3 iterations to account for the erratic and capricious nature. Creating the connections between characters on the basis of their appearing together in comic books gave us an O(n2) initial processing time of 10.5hrs.


  1. Live Project (Webpage)
  2. Marvel Wiki Parser (Github)
  3. APIs created in Python’s Django frameworks (Github)
  4. Live API Link (Heroku)

Earlier Design Iterations

The team iterated through several design options before settling on the selected Circular Visualization depicted above. Each of the designs are featured below with reasoning for their selection and the user interaction they enabled. The design ideas and decisions were weighed for their pros-cons to create the final visualization.

Rich Set Analysis

Timeline of Event Arcs


Geographic – Sunburst

Circular Relationship Network

Circular Bar Chart