On “A Survey of Scholarly Data Visualization”

Scholarly data are databases containing academic resources such as papers, books, patents, and scientific reports. With the recent advancements in technology that allow for academics and professionals to store their work and publications in clean, organized data bases, a need to find a way to represent the structure of these data sets and portray hidden patterns has become an issue. This paper, “A Survey of Scholarly Data Visualization”, written by members of the Institute of Electrical and Electronics Engineers, was created with the goal of educating fellow academics on how to collect scholarly data, data visualization tools and techniques, and then finally provides some open-ended issues for readers to research themselves once they are done reading. While this article was made for the use of academics in mind, many points and lessons that can be taken away from this article can be confidently applied to data science as well.

As data scientists, while we view data visualization as a way of portraying data, it is a tad bit different when it comes to scholarly data. Scholarly data doesn’t just contain the usual project data, but also a vast amount of metadata as well. They could contain associated data such as information on the titles, authors, citations, figures, tables, journals, key terms, abstracts, languages, date published, organization, network, editors, etc. All this is a part of a term known as the scholarly network, which is a network of academics and researchers linked together by relations such as co-authoring a paper or through citations. Thus a major part of the visualization process come from properly portraying the network, through which we can do things such as find how researchers interact with each other, find relationships between researchers hidden in citation networks, observe impact of funding in an institution, or allocating resources to different departments.

Framework of scholarly data visualization

Here is a framework that that article produced to showcase the typical workflow of scholarly data visualization. It starts with data collection, then to data processing, and lastly ends up with data visualization. As one may notice, this is exactly the same steps a typical data scientist undergoes through their data processing.

To begin the article starts with the scholarly data extraction process. With the age of big data and online storage, more and more documents can be pulled straight from the internet. There are two main choices you have when collecting data: scholarly data extraction or using a pre-made dataset. First we will go over data extraction, or the methods used to extract the information from raw data. Raw data can be found using online digital libraries or academic search engines, where we want to collect information on the author, title, abstract, keywords, venue, publisher, and page number for each document to begin with. However a very important detail that isn’t well known are the citations. If a paper cites another paper, those two are thus linked together in a relationship. In terms of the scholarly network, these two papers would thus form an edge. The same goes for authors, as when two or more authors co-author a paper, a similar edge is formed between all of them.

On the other hand, pre-made datasets are quite widespread, as with the growth of online data servers comes with a decrease in overall stinginess by scholars to share their work. Many academic search engines, digital libraries, and research institutions have made their work public and available for anyone who desires it. Some popular organizations that host these datasets in which the article recommends are DBLP (Digital Bibliography and Library Project), APS (American Physical Society), MAG (Microsoft Academic Graph), and ORC (part of the Semantic Scholar project). These datasets should come with all the information we desired in the previous paragraph, and even more as they usually contain metadata not just on the document itself but also on publications and journals that it is a part of.

Scholarly Visualization on Popular Datasets

Once we have our data, we can move onto our visualization tools. As we know, these tools give the users the ability to transform every element of the data into expressive charts and pictures, which will increase the effectiveness of our analysis compared to raw data. The article splits its tools into simple categories, tools that don’t require a programming language and tools that do. We will not go over in depth all the tools but the ones that the article lists that are not programming related are Tableau, ICharts, Infogram, Raw Graphs, and Visualize Free. These tools are ones that are recommended for academics that do not have a background in programming, and are more UI friendly as a result.

For tools that do require programming, the authors provides two list. The first list consists of tools that require Javascript, and consists of D3.js, Chart.js, FusionCharts, FlotCharts, and ZingCharts. The second list of non-Javascript libraries consists of Gephi, Nodebox, Ggplot2, Processing, and JPGraph.

Using D3.js to visualize the relationships between entities in a scholarly network

While readers of this blog post are free to go look into more on each tool if they desire, the one I will briefly go over is the library D3.js, or D3 for short. D3 is an open source JavaScript library that combines HTML and CSS techniques, and creates graphs in the svg format. The above visual was created using D3 to show the relationships between different entities in a scholarly network. The below visual is one showcasing it’s geographic capability, by plotting the location of institutions, in red, that attended the KDD conference in 2012, in green.

Geographic Visualization using D3.js

Now that we have both our data and our tool of choice, we can begin the visualization process. Here we will go over each section and provide some information on their importance for visualization.

Visualization of Academic Entities

As stated earlier, not only is the data in the document important, when it comes to scholarly data the metadata is incredibly significant as well. The title, authors, keywords, algorithms, figures, tables are all vital for scholarly services, and thus important to visualize so that users can understand it.

Relationship between authors

The relationship between authors is important as this makes up the building blocks of a scholarly network. Author’s names, affiliations, research grants, citations, and works are important for building profiles that others can use to form connections in order to link people together. The visualization above is a chart used to show off the relationship between authors; a thicker line between two symbolizes a stronger relationship, which can come from things like co-authoring a piece of work or by using each other’s work in citations.

While authors may be the building block of scholarly networks, papers are an even more basic unit. Their importance come from being the way that authors showcase or share their knowledge, and have it spread. These are the ways that academics keep up to date with the latest research, so visuals are important for getting the point across. These are the usual visualizations that data scientists do on data sets.

Map showing the locations of different institutions

Papers are considered a step down from authors, but a step up from authors would be institutions. Each institution is a large group that encompasses many authors, and provides more information about each author such as the name, ranking, members, locations, etc. This greatly aids in the formation of scholarly networks, as many authors in an institution could easily form a connection with authors simply by working in close proximity, or with external authors when institutions choose to work together.

Some more visualizations

Showcases the publications done by two researchers

This visual is an art piece showcasing the activity of two researchers. Each branch on the tree symbolizes the work each researcher did over the course of two years, with every leaf being a publication that they submitted. The patterns were meant for the scientists to examine their personal career and focus on self-development.

Network between authors in the Harvard University Institution

This image is showcasing a visual of a scholarly network in Harvard Institution. Each node represents an author and the edges between the nodes symbolize a collaboration. The color of the node represents separates them by the research field of their respective author.

Showing the top 20 productive regions in publications for a journal

Here is a bibliographic coupling of the 20 most productive regions for the TFS publications. As China has the largest circle, we can see that it is the greatest contributor for the journal.

By now hopefully you have an idea of how scholarly data can be visualized to help portray ideas that are outside of simply the data or research that a scientist is working on. These visuals help greatly on scholarly data analysis, and help address the problems arising from the exponentially huge increase in academic data in the recent decade, a problem that will only get worse from here on now. The article ends with several open-ended questions, such as how several subjects remain tricky to properly visualize, such as institutions or who to choose when the number of authors is simply too much. While aimed towards their fellow academics, the authors of this paper shows off many points that our fellow data scientists can take away if they ever need to work with scholarly data.

Sources: (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8314667)

All visuals were taken from the article linked above.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store