You’ve located the journal data you want in the Data for Research (DfR) service; made the request; completed the license agreements; and downloaded all of the data. Now what?
This page is meant to give you an overview of the process we developed and what we learned along the way. Much of it is generalizable to any DfR dataset, but some of it is not. All the code can be found in our Github repository.
File Organization
We wrote the code to make things as simple as possible, although that choice has some costs. The code is all in Python 3, in Jupyter Notebooks. We unzipped the full data package that we got from DfR into a directory (folder) called “jstor_data”, which should contain several subdirectories. The Notebooks are designed to run from one directory, “js_sotf,” which also contains jstor_data, and will output all of their files back into that directory. You can, of course, change the file outputs in the programs (or separate manually), but we thought that this approach was most likely to run smoothly, particularly by those with less programming experience.
Creation of a CSV Reference File from the XML Metadata Files
- Using the “csv_creation” notebook, we turned the XML files in the “jstor_data/metadata” directory into one CSV file saved in the main directory, called “articles_raw.csv”.
- We then used the “csv_prep” notebook to remove miscellaneous articles and infer the authors’ genders. The output of this notebook is called “articles_gender.csv”.
Author Gender Analysis
Unigram Analysis
The purpose of this analysis is to track the most popular significant words, by year, that appear in book reviews and research articles in AJS Review.
For the unigram analysis, the first step was to separate the unigram files according to the type of article they summarize. This meant that book review unigram files were sorted into one directory and research article unigrams files into another. This was accomplished with the “data_separator” notebook. Please refer to the notebook for more details on this process.
A second notebook, “data_cleaning”, was used to read all of the unigrams in each category (book review or research article), strip numbers that are not dates, filter out the stopwords provided by the NLTK package and our own file of stopwords.txt, and sort unigrams by frequency. We ran the notebook, manually examined the output, identified additional stopwords that we then added to the “custom_stopwords.txt” file, and repeated until we had a stopword list that we thought did a good job of filtering the noise from our data.
To visualize the unigrams with sparklines, we used the “sparklines” notebook. Given an article type, time period, bin size, and the number of words to use, it creates a sparkline visualization showing the changes in unigram frequency. The visualization is quite simple and is created with matplotlib, which allows for extensive customization of the visualization. These visualizations can be seen here.
Topic Modeling
Citation Web
Citation analyses are relatively easy to produce from websites devoted to journals in the sciences but much trickier for the kind of data found in DfR. Our approach has been to extract the names of authors of articles and authors of the works that they cite from the DfR metadata; clean that list (it needed a lot of cleaning!) and assign unique ids to each author (and the same id to authors whose name appears with different spellings in the metadata); and then to use that list to search the metadata in order to compile a list, by id, of vertices – that is, who is citing whom. We used these two lists – one of the author and id and the other of the connections between them – to visualize the network in NetworkX (for the static image) and Gephi (for the dynamic one). We then exported the dynamic network in a sigma.js template, which we then rendered in our WordPress site.
We had hoped to use the structured metadata for compiling the author lists, but we discovered that these fields were incomplete and often ambiguous. All of our extractions were thus done on the field <mixed-citation>, bypassing the structured data completely.