You’ve located the journal data you want in the Data for Research (DfR) service; made the request; completed the license agreements; and downloaded all of the data.  Now what?

This page is meant to give you an overview of the process we developed and what we learned along the way.  Much of it is generalizable to any DfR dataset, but some of it is not.  All the code can be found in our Github repository.

File Organization

We wrote the code to make things as simple as possible, although that choice has some costs.  The code is all in Python 3, in Jupyter Notebooks.  We unzipped the full data package that we got from DfR into a directory (folder) called “jstor_data”, which should contain several subdirectories.  The Notebooks are designed to run from one directory, “js_sotf,” which also contains jstor_data, and will output all of their files back into that directory.  You can, of course, change the file outputs in the programs (or separate manually), but we thought that this approach was most likely to run smoothly, particularly by those with less programming experience.

 Creation of a CSV Reference File from the XML Metadata Files
Our first step was to create a clean version of a CSV reference file from the XML files in the “metadata” folder. We did this in the following steps:
  1. Using the “csv_creation” notebook, we turned the XML files in the “jstor_data/metadata” directory into one CSV file saved in the main directory, called “articles_raw.csv”.
  2. We then used the “csv_prep” notebook to remove miscellaneous articles and infer the authors’ genders. The output of this notebook is called “articles_gender.csv”.
Author Gender Analysis
With “articles_raw.csv”, we used the “csv_prep” notebook which uses the library `gender_guesser` to look at the forenames in the author field and use its estimation of whether the name is male, female, most likely one or the other, equally likely to be either, or unknown. 141 articles were labeled with unknown names (some of these were in Hebrew); 6 were most likely male; 8 were most likely female; 5 were equally likely to be either (returning the value “andy”); and there were three stray files. The uncorrected file is “articles_gender.csv”. We manually fixed those designations (and made some corrections to the gender identifications) and saved the file to “articles_gender_validated.csv”.
The “gender_visualization” notebook produces visualizations of author counts by gender and by year, conditioned on article type. It takes in the file “articles_gender_validated.csv” and produces these figures.
Unigram Analysis

The purpose of this analysis is to track the most popular significant words, by year, that appear in book reviews and research articles in AJS Review.

For the unigram analysis, the first step was to separate the unigram files according to the type of article they summarize. This meant that book review unigram files were sorted into one directory and research article unigrams files into another. This was accomplished with the “data_separator” notebook. Please refer to the notebook for more details on this process.

A second notebook, “data_cleaning”, was used to read all of the unigrams in each category (book review or research article), strip numbers that are not dates, filter out the stopwords provided by the NLTK package and our own file of stopwords.txt, and sort unigrams by frequency. We ran the notebook, manually examined the output, identified additional stopwords that we then added to the “custom_stopwords.txt” file, and repeated until we had a stopword list that we thought did a good job of filtering the noise from our data.

To visualize the unigrams with sparklines, we used the “sparklines” notebook. Given an article type, time period, bin size, and the number of words to use, it creates a sparkline visualization showing the changes in unigram frequency. The visualization is quite simple and is created with matplotlib, which allows for extensive customization of the visualization. These visualizations can be seen here.

Topic Modeling
Topic modeling begins with a dataframe that contains the texts of the article. The general strategy is (1) to clean the text and put it in a form that can be modeled; (2) to determine the best number of topics to use and the best number of passes; (3) to chose a particular model and assign topics to it; and (4) to produce the files needed to visualize the topics through time.
Once you determine the best number of topics and passes, we run a different program to run the models (assuming you want the same number of topics and passes for each) on multiple chunks of date ranges and types of articles. You might, for example, want to create pyLDAvis files for research articles in five year chunks.
Citation Web

Citation analyses are relatively easy to produce from websites devoted to journals in the sciences but much trickier for the kind of data found in DfR.  Our approach has been to extract the names of authors of articles and authors of the works that they cite from the DfR metadata; clean that list (it needed a lot of cleaning!) and assign unique ids to each author (and the same id to authors whose name appears with different spellings in the metadata); and then to use that list to search the metadata in order to compile a list, by id, of vertices – that is, who is citing whom.  We used these two lists – one of the author and id and the other of the connections between them – to visualize the network in NetworkX (for the static image) and Gephi (for the dynamic one).  We then exported the dynamic network in a sigma.js template, which we then rendered in our WordPress site.

We had hoped to use the structured metadata for compiling the author lists, but we discovered that these fields were incomplete and often ambiguous.  All of our extractions were thus done on the field <mixed-citation>, bypassing the structured data completely.

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
css.php
0
Would love your thoughts, please comment.x
()
x