In this post, I want to offer a brief guide to Voyant Tools and my review of using these tools.
In my Digital Public Humanities (DPH) course, we are learning to work with digital tools to help us conduct research and make effective analysis with our work as Humanists. Voyant Tools is a web-accessible program that makes available a wide range of tools for those wanting to conduct text analysis and build visual presentations of the data gathered from these tools.
To begin, opening the main webpage for this program presents a box that the user enters links into. These links can be PDF files, text files, or URLs linking to textual items. Once these files are submitted, the program analyzes them and present five default tools for the user to operate to investigate the documents (which are collectively referred to as a “corpus”). In my assignment, I utilized 17 documents that were comprised of interviews with former African-American slaves from different states in the country that were conducted in the 1930s.
The first of the five default tools, known as “Cirrus” creates a word cloud of the most commonly used words (note: there is an option to filter out the most common words such as “a,” “they,” “the,” etc., known as “Stop words,” which is available in the toolbar ribbon of the tool itself). Sometimes it is necessary to modify the words that are presented in the word cloud, such as with the sample documents I was analyzing. Adding the dialectical versions of the stop words clarified the actual body of most commonly used words across the corpus. This word cloud could be modified to create word clouds based on the individual documents of the corpus, allowing for a more narrow inspection.
The second of the five default tools, known as the “Reader,” provides full text reading of selected documents and selected keywords. Through the provided text, the user can select any other word, including those that may have been filtered, and they will then be loaded into the other tools available. This also provides a colored graph of all the documents in the corpus and displays a horizontal line to plot the frequency of the word across all documents and a vertical line to show where the user is looking among the full body of text.
The third of the five default tools, known as “Trends,” provides a line graph and the ordered frequencies of selected words, either across the corpus, within individual documents, or between different segments of one document (in the case of the documents I was using, different interviews in the same document). This helps the user to visual the usage of particular words among the documents in comparison to other areas of the corpus. By selecting the dots on the graph, the user can enter into a graph displaying the frequency of the target word across individual documents.
The fourth of the five default tools, known as “Summary,” creates a summary of the corpus according to several preset categories. These categories are: document length, vocabulary density, average words per sentence, most frequent words in the corpus, and the most distinctive words of each document. This is primarily descriptive information, but the user may select any of the words in the summary to become the target of other tools on the tool page.
The fifth of the five default tools, known as “Context,” makes a range of context available for each key word. For example, it can provide the first five words and following five words surrounding the key word selected by the user (this can be adjusted with a sliding scale at the bottom to add or subtract the amount of contextual words). This can be applied across the corpus or by individual document, which is selected at the bottom of the tool under “Scale.” It can also be done in combination with multiple individual documents as opposed to the entry body of works.
My Reflection of the Tools
Throughout my time of using these tools, I find them quite useful for my assigned task. These tools do not provide full interpretations of the data. Rather, they do the heavy lifting for the researcher in terms of breaking down large sets of data and presenting it in a manageable way. The tools provide enough information to spark questions and offer clues to answering those questions. They allow for a dynamic search in where users can explore the tools and essentially “find” or stumble upon interesting points of information without searching too hard.
One of the most useful options for the tools is the ability to restructure the information in different formats. For example, all five of the default tools have either two or three options to change the appearance of the data. Some of them provide different visualizations, such as the Context tool being able to provide a “bubblelines” chart, and others change the visualizations into quantified formats, such as the “terms” selection for Cirrus to present line text with small line graphs.
Another good feature of this program is the ability of the tools to work with each other. As mentioned before, selecting words from the Reader tool will then change how the other tools, such as Trends and Context, are presenting the data, relating information about the selected word. Doing this allows the user to experience other forms of the target information and identify further trends or investigate areas of interest.
A final mention of the benefits of the program is that these tools are developed enough to make clear distinctions between the bulk of the information, the corpus, and the individual documents. This means that the researcher does not need to waste time running multiple programs to analyze different sectors of the same corpus, but can make select comparisons from the same tool windows.
One drawback that I want to mention, however, is that while the tools are not terrible difficult to learn, it strikes me as important to know how they operate and interact with each other in order to effectively use them. I became familiar with the tools because instruction was provided for me. I can easily see how the program is not user friendly by not providing tutorials up front for new users and this, of course, can impact the effectiveness of this program for researchers who are more accustomed to analog styles of research.
What I Learned
Through this assignment, I learned some very interesting information. For example, by examining the usage of distinct words across the interviews that are tied to specific states, it became clear that word choice was highly influenced by the context of circumstances for the interviewees. States that had a varying proximity to areas of the Civil War seemed to reflect different degrees of mentions about the Civil War in the interviews. Interviewees from states that were closer to American Indian Tribes, such as the Cherokee, mentioned said Tribes more often than other states. And areas that had higher rates of plantations also showed that they were mentioned more frequently than other areas that might’ve had different industries that utilized slave labor.
Going even deeper, there was interpretative temporal and social boundaries that could be identified from the trends in the data. Using the word “war,” the Reader and Context views revealed that the Civil War was seen as a temporal marker for the changing of the interviewees social status on a national level in where they transitioned from being slaves “before the war” to being freed peoples “after the war.” Yet, along the same lines, some words made it clear that certain relevant terminology persisted that I would argue described the same fundamental issues. The word “white” was still commonly used both for describing the period of enslaved status and describing the current conditions the interviewees were under at the time of the interviews.
But there is definitely a word of caution that I learned from my background reading of these tools and that I could identify when using the tools: text analysis tools cannot be taken as literal or at face-value. Though the information was insightful, it merely presented clues for me to investigate further. They did not, by any means, offer a full picture of the data that consisted of over 2 million words. Plus, the trends that are constructed and I interpret are based on my reading of the analysis. I do not have a strong foundation of knowledge regarding slavery or the American Civil War to make accurate conclusions when observing this information. Additionally, as was clear with the necessary act of modifying the filtering of words, the data is easily skewed by rogue factors that can be missed by inexperienced users.
In conclusion, the tools were very cool. They can definitely provide a novel and even nuanced view of the material in question. I plan to use these tools both for my future work with my DPH course and research outside the course, but the lessons for utility also make clear the caution that needs to accompany these tools.