Text Analysis on a Native American Corpus

Eustina Kim, Michelle Lee, Priyana Patel, Vicki Truong

Overview

Corpus Background

The corpus used for analysis included documents from the American State Papers and notes from treaty councils with Native Americans. 

The American State Papers is a collection of legislative and executive documents from Congress during 1789 to 1838. The documents are arranged in the following ten classes: Foreign Relations, Indian Affairs, Finances, Commerce and Navigation, Military Affairs, Naval Affairs, Post Office Department, Public Lands, Claims, and Miscellaneous. We used texts from the Indian Affairs class for analysis, which has two volumes: Volume 1 from 1789 to 1814 and Volume 2 from 1815 to 1827. 

The treaty council papers include transcription notes from treaty councils during the period 1784-1814. Initially, treaties between the United States and Native American leaders focused on peace and friendship because the U.S. as a young country needed to establish a preeminent relationship with the tribes that were also trading with the British and the French. Also, it was too costly to fight them. The objective behind these peace treaties was to foster mutual respect between the parties involved because “federal officials learned of the importance of kinship and symbolic bonds in tribal communities” (Fixico 5). As this relationship was eventually solidified, land-acquisition became the central issue in U.S.-Indian treaties. The main goal of the U.S. was to expand its territory for new settlers, while the Native Americans were concerned about maintaining sovereignty. Their conflicting goals created more treaties leading to the cession of tribal lands and systematic creation of new reservation boundaries. By the end of 19th century, “American Indians held less than 2 percent of the total land that they had once possessed” (Fixico 6).

Tools for Analysis 

The tools used to analyze the corpus include  Voyant Tools and Python. The first two text analysis techniques, word frequency and collocation analysis, will be explored with Voyant Tools. Voyant Tools is an interactive web-based application, which gives users a simple and easy method to analyze a corpus or document. The last section, topic modeling, employs several Python packages. First, the NLTK package in Python facilitates  text pre-processing steps like tokenizing, classification, and stemming which breaks down text for the computer to better analyze. As each file is opened and read into the working directory, stop words, punctuation, and non-alphabetical tokens are removed. The processed text was appended to an empty list, creating a collection of processed documents. The gensim package is used for topic modeling and identifying document similarity. Lastly, the LDAvis package enables us to create visualizations of the model that illustrate the results of our analysis and allow users/readers to explore the model themselves. 

Word Frequency/Dispersion

Word frequency tools analyze and organize the most frequently occuring words in a given text. Visualization tools for word frequency analysis provide an overview of a particular set of data while also highlighting some surprises. One example of a word frequency visualization is a word cloud. A word cloud represents text data in terms of text size: the bigger the size, the more frequently the term occurs. 


**100 most common terms**

Using the wordcloud function in Python, we generated a word cloud of the preprocessed corpus. We find some of the most frequently occurring terms to be ‘United States’, ‘Indians’, ’would’, ’people’,’ us’, ’government’, ’present’, ’nation’, ’country’ and ‘time’. Among those are also the pronouns ‘we’, ‘I’, and ‘you’, commonly used in the context of being “Indian” and notions of difference. Between the three pronouns, ‘we’, ‘I’, and ‘you’, ‘I’ was the most frequently occuring pronoun, ‘we’ was second, and ‘you’ was the least. Based on these results, we can infer that Native Americans referred to themselves with ‘we’. Native Americans choose to use ‘we’, since they also refer to themselves as ‘the Six Nations’, a term that signifies that community and unity between the tribes of the Iroquois confederacy. Words such as ‘us’, ‘nation’, ‘others’, ‘he’, ‘chiefs’, and ‘men’ follow this common theme of differentiating between Native people and Europeans. In contrast, non-Natives use the pronoun ‘I’ because they speak as individuals as opposed to on behalf of a group or community.

Collocation Analysis

In languages, certain words tend to appear together in a statistically significant pattern to form a unique meaning. For example, “day and night” and “fast food”. These word pairings are known as collocations, which often form the expressions or idioms of a certain language. 

This section will discuss the applications of collocation analysis to the Native American corpus. Using the Natural Language Toolkit (NLTK) with Python on the corpus, we explore the list of most frequently occuring collocates. Based on this list and the visualizations generated, we can understand certain language patterns unique to Native American speakers and what certain collocates could mean respective to the topics in the corpus.

Bigrams

We can compute bigram collocations (two adjacent words) for the 15 most frequently occurring terms using the following code: 

bigram_collocation = BigramCollocationFinder.from_words(tokens)
bigram_collocation.nbest(BigramAssocMeasures.likelihood_ratio, 15)

Results:


**100 most common bigrams**

We find common bigrams from the entire corpus that refer to nations, cities, and native groups, such as the United States, New York, Six Nations, and Creek Nation. Interesting terms such as “white people” and “obedient servant” extend to how native people voiced the notion of difference. Verb phrases and prepositional phrases such as “I honor”, “communicated to”, “within limits”, and “OF WAR” suggest possible themes of respect and boundaries. Terms such as “Secretary War”, “President United” and “chiefs warriors” all pay respect to government titles and hierarchies of power and control. 

To filter by pronoun to see which specific words commonly occur with I, we, or you, we can create a ngram filter:

pronoun_filter = lambda *w: 'you' not in w
bigram_collocation.apply_ngram_filter(pronoun_filter)

Results:

The  pronouns “I” and “we” are commonly paired with verbs. Exceptions to this case include “Sir I”, “us we”, “us we”, “I we” and “When we”. Common verbs between the two pronouns include “hope”, “shall”, “know” and “wish”. 
*Inconclusive for “you”.

Trigrams

We follow the same process for trigram collocations (three adjacent words):

trigram_collocation = TrigramCollocationFinder.from_words(text) 
trigram_collocation.nbest(TrigramAssocMeasures.likelihood_ratio, 15)

Results:

***100 most common trigrams***

In stark contrast to the bigram results for the entire corpus, we find the bigram “United States” embedded within all 15 of the most commonly occurring trigrams. “United States” is paired with words such as “President”, “Senate”, “Government” and “commissioners”, again alluding to notions of influence and control. We also find the word “limit(s)” as a common word between bigrams and trigrams, indicating that land boundaries (or limits) were a common topic of conversation. The trigram “citizen United States” emphasizes the broad topic of a nation and its people. This poses the question of whether the context of this term within the corpus was generalized to those that inhabited the land at this time, or just Americans, excluding Native people. Common terms such as “part” and “behalf” bring up concepts such as responsibility for one’s actions. We also find words including “cede” and the past tense “ceded”, posing the question of whether this act of surrendering land was a statement, remark, or direct request. Lastly, the word “friendship” is a critical term in the exploration of the relationship between Americans and Native people, questioning whether both sides sought friendly relations, as leaders considered the values and well-being of their own people. 

Pronoun filter:

pronoun_filter = lambda *w: 'you' not in w
trigram_collocation.apply_ngram_filter(pronoun_filter)

Results:

With the pronoun filters for “I”, “you”, and “we”, we also find the bigram “United States”, “thousand dollars” and “chiefs warriors” within the commonly occurring trigrams. We also find an overlap between other commonly bigrams such as “Secretary War”, “Six Nations” and “New York”. A common bigram for the pronoun “I” in our trigram results is “I honor” paired with the adjective “obedient” and the verbs “transit” and “acknowledge”. A new finding that we see specifically in these results is the instance of “Creek nation” referring to the Muscogee indigenous people. The mention of “Great Spirit” highlights the supernatural being and creator deity of Native American Indians. Additionally, the title governor in “Governor Blount we” identifies William Blount, the first territorial governor of Tennessee. 

**The results for the pronoun “you” were inconclusive for bigram and trigram collocation results for the entire corpus.**

Topic Modeling

With a corpus of many documents, we may wonder what the dominant themes in the corpus are and which documents are most strongly correlated with each theme. Topic modeling, a type of statistical modeling to discover such themes or topics, can help us with this. We will use Latent Dirichlet Allocation (LDA) for this corpus, because LDA helps classify text in a document to a particular topic. We use gensim to create a dictionary of words from the documents and filter out tokens that appear in fewer than 3 documents and then convert the tokenized documents to vectors. 

In order to determine how many topics we should use for our model, we compare C_v coherence scores across different numbers of topics. Coherence scores measure the degree of semantic similarity between high scoring words in the topic. We create several models with the same parameters, only varying the number of topics from 2 to 15. We find that 11 topics provides the highest coherence score (0.5569352542868876) before tapering off, so we will use 11 topics. 

Parameter tuning will lead to different models. For this model, we use our corpus, dictionary, set a random state of 100 (akin to setting a seed for reproduction purposes), 11 topics (optimal number according to coherence scores), 25 passes (number of times the corpus is passed through for training), and chunk size of 100 (number of documents to be used in each training chunk). These parameters are the same as those used to determine coherence scores, except for the number of topics. 

The Intertopic Distance Map is a visualization of the topics and how they differ from each other. It also allows for further inspection of the terms most highly associated with each topic.

Below, we have attempted to come up with topics given the relevant terms and how important each term was. Python counts starting from 0, so in some of the code, Topic 0 really refers to Topic 1, and so forth. The top 5 words from each topic are given in parentheses; the full list of words for each topic can be found in the Intertopic Distance Map.

***List of words for each topic***


Topic 1: war related to the Creeks tribe (Creeks, war, towns, killed, sent)

Topic 2: commissioning related to the Creeks tribe in Georgia (Georgia, agent, commissioners, We, Creek)

Topic 3: treaty conditions (line, thence, miles, dollars, tribe)

Topic 4: correspondence about commissioners by the Creek (commissioners, We, Georgia, Creek, war)

Topic 5: trading (amount, trade, per, goods, Do)

Topic 6: commissioning related to tribes in Florida (commissioners, agent, rations, father, Florida)

Topic 7: correspondence between nations, possibly people in Six Nations (We, Nations, Father, Brothers, Six)

Topic 8: interactions with non-Natives (troops, Christian, society, Major, Ohio)

Topic 9: Native American communication (You, brothers, Brother, We, Wyandots)

Topic 10: trade in Missouri (tribe, trade, Missouri, intercourse, traders)

Topic 11: currency (For, Do, cents, pounds, John)

Overall, we can gather that the topics relate to Native American tribes (some specifically addressed) and their interactions with non-Natives, which range from transactions to war. 

From the Intertopic Distance Map, larger bubbles represent the number of relevant terms that belong to that topic, and topics that are nearer to each other are more similar. The topics rank from the largest marginal topic distribution to the least (Topic 1 is largest, Topic 11 is smallest). Topics 1, 2, 3, 4, 6 and 8 are relatively closer together; topics 7 and 9 are closer together; topics 5, 10, and 11 are not as close to any other topics. This makes sense – topics 1, 2, 3, 4, 6 and 8 seem to be related to commissioning and/or the Creeks; topics 7 and 9 are heavy in words that Native Americans use to address others. 

An interesting observation is how capitalization affects the interpretation of the results. In this analysis, we did not change the capitalization of the words because capitalization can indicate terms of address, which is important when examining correspondence. Native Americans have terms that are relatively specific to them, so removing capitalization can take away possible interpretations. For example, ‘Creeks’ might refer to streams of water to most people. However, in the context of our corpus, ‘Creeks’ likely means the Muscogee indigenous people. Of course, our analysis hinges on how tokenization is performed in Python and what steps we take to further process it. These steps are subject to error, which could leave some gaps in our work. Also, different model parameters can create different models, also leading to varying interpretations. Tinkering with more pre-processing and model parameters can support other steps for more research. 

Using a heat map, we also found the proportion of topics for each document. Each row is a document, and each column is a topic; the proportions across each row sum to 1. Brighter colors indicate a larger proportion of the document is about a topic.

Indeed, for each topic, we can gather the top 20 documents that are about that topic with the proportions. In the image below, we can see the top 20 documents that are about Topic 7 (labelled as Topic 6 in Python). 

A corresponding visualization of the topic representation frequency can be seen below. Topic 8 has the greatest number of documents that are the most related to topic 8. On the other hand, Topic 7 has the fewest number of documents that are the most related to topic 7.

Conclusion 

Overall, this analysis gives a glimpse into the Native American corpus using three different text analysis techniques: word frequency and dispersion, collocation analysis, and topic modeling. From this analysis, we’ve gathered that Native Americans most commonly refer to themselves with familial and unifying terms such as ‘we’, ‘us’, ‘nation’, ‘brother’, and ‘father’ because their representatives are usually speaking on behalf of the entire community. Additionally, the most commonly occurring collocations give context to the locations that the corpus references “United States”,  “Six Nations”, “Lake Erie” and “Western Territory”. The events that occur in these locations, including trade transactions and war, are further investigated from the topics we gather about Native Americans and their relations with non-Natives. So far, we’ve done a fair amount of analysis, yet there is still a lot more to be explored from the corpus.

Our analysis is limited in the sense that it only applies to our instances of how we processed the text and tuned model parameters. Our text processing could have been taken a step further by creating our own stopwords and stemming. This requires greater knowledge of the corpus so that we know what words to include or exclude. Words may have little lexical meaning in different contexts, but for our corpus, they may be important. 

Further steps to take may include using other methods of text analysis. For example, it could be interesting to compare sentiment between the American state papers and the treaty council notes to compare the connotations between the documents. We can also analyze the council notes by speakers to compare the languages between the U.S. officials and Native American tribes. These steps will help us better understand the corpus and allow us to apply this knowledge to other areas.  


References:

Leave a Reply

Your email address will not be published. Required fields are marked *