Background and Basic Statistics

Introduction and Background

The Southern Baptist Convention (SBC) is the largest and perhaps most influential protestant organization in America today. In a recent study, I surveyed the resolutions adopted by the convention, using machine learning to identify topics and track trends in language over time. Here, I will present a similar study using the words of the SBC's leaders.

The Southern Baptist Historical Library and Archives (SBHLA) has been digitizing relevant documentation and recently made available two rather interesting sets of documents: sermons preached at the convention, and SBC presidential addresses. Taken together, these represent the viewpoints of respected spiritual and political leaders within the SBC and provide insight into the priorities and focus of the leadership of the convention.

In this work, following the techniques used in the last analysis, non-negative matrix factorization was applied to these documents in order to identify topics. Due to the nature of the texts, there are a few new innovations to the process that will be described. Finally an analysis of the topics identified will be presented, exploring how the language in these documents change over time.

Data Acquisition and Methodology

As stated by the SBHLA, since its founding in 1845, a sermon has been preached at every yearly gathering of the SBC. Initially these sermons focused on foreign missions and were preached on the first night of the meeting. While the focus of the sermons has varied over the years, the practice of appointing a minister to preach to the gathered body of Baptists has been maintained to the present. The record of sermons is fairly complete, beginning in 1890 and continuing through 2019.

Presidential addresses, given by the incumbent, are also available online from the SBHLA. This platform provides the president with an opportunity to address concerns and encourage the convention at large. Unlike the sermons, however, there are significant gaps in the historical record. While a few addresses are available from 1919 and 1923, a mostly continuous record exists between 1950 to 1985 and 1990 and 2019.

Before analyzing the topics in these documents, it is interesting to note that convention sermons and presidential addresses have remained fairly consistent in length over the years. This can be seen in the figure here, which displays the word count of these documents. While there is certainly a large amount of variation year over year, the overall trend line is relatively flat. Despite the fact that resolutions have been getting longer in recent years, it would appear that the pastors and presidents speaking to the SBC have not felt the need to become more verbose.

Right: the word count of sermons preached at the SBC's yearly convention. Left: the word count of the presidential addresses given at the SBC's yearly convention. The dots represent raw values whereas the solid line represents a 10 year moving average.

Building the corpus

While the documents for this study were available online, they were in PDF format and much of the extractable text contained significant errors. In an attempt to alleviate this, the scanned images were re-processed through the optical character recognition (OCR) program, Tesseract. This open-source software was originally developed by HP and maintained by Google up through 2019.

In order to feed in the documents to Tesseract, they were first converted from pdf into jpeg images. Next, using OpenCV (open source computer vision), they were processed to clean up the text, converting the document to greyscale, applying appropriate thresholds, and using a median blur. After this, characters were detected using Tesseract. Finally, a simple spell check algorithm was applied to attempt to correct single character errors.

Unfortunately, significant errors remained in the final dataset, such as misspelled words or missing characters. This appears to be largely due to the quality of the original documents and the limitations of the software employed. Despite this, as will be shown, a sufficient number of words were processed correctly so as to produce meaningful topics.

Next
Next

Analysis: 1880-1960