Downloading Data from millercenter.org

Introduction

The Miller Center of Public Affairs often receives requests for bulk downloads of our data. This site is intended to help satisfy these requests. If you are interested in downloading data from millercenter.org, please read this document. If you have questions, contact Miles Efron (mefron@virginia.edu).

Our most commonly requested data are from our the Presidential Speeches collection. This is a corpus of text data--speeches given by U.S. presidents, from George Washington's time to the contemporary presidency. The collection is not exhaustive; inclusion in the collection is an editorial decision by Miller Center staff. However, we have over 1,000 speeches available, and many NLP and computational humanities / social science researchers find it useful.

With this in mind, we have made the Presidential Speeches collection available via a REST API. At this point the API is simple, but it should meet most researchers' needs.

Terms of service

These data are offered as-is, with no warantee or support, for the use of the research and academic communities. At this time, we offer access to the data without authentication. However, rate-limiting does apply to the API described in this document, so please exercise prudence in using it.

The speeches are in the public domain. But please cite the data like so:

For the impatient: how to get the data

If you just want to download the speech data, the fastest way is to download this simple python program. Running it on your computer will download the full speech corpus:

$ python download_mc_speeches.py

(This sends the results to a json-encoded file called speeches.json.)

NB: This is a trivial and emphatically non-optimized program. For the sake of simplicity, it holds all returned speeches in memory during the API pagination. Since the corpus is small, this shouldn't present a problem for most computers. Our goal in providing it is only to allow (a) a simple way to download the data and (b) a clear example of how to interact with the API.

Other Details

For a fuller discussion of how to interact with our API. Please see the API documentation. At this time the API only permits downloading the speech data. In the future we plan to add to the API in two ways:

  1. More data: We hope to increase our data offerings to other corpora.
  2. Summary statistics: Soon the API will expose statistics such as term counts for the speech data to allow for shareable model preprocessing.