Corpora Developed

Listed below are the corpora I have been developing. Download links and brief summaries are provided below. For any data that were extracted from PDFs, it should be understood that some errors may be present as a result of encoding issues or text that could not be extracted from images. Consequently, I have also compressed and made available the original PDFs for your consideration, especially for any studies that may rely upon the images.

Corpus of al-Qa’ida English-Language Periodicals (2017) — This corpus contains over 380,000 words from sixteen issues of Inspire, the magazine produced by al-Qa’ida. All content from the magazines is provided in text format in individual and conjoined files, and basic statistical data are included, viz. character/word/line frequency for each file and total word frequency for each file and the entire corpus. Stopwords and punctuation were removed for frequency statistics. Any additional, relevant information can be found in the included README. The original PDFs can also be downloaded in compressed format here.

Corpus of Da’esh English-Language Periodicals (2016) — This corpus contains over 515,000 words from fifteen issues of Dabiq and seven issues of Rumiyah, the two magazines produced by Da’esh (i.e. the Islamic State). All content from the magazines is provided in text format in individual and conjoined files, and basic statistical data are included, viz. character/word/line frequency for each file and total word frequency for each file and the entire corpus. Stopwords and punctuation were removed for frequency statistics. Any additional, relevant information can be found in the included README.  The original PDFs can also be downloaded in compressed format here.