Towards politically relevant corpora that persist
The focus of PolMine is on the texts published by public institutions in Germany. Corpora of parliamentary protocols are at the heart of the project: Parliamentary proceedings are available for long stretches of time, cover a broad set of public policies and are in the public domain, making them a valuable text resource for political science. In the course of the project, we have worked with txt-, pdf- and html-documents as raw material. We establish a framework for processing plenary protocols.
Corpora need to be sustainable if we want to share and expand our experiences with text mining techniques. This is an advantage of texts produced by public institutions: We do not face licenses requiring researchers to delete corpora that have been prepared. To facilitate the growth of sustainable corpora, we are also concerned to develop a solid codebase that will ensure the reproducibility and expansion of the corpora on a continuous basis.
Achieving the machine-readable format of text prepared as a corpus, we perform a XMLification. All kinds of metadata that are contained in the original document are maintained and will be available for the further analysis. To avoid a proliferation of document formats, we adhere to standardizations suggested by the Text Encoding Initiative (TEI). An XML document conforming to TEI is sufficiently flexible to be turned into almost any other data format required.
The first release of the PolMine corpus of parliamentary protocols of the German Bundestag is scheduled for spring 2016. The corpus will be available for registered academic users. First, we want to support users. Second: We take considerable efforts to prepare clean data. Yet we rely on automated proceedures. It is during the actual use of the data when remaining errors are recognized. We want to involve users in a process to improve the quality of the data.