Jump directly to: Using the database | Technical background
In recent years, the Heidelberg Research Architecture project Early Chinese Periodicals Online (ECPO) has evolved from a data silo into an open-access research platform. In the first decade of its existence, the project’s focus was on the systematization of digitized early Chinese press material. This resulted in a searchable database for image scans and bilingual metadata with over 435,000 entries – including 300,000 scans, 85,000 records and 50,000 agents names from Republican-era magazines and newspapers.
The ECPO platform was implemented in collaboration with the Institute of Modern History, Academia Sinica, Taiwan, made possible with funding from the Chiang Ching-kuo Foundation for International Scholarly Exchange. The platform has since been developed with further support from various institutions, such as the Centre for Asian and Transcultural Studies (CATS) Library, the Heidelberg Centre for Transcultural Studies (HCTS), the Institute of Chinese Studies and the Research Council Cultural Dynamics in Globalized Worlds from the University of Heidelberg; along with the Konfuzius-Institut Heidelberg and the University of Erlangen-Nürnberg as affiliated partners.
The ECPO database combines what we call “extensive” and “intensive” approaches to China’s early periodicals.
The extensive comprises a comprehensive catalog and record of Republican-era art and literary periodicals, including such basic data as title, editor, publisher, location and dates of publication, periodicity, format, prominent contributors, etc.
The intensive approach involves archiving digital cover-to-cover copies of entire runs of periodicals, analyzing their complete contents, and tagging them with structured meta-data.
So far, six journals (four women’s journals and two entertainment periodicals) have been included in the database using the intensive approach, including the four in-depth analysed magazines in the WoMag database. In the extensive section of the database, we have so far been able to work on a selection of some 150 journals in addition to those contained in the Xiaobao database.
We will continue to upgrade the database’s keyword metadata by mapping them -as far as possible- to established structured thesauri such as Getty Art and Architecture Thesaurus and its Chinese version, AAT Taiwan in cooperation with the ASDC. This mapping in turn opens up the possibility of linking ECPO directly to other digital collections.
ECPO is an Open Access resource developed by independent programmers and the Heidelberg Research Architecture (HRA). All bibliographic descriptions can be accessed through the ECPO API in MODS XML format. This allows broad sharing and exchange of the project’s research results in the future.
We are continuously working on improving our agent records. ECPO comprises more than 50.000 different names of persons, groups, or institutions. These names are recorded as they occur in the original publications. We developed a cross-database agent service which allows us to manage name records, assign them to individual agent records, or split similar names into various angents. We use international authority files, like VIAF or GND, and larger knowledge bases like Wikidata, DBpedia, ors well as the Chinese encyclopedia Baidu Baike, to uniquely identify our agent records and provide users with links to additional information on the respective agent. For an example, see the agent record of Bao Tianxiao 包天笑.
Recently, ECPO started to work with neural networks with a focus on document layout recognition of Republican newspapers and OCRing individual text segments. Our aim is to advance the processing of Republican China newspapers and provide the content as full text. To learn more about our results, please follow our project presentations and have a look at the bibliography.
For an introduction into features of the database and the recent experiments with the use of neural networks, please have a look at the video Ground Truth, Neural Networks, OCR: Towards Full Text of Republican China Newspapers presented at AAS2021. We have created a short tutorial using the database to help you find your way thruogh the database. In that section we offer lists of the periodicals included in ECPO. We also provide a special section about the technical background. Here you find some information about the systems we use. In addition, you will find information about the API's we currently provide.
Update 2022:
As the material basis of the database consists mostly of image scans, the project has been running experiments on one Republican newspaper to explore approaches towards full-text generation. Computer-aided processing of image scans of historical periodicals is still a challenging process with the current state of technology, in particular because processing standards for Latin-script newspapers are not applicable for the Chinese context. It is only with new approaches in machine learning that it is now possible to transform material which was previously inaccessible just a few years ago. However, many challenges remain. Extremely complex layouts resulting in difficulties for reliable automatic detection of page segmentation have so far prevented full-text generation for these newspapers even within China.
The application of artificial intelligence requires a ground truth data set. This error-free, manually corrected text with structural information is used both for evaluation and the training of software models for text and layout recognition. In fall of 2021, the project successfully implemented OCR on a sample from the newspaper 晶報 Jing bao (The Crystal), with a character error rate below 3% (Henke 2021). On that basis, the project is now expanding and generalizing its approach. With additional funding recently received from the Research Council Cultural Dynamics in Globalized Worlds for the first half of 2022, the project is currently producing a new data set. The project’s aim is to offer a solution to automatically produce full text from Republican newspapers using neural networks and machine learning.
The project’s current work will not only further develop its original aims, but will also contribute to the field of research as a whole. With the disclosure of the project’s network models and data sets, its results can be reproduced, evaluated and its approaches can be adopted by others in the field. Although processing non-Latin-script is still a challenge in many cases, the project hopes that its work may serve as good practice examples for such initiatives.