How to train ParsCit for scientific journals
In the project group knowAAN our goal is to analyze the data of scientific publications. These publications are often offered in the PDF. As we need a data format, which is accepted for inputs of large parts of software, our first step is to transform the PDF into a suitable format. This format is plain text, which can be used as an input for the majority of extraction tools.
For one branch of our analysis of data, we chose the extraction tool ParsCit. ParsCit generates useful outputs: Metadata, the logical structure of a publication, and different fields of the reference-part of publications (e.g. authors, title, date). We use these extracted data for further analysis.
For the extraction of references there exist different approaches. ParsCit „is architected as a supervised machine learning procedure that uses Conditional Random Fields as its learning mechanism“. You can train ParsCit with a set of data to generate a foundation to extract metadata of similar inputs. This means that the outcome of the extraction process highly depends on the used train data.
We are interested in results of special research fields. Therefore the idea of analyzing whole journals came up. Such a journal can exist since 10 years with 3 issues a year and 10 publications per issue. This produces a set of 300 publications. For the special case of analyzing a journal, it is obvious to train ParsCit especially for the publications of the journal.
After testing some approaches, we developed heuristics for the training of ParsCit to parse journals. These heuristics consist of a gradual and iterative process of creating input data and perform a training.
The first step is to create a blank file for tagged data. This file is the human-readable training set of ParsCit. We will gradually add data to the file to produce well parsing results and keep the file small. In the first step we have to create an initial training set. After a first training, this set is completed by adding additional references, which were parsed incorrectly. After the initial set is completed, there is the option to add additional publications.
The creation of the initial training set starts with a choice of some publications of the journal. In most cases publications of a journal use the same citation style. A set of 7 to 10 publications is a good start for a homogeneous citation style. If there are some publications which use different citation styles, the use of up to 13 publications results in a better quality. For each of the chosen publications we choose 3 to 5 references and add them to the tagged data file. In some cases there are publications with very heterogeneous references. We got better results, if we added up to 10 references for those publications. The initial set should consist of at least 21 lines of tagged data. The last sub-step is to train ParsCit with the generated tagged data.
After the creation of the fist model, we have a look at the parsing results of the initial chosen publications. If the results are satisfiable, this step can be skipped. If not, we choose 5 publications of the initial set with incorrectly parsed references. Following, we choose 3 to 4 incorrectly parsed references of each publication, add the correct values to the tagged data file, and train ParsCit. These steps are repeated until all files of the initial set of publications is parsed correctly.
The last step is equal to the previous step, except that publications of the journal, which have not been in the initial training set, are used. We collect a set of publications, choose incorrect parsed references, add the correct data to the tagged data file, and train ParsCit.
Following these steps, you will generate an useful model. It is also to mention that we achieved good results in combining training sets of different journals.