An Academic Catalogue

An openly licensed collection of metadata about all of humanity’s scholarship would be extremely useful to UK and international Higher Education. Libraries, developers and even researchers and authors would be able to find the data they need in seconds.

The UKRISS project is investigating various options for building a UK national research information infrastructure. Right now University administrators, managers, even scholars themselves need to report data to multiple funding organisations, each time sending a very similar dataset which nonetheless takes a lot of time to compile..again and again.

In order to help UKRISS fix this situation, Cottage Labs is looking at how information from different data sources can be mingled, enhanced and validated so that it’s more accurate, useful and finally, so that end-users have to type a lot less to achieve reporting requirements.

But how does one detect a misspelling in a journal name? A non-existent DOI? We discuss this to some extent in our previous post on UKRISS Model Validation. While he describes the process of validation, that process needs a lot of data – research information – to become a high-calibre time saver for multiple users.

And this is where the Academic Catalogue, an index of all scholarship, comes into play. Initially we are focussing on journal-level data only, but will most probably refine this to article-level data at some point.


Data Sources

The word “catalogue” suggests some original source of data which has been included or described in this catalogue. We will be extracting journal titles, journal title abbreviations, ISSN-s and publisher names from:

  • NCBI Entrez E-Utilities
  • We are interested in their PubMed database, containing more than 20 million bibliographic records in the medical field. Also famous for the MEDLINE dataset, which is a subset of the whole database. (For a nice distinction between the different subsets and bits of PubMed see WHAT DATASET ARE WE TALKING ABOUT at the related OpenBiblio project blog post.)

    PubMed contains article-level information – we will be harvesting journal-level data from that.

  • DOAJ – Directory of Open Access Journals
  • This is a journal-level resource. It only describes a smallish subset of all academic journals (some open access ones). Nonetheless, they have kindly provided us with a data dump and are keen to share their information openly, for example through the building of a suitable lightweight API. This is very welcome in a world in which institutions try to hold on to their data as much as possible.

We are looking into SHERPA RoMEO, Mendeley, ORCID and the British National Bibliography for extracting more journal-level data. These links will take you to the API / Developer documentation for each of these services.


Reconciling the different sources

We plan to ultimately provide records which look like this:

    "canonical_journal_title" : "Journal of Stuff", # <-- calculate these later
    "journal_title" : ["Journal of Stuff", "International Journal of Things", ...],
    "canonical_issn": "1234-5678" # <-- calculate these later
    "issn" : ["1234-5678", "9876-5432", ...],
    "canonical_publisher_name" : "Elsevier" # <- calculate these later
    "publisher_name" : ["Elsevier", "Elsevier GmbH", ...]
    "provenance" : [
        {"issn" : "1234-5678", "source" : "pubmed/12345678", "date" : ""},
        {"issn" : "9876-5432", "source" : "repo/item/456", ...},
        {"journal_title" : "Journal of Stuff", "source" : "pubmed/12345678", ...},
        {"journal_title" : "Journal of Stuff", "source" : "repo/item/456", ...},
        {"journal_title" : "International Journal of Stuff", "source" : "repo/item/456", ...},
        {"publisher_name" : "Elsevier" : "source" : "repo/item/456"},
        {"publisher_name" : "Elsevier GmbH", "source" : pubmed/12345678, }
    # other fields we've chosen to pick up from some data source or another
    "electronic_issn": ["1555-2101"], 
    "print_issn": [], 

In essence, when we collect a piece of information – say, an ISSN, from a data source like DOAJ, we add it to that journal’s record, but do not change the canonical fields. We would only add a new ISSN to a journal record’s list of known ISSN-s.

“Later”, as the comments above helpfully state, we run a script over our dataset. This script trusts some data sources more than others – e.g. it may trust the DOAJ more than information from a single institutional repository. However, if multiple less-trusted sources confirm the same information, then that overrides the more-trusted source and becomes the canonical value. We have not implemented this in practice yet, but plan to run it every day, possibly several times per day.

We still provide the lists of possible ISSN-s, or possible titles, as not all applications (including within UKRISS) will need the canonical fields.


More Data Sources

In order to create an Open index of all scholarship, even if it’s just journal information, we need to look at a lot more data sources than we have listed above. We don’t really know how many journals we need to have information on in order to cover a substantial amount of the world’s scholarship, but a previous attempt to build this kind of catalogue left us in the 50 – 60 thousand range, so we will reevaluate the situation once we get there.

Please tell us about data sources which might have journal information in the comments!

This entry was posted in Uncategorized. Bookmark the permalink.