An Academic Catalogue

An openly licensed collection of metadata about all of humanity’s scholarship would be extremely useful to UK and international Higher Education. Libraries, developers and even researchers and authors would be able to find the data they need in seconds.

The UKRISS project is investigating various options for building a UK national research information infrastructure. Right now University administrators, managers, even scholars themselves need to report data to multiple funding organisations, each time sending a very similar dataset which nonetheless takes a lot of time to compile..again and again.

In order to help UKRISS fix this situation, Cottage Labs is looking at how information from different data sources can be mingled, enhanced and validated so that it’s more accurate, useful and finally, so that end-users have to type a lot less to achieve reporting requirements.

But how does one detect a misspelling in a journal name? A non-existent DOI? We discuss this to some extent in our previous post on UKRISS Model Validation. While he describes the process of validation, that process needs a lot of data – research information – to become a high-calibre time saver for multiple users.

And this is where the Academic Catalogue, an index of all scholarship, comes into play. Initially we are focussing on journal-level data only, but will most probably refine this to article-level data at some point.


Data Sources

The word “catalogue” suggests some original source of data which has been included or described in this catalogue. We will be extracting journal titles, journal title abbreviations, ISSN-s and publisher names from:

  • NCBI Entrez E-Utilities
  • We are interested in their PubMed database, containing more than 20 million bibliographic records in the medical field. Also famous for the MEDLINE dataset, which is a subset of the whole database. (For a nice distinction between the different subsets and bits of PubMed see WHAT DATASET ARE WE TALKING ABOUT at the related OpenBiblio project blog post.)

    PubMed contains article-level information – we will be harvesting journal-level data from that.

  • DOAJ – Directory of Open Access Journals
  • This is a journal-level resource. It only describes a smallish subset of all academic journals (some open access ones). Nonetheless, they have kindly provided us with a data dump and are keen to share their information openly, for example through the building of a suitable lightweight API. This is very welcome in a world in which institutions try to hold on to their data as much as possible.

We are looking into SHERPA RoMEO, Mendeley, ORCID and the British National Bibliography for extracting more journal-level data. These links will take you to the API / Developer documentation for each of these services.


Reconciling the different sources

We plan to ultimately provide records which look like this:

    "canonical_journal_title" : "Journal of Stuff", # <-- calculate these later
    "journal_title" : ["Journal of Stuff", "International Journal of Things", ...],
    "canonical_issn": "1234-5678" # <-- calculate these later
    "issn" : ["1234-5678", "9876-5432", ...],
    "canonical_publisher_name" : "Elsevier" # <- calculate these later
    "publisher_name" : ["Elsevier", "Elsevier GmbH", ...]
    "provenance" : [
        {"issn" : "1234-5678", "source" : "pubmed/12345678", "date" : ""},
        {"issn" : "9876-5432", "source" : "repo/item/456", ...},
        {"journal_title" : "Journal of Stuff", "source" : "pubmed/12345678", ...},
        {"journal_title" : "Journal of Stuff", "source" : "repo/item/456", ...},
        {"journal_title" : "International Journal of Stuff", "source" : "repo/item/456", ...},
        {"publisher_name" : "Elsevier" : "source" : "repo/item/456"},
        {"publisher_name" : "Elsevier GmbH", "source" : pubmed/12345678, }
    # other fields we've chosen to pick up from some data source or another
    "electronic_issn": ["1555-2101"], 
    "print_issn": [], 

In essence, when we collect a piece of information – say, an ISSN, from a data source like DOAJ, we add it to that journal’s record, but do not change the canonical fields. We would only add a new ISSN to a journal record’s list of known ISSN-s.

“Later”, as the comments above helpfully state, we run a script over our dataset. This script trusts some data sources more than others – e.g. it may trust the DOAJ more than information from a single institutional repository. However, if multiple less-trusted sources confirm the same information, then that overrides the more-trusted source and becomes the canonical value. We have not implemented this in practice yet, but plan to run it every day, possibly several times per day.

We still provide the lists of possible ISSN-s, or possible titles, as not all applications (including within UKRISS) will need the canonical fields.


More Data Sources

In order to create an Open index of all scholarship, even if it’s just journal information, we need to look at a lot more data sources than we have listed above. We don’t really know how many journals we need to have information on in order to cover a substantial amount of the world’s scholarship, but a previous attempt to build this kind of catalogue left us in the 50 – 60 thousand range, so we will reevaluate the situation once we get there.

Please tell us about data sources which might have journal information in the comments!

Posted in Uncategorized | Comments Off on An Academic Catalogue

UKRISS Model Validation

The UKRISS project is working on representing information about research projects and their outcomes in CERIF, an industry standard in the description of educational activities and assets. The output of this work is a proposal for a harmonised reporting model – a common way of representing research outputs and related pieces of information important to multiple stakeholders in Higher Education.

UKRISS Model Validation introduces the topic and describes the technicalities of understanding, validating and enhancing data which conforms to the proposed model.

Posted in Uncategorized | Comments Off on UKRISS Model Validation

UKRISS interactions with HEDIIP and CERIF in Action

UKRISS recently contributed to the draft HEDIIP report “Creating and Managing a Data Model”, which investigated the feasibility of defining a data model which maps, across the HE lifecycle, the main entities, their relationships and their use in the various sector-level data systems. This report is currently in the review phase and is expected for release in 2013. This involved several in-depth discussions between John Milnor, the main author, and UKRISS team members. The focus of the report is on teaching, whereas UKRISS is primarily concerned with research outcomes. However, a number of areas of intersection were identified such as information concerning postgraduate students. Also, many entities within the research and teaching domains would benefit from the use of shared vocabularies and identifiers. We also thought that CERIF may be a useful language for modelling some of the concepts identified in the HEDIIP report. As a result of these interactions, the CERIF task group has already dedicated some thoughts as to the linkage with the Education domain, which builds upon entities such as service and event.

HEDIIP (Higher Education Data and Information Improvement Programme) is an activity set up within the framework of the Regulatory Partnership Group, whose aim is to enhance the arrangements for the collection, sharing and dissemination of data and information about the HE system. The remit is in line with  the recommendations of section 6.22 of the Government white paper “Students at the Heart of the System”.

The UKRISS team have also been discussing the mapping of core profile fields with the Jisc-funded CERIF in Action project, which recently received an extension. One of the main use cases is bulk upload of research reporting information from RCUK research projects from institutional CRIS systems to the Research Outcomes System (ROS). This covers only a subset of the outcomes reported to funders by PIs. It includes in particular information about publications. The aim is to map the fields of the core reporting profile to CERIF. The objective is to provide these mappings to CRIS vendors in order that they can update their systems for future ROS data gathering. In addition, UKRISS is developing schema validation tools as part of its phase 2 activities, which enable implementers of the core profile to validate the XML generated as well as to validate the information content of the information fields themselves. We also see CRIS vendors as a potential landing zone for this work.

Posted in Uncategorized | Comments Off on UKRISS interactions with HEDIIP and CERIF in Action

CERIF elements and vocabularies landscape survey

Landscape survey spreadsheet

Brief notes 

 As part of WP4, particularly in preparation for its core component of CERIF mapping, it was decided to conduct a ‘landscape’ survey of existing practices and technologies. Specifically, this study examines the current element sets in use in a number of major projects (particularly any use they make of CERIF elements) and also what dictionaries or other vocabularies they use to support the semantics of their CERIF applications. This document provides brief contextual information on the results of this survey: there is also a spreadsheet (available here) which represents the results of the survey of data elements and fields.

Element sets

The spreadsheet documents the data elements used in  the following projects, products and resources:-

  •  RIOXX
  • Gateway to Research
  • Pure
  • CERIF for Datasets

In addition, we looked at Sympletic, although no publicly available information on its CERIF-XML export was available.

Column A in the spreadsheet provides the project or product name, and the next five the hierarchical arrangement of elements, starting with the highest level in column B. Column G provides element definitions when available and column H any notes.

Of these, Gateway to Research, CERIF for Datasets and IRIOS use CERIF, RIOX Simple Dublin Core supplemented by a small number of extra elements (RIOXX Terms) and CASRAI their own element set. In addition, Pure use CERIF (although the exact elements used are not published).

These resources vary considerably in the level of details they record and the manner in which they are  structured. The simplest is the flat-level representation used by RIOXX, which is based on Simple Dublin Core fields and DC Terms for audience, issue date and references, supplemented by two RIOXX elements for project ID and funder,

All of the other projects use more complex, multi-granular levels of description.

Of the three CERIF-based projects Gateway to Research has the most extensive element set. Element sets are defined for:-

  • projects (cfProj)
  • persons (cfPers)
  • research publications (ResPubl)
  • organisational units (cfOrgUnit)
  • funding (cfFund)
  • measures (cfMeas)
  • postal addresses (cdPAddr)
  • research patents (cfRefPat)
  • research products (cfResProd)

    in addition to  CERIF class definition elements (cfClassScheme and cfClass). A standard set of sub-elements and linking elements are used for all of these (designated Second- and Third-level elements in the spreadsheet respectively).

CERIF for Datasets do not have such an extensive element set. They concentrate more on research outputs and also incorporate more Dublin Core elements into their descriptions. Beyond standard description and identification information for outputs, they also include geographic bounding information and spatial and temporal coverage metadata. A number of Dublin Core elements are also used for rights information.

IRIOS use a smaller element set, proving basic information on the project, funding, persons, organisational units, postal and email addresses. Relatively limited sub-elements are deployed within each of these.

Pure state in their documentation that they use CERIF elements internally, although the public documentation does not details the exact implement ion of these. The spreadsheet records the elements listed in this documentation, arranged into the broad categories given. Most categories include statements that more elements are available in addition to these main ones listed. All of these elements would map neatly into CERIF.

CASRAI has an extensive element set arranged over three nested levels, which is detailed in the spreadsheet. All of these should map into CERIF given the use of appropriate semantics.


 In addition to compiling this element set survey, we also examined what dictionaries (if any) are being used to support CERIF data infrastructures.

Research Fish employs a small dictionary, its controlled terms mainly limited to types of publication. Content rules (for instance on the format of author names) are employed for a number of other fields, such as Title. PubMed is used as the primary source of publication IDs. Controlled lists are used for staff roles, sectors, qualifications, engagement activities, audience for engagement activities, types of influence on policy etc., impact types, types of research methodology, types of product output, product development stages, and types of award/recognition.

 ROS employs more detailed dictionaries, covering publications, other research outputs, collaboration and partners, further funding, staff development, dissemination, intellectual property and exploitation, awards/recognition and impact. Publication types and other research outputs are the most detailed (the latter divided into biological, creative, electronic, physical, research materials and other).

Further dictionaries are employed for languages, roles, beneficiaries of outcomes, broadcast media and coverage, venue types for events (also descriptions of its coverage (national/international etc) and audience size. Dictionaries for information on collaboration include organisation types and sectors. Funding has lists for funding organisations; staff development for project roles, destination for roles and destination sectors; dissemination for types of activity, nature of briefings to government advisers and target audience; intellectual property for other sectors’ involvement, stage of disclosure, exploitation types and employee numbers in spin-out companies; impact for type of impact, the ways in which influence is brought about, target audience and impact type; and key findings for sector.

Gateway to Research publishes a CERIF class dictionary with 236 terms including terms covering the status of a grant (active, closed etc), types of grant, funding schemes and so on.

euroCRIS have published a CERIF vocabulary, currently in version 1.5,  which contains 450 terms, each of which is associated with a classification scheme (Organisation Types, Funder Types,  Output Types, Person Contact Details, Electronic Address Types, Organisation Contact Details,  Media Relations, Research Infrastructure Types, Identifier Types, Person Names, Education Domain Terms, Activity Funding Types, Activity Finance Categories, Activity Finance Category Amounts, Publication Statues, Peer Reviews, Output Quality Levels, Open Science Costs and Verification Statuses). This provides a useful core vocabulary for any CERIF application and show be employed as the primary scheme.

 RIOXX have produced a controlled vocabulary for international and  UK funders (available at – this could be very useful although it is not clear if this will continue to be updated.

Posted in Background Research, International Perspective, Technical Review | Comments Off on CERIF elements and vocabularies landscape survey

Landscape survey

We have now completed a survey of current use of CERIF in a number of projects and products. Those we examined were:-

Gateway to Research
CERIF for Datasets

Of these, four Pure, Gateway to Research, CERIF for Datasets and IRIOS employ CERIF, although Pure do not make their CERIF data structures publicly available. The projects we looked at have very different approaches to the elements used and the levels of granularity they employ. Details can be found in the landscape survey spreadsheet and accompanying notes. The notes are reproduced in the next post.

In addition, we looked at the use of vocabularies by these projects, and what others are available. EuroCRIS themselves publish a list of about 450 terms in a number of classification schemes. Otherwise the most extensive scheme available is that published by ROS which covers publications, other research outputs, collaboration and partners, further funding, staff development, dissemination, intellectual property and exploitation, awards/recognition and impact. Information on these dictionaries and vocabularies can be found in the landscape survey accompanying notes.

Posted in Uncategorized | Comments Off on Landscape survey

Phase 2 Technical Work

At the review held after the first phase of UKRISS, it was agreed with Jisc the focus of phase 2 would be on modelling to support the reporting of research outcomes. The main objectives are to identify opportunities for harmonisation across the data collected by Research Outcomes System (ROS), Research Fish and the HEFCE HE-BCI survey, to document the differences between the various information requests and to define a core information profile of common fields together with supporting CERIF mappings and dictionaries of terms.

In order to demonstrate the validity of the models (including CERIF mappings and dictionaries) developed by the project and the feasibility of their being taken forward by the community, the team at Cottage Labs and Exeter will develop pre-beta quality software which will aim to validate the models by answering the following questions:

  • Can the models be suitably populated from institutional systems? This will allow us to address whether the model is complete and sufficiently rich to represent the real data held and managed by institutions.
  • Is the data in the model consistently applied and correct? That is, can the data, when held in the model, be validated as correct. Validation means more than just simple schema validation, but validation that the content of the fields themselves is appropriate. This will allow us to asses whether the model can be used consistently across many organisations, enabling the interchange of data.
  • Can aggregates of the data in the model formats provide extra value? In aggregating the data can we enable more advanced usages of it? This would allow us to determine if the model has the right degree of granularity and ease of interpretation for uses outside of the organisations that create it.

In addition, the project will produce documentation which will assist potential adopters of the models in implementing and using them. This documentation will also provide support for the justifications that the project will make for its modelling decisions. You can read a more detailed outline of the technical work here.

Posted in Technical | Tagged , , , , , , , , , , , , , , , , , , | Comments Off on Phase 2 Technical Work

Regulatory Partnership Group spring conference

The Regulatory Partnership Group (RPG) was asked by the Government in June 2012 to develop a new operating framework for higher education in England to take account of the new funding arrangements introduced in September 2012.  A vision was set out in the 2011 White Paper “Students at the Heart of the System”.

A draft of the operating framework was circulated prior to the conference and was discussed during the presentations and breakout sessions. The overall feedback was that the document represented a step forward. However the current version is complex and hard to understand, and not accessible to students in particular. The main new feature in the report seemed to be a registry of providers of HE services. Overall it was hard to judge from the document what was a description of existing structures and what was new. Further iterations of the operating framework will be circulated in the next few months, in particular including market research with students.

Part of the RPG activity is on redesigning the information landscape. Andy Youell from HESA gave a very clear summary of the work carried out in the HESA Information Landscape Study of June 2012 and subsequent work.  A new body called HEDIIP has been formed and is currently defining a programme of work, supported by a programme office. For further details see Tom Wilson will be chair of the programme board. The overall objectives are

1. Collective oversight to enable effective governance of data collection.

2. Definition of a common data language.

3. An inventory of data collection and collectors.

4. Specific data standards work.

From the UKRISS perspective, there is likely to be overlap with the UKRISS work on research information reporting and HEDIIP activities,  in areas such as reporting on research students, subject coding etc… Hence we plan to engage with HEDIIP during the next few months to identify common information fields, definitions and dictionaries.

Posted in Uncategorized | Comments Off on Regulatory Partnership Group spring conference

UKRISS Phase 2: Data Gathering

Over the last few weeks, the project team have been working on requirements and data gathering for Phase 2. At KCL, the team has been working on gathering information about the reporting information collected by the RCUK systems ROS (Research Outcomes System) and Research Fish. Research Fish is used by MRC and STFC, whilst the remaining five councils use ROS. Meetings have been held with MRC to discuss in detail the information collected by Research Fish. This has been combined with scraping questions and help information from the user interface (assisted by Cottage Labs) and an analysis of the KCL MRC dataset. A similar process has been undertaken for ROS, based on an analysis of information provided by NERC as well as technical specifications of the ROS system. Research Fish collects information from PIs (or delegated staff) only. ROS reporting is carried out by PIs (or delegated representatives), but in part can be done by bulk upload data in Excel format. The next step in the analysis will be to perform a detailed comparison of the information fields collected by the two systems and in order to define a core information profile for reporting, and to identify where there are significant conflicts in reporting requirements.

In parallel, a survey of CERIF mappings performed by related RIM projects has been carried out. This includes RIOXX, CiA, IRIOS (1 &2) Gateway to Research. This is in preparation for the next step of mapping the core information profile to CERIF, and will help us to reuse mappings already carried out by previous projects.

The team at Brunel University has been working on gathering data on institutional research information reporting. So far, we have drafted an initial set of questions, which have been refined into a two-pronged data gathering exercise, after detailed discussion with the rest of the project team on the issues around institutional reporting and the type of data we need for the project.

Some of the issues raised centre on the varied systems in use to record and report research information, what reporting they are used for, and how difficult it is for these systems to talk to one another. Underpinning all this is the fact that there may not be any person or department with a complete overview of all the research information reporting going on, and certainly no one who is familiar with all systems. It’s likely that researchers are individually responsible for reporting (or delegating this) to funders, while other departments record research outputs. Reporting to funders can therefore take place directly without any further reference to the institution; so no complete records of what’s reported may exist centrally.

To tackle these issues, we felt that the best way to capture information in the first data gathering phase is to document the research information reporting process with an active researcher for each RCUK funded project with access to relevant reporting systems, to see what information is asked for, and to harvest data fields and guidance to set the context, identify ambiguities, and compare data across funders for commonalities and differences. This data can then be used to inform the technical mapping of vocabularies to fulfil some of the project’s ultimate goals: the simplification, standardisation and efficiency of institutional research information reporting.

We are also planning a second data gathering phase: this time looking for information on time and FTE cost of reporting for researchers, targeting a broader sample of researchers to identify patterns, and areas where efficiencies can be made to make processes simpler for researchers, institutions and funding councils, while meeting reporting requirements.

Posted in Background Research, Communication & Dissemination, Project Management, Project Partner, Requirements, Requirements & Feasibility, Requirements Gathering, Stakeholders | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , | Comments Off on UKRISS Phase 2: Data Gathering

Appendices to the UKRISS Feasibility Study

As part of the first phase of this project, we published a feasibility and scoping study with the aim of presenting an overview of the research reporting landscape, and defining the options for phase 2. The background research for the report included carrying out interviews with key stakeholders, and technical reviews of commercial CRIS systems and existing national research information systems. While the results of this research are summarised in the report, some readers may find it useful to view the datasets on which the report was based. These have been included in separate appendices which can be viewed by clicking on the links below.

Appendix A: UKRISS Project Background

Appendix B: Stakeholder Matrix

Appendix C: Landscape Study

Appendix D: Technology Review

Appendix E: Question Set

Appendix F: Participant Consent Form

Appendix G: Drivers and Driver Aggregation

Appendix H: Full set of requirements

Appendix I: Classification of Stakeholders and Requirements

Appendix L: Data Management Guidelines

Posted in Background Research, Communication & Dissemination, Requirements & Feasibility, Stakeholder Analysis, Technical Review | Tagged , , , , , , | Comments Off on Appendices to the UKRISS Feasibility Study

UKRISS Phase 2: focus on modelling

Following the recent publication of the Feasibility Study from phase 1 of the UKRISS project, we are pleased to announce that JISC have approved the funding for Phase 2 of the project.

We previously provided JISC with 3 options for how phase 2 could progress:

Option 1: Focus on modelling. This option focuses on the development of a core information profile and serialisation in CERIF. Extensions to include CERIF modelling of organisational structures and HR data can be made to support internal collation of research information.

Option 2: Focus on benchmarking. This option involves the development of a benchmarking tool that exploits shared information to carry out cross-institutional information analysis. A subset of the modelling work in 1 will be carried out to support this.

Option 3: Focus on reporting infrastructure. This option has a focus on the development of a reporting service based on connecting funders and institutions to a single cloud-based connector (Enterprise Service Bus). Smaller CERIF modelling and benchmarking tasks are included as well as work on sustainability of such a service.

After deliberating on the pros and cons of each option, consulting with the Steering Group, and with other members of the community, Jisc has selected Option 1: Focus on modelling:

It was felt that the work required for Option 2: Focus on Benchmarking was being partially addressed already elsewhere, for example by the Gateway-to-Research for HE and Snowball projects, and that the aim was too long term for the time and resources available to UKRISS. Therefore, the Steering Group didn’t feel that this was worth pursuing further at this point. Meanwhile, Option 3: Focus on Reporting, wasn’t felt to have a sufficiently strong business justification, given the size and complexity of the task, and more work was needed on the technical options appraisal. The direction signalled by Option 1, though, was felt to be central to significant future work in this sector, and was therefore the way forward.

The next stage for us is to make a more concrete plan for this second phase, bearing in mind the following things:

  1. Existing support for the sector in the use of CERIF via the CERIF support post at UKOLN and through close collaboration with euroCRIS.
  2. Sustainability, and the relationship of the project outputs to the Regulatory Partnership Group, and other community efforts
  3. Support of the models by relevant technical development – code libraries for supporting the models, validators, demonstrators to show value – to ensure that the only output is not just documentation which may not have sufficient impact
  4. Futher development on appraising technical options to deliver value to universities, based on this modelling work and work elsewhere.
  5. Incorporate a business model for both the value of the project outputs and the value of future investment in this area

In the next few weeks, the project will then be working on a plan which will aim to provide outputs including:

  • Documentation of a set of models/profiles appropriate for both funder information and institutional data (HR, org structure, etc) with CERIF serialisations
  • Supporting code libraries/services (e.g. validators)/demonstrators as necessary
  • Implementation guides (i.e. how to apply the models to your data), for institutions, funders and vendors.
  • Potential further work on modelling and supporting gap analysis
  • An institution case study for adopting the models, based on one of the project partners
  • Business case to justify further investment in this area

You’ll hear from us on this blog as soon as a plan has taken shape. In the mean time, your comments on this way forward are welcome.

Posted in Prioritisation, Prototyping & Recommendations | Tagged , , , , , , | Comments Off on UKRISS Phase 2: focus on modelling