Semantic Web and data model

Summary

  • Resource description framework (RDF) and the semantic web in the data.bnf.fr project
  • The FRBR construction and author, work and subject concepts
  • Simple schema
  • Embedded data: schema.org and Opengraph Protocol
  • Vocabularies
  • Presentation of the ontology: bnf-onto
  • Bibliothèque nationale de France Vocabularies
  • Mappings between the Intermarc format and the RDF language we use
  • Resource description framework (RDF) and the semantic web in the data.bnf.fr project

    The data.bnf.fr project has to be placed in the context of our move towards open data. This approach has been defined by the W3C, regarding the “semantic web” or “linked data”.
    This is about structuring resources in order to make them reusable by machines in a better way. The data.bnf.fr project uses data which have been created in various formats such as InterMarc for the main catalogue, XML-EAD for archives inventories and Dublin Core for the digital library.
    Such data is automatically gathered, modelled and enriched and are published in the RDF semantic web language. The result is available on the website, in different RDF syntaxes: RDF-XML, RDF-N3, and RDF-NT.
    Part of the data is matched with external value vocabularies: id.loc.gov for languages and nationalities, dewey.info for subjects, DCMI type for document types.
    They are also matched with data sets that are identified by CKAN: dbpedia and VIAF.

    The Bibliothèque nationale de France provides:

  • URIs for resources: all resources have permanent identifiers, granted via the ARK process which is the way to find all resources of the library.
  • a display of data in RDF as “linked open data”. It is available for every page and for the whole database
  • How to retrieve data.bnf.fr data:

  • by clicking on the RDF icon, at the bottom of the pages
  • by adding the following suffixes to the URL: NT, N3, RDF-XML, according to the format needed. For instance:
    http://data.bnf.fr/11928016/jules_verne/rdf.xml,
    http://data.bnf.fr/11928016/jules_verne/rdf.nt,
    http://data.bnf.fr/11928016/jules_verne/rdf.n3.
  • via content negotiation, using a RDF web browser, from the URL.
  • by using the following URL to retrieve the complete Dump:

  • ftp host : echanges.bnf.fr , port : 21
    login : databnf, mot de passe : databnf

    and HTTP: full rdf dump (rdf/xml)

    The licence to use our data is available here.

    The software is: CubicWeb

    CubicWeb It is a open source platform for semantic web applications under LGPL licence.

    logo cubicweb

    The FRBR construction and author, work and subject concepts

    The FRBR model

    Data.bnf.fr is carried out in the context of the recent evolutions of bibliographic description, by experimenting and adapting the FRBR (Functional requirements for Bibliographic Records)model, elaborated by the IFLA (International Federation for Library Associations).
    The model has three entity groups which are linked together by relationships: information about documents, persons and organisations, and subjects.

    • “Work” pages

    The first group of the FRBR model describes the different aspects of an intellectual or art creation, and discerns 4 levels: work, expression, manifestation and item.

    The work level is about the intellectual and artistic creation. For instance: Le colonel Chabert by Honoré de Balzac. “Work” pages are created using the related authority records from the BnF Main Catalogue.

    The expression level (different versions of this work such as a translation, an adaptation or an abridgment) does not appear in the html pages but can be seen in the corresponding RDF pages.

    The manifestation level is the physical embodiment of a work. For instance an edition of Les Misérables like “Nouvelle impression illustrée. 1879-1882. Paris. E. Hugues”. The manifestations are listed in the documentary unit and gathered in the section entitled “Vie et éditions de l’œuvre” (Life and editions of the work). This level corresponds to the bibliographic record in the BnF Main catalogue, or to a manuscript that is identified by a label in the Archives and Manuscript Catalogue (BnF archives et manuscrits).

    There can be a part-whole relationship between:
  • a work and another work. For example: Le Père Goriot (Honoré de Balzac), is part of the work Scenes de la vie privée, by the same author, and both are considered as works and have a page in data.bnf.fr.
  • a manifestation and another one. For example: a specific edition of Le Père Goriot (Honoré de Balzac) is part of the manifestation Etudes de moeurs which is an edition gathering several texts by Balzac.
    • “Author” pages:

    A person or an organisation can be either the “author” of a work (then there is a link between the “author” page and the related “work” page) or “contributor” of an expression (translator, preface writer, librettist…).
    Nevertheless, as the expression level is not different from the manifestation level in the html pages of data.bnf.fr, contributors do only appear at the manifestation level. The different creation or contribution roles are listed in a BnF repository, in the Intermarc format, and in the Library of Congress repository, in Marc. This kind of data enrich the RDF of the pages.

    Link to the Intermarc code list for relators and creators (BnF).

    Link to the Marc code list for relators of the Library of Congress.

    • “Subject” pages

    Among retrievable data, there are subjects records from the Bibliothèque nationale de France (RAMEAU, which is the French indexation language). They have been converted into the RDF language SKOS (Simple Kowledge Organisation), in the context of the European project TELplus. This repository is now updated on data.bnf.fr with the whole current database from the Bibliothèque nationale de France.
    In order to get dereferenceable URIs in our website, URIs from the initial project such as http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb12650268p, have to be converted to simple and uniform URIs with: the root: http://data.bnf.fr and the ARK identifier of the authority subject record.
    For instance:
    The URI http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb12650268p, the subject “ornithologie”, will be replaced by: http://data.bnf.fr/ark:/12148/cb12650268p.
    Manifestations which have a RAMEAU term as a subject are brought together in the appropriate “subject” page.

    Moreover the site holds pages that gather works and manifestations about a work or an author. These pages are not indexed by search engines and are available from the “work” or “author” pages.
    For instance: on the page “Napoleon”, there is a link towards a page presenting documents about Napoleon such as Vie de Napoléon Buonaparte, 1827.

    Alignments and clustering by work

    In “work” and “author” pages, all manifestations by a single author are gathered around his works, thanks to the explicit link to title authority record (Titre Conventionnel or TIC, in French) , inside the original bibliographic record.

    In the meantime some manifestations are not linked to the title authority record and remain “orphan”. In order to improve the way our data is translated in FRBR and to bring a better service to the public, it is important to align these orphan manifestations, which means bringing them together around the corresponding work.

    Example:
    Bibliographic record (BnF) with a link to the title authority record “Fables” and the author authority record “Jean de La Fontaine”.
    Bibliographic record (BnF)without any link to the title authority record “La cigale et la fourmi” but with a link to the author authority record “Jean de La Fontaine”.

    That is why we have already produced simple alignments in data.bnf.fr . When a manifestation is explicitly linked to an author authority record in the bibliographic record, and when the character string of this manifestation is exactly the same as the work’s title, then the manifestation is aligned with the work.

    Yet, after this simple alignment, many manifestations remain orphan. In the long term two solutions are possible:

  • alignment: : It means attaching manifestations to a work which has its own title authority record and, thus, its own page. These manifestations do not have any link to the title authority record: they come from bibliographic records of the Main catalogue or from descriptions of “BnF Archives et Manuscrits”.
    We use a simple and advanced alignment algorithm (word beginning with, exact match, words with a X distance, Levenstein distance, matching algorithm) to determine whether two character strings correspond to the same work. The link to the author authority record remains essential to align works.
  • clustering: if there is no title authority record, some manifestations are gathered around a new documentary unit
  • Simple schema

    The data model is presented here :

    schema ontologie

    Example 1 : Victor Hugo, author of Les Contemplations.

    Example 1 : Victor Hugo, author of Les Contemplations.


    Example 2 : Charles Baudelaire, writer of the preface of an edition of the "Gold-Bug" by Edgar Allan Poe.

    Exemple de graphe 2 : Charles Baudelaire, préfacier d’une édition du Scarabée d’or d’Edgar Poe.

    The full data model is available here.

    Consulter le schéma complet du modèle de donnée.

    Embedded data: schema.org and Opengraph Protocol

    “Author”, “work” and “subject” pages are open on the Web and can be reached by search engines.
    This is why, except from the traditional methods used for indexing the homepage, we have chosen to embed two kinds of data to structure these pages:

  • Schema.org, provides a vocabulary to add information to the HTML content, with a microdata format, to foster the indexing by search engines.
  • The following elements are used:

    itemtype=http://schema.org/Person
    itemprop="description" itemprop="birthdate" itemprop="deathdate" itemprop="nationality" itemprop="memberOf"

    itemtype=http://schema.org/Book
    itemprop="description" itemprop="inLanguage" itemprop="datePublished" itemprop="genre"

    itemtype= http://schema.org/Organization
    itemprop="description" itemprop="image" itemprop="name" itemprop="url" itemprop="members" itemprop="founding date" itemprop="founders"

    And for sub groups of the organizations:
    itemscope itemtype= http://schema.org/PerformingGroup itemscope itemtype= http://schema.org/DanceGroup itemscope itemtype= http://schema.org/TheaterGroup itemscope itemtype= http://schema.org/MusicGroup

  • Opengraph Protocol (OG), so that the pages can be represented in social networks.
  • It is a very simple vocabulary to encode in RDFa metadata to be retrieved when the user adds the resource to its Facebook profile. The following metadata is embedded in the HTML header, thanks to META markups:

    og: title (title of the page)
    og: description (description of the page content)
    og: type (type of resource)
    og: url (page URL)
    og: image (URL of the image that illustrates page)
    og: author (name of the author in the “work” page)

    Vocabularies

    We preferred to reuse existing vocabularies in order to foster interoperability.
    rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns;
    rdfs: http://www.w3.org/2000/01/rdf-schema;
    skos: http://www.w3.org/2004/02/skos/core;
    dc: http://purl.org/dc/dcmitype/";
    foaf: http://xmlns.com/foaf/0.1/;
    RDAgroup2elements http://RDVocab.info/ElementsGr2/;
    rdvocab: http://RDVocab.info/Elements;

    Nevertheless some properties and classes have to be expressed by an ontology specific to the BnF: bnf-onto. To publish the ontology, the BnF has chosen the harmonized namespace http://data.bnf.fr/ontology/.

    Presentation of the BnF ontology: bnf-onto

    The ontology "bnf-onto" can be seen at this adress : http://data.bnf.fr/ontology/bnf-onto-en/".
    List of properties:

  • isbn = International standard book number). http://data.bnf.fr/ontology/bnf-onto/isbn.
  • imgAlt = alternative text for images. http://data.bnf.fr/ontology/bnf-onto/imgAlt.
  • ean = EAN, European article numbering (Bar Code). http://data.bnf.fr/ontology/bnf-onto/ean.
  • cote = shelfmark of an archival document: unique number identifying the item which is kept in the collections.http://data.bnf.fr/ontology/bnf-onto/cote.
  • depiction = The preferred thumbnail for a page, chosen manually.http://data.bnf.fr/ontology/bnf-onto/depiction.
  • issn = International standard serial number.http://data.bnf.fr/ontology/bnf-onto/issn.
  • ismn = International standard music number (printed music).http://data.bnf.fr/ontology/bnf-onto/ismn.
  • expositionVirtuelle = URL for a virtual exhibition of the BnF.http://data.bnf.fr/ontology/bnf-onto/expositionVirtuelle.
  • anl = an ANL stands for a sub-record that gives access to a specific part of document (illustration of a book, text embedded in a book…).http://data.bnf.fr/ontology/bnf-onto/anl.
  • ouvrageJeunesse = an adapted edition of a work for the younger public. Meant to sort editions, which often offer a different content even though the title is the same.http://data.bnf.fr/ontology/bnf-onto/ouvrageJeunesse.
  • code_role = coded role describing a contribution of a person/organisation in a work.
    Numeric values are used to describe relators.
    Completed by the mark-up related to the code list for contributors and creators of the Library of Congress. http://data.bnf.fr/ontology/bnf-onto/code_role.
  • role = the French label of contribution roles.http://data.bnf.fr/ontology/bnf-onto/role.
  • Bibliothèque nationale de France vocabularies

    BnF specfific vocabularies are displayed at this address : http://data.bnf.fr/vocabulary-en.
    List of vocabularies:

  • Country codes list: http://data.bnf.fr/vocabulary/countrycodes
  • Relator codes list: http://data.bnf.fr/vocabulary/roles
  • Types of RAMEAU subject headings: http://data.bnf.fr/vocabulary/scheme
  • Mappings between the Intermarc format and the RDF language we use

    Person RDF Intermarc field for authors
    name skos:prefLabel @in_lang 100 400
    other name skos:altLabel foaf:familyName foaf:givenName dc:date
    nationality foaf:nationality 008 position 12-13
    language RDAgroup2elements: languageOfThePerson 008 position 14 16
    gender foaf:gender 008 position 17
    date of birth RDAgroup2elements:dateOfBirth 008 position 27-36
    date of death RDAgroup2elements:dateOfDeath 008 position 37-46
    place of birth RDAgroup2elements:placeOfBirth 603 $a
    place of death RDAgroup2elements:placeOfDeath 603 $b
    beginning of activity RDAgroup2elements:periodOfActivityOfThePerson 008 position 47-51
    end of activité RDAgroup2elements:periodOfActivityOfThePerson 008 position 52-55
    sources (note about the record's sources) skos:editorialNote 610
    summary, note RDAgroup2elements: biographicalInformation 600
    domains RDAgroup2elements: fieldOfActivityOfThePerson 624
    link to the DBpedia resource owl:sameAs
    code for relators marcrel:[from the Library of Congress, http://id.loc.gov]
    image of the author from Gallica foaf: depiction
    Organisation RDF Intermarc field for organisations
    name skos:prefLabel @in_lang 100 400
    nationality foaf:nationality 008 position 12-13
    language RDAgroup2elements: languageOfThePerson 008 position 14-16
    beginning RDAgroup2Elements:dateAssociatedWithTheCorporateBody 008 pos 27-36
    end stop_date_info RDAgroup2Elements:dateAssociatedWithTheCorporateBody 008 pos 37-46
    beginning of activity dc:date 008 pos 47-51
    end of activity RDAgroup2elements:periodOfActivityOfTheCorporateBody 008 pos 52-55
    website foaf:homepage 606
    sources skos:editorialNote 610
    summary/note RDAgroup2elements:corporateHistory 600
    domain RDAgroup2elements:fieldOfActivityOfTheCorporateBody 624
    link to the DBpedia resource owl:sameAs
    RAMEAU subjects headings RDF
    orginal title skos: prefLabel 16X 46X
    other title skos: altLabel 16X 46X
    source (thesaurus Rameau) skos: inScheme
    source (note about the record's note) skos: editorialNote 610-612
    other note skos: scopeNote 600
    broader concepts skos: broader 3XX, 5XX
    narrower concepts skos: narrower 3XX, 5XX
    related concepts skos: related 3XX, 5XX
    alignement with external datasets skos: closematch 620
    alignement with external datasets skos: exactmatch
    Work RDF Intermarc field for titles
    title (main title) dc:title skos:prefLabel, rdfs:label @in_lang 145 415
    other title skos:altLabel @in_lang
    langue dc:language 008 position 14-16
    dates dc: date 008 position 27-26
    source skos:editorialNote 610
    summary/note dc: description 600
    domain dc:subject 624
    link to the authority record in the BnF catalog owl: sameAs
    Part of dc:isPartOf
    Relations
    main author dc: creator 100 101 110 110
    relators dc:contributor bnf_onto:coderole 711/702/700/701/710/712
    relator's code dc:contributor bnf_onto:coderole code libre 321 322
    image for the digitised work in Gallica foaf: depiction
    Manifestation RDF Intermarc field (bibliographic record)
    manifestation of a work rdarelationships:workManifested
    title dc: title 245
    has part dc:hasPart
    publishing date dc:date 260
    publishing place rdvocab:placeOfPublication 250
    publisher's name rdvocab:publishersName 260
    physical description dc:description
    ISBN bnf-onto:ISBN 20
    Type of document dc:type
    Language dc: language 41
    adaptation for the youth bnf-onto: ouvrageJeunesse
    Expression RDF
    Relators marcrel: [relator's role from the Library of Congress, http://id.loc.gov]
    Relator's code bnf-onto: coderole sub-field $4
    Contribution bnf-onto: role
    type of document dc: type