The Dutch Ships and Sailors Project

Abstract

In this document, we present the Dutch Ships and Sailors project. This project brings together four Dutch maritime historical datasets. We use Semantic Web technologies for representing and interlinking the resulting data into one interoperable but heterogeneous data cloud. The dataset is available as five-star linked data. The individual datasets use separate data models, designed in close collaboration with maritime historical researchers. We present the project goals, technologies and current status as well as discussing current work in a second phase of the project. We show ways of accessing the data and present a number of examples of how the dataset can be used for historical research. The Dutch Ships and Sailors Linked Data Cloud is a prime example of the benefits of Linked Data for the field of historical research.1

1. Introduction

In historical research, data-driven methodologies are gaining ground. Here, integrating datasets that are curated by researchers with different research goals can prove very valuable, allowing for new types of research questions and analyses. In the digital history domain, it has been recognized that this integration of data has great potential [Cohen et al. 2008]. However, often datasets made in individual projects are not made available in a shareable and reusable format, if they are made available at all.

In the Dutch Ships and Sailors project (DSS), computer scientists and historical scientists are collaborating to bring together multiple Dutch maritime historical datasources. The maritime industry in the Netherlands has been central to regional and global economic, social and cultural exchange. It is also one of the best historically documented sectors of human activity. In the past few decades, much of the data in the preserved historical source material has been digitized. Among the most interesting data are those on shipping movement and crew members (cf. [van Rossum 2011]). However, much of the digitized historical source material is still scattered across many databases and archives while still referring to common 'places', 'ships', 'persons' and 'events'. The DSS data cloud brings together the rich maritime historical data preserved in four of these different databases.

By linking the different available databases, the data can complement and amplify each other, and new research possibilities open up. Toward this end, we employ open (Semantic) Web standards, specifically the Resource Description Framework (RDF) for representing the data and Linked Data principles [Berners-Lee 2006] for linking and providing access to the data. Uniform Resource Identifiers (URIs) are assigned to resources, acting as persistent (and Web-dereferenceable) identifiers for 'things' (ships, places, persons, etc.). The use of Semantic Web standards and principles allows for the light-weight integration of data sources. Specifically, these principles do not impose a singular database schema, but rather enable ad-hoc integration through mapping of fields and values. This has made earlier successful data integration efforts such as MultimediaN E-culture [Schreiber et al. 2008] possible. This project shows how Linked Data principles and technologies serve to integrate different datasets in a flexible way. In the case of these relatively "small" datasets, close collaboration between data experts and the converting party ensures that the richness of the original data is not lost, and interoperability is gained up to a level where it can be used for further historical research. It is an example of how Linked Data can benefit humanities research -more specifically digital history. The datacloud can serve as a hub dataset for international maritime historical datasets as well as for other (Dutch) historical datasets.

2. Importance of the DSS project for the historical field

The project has importance for the historical field from both the perspective of the historical discipline, especially that of maritime research, and from the perspective of methodologies in which computer science and historical heuristic needs meet. We will elaborate on both aspects below.

The project itself was born from a historians' workshop investigating possibilities to connect the considerable amount of existing data sets about ships, ship movements, and shipping crews .2 While much available historical material was digitized by historians in separate projects over the last decades, many of these data are not readily available and combining them was previously either impossible or extremely time consuming. The sets combined in the DSS data cloud are a just few of the available sets, chosen because they complement each other in the field of crew recruitment in the 18th and 19th centuries and because they make it possible to investigate changes in patterns of Dutch shipping and crew recruitment. In the project, they are combined with shipping data that were text mined from 19th century newspapers available in the Dutch historical newspaper archives, which contain information that was previously not accessible for researchers in a systematic way. In a way, this was experimental as it was not known whether the newspaper information could be mined and whether its reliability would be satisfactory for historical research purposes. The results in this respect were surprisingly good.

The resulting data cloud of Dutch Ships and Sailors is already a useful data source that may be used for research that transcends the possibilities of the original, unconnected datasets, but its significance lies also in the open nature of the linked data cloud. By design, new sets may be incorporated into the cloud using the same linked data methodology. Thus, they contribute to the data cloud and to the research possibilities for maritime research, but also for wider social and economic historical investigations of which shipping was always an import part in the Netherlands.

The combination of datasets into a combined resource is not just convenient and time saving. It also enables new types of research like quick hypothesis testing a historical researcher's conjectures by making queries into the data cloud. Researchers may discover previously unsuspected patterns and developments that are only visible in the cumulative data by exploring the combined sets, for example by using visualisations of trends or statistical analysis.

For an extensible dataset, it is essential that the data are combined into the data cloud in the proper way. All modelling of real life data requires interpretation, but this is even more the case for historical data that often are derived from sources that were compiled for much different purposes than what the historian wants to use them for. When datasets are combined, data usually need to be abstracted once more to make them comparable to the other datasets they are combined with. Creating a combined dataset is not a new effort, but in the past rigorous data modelling and hidden assumptions in modelling often resulted in altered data and inextricable conglomerates of data from different and unequal sources. For a historian, it is important to always be able both to see what context the data were taken from and to which manipulations by man or machine (or both) they were subjected when they were combined into the data cloud. The methodology used in the Dutch Ships and Sailors project makes this possible. For all data the provenance was kept, all modelling was always explicit and all adaptions of the data were kept as separate annotations in the resulting linked data cloud. In this way, historians may investigate the original context of the data and include or exclude all sources or processing steps when querying the data. In this way, the data cloud has become a transparent resource that caters to the historian's needs very well.

3. Importance of the DSS project for the field of Semantic Web research

This work also presents advances in the field of Knowledge Representation, a subdiscipline of Computer Science, more specifically in the context of the Semantic Web and Linked Data [Bizer 2009]. Many cultural heritage and history datasets are being converted into RDF and published as Linked Data. Often, this is done either in the context of larger projects, such as for example Europeana or in individual efforts. In the first case, data is mostly converted to a single data model (e.g. the Europeana Data Model in Isaac, 2010), where nuances of the individual datasets can get lost. In the second case, a specific data model for the data is used and integrating it with other datasets is a difficult task.

This project presents an option which lies between converting to one data model and having multiple unrelated data models, by presenting individual conversion strategies for the four datasets with mappings of classes and properties to a domain-specific integration layer. The conversion strategies of the individual datasets have been developed in close collaboration with the historical researchers to ensure that no intended meaning is lost and no unnecessary normalisation steps are taken. In this project, we show how in this way, the individual nuances and intentions of the datasets is retained, while still making the datasets reusable in a larger context. We furthermore explicitly model provenance of the data and its different conversion steps detailed in Section 6.4.

The project results in a methodology for converting, integrating different (cultural) historical datasets as well as tools for the different steps in the methodology, thereby making it easier to successfully convert and integrate new datasets in the future. These tools are open and described in a number of project deliverables. For a more detailed account of specific advances this project brings to linked data, we refer the reader to [de Boer et al. 2014].

4. Related Work

This work builds on previous research into the benefits of Linked Data for Digital History also done in close collaboration with historians [de Boer et al. 2013] and some tools and methods are re-used for this project. Our work has a similar relation to other efforts that attempt to link historical data to the Web of data [Hyv¨onen et al. 2012] [van Erp et al. 2011]. In fact, there are multiple examples of datasets that are the result of collaborations between computer scientists and historians [Mero˜no-Pe˜nuela et al. 2012b]. However, in most cases, this concerns a single dataset, published using a single metadata model. In our approach, we work with historians from different backgrounds, who are responsible for their own data and data model. This results in a data cloud of multiple datasets rather than one monolithic dataset. In the related cultural heritage domain, publishing of metadata as linked data is gaining ground. Examples include Europeana [Isaac 2012] which uses the Linked Data architecture to provide access to Europe's cultural heritage metadata from multiple collection metadata providers.

5. The Datasets

Within a pilot study, over 25 Dutch maritime historical datasets were identified. One of the subgoals of the DSS project was to present a list of these datasets with information about availability.3 Of these 25, four were identified to be further enriched and linked into an integrated linked data cloud. The first two are modeled and converted in close collaboration with the historical researchers responsible for the source datasets and we describe them in more detail. The third and fourth datasets are conversions of previously published historical datasets and are less elaborately described. They were, however, converted with the help of the historians. Figure 1 gives an overview of the entire DSS data cloud and the internal and external links.


Figure 1. The Dutch Ships and Sailors Linked Data cloud. The individual datasets are represented by ovals in the bottom half of the image. Internal links are represented by arrows. External links are represented by dotted arrows.


5.1 GZMVOC

The "Generale Zeemonsterrollen VOC" (GZMVOC) (en: "General sea muster rolls VOC'') is a dataset describing the crews of all ships of the Dutch East India Company (VOC) from 1691--1791. The data was gathered by a Dutch social historian Matthias van Rossum (co-author of this paper) in the course of his research on labor situations for European and Asiatic crews on Dutch VOC ships. The data is based on archival records from the VOC itself and lists data of all ships that sailed between Europe and Asia. The data consists of the size of the captain and crew as well as its composition (number of European and Asiatic sailors, soldiers and passengers), geographical location of the counting as well as data on the name and type of ship. Details on the Asiatic crew members are listed, including wages, job descriptions, place of origin, categorization and hierarchical structure. References to the Dutch Asiatic Shipping (DAS) records were also present (see Section 5.4).

5.2 MDB

The "Noordelijke Monsterollen Databases" (MDB) (en: "Northern muster rolls databases'') is a dataset describing mustering information found in mustering archives in the three northern Dutch provinces (Groningen, Friesland, Drenthe) in the period 1803--1937. The original Noordelijke Monsterollen Databases (MDB) was provided as a SQL dump file by the original maker of the data, historian Jurjen Leinenga (also co-author of this paper). The database consists of two tables, one with records of ship muster rolls and one with records of person-contracts related to those muster rolls. Figure 2 shows an original low-resolution scan of a muster roll.

5.3 VOCOPV

The original dataset "VOC Opvarenden" [van Velzen 2010] is the result of a manual digitization of the personnel data of the VOC in the 18th Century. The original data consists of three separate parts ('voyagers', 'salary books' and 'beneficiaries') and was downloaded from DANS Easy website.

5.4 DAS

The Dutch Asiatic Shipping (DAS) dataset contains data regarding outward and homeward voyages of more than 4,700 ships that sailed under the flag of the (VOC) between 1595 and 1795. The dataset is a conversion of a previously digitized DAS dataset hosted at Huygens ING [Bruijn et al. 1979] at http://resources.huygens.knaw.nl/das/index_html_en. Between 1595 and 1795 the Dutch East India Company (VOC) and its predecessors before 1602 equipped more than 4,700 ships to sail from the shores of the Netherlands bound for Asia. More than 3,400 ships made the return voyage home. The reference work Dutch-Asiatic Shipping has classified these voyages on which Dutch trade between Europe and Asia was founded in a systematic survey.


Figure 2. Example scan of a MDB muster roll, listing legal information as well as handwritten information on the names, ranks and wages of the crew members.


6. General approach

As described in de Boer, 2012, the use of Linked Data technology allows for the use of heterogeneous data models, while interoperability is still achieved through linking of instances, classes and properties. Since the individual datasets are created with a specific goal in mind, they have different data models. Rather than normalizing everything into a single monolithic database scheme, Semantic Web technologies allows to represent these different models in the same framework (RDF). This light-weight integration also allows for later updating of datasets and easy integration of new datasets. For each of the four datasets, RDF representations were created following the methodology as described in de Boer, 2012. For more details on this conversion and the results we point the reader to de Boer, 2014.

The four datasets were initially exported to an XML representation. For converting this XML to RDF, we use the ClioPatria semantic server. ClioPatria includes an RDF triple store that through a web interface provides feedback on the (intermediary) produced RDF. It has extensions for converting RDF using declarative rewrite rules. These two features allow for interactivity of the conversion and modeling, since at each step of the conversion process, the results can be inspected by the experts. In this case, the RDF data models were designed in close cooperation with the historians based on the original data models. Each dataset has its own data model and conversion rules.

6.1 Linked Data for Multilayered enrichment

For many literal values in the data ("strings"), we create RDF resources with URIs in the conversion, allowing for linking them internally or externally. Specifically, we do this for persons, places, ships, ship types and ranks. However, we always retain the original 'flat' structure to ensure that we can always reproduce the original, un-normalized data, which exists alongside the normalized or enriched data. This corresponds to an important requirement as put forward by the historical researchers.

By default, we assume that data items are unique and are mapped to separate URIs, even though they have a number of metadata fields in common. For example, two records (say from 1850 and 1851) might both refer to a person "Piet Janssen" who sailed on the ship "Alberdina". By default, these are not mapped to the same URI. This was an explicit modeling decision taken in collaboration with the historians, since many Dutch names are common and often fathers and sons with the same first and last name sailed on the same ships. Therefore, in the basic data, we assume that all persons and ships are unique and assigned separate URIs. At a later stage, automatic or manual methods are used to establish identity links (see also below).

6.2 Interoperability

After the datasets have been converted to RDF, interoperability is achieved through the linking of instances, properties and classes. We use subproperty and subclass relations to map our classes and properties to common ones. This way we can retain the specificity of the dataset and the intended semantics of the model and still allow for reasoning and querying at the interoperability level. For example, the notion of a ship name is slightly different amongst the datasets even though they use the same field name. In some cases, some normalization process has taken place in the original archive data and in other cases it has not. These (sometimes subtle) differences are regarded as crucial by the historians and they need to be maintained in the converted datasets to ensure trust and usage.

At this interoperability level we have defined commonly appearing classes and properties such as 'Ship', 'Person' etc. This DSS schema itself is mapped to often-used schema's such as SKOS to describe concepts schemes (ranks, ship types,...); FOAF to describe person information; and Dublin Core terms to describe record information (description, identifier,...). We provide mappings to the ISOcat registry.

6.3 Establishing links

We establish links to (external) resources in four different ways.

  • For cases where in the original data there are explicit and unambiguous references to other records in either the same or another dataset, in the conversion step, we generate an RDF mapping triple.
  • When linking requires more complex techniques, we employ the ClioPatria package Amalgame. Amalgame is an iterative alignment platform that allows a user to mix-and-match multiple label- and structure-matching algorithms as well as filtering operations into an alignment workflow. We use this tool to establish links to external datasources such as DBPedia, the Getty Art and Architecture Thesaurus and GeoNames
  • We also link to non-Linked Data resources, more specifically to digital historical newspaper articles from the Dutch Royal Library (KB). For this, a separate linking algorithm was developed [Balado 2014]. The linking algorithm uses a number of features such as ship names, captain names, time constraints and automatically derived indicator phrases for maritime events (such as "left port", "sailing for" etc.) to establish likely links between MDB records and KB articles.
  • For the establishing of identity links between resources (for example ships) within one dataset, as described in Section 6.1, we developed an entity reconciliation algorithm. This is described in more detail in Ponstein, 2014).

6.4 Provenance and content confidence

Provenance plays an important role in historical research and specifically in archival research. The origin and history of archival data is crucial to estimate the scientific value of data [Ockeloen et al. 2013]. This holds even truer for digital data, where in many cases its provenance is unknown or lost. The Web recommendation PROV-O [Groth 2013] allows for modeling this provenance using Linked Data. In the DSS cloud we model the provenance on the named graph level. Each named graph is a separate set of triples that come from one source. This can be either an original data source, or the result of an enrichment or linking process. Provenance triples describe for each named graph a) the process from which it originates b) (software) actors involved in those processes and c) datasets used as input.

Next to provenance information, for automatically derived data we list the content confidence [De Niles et al. 2013]. This provenance information allows for SPARQL queries that include or exclude triples from specific named graphs because they are the result of an operation of a software agent or because they have a too low content confidence value. For a total of four link sets we performed a structured manual evaluation of random samples by the domain expert. For these named graphs, we assign confidence levels based on the evaluation results.

7. Current status

The first stage of the Dutch Ships and Sailors project was funded by CLARIN-NL and ended in March 2014. The result of this stage is converted data made available as five-star linked data online as well as sustainably preserved in a digital archive. We here describe the resulting dataset, data access and preservation and examples of usage. In Section 8, we describe a next stage and current work of the project.

7.1 Data converted

Details on the converted data can be found on the project website as well as in [de Boer et al. 2014]. The conversion scripts, as well as input and output files are available online. In total, the DSS datacloud consists of more than 25 Million RDF triples, divided over 33 named graphs.

Around 1.5 Million internal and external links connect the cloud with itself as well as to external sources. For example, 180,000 links to external newspapers articles are established and 2,500 geographical entities are matched to GeoNames entities.

Data access and preservation

Web interface

The data is accessible through two live ClioPatria triple store instances. A stable version hosted by the Huygens ING institute for historical research is available at http://dutchshipsandsailors.nl/data and a development version online at http://semanticweb.cs.vu.nl/dss. A web interface allows for browsing the data. The graphs can be browsed or downloaded and basic statistics are provided. Detail pages of resources are also provided; for example, http://www.dutchshipsandsailors.nl/data/browse/list_resource?r=http://purl.org/collections/nl/dss/vocopv/opvarenden-344716. A search functionality, which includes autocompletion, is available.

The provenance can be visualized using the PROV-O-Viz tool, which is integrated with the triple store at http://dutchshipsandsailors.nl/data/provoviz. Figure 3 shows two screenshots of the web interface.


Figure 3. Screenshots of the ClioPatria Web interface. The left image shows a local view of a single resource, while the right shows an example of the provenance dataflow visualization.


Linked Data and SPARQL endpoint}

To comply with Linked Data principles [Berners-Lee 2006], when a URI is requested through HTTP, relevant data is returned. Based on the request parameters, the triple store responds with RDF in XML, n-triples, turtle or JSON-LD serialization (or the human-readable detail page). A SPARQL endpoint is provided at http://dutchshipasandsailors.nl/data/sparql/, with a number of interactive interfaces provided, such as the YASGUI interface at http://dutchshipasandsailors.nl/data/dss/yasgui/.

Persistent Data Archiving

The data is also archived at the EASY online archiving system of Data Archiving and Networking Services (DANS). Here the four datasets as well as the interoperability layer are available as RDF/XML files with persistent identifiers\footnotes {These PIDs will be published in the coming months}. Here they are ensured sustainability beyond the life expectancy of the live versions. The data cloud is also registered at the Datahub registry at http://datahub.io/dataset/dutch-ships-and-sailors through which it will be discoverable for others.

7.3 Examples of Historical Research Questions

The datasets were curated with the intent of providing digital access to them for historical researchers. Using the web interface, the SPARQL endpoint or the raw datasets, researchers can retrieve qualitative and quantitative information from each of the datasets separately or from the combined datasets. Furthermore, researchers can inspect provenance information about the retrieved results.

As part of current work, we are looking at the use for historical research. In collaboration with the historical researchers, a number of example use cases and queries have been developed. A number of editable example SPARQL queries are presented at http://www.dutchshipsandsailors.nl/data/dss_queries. We here describe a number of examples.

Cross-dataset search

Because many dataset-specific properties are mapped to higher-level properties, we can search for resources across the different datasets. It is not hard to define a search query that retrieves all ships with the ship name "Johanna" or that have some person with the rank of Captain that has "Veldman" as a last name. This allows for search and comparison between the datasets and for example to research correlations between variables (for example ranks and wages) using data from more than one dataset.

Exploiting the links to GeoNames

Analysis of the origins of persons that sailed on the VOC ships can give insight into the socio-economic realities of the 18th Century. Through the links with GeoNames, we can for example get geo-coordinates to plot information on a map. Figure 4a) shows such a plot. We can also use the GeoNames geographical hierarchy to—for example—analyze the provinces of origin of the voyagers, giving insight at an aggregated level. Figure 4b)visualizes the birth provinces of sailors for one year (1750) and Figure 4c) shows a stream plot of the birth provinces of sailors over multiple years. These visualizations are made possible through the links with an external dataset, they can easily be done for one or multiple DSS datasets and give an insight into the geographical origins of sailors. These visualizations can be used to detect anomalies, formulate hypotheses and to make the work of the quantitative historian more effective and efficient.

Using the shiptype hierarchy

Through the link with AAT and DBPedia, we can use the formalized common sense and expert knowledge to automatically analyze the data. For example, the ship type hierarchy from AAT can be used to analyze features of specific ship types. One of the example queries lists persons that embarked on coastal ships (which has a number of subtypes such as "kof" or "tjalk"). Without the explicit links, a very complex conjunctive query would have to be formulated.


Figure 4. Three visualizations of VOC data made possible through GeoNames links a) shows a plot of birth places on a map; b) shows aggregation by provinces of sailors in one year (1750) and c) shows a stream plot of the sailors per province over all the years for which we have data. These visualizations are made through a simple SPARQL query on the data cloud and visualizing the results using R.


8. Current work

The first phase of the project is finished and the datasets are published as linked data. In a current second phase of the project, we are expanding this data cloud and improving access and usage of the data.

  • Links to more related datasets are currently being established. For example, part of the Dutch historical census data made available through the CEDAR project [Mero˜no-Pe˜nuela et al. 2012a] is already partly linked available in the development version. This presents opportunities for even more elaborate types of analysis beyond the maritime context.
  • We are experimenting with user interfaces for historical research. This involves a 'data science' interface, where users can experiment with different types of queries, visualizations and manipulations of the data, resulting in analyses such as those described in the previous section.
  • Currently, the provenance of the datasets can be traced back to source digital representations of the data. However, in some cases, the original data comes from transcribed archival records. A follow-up research grant allows us to make digital scans of these source records also available and link them to the extracted data as published in the DSS data cloud. For the MDB dataset, we will make digital scans of muster rolls (see Figure 2) available and link these to the MDB records, deepening the provenance information. This enables tracing results of (SPARQL) queries back to the original data even more than is currently possible, ensuring further trust and usability in the historical research context. Currently, we are clearing publishing rights for scans of records from 13 Dutch archives.

Acknowledgments

This work was partly supported by CLARIN-NL under project name DSS. It is also partly supported by a Small Data Grant from Data Archiving and Networking Services (DANS). The authors would like to thank Robin Ponstein and Andrea Bravo Balado.

References

  • [Balado 2014] Balado, A. B. Linking historical ship records to newspaper archives. M.Sc. thesis VU University Amsterdam (2014).
  • [Berners-Lee 2006] Berners-Lee, T. Linked data - design issues. http://www.w3.org/DesignIssues/LinkedData.html (2006).
  • [Bizer et al. 2009] Bizer, C. et al. Linked Data - the story so far. International Journal on Semantic Web and Information Systems, 5(3) (2009):1–22.
  • [Bruijn et al. 1979] Bruijn, J. R. et al. Dutch-asiatic shipping in the 17th and 18th centuries, i, ii and iii. Rijks Geschiedkundige Publication Grote Serie 165, 166, 167; Den Haag: Martinus Nijhoff, 1987, 1979 and 1979.
  • [Cohen et al. 2008] Cohen, D. J. et al. Interchange: The promise of digital history. Special issue, Journal of American History, 95, no.2 (2008).
  • [de Boer et al. 2013] de Boer, V. et al. Linking the kingdom: Enriched access to a historiographical text. In Proceedings of KCAP 2013, Banff, Canada, 23-26 June 2013, 2013. The Dutch Ships and Sailors Project 15 (2013).
  • [de Boer et al. 2012] de Boer, V. et al. Supporting Linked Data Production For Cultural Heritage Institutes: The Amsterdam Museum Case Study. In Proceedings of European Semantic Web Conference (ESWC) (2012).
  • [de Boer et al. 2014] de Boer, V. et al. Dutch ships and sailors linked data cloud. In Proceedings of the International Semantic Web Conference (ISWC 2014), 19-23 October 2014, Riva del Garda, Italy (2014).
  • [De Niles et al. 2013] De Nies, T. et al. Modeling uncertain provenance and provenance of uncertainty in w3c prov. In Proceedings of WWW 2013, pages 167–168 (2013).
  • [Groth 2013] Groth, P. and Moreau, L. (eds.). PROV-Overview. An Overview of the PROV Family of Documents. W3C Working Group Note NOTE-prov-overview-20130430, World Wide Web Consortium, April 2013.
  • [Hyv¨onen et al. 2012] E. Hyv¨onen et al. History on the semantic web as linked data – an event gazetteer and timeline for the world war i. In Proc. of CIDOC 2012 - Enriching Cultural Heritage, Helsinki, Finland, June (2012).
  • [Isaac 2012] Isaac, A. and Haslhofer, B. Linked open data - data.europeana.eu. Semantic Web 4(3): 291-297 (2013) (2012).
  • [Isaac 2010] Isaac, A. Europeana data model primer. http://pro.europeana.eu/edmdocumentation (2010).
  • [Mero˜no-Pe˜nuela et al. 2012a] Mero˜no-Pe˜nuela, A. et al. Linked humanities data: The next frontier? a case-study in historical census data. Proceedings of the 2nd International Workshop on Linked Science 2012, 951, (2012).
  • [Mero˜no-Pe˜nuela et al. 2012b] Mero˜no-Pe˜nuela, A. et al. Semantic technologies for historical research: A survey. Semantic Web Journal [to appear] (2012).
  • [Ockeloen et al. 2013] Ockeloen, N. et al. Biographynet: Managing provenance at multiple levels and from different perspectives. In Proceedings of the Workshop on Linked Science (LiSC) at ISWC 2013, Sydney, Australia, October 2013 (2013).
  • [Ponstein 2014] Ponstein, R. Reconciling dutch ships. M.Sc. thesis VU University Amsterdam (2014).
  • [Schreiber et al. 2008] Schreiber, G. et al. Semantic annotation and search of cultural-heritage collections: The MultimediaN E-Culture demonstrator. J. Web Sem., 6(4):243–249 (2008).
  • [van Erp 2011 et al.] van Erp, M. et al. Automatic Heritage Metadata Enrichment With Historic Events. In Proc. of Int. Conference for Culture and Heritage On-line-Museums and the Web 2011. Archimuse, April (2011).
  • [van Rossum 2011] van Rossum, M. De intra-aziatische vaart. schepen, de aziatische zeeman en ondergang van de voc. Tijdschrift voor Sociale en Economische Geschiedenis, 8, nr. 3:32–69 (2011).
  • [van Velzen 2010] van Velzen, A.J.M. and Gaastra, F.S. Thematische collectie: Voc opvarenden; voc sea voyagers. urn:nbn:nl:ui:13-v73-sq8 (2000–2010).

Contribution

1) How does the project advance contemporary discussions within its particular subject area?

The Dutch Ships and Sailors Project is representative of ongoing developments in the digital humanities community in many respects:

  • It elaborates upon existing digital resources with the objective to provide a seamless integration of the various sources in order to facilitate historical research;
  • Its highly specific purpose, namely recording the components of the Dutch navy in the 17th-19th Centuries, is clearly articulated with the expectation of a community of researchers working on the period and the topic;
  • It explores the timely topic of linked open data (LOD) as a technical means to pursue its general objective, which is an interesting point of comparison point for projects of a similar perspective. Moreover, the project offers a good example for non-LOD amateurs who want to take a look at the corresponding methodology.

2) Does the project fully engage with current scholarship in the field?

The project is clearly based upon a strong comprehension of the needs of historians looking to better understand the networks of people and vessels that characterize the merchant navigation of that time. It places specific emphasis upon providing identification and linking mechanisms for the corresponding elementary components.

3) Do the digital methods employed offer unique insights into the project’s key questions?

The project, as described in the statement, focuses a bit too much on the LOD technical setting without clearly describing the actual issues of data modelling, interoperability, and the management of heterogeneous digital surrogates. It may be superfluous to be referr to rdf triples. An example of such trend bias is the reference to the notion of “5-star linked data,” which may be unclear to many readers.

Presentation

1) Does the interface effectively communicate and facilitate the goals, purpose, and argument of the project?

The current setting of the project as exemplified by the available online search interface presents a paradox: whereas the project is clearly based upon a well identified research domain, the search interface provides an output where fields appear as they are encoded in the rdf network, which makes the results rather illegible. This issue probably reflects a strong technical orientation, which would be nicely balanced if more input from the visualisation field were to be included.

Preservation

1) Have relevant best practices and standards been followed for markup and metadata?

Since the project has chosen a very specific technical orientation, the reader more interested in the general methodology related to multiple data sources management may be left unsatisfied by the lack of details about the data modelling issues. In particular, the statement cites that that various textual sources have initially been encoded in “XML” without saying what kind of format this actually was and/or whether the resource (for instance if encoded according to the TEI guidelines) has the potential for other types of explorations. Independent from the project itself, I could imagine that the corpus of 19th century newspapers may have a variety of other potential usages.

2) Is documentation available about the project? Is information provided about who, why and when and how different responsibilities were assigned?

The document mainly differentiates between the two groups of historians and technicians in charge of the implementation. It could have been useful to have an overview of the manpower within the project in terms of responsibilities, competence and number (maybe as a blog entry related to the paper).

3) How is the project hosted? Through a university server? A commercial host? A non-profit organization? Is there evidence of ongoing commitment to support of the project at the level of hosting? Is there similar evidence of ongoing support from project personnel?

4) Is there a preservation and maintenance plan for the interface, software, and associated databases (multiple copies, mirror sites, collaboration with data archives, etc.)? Is the project fully exportable/transferable?

[3+4] No real statement is made concerning hosting or the long-term maintenance of the resource. When one reads between the lines, it seems that the presence of the Huygens institute as a partner, as well as an explicit reference to the European Clarin infrastructure, may give some hints as to hosts and maintenance of the resource. Still, it would be very helpful for projects of similar nature to have an overview of issues concerning the sustainable preservation of the resource.

5) Is the software being used proprietary, open-source, or editable by multiple programs? Are there clear plans for future accessibility? Will researchers have access to project material and/or metadata outside of a web-based interface?

By definition of linked open data, all elementary information units are completely available online. In the same way, all the pieces of software used for the project seem to be open source. As mentioned about the query interface, it could be valuable to plug in additional components for data visualisation, but such components could also probably be taken from the OSS domain.

As a whole, a project such as DSS presents the opportunity to better understand some fundamental questions in data integration, which are not always raised or elaborated upon within the available document. Among the ones, which could deserve a blog entry as an answer, I would consider:

  • Multilingualism: how are the variety of languages present in the data sources dealt with in order to provide researchers with an integrated data space where all content can be equally queried and visualised?
  • Open-access: beyond the use of LOD technologies, how does the project intend to make sure that all components of the research process are openly accessible? This question covers issues related to open access to associated publications, status of possible enrichments for the scholars, and licensing policy.
  • Semantics: how to ensure the long-term semantic usability of the resource, for instance, from the perspective of creating a large coverage prosopographical resource? What kind of action is needed to allow for fine grained comparison of similar components across projects? How does this relate to large standardisation initiatives such as the TEI or archival standards?

Contribution

Overall, the project statement situates the project's contribution as primarily methodological, describing a process by which four diverse data sets have been assembled—in some cases by laborious hand-translation of 17th and 18th century ship's muster rolls—and abstracted into a common ‘data cloud’ for historical research. The main focus of the project is on the method of this integration, but the authors have also provided some robust web-based querying tools for scholars interested in using the combined DSS data set. The project's website contains a host of supporting material including numerous data query tools, project documents and specifications, linked data exports, and the submitted project statement. These materials elaborate upon the contribution of the project’s methodology in some detail. A screenshot in the paper of a ‘provenance flow’ of a set of records hints at a very rich capability to visualize this information.

Looking at a promising project in midstage production, inevitably leaves one hungry for more elaboration of the actual and the potential. For example, I found myself wanting to learn more about the project's fascinating 'Entity Recognition Algorithm,' used to forge identity links between what a set of unique entities. Likewise, after all that effort put into linking this data using Semantic Web, providing a little more visibility into the 'field mapping' process, which allows the four different data sources to be abstracted into an integrated data layer at a more granular level, could really benefit the project. Even a relatively simple "schema" type diagram representing the process, that illustrates the metadata relationships between one or two of the fields mentioned in the statement (e.g. ship name, captain, etc.), could be of use to readers and users of the site’s data.

Presentation

The “data cloud” produced by the Dutch Ships and Sailors project has real scholarly value—their abstracted data sets interrelate using Semantic Web (RDF/linked data), permitting a host of complex analyses. The project’s goals are primarily methodological, and the statement and site do not foreground any specific contributions to various conversations in maritime Dutch historiography, nor elaborate the methodology in approachable level of detail. Instead, they point to the data itself as a primary project output. The statement speaks to the value of the “surprisingly good” results from their experimental mining of these disparate data sources, but I’m left wanting to learn more about the kinds of historical insights the site and/or project may have already offered historical researchers using the DSS data sets. While the statement does detail a few potential historical research questions, these only describe specific applications of interrogating the data and are not clearly tied into any underlying debates in the existing (non-DH) historiography. This is not to say that such insights have not been obtained—despite not being a historian of this specific subject matter, my sense is that this project is just beginning to contribute to the specific field conversations and historiography of 17th and 18th century Dutch maritime history.

Preservation

The project has clearly devoted significant thought and effort to establishing a preservation and sustainability strategy for the data it has generated. Overall, the project has accomplished this goal. The use of Semantic Web and RDF 'tagging' ensures that the data will be compliant with a wide and growing community of data scientists and archivists. However, the project statement does not specifically detail any specific archiving or preservation strategies and/or repositories.

The project's focus on maintaining provenance information for the sources is particularly impressive and should serve as a model moving forward. However, the data also needs to be approachable for historians without significant RDF/Linked Data experience, and it is here that the one vulnerability of the data management/user interface strategy may manifest. For historians interested in the subject matter but uninterested in archival maintenance and practice, the lack of an interface in which they can interrogate the data in a way they can understand may cause them to leave empty-handed, or leave with a set of downloads in tow to be imported, modified, and left to an unknown end. If the project can improve the web site’s user interface, it will increase the likelihood of its data being enhanced and ultimately preserved. To this end, I would suggest a comprehensive review and update of the site’s interface. Beyond the issue of the site's interface issues and the impact on preservation they may pose, the project's general focus on data preservation has been notable and successful with a range of involved institutions and project participants.