Thursday, April 21, 2016

Linked Data in Sweden 2016

It's time for the 5th "Linked Data in Sweden" event, Tuesday 26 April. Last year I was organizing the meeting in Gothenburg together with Fredrik Landqvist. This year we are back in Stockholm, this time at the Royal Armoury. I just learned that it is the oldest museum in Sweden. It was established by King Gustav II Adolph in 1628.

Several interesting presentations on the agenda from e.g Scania, Nobel Media, Wikimedia, Findwise and National Library of Sweden. I will give a short update on Linked Data efforts for data standards in biopharma and healthcare. So, I have started to think about things I would like to cover and will tweet an item per day to things I find interesting. Below the emerging list of links and a video presentation per item. Not much spare time, so I will shape them into a couple of slides on the train up to Stockholm, see slides in the end of this blog post.

Standards represented as Linked Data

The first items on my list are examples of when the authoritative sources of the content, in this case traditional standard organisations, publish linked data versions of their own content. This is very much what I was hoping for in my key at the Semantic Web Applications Tools for Life Sciences (SWAT4LS) workshop in late 2013: Pushing back, standards and standard organizations in a Semantic Web enabled world.
  • CDISC in RDF
  • HL7 FHIR in RDF
  • MeSH in RDF
  • ICD-11 in OWL
  • Others standards e.g. ATC, WHO Drug and MedDRA


In 2011 I presented; Linking Clinical Data Standards, at the CDISC (Clinical Data Interchange Standards Consortium) EU conference in Brussels. A year later, in Stockholm, Frederik Malfait (IMOS Consulting and consult at Roche) and I together presented Semantic models for CDISC based standard and metadata management. At the 2nd Linked Data in Sweden meeting in 2013 I presented; Länkade kliniska data standards (Linked clinical data standards).

The same spring the CTO of CDISC, Wayne Kubick, agreed to make this a task for the PhUSE organisation (PhUSE Association Programming Pharmaceutical Users Software Exchange). The PhUSE Semantic Technology project started later that year.

Overview of PhUSE Semantic Technology Project
by Frederik Malfait (21:16 - 37:00)

In the summer 2015 CDISC published their standards in RDF.  In the future, representation of CDISC standards in RDF will be one of the outputs of CDISC's metadata registry (SHARE).


The Fast Healthcare Interoperability Resources (FHIR, pronounced "fire") proposed standard describing data formats and elements (known as "resources"). It is an Application Programming Interface (API) for exchanging Electronic health records. The standard was created by the Health Level Seven International (HL7) health-care standards organization. And it is hot! I recently attended a FHIR workshop organised by HL7 Sweden at the Swedish eHealth conference Vitalis (see my Storify Vitalis2016).

The HL7 FHIR project and the W3C Semantic Web Health Care and Life Sciences Interest Group work on RDF representations of FHIR. The HL7 work lead by Graham Grieve, one of the creators of FHIR, and the W3C HCLS group lead by, David Both the initiator of the so called Yosemite project, will be aligned.


The Medical Subject Headings (MeSH) is the National Library of Medicine's controlled vocabulary thesaurus. It is used to index the biomedical journals. The rational and design of MESH in RDF is described in a good article: Desiderata for an authoritative Representation of MeSH in RDF

ICD-11 in OWL

The 11th revision of the International Classification of Diseases (ICD-11) is based on a content model encoded in OWL that takes it beyond the long list of terms in ICD10. Excellent introduction by Mark Musen to both ICD11 and to how the ontology tool called iCAT, based on WebProtege, has been used to represent ICD-11. While most editors want to stick to Excel spread sheets. This is a shared experience for all data standards mentioned here.

Other standards e.g. ATC, WHO Drug, MedDRA

There are several other standards I would like to see RDF/OWL versions of  to make our use of them in biopharma more robust. For example ATC (Anatomical Therapeutic Chemical Classification System), WHO Drug Dictionary and MedDRA (Medical Dictionary for Regulatory Activities). Early 2015 I was invited to WHO Uppsala Monitoring Center to talk about the value of this.

In the same way as it took CDISC almost 5 years, from early ideas on using semantic web standards and linked data principles to actually applying them, I think it will take some years more before we have:"Standardized the Standards", quote from David Booth leading the Yosemite project (see below).

New initiatives outside the traditional standard organisations

Here a couple of interesting initiatives I wanted to also cover but will probably not have the time to do. 

See my Storify LDSV2016 with notes and links from the event.

And here are the slides for my presentation in the afternoon that I did put together on the train from Gothenburg to Stockholm this morning.

Wednesday, December 9, 2015

SWAT4LS 2015 Industry stream

It's been a great first day at SWAT4LS and I have been buying a few books in the lovely Cambridge University bookstore and had a nice conference dinner.

I'm now preparing for tomorrow's task to be the chair for the industry stream in SWAT4LS (see my previous blog post for more information about this event). So, here's a list of the 6 abstracts, companies and projects/tools that I'll introduce tomorrow morning:
  1. The BioHub Knowledge Base: Ontology and Repository for Sustainable BiosourcingText Mining/NLP research group within the School of Computer Science at the University of Manchester together with UniLever, BioHub Knowledge Base (BioHubKB)
  2. Customizing “General SPARQL” for visualisation of in-house data in CytoscapeGeneral BioinformaticsGeneral SPARQL
  3. GraphScope – smart data access for the life sciencesSearchHaus,  GraphScope
  4. Semantic Technologies Make Sense for Life SciencesSmartLogic
  5. Advancing Knowledge Discovery for Alzheimer’s Disease: The Alzforum ExperienceAlzforum
  6. Everybody a Translational Data ScientistOntoforceDISQOVER 

Tuesday, December 1, 2015

SWAT4LS 2015

Monday, 7th December, I will fly to Cambridge to attend the Semantic Web Applications and Tools for Life Sciences (SWAT4LS) conference and also visit colleagues at the new AstraZeneca site. The conference programme looks interesting and the venue, Clare College, fantastic ("Harry-Potter-land" was my husband's comment when he saw the pictures :-).

I am very glad to be the chair for the Industry session on Wednesday morning. Here are a few items on the programme I find extra interesting, from my clinical and RWE data perspective:
Will be great fun to meet friends and colleagues in the Semantic Web community.

Checkout my Storify: SWAT4LS2015

Thursday, June 25, 2015

Jupyter Notebooks

Last week I followed the feed from the Spark Summit 2015 event and several tweets talked about using Notebooks. Two tweets especially:
So I got curios in Jupyter, the lab notebooks used in the edX/DataBricks MOOC I'm following (Introduction to Big Data with Apache Spark). And yes, I do agree with Paco Nathan (@pacoid) and Edd Dumbill (@edd); Notebooks do look like a real game changer:  
  • VisiCalc and Lous 1-2-3 in the early 80ies. 
  • Mosaic and Netscape in the mid 90ies. 
  • I get a similar feeling now, in the mid 2010ies, when I see Jupyter Notebooks.
    (Yes, I know it's old news for all Mathematica users :)

The first 20 mins of this great video with Min Ragan-Kelley (@minrk) one of the core contributor to IPython and now to Jupyter, he gives a nice intro and in the following 30 mins he describes several cool examples of Notebooks, e.g. the CodeNeuro notebooks using Thunder based on Spark.

Excellent podcast interview with two other key contributors to iPython/Jupyter: Brian Granger (@ellisonbg) and Fernando Perez (@fperez_org)

Hmm, I need to think more about the combinations of Notebooks (reproducible research) and Linked Data (processable data) ... ...

Wednesday, April 22, 2015

CSVW for Tabular Clinical Trial Data and Metadata

W3C has developed a set of working drafts for tabular data and metadata called CSV on the Web (CSVW) and are now seeking comments and implementations.

The drafts describes:
  • Metadata vocabular for tabular data
    A JSON-based format for expressing metadata about tabular data to inform validation, conversion, display and data entry for tabular data
  • Model for tabular data and metadata
    An abstract model for tabular data, and how to locate metadata that enables users to better understand what the data holds; this specification also contains non-normative guidance on how to parse CSV files.
  • Procedures and rules to be applied when converting tabular data into JSON and RDF 
These are based on a series of use cases and recommendations including for example Publication of National Statistics and Analyzing Scientific Spreadsheets. I can see some interesting opportunities in this for tabular Clinical Trial Datasets.

A small example

Check out Ed Summers' (@edsu) very nice, small csvw example mentioning one of the authors of the drafts; Dan Brickley (@danbri, Developer Advocate at Google). Below the CSV example, related Metadata and the Annotated, linked data.

0470402377,"Bricklin on Technology","Dan Bricklin"

  "@context": {
    "@vocab": "", 
    "dc": ""
  "@type": "Table", 
  "url": "example.csv",
  "dc:creator": "Dan Bricklin", 
  "dc:title": "My Spreadsheet", 
  "dc:modified": "2014-05-09T15:44:58Z", 
  "dc:publisher": "My Books", 
  "tableSchema": {
    "aboutUrl": "{isbn}",
    "primaryKey": "isbn",
    "columns": [
        "name": "isbn",
        "titles": "ISBN-10",
        "datatype": "string",
        "unique": true,
        "propertyUrl": ""
        "name": "title", 
        "titles": "Book Title",
        "datatype": "string", 
        "propertyUrl": ""
        "name": "author",
        "titles": "Book Author",
        "datatype": "string",
        "propertyUrl": ""

Annotated, linked data 
(RDF modeled serialized in JSON-LD)
  "@context": {
    "csvw": "",
    "dc": "",
    "prov": "",
    "xsd": ""
  "@graph": [
      "@id": "_:g69960879269460",
      "@type": "prov:Usage",
      "prov:entity": {
        "@id": "example.csv-metadata.json"
      "prov:hadRole": {
        "@id": "csvw:tabularMetadata"
      "@id": "_:g69960879270660",
      "@type": "prov:Usage",
      "prov:entity": {
        "@id": "example.csv"
      "prov:hadRole": {
        "@id": "csvw:csvEncodedTabularData"
      "@id": "_:g69960879273280",
      "@type": "prov:Activity",
      "prov:endedAtTime": {
        "@value": "2015-04-22T20:21:11Z",
        "@type": "xsd:dateTime"
      "prov:qualifiedUsage": [
          "@id": "_:g69960879270660"
          "@id": "_:g69960879269460"
      "prov:startedAtTime": {
        "@value": "2015-04-22T20:21:10Z",
        "@type": "xsd:dateTime"
      "prov:wasAssociatedWith": {
        "@id": ""
      "@id": "_:g69960879277480",
      "@type": "csvw:Row",
      "csvw:describes": {
        "@id": ""
      "csvw:rownum": {
        "@value": "1",
        "@type": "xsd:integer"
      "csvw:url": {
        "@id": "#row=2"
      "@id": "_:g69960879413940",
      "@type": "csvw:Table",
      "csvw:row": {
        "@id": "_:g69960879277480"
      "csvw:url": {
        "@id": "example.csv"
      "dc:creator": "Dan Bricklin",
      "dc:modified": "2014-05-09T15:44:58Z",
      "dc:publisher": "My Books",
      "dc:title": "My Spreadsheet"
      "@id": "_:g69960879425260",
      "@type": "csvw:TableGroup",
      "csvw:table": {
        "@id": "_:g69960879413940"
      "prov:wasGeneratedBy": {
        "@id": "_:g69960879273280"
      "@id": "",
      "dc:creator": "Dan Bricklin",
      "dc:identifier": "0470402377",
      "dc:title": "Bricklin on Technology"

A clinical trial data example?

Tabular data has been the traditional way to organize how clinical trial data is captured, stored and submitted. So, I think that this would be very interesting to explore to be able to bind data to it's metadata in a similar way. That is to, make things like variable labels, date/time formats etc. explicit.
  • How could the metadata for a small, example of e.g. demographic data look like?
  • How would the annotated, linked data look for such a small example like?
I would love to see some early ideas on how this could be implemented in the two main language/environments we use today for clinical data: SAS and R. Similar to the early implementation of CSVW in Ruby described in a nice blog post from Greg Kellogg (@Gkelloggone of the authors of the drafts).

Such a first example I think would trigger an interesting ideas for best practices and potential extensions to the metadata vocabular and model, and also to the procedures and rules to create annotated JSON and RDF representations such as:
  • Templates for the URIs to be assigned to each captured and derived data point?
  • Representing implied formats in varchar fields such as dates and precision.
  • Making explicit the implied metadata from the actual data such as encoded labtest codes and units.
  • How to leverage the RDF schemas representing CDISC standards?
  • How to best use W3C's Provenance ontology to capture the life cycle of a data point in a clinical trial?
I think questions as these are important to address, especially in the context of transparency and reuse of clinical trial data, see also an earlier blog post: Clinical Trial Data Transparency and Linked Data.

So, I hope this blog post will spark some interesting responses from the SAS and R communities, and discussions in groups like CDISC and PhUSE Semantic Technology project.

Monday, February 16, 2015

Clinical Trial Data Transparency and Linked Data

I've with great interest been following the discussions about clinical trial transparency and sharing of clinical trial data for the last three years. More precisely - my first tweet about this is from early 2012:

There has been a lot of debates over these years of how much of results of clinical trial results being published - is 50% or much more? Journal article publications vs trial registries? A lot of issues around summary level data vs. patient level data, and around de-identification of data and redaction of documents etc.

All interesting topics but my interest in all of this is the opportunities in making data in, about and related to clinical trials, useful using semantic web standards and linked data principles. In the spring 2013 I wrote a post on my blog: Talking to Machines, about this after listening to Ben Goldacre, one of the key people behind the AllTrials initiative where he also acknowledged this:

Here are a couple of recent events, early 2015, related to Clinical Trial Data Transparency and Linked Data:
  • AAAS Panel on Innovations in Clinical Trial Registry
  • Public consultation EMA Clinical trial database
  • IoM report: Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk

AAAS Panel on Innovations in Clinical Trial Registry

So, I really liked what I saw in the program for a session yesterday evening (15 February, 2015) from the American Association for the Advancement of Science annual meeting in San Jose (#AAASmtg) in a panel on Innovations in Clinical Trial Registers
Documents relating to trials -- protocols, regulatory summaries of results, clinical study reports, consent forms, and patient information sheets -- are scattered in different places. It is difficult to track the information that is available, in order to audit for gaps in information and for doctors and regulators to be sure they have all the information they need to make decisions about medicines. There is an unprecedented opportunity to refine how clinical trial data are shared and linked.

Public consultation EMA Clinical trial database

This is similar to what I wrote last week when I tried to "act courageously" and responded to "the public consultation on how the transparency rules of the European Clinical Trial Regulation will be applied in the new clinical trial database is launched by the European Medicines Agency (EMA)."
Make use of modern data standards and access methods to make the access to the clinical trial database developer-friendly, data machine-processable and the trials and their components linkable. Leverage initiatives and use principles, such as CDISC Standards in RDF (under review), that uses modern data standards from W3C stack of semantic web standards, openFDA that uses developer-friendly REST APIs JSON (openFDA API reference), and the linked data principles.

IoM report: Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk

A couple of weeks ago the Institute of Medicine (IOM) released an excellent report: Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk.

Short summary, as I interpret the core message of the report: Instead of just designing and planning a study, scientists need to plan and document how they're going to share the data from that study so that its usable to others who may want to re-analyze it.

The report has a well written section on “legacy trials” and an interesting listing of challenges:

Infrastructure challenges—Currently there are insufficient platforms to store and manage clinical trial data under a variety of access models. 
Technological challenges—Current data sharing platforms are not consistently discoverable, searchable, and interoperable. Special attention is needed to the development and adoption of common protocol data models and common data elements to ensure meaningful computation across disparate trials and databases. A federated query system of “bringing the data to the question” may offer effective ways of achieving the benefits of sharing clinical trial data while mitigating its risks. 
Workforce challenges—A sufficient workforce with the skills and knowledge to manage the operational and technical aspects of data sharing needs to be developed. 
Sustainability challenges—Currently the costs of data sharing are borne by a small subset of sponsors, funders, and clinical trialists; for data sharing to be sustainable, costs will need to be distributed equitably across both data generators and users.

And for a ”clinical trial data and metadata nerd” as me this is like music :-)

Just because data are accessible does not mean they are usable. Data are usable only if an investigator can search and retrieve them, can make sense of them, and can analyze them within a single trial or combine them across multiple trials. Given the large volume of data anticipated from the sharing of clinical trial data, the data must be in a computable form amenable to automated methods of search, analysis, and visualization.

To ensure such computability, data cannot be shared only as document files (e.g., PDF, Word). Rather, data must be in electronic databases that clearly specify the meaning of the data so that the database can respond correctly to queries. If data are spread over more than one database, the meaning of the data must be compatible across databases; otherwise, queries cannot be executed at all, or are executable but elicit incorrect answers. In general, such compatibility requires the adoption of common data models that all results databases would either use or be compatible with.

Wednesday, November 5, 2014

ISWC2014 Trip Report

A few highlights from five intensive days at the International Semantic Web Conference (ISWC2014) in lovely Riva del Garda. See also my previous blog post Preparing for ISWC2014 and my live blog from all five days using Storify.

ISWC2014 Storify

Strong industry presence

ISWC is a research focused conference. However, this year it had a strong industry prescence with a full day Industry track, Semantic Developer workshop and many of the Lighning Talks came from industry. It was great to meet Business Analysts and Information Architects from large companies such as Roche, Genentech and NXP Semiconductors and also from small companies such as the Danish StatGroup.
  • All five Information Architects in the Data Standards Office at Roche / Genentech attended all five days to learn more about latest in semantic web research, especially traceability and provenance. Frederik Malfait, working for Roche and FDA/Phuse, described their RDF implementations of clinical trial data standards is the basis for a model driven architecture enabling computable protocols, component based authoring and automation of setting up clinical trial databases and generating submission datasets.
  • Marc Andersen, one of the two founders of StatGroup, presented the experience of the Pharmaceutical Users Software Exchange (PhUSE) developing a semantic representation of statistical results based on RDF and OWL. Providing clinical trial results as linked data will facilitate traceability, data sharing and integration, data mining and meta-analysis benefiting industry, regulatory authorities and the general public.
  • A business analyst described how NXP Semiconductor is making use of Semantic Web technology such as RDF and SPARQL to manage a product taxonomy for marketing purposes that forms the key navigation of the NXP website. 

Hot topics: Developer friendly, Linked Data Fragments, Provenance and Semantics for Sensors

  • The Semantic Developer Workshop and the conference program included many examples of RDF and SPARQL support in traditional programming languages, such as Java, Perl, C# and Javascript, as well as in data science languages, such as Python and R. The Semantic Developer of the Year, Kjetil Kernsmo, from Oslo University, presented RDF/Linked Data for Perl. JSON-LD was refereed to as the the developer-friendly serialization of RDF.
  • Many of the presentations described how they applied the Provenance standard from W3C for "information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness." One example was how the standards had been used event based traceability in pharmaceutical supply chains via automated generation of linked pedigrees.
  • Semantics for Sensors for every-thing from smart building diagnostic,  traceability  in pharmaceuticals supply chain, and traffic diagnosis to predicting frost in vineyards on Tasmania.
  • "Everyone" talked about the work presented on the best awarded poster: Linked Data Fragments "so light-weight that even a Raspberry Pi can publish DBpedia (Wikipedia structured content) with high availability" 

Best workshop paper award

It was very nice to present our joint EHR4CR, Open PHACTS, SALUS and W3C HCLS paper. It got a best paper award in the pre-conference workshop: Context, Interpretation and Meaning for the Semantic Web.

Other ISWC2014 reports