Sunday, May 22, 2016

Awesome graphic as Graphs

The classic continuum from Data via Information to Knowledge is nicely visualized in a three part graphic. I've seen it shared many times the last couple of years on Twitter and LinkedIn. Today I saw it extended with Insight and Wisdom. It made it even more awesome.

Original graphic by Hugh MacLeod @hughcards
extended by David Sommerville @smrvl  

It was my friend and former colleague Martin Börjesson @futuramb that did a Re-Tweet of a tweet from John Hagel @jhagel, management consultant and author. It took me to the creator of the original graphic, Hugh MacLeod @hughcards, cartoonist and co-founder of @gapingvoid. The extension of it is done by David Sommerville @smrvl Digital Design Director for @TheAtlantic.

So, I started to think about representing the five pieces as executable and querayable graphs:

  • 1 DataPoint class
  • 21 DataPoints
  • 2 InfoClasses (represented by the green and lilac labels) 
  • 21 Classifications 
  • 1 type of Relationship
  • 18 relationsships
  • 1 new InfoClass (yellow) 
  • 2 new Classifications
  • 1 Relationship Query

RDF triples, RDF Schema and SPARQL would be one option.

Neo4j Property Graph and Cypher, another option.

Well, will see if I can find the time to do it, or convince some graphs and linked data friends to have a go at it :-)



Thursday, May 19, 2016

Global, persistent and resolvable identifiers for clinical data

Yesterday two thought leaders in clinical data standards publised great blog posts. Dave Ibersen-Hurst (@Assero_UK) and  Armando Oliva (@nomini). Dave's post has the title Wear Sunscreen but it's really about "CDISC 2.0". Armando's post has the title Improving the Study Data Tabulation Model

Discussions threads on Twitter and LinkedIn today made me write this post about one the many great proposals in the two blog posts: 1. SDTM should incorporate unique identifiers for each record in each domain.

In today's clinical data standards for 2-dimensional/tabular data exchange, e.g. CDISC SDTM, keys are either natural keys, e.g. STUDYID, USUBJID, LBTESTCD in a dataset of labdata according to SDTM, or surrogat keys, e.g LBSEQ. A define.xml file should be the source for study specific Key Variables for each dataset. For more details about SDTM keys and the challenges of this see Duplicate records - it may be a good time to contact your data management team, PharmaSUG 2016, Sergiy Sirichenko and Max Kanevsky (@pinnacle_21)

Armando details the proposal in his blog post as he says that the identifiers should be "globally unique".
This is a discussion I have looked forward to since I urged CDISC to consider semantic web standards and linked data principles in my presentation at CDISC EU conference in 2011.

Linking Clinical Data Standards
My presentation at CDISC EU Interchange 2011
I now see how smart programmers and informatians use checksums as record identifiers as a practical way to get around this problem and simplify the integration and reviewing of clinical data.

A phrase we often use talking about linking data and semantic web standards is: "globally, persistent and resolvable identifiers".

  • A http URI schema makes identifiers possible to resolve. An example of the URI that has a resolver service is http://data.ordnancesurvey.co.uk/id/postcodeunit/SO160AS the URI for the UK postcode SO160AS 1). 
  • While the URIs assigned to CDISC standard items such as http://rdf.cdisc.org/std/sdtmig-3-1-3#Column.LB.LBSTRES for the standard lab result variable in CDISC SDTM do (yet) not resolve.

So how would a URI look like for a single data point in a clinical study? HL7 FHIR use so called UUID. Trusty URI:s use hash values "URIs that contain a certain kind of hash value that can be used to verify the respective resource" http://trustyuri.net/ 

I am eager to learn more about the potential of using URIs in combinations with Blockchains. This presentation on using blockchain technology and semantic standards for provenance across the supply chain made me think ...



... about Semantic blockchains in the Clinical Data Supply Chain. With identifiers assigned to each data point through the the supply chain of clinical data captured in EHR and smartphones, fed into clinical trial records, aggregated into summary level TLFs and later on included in secondary use analyses.

Thoughts?

1) https://www.ordnancesurvey.co.uk/education-research/research/linked-data-web.html 
2) CDISC2RDF see https://github.com/phuse-org/rdf.cdisc.org

Friday, May 6, 2016

Twitter Feeds and Blog posts from Conferences

Conferences is a great way to meet interesting people and learn new things. Always nicest when you can attend IRL but interesting also following remotely via Twitter feeds, live blogging and reports and presentations blog post.

Conference Live Blogging

When I can attend conferences IRL I like to take notes using Twitter and I try to gather links and tweets using Storify as a kind of live blogging. Check out Storify/kerfors from events such as the recent Linked Data in Sweden, 2016 (ldsv2016) and HL7 FHIR workshops at Vitails, eHealth conference (Vitails2016).

Me in action live blogging

When I can not attend I like to follow conferences on  a distance and read peoples blog reports.

This week I've been following the great #csvconf feed from "a data conference that's not literally about CSV file format but rather what CSV represents to our community: data interoperability, hackability, simplicity,etc" The most interesting Twitter feeds from onferences I've seen so far.
Many thanks to some of the people tweeting from the event: , @_inunddata, @EmilyGarfield (Emily also posted some very nice drawings from the event.)


Conference Reports as blog posts

The recent CDISC Europe conference in Vienna #CDISCEurope did have a pretty thin feed but with some great tweets from Magnus Wallberg (@CMWallberg), Technology Evangelist at WHO Uppsala Monitoring Center, posted a few tweets.
Magnus also wrote an excellent report as a blog post: A great mix of standards and great visions when CDISC met in Vienna

Update: Just after I published this blog post I saw Wayne Kubick's (@WayneKubick), CTO for  HL7 and former CTO for CDISC, blog post HL7’s FHIR and BioPharma and article in Applied Clinical Trial: Building on FHIR for Pharmaceutical Research from a HL7 event I recently followed: Partners in Interoperability workshop in Washington DC.

Conference Presentations accompanying blog posts 

I also very much like when presenters quickly post their conference presentations on e.g. Slideshare. And it's also very nice to see accompanying blog posts with the speakers notes and additional material. I very much liked Dave Iberson-Hurst (@assero_UK) blog post with his CDISC Europe presentation this year. It is a post on his Semantic Web & Metadata series: CDISC Standards: Assessing the Impact of Change

I tried something similar when I wrote a blog post to prepare for my presentation "Linked Data efforts for data standards in biopharma and healthcare" at the Linked Data in Sweden, 2016 meeting a week ago: Linked Data in Sweden 2016


Thursday, April 21, 2016

Linked Data in Sweden 2016

It's time for the 5th "Linked Data in Sweden" event, Tuesday 26 April. Last year I was organizing the meeting in Gothenburg together with Fredrik Landqvist. This year we are back in Stockholm, this time at the Royal Armoury. I just learned that it is the oldest museum in Sweden. It was established by King Gustav II Adolph in 1628.

Several interesting presentations on the agenda from e.g Scania, Nobel Media, Wikimedia, Findwise and National Library of Sweden. I will give a short update on Linked Data efforts for data standards in biopharma and healthcare. So, I have started to think about things I would like to cover and will tweet an item per day to things I find interesting. Below the emerging list of links and a video presentation per item. Not much spare time, so I will shape them into a couple of slides on the train up to Stockholm, see slides in the end of this blog post.

Standards represented as Linked Data

The first items on my list are examples of when the authoritative sources of the content, in this case traditional standard organisations, publish linked data versions of their own content. This is very much what I was hoping for in my key at the Semantic Web Applications Tools for Life Sciences (SWAT4LS) workshop in late 2013: Pushing back, standards and standard organizations in a Semantic Web enabled world.
  • CDISC in RDF
  • HL7 FHIR in RDF
  • MeSH in RDF
  • ICD-11 in OWL
  • Others standards e.g. ATC, WHO Drug and MedDRA

CDISC in RDF

In 2011 I presented; Linking Clinical Data Standards, at the CDISC (Clinical Data Interchange Standards Consortium) EU conference in Brussels. A year later, in Stockholm, Frederik Malfait (IMOS Consulting and consult at Roche) and I together presented Semantic models for CDISC based standard and metadata management. At the 2nd Linked Data in Sweden meeting in 2013 I presented; Länkade kliniska data standards (Linked clinical data standards).

The same spring the CTO of CDISC, Wayne Kubick, agreed to make this a task for the PhUSE organisation (PhUSE Association Programming Pharmaceutical Users Software Exchange). The PhUSE Semantic Technology project started later that year.


Overview of PhUSE Semantic Technology Project
by Frederik Malfait (21:16 - 37:00)

In the summer 2015 CDISC published their standards in RDF.  In the future, representation of CDISC standards in RDF will be one of the outputs of CDISC's metadata registry (SHARE).

HL7 FHIR in RDF

The Fast Healthcare Interoperability Resources (FHIR, pronounced "fire") proposed standard describing data formats and elements (known as "resources"). It is an Application Programming Interface (API) for exchanging Electronic health records. The standard was created by the Health Level Seven International (HL7) health-care standards organization. And it is hot! I recently attended a FHIR workshop organised by HL7 Sweden at the Swedish eHealth conference Vitalis (see my Storify Vitalis2016).


The HL7 FHIR project and the W3C Semantic Web Health Care and Life Sciences Interest Group work on RDF representations of FHIR. The HL7 work lead by Graham Grieve, one of the creators of FHIR, and the W3C HCLS group lead by, David Both the initiator of the so called Yosemite project, will be aligned.


MeSH in RDF

The Medical Subject Headings (MeSH) is the National Library of Medicine's controlled vocabulary thesaurus. It is used to index the biomedical journals. The rational and design of MESH in RDF is described in a good article: Desiderata for an authoritative Representation of MeSH in RDF



ICD-11 in OWL

The 11th revision of the International Classification of Diseases (ICD-11) is based on a content model encoded in OWL that takes it beyond the long list of terms in ICD10. Excellent introduction by Mark Musen to both ICD11 and to how the ontology tool called iCAT, based on WebProtege, has been used to represent ICD-11. While most editors want to stick to Excel spread sheets. This is a shared experience for all data standards mentioned here.



Other standards e.g. ATC, WHO Drug, MedDRA

There are several other standards I would like to see RDF/OWL versions of  to make our use of them in biopharma more robust. For example ATC (Anatomical Therapeutic Chemical Classification System), WHO Drug Dictionary and MedDRA (Medical Dictionary for Regulatory Activities). Early 2015 I was invited to WHO Uppsala Monitoring Center to talk about the value of this.




In the same way as it took CDISC almost 5 years, from early ideas on using semantic web standards and linked data principles to actually applying them, I think it will take some years more before we have:"Standardized the Standards", quote from David Booth leading the Yosemite project (see below).

New initiatives outside the traditional standard organisations

Here a couple of interesting initiatives I wanted to also cover but will probably not have the time to do. 

See my Storify LDSV2016 with notes and links from the event.

And here are the slides for my presentation in the afternoon that I did put together on the train from Gothenburg to Stockholm this morning.



Wednesday, December 9, 2015

SWAT4LS 2015 Industry stream

It's been a great first day at SWAT4LS and I have been buying a few books in the lovely Cambridge University bookstore and had a nice conference dinner.

I'm now preparing for tomorrow's task to be the chair for the industry stream in SWAT4LS (see my previous blog post for more information about this event). So, here's a list of the 6 abstracts, companies and projects/tools that I'll introduce tomorrow morning:
  1. The BioHub Knowledge Base: Ontology and Repository for Sustainable BiosourcingText Mining/NLP research group within the School of Computer Science at the University of Manchester together with UniLever, BioHub Knowledge Base (BioHubKB)
  2. Customizing “General SPARQL” for visualisation of in-house data in CytoscapeGeneral BioinformaticsGeneral SPARQL
  3. GraphScope – smart data access for the life sciencesSearchHaus,  GraphScope
  4. Semantic Technologies Make Sense for Life SciencesSmartLogic
  5. Advancing Knowledge Discovery for Alzheimer’s Disease: The Alzforum ExperienceAlzforum
  6. Everybody a Translational Data ScientistOntoforceDISQOVER 

Tuesday, December 1, 2015

SWAT4LS 2015

Monday, 7th December, I will fly to Cambridge to attend the Semantic Web Applications and Tools for Life Sciences (SWAT4LS) conference and also visit colleagues at the new AstraZeneca site. The conference programme looks interesting and the venue, Clare College, fantastic ("Harry-Potter-land" was my husband's comment when he saw the pictures :-).



I am very glad to be the chair for the Industry session on Wednesday morning. Here are a few items on the programme I find extra interesting, from my clinical and RWE data perspective:
Will be great fun to meet friends and colleagues in the Semantic Web community.

Checkout my Storify: SWAT4LS2015

Thursday, June 25, 2015

Jupyter Notebooks


Last week I followed the feed from the Spark Summit 2015 event and several tweets talked about using Notebooks. Two tweets especially:
So I got curios in Jupyter, the lab notebooks used in the edX/DataBricks MOOC I'm following (Introduction to Big Data with Apache Spark). And yes, I do agree with Paco Nathan (@pacoid) and Edd Dumbill (@edd); Notebooks do look like a real game changer:  
  • VisiCalc and Lous 1-2-3 in the early 80ies. 
  • Mosaic and Netscape in the mid 90ies. 
  • I get a similar feeling now, in the mid 2010ies, when I see Jupyter Notebooks.
    (Yes, I know it's old news for all Mathematica users :)

The first 20 mins of this great video with Min Ragan-Kelley (@minrk) one of the core contributor to IPython and now to Jupyter, he gives a nice intro and in the following 30 mins he describes several cool examples of Notebooks, e.g. the CodeNeuro notebooks using Thunder based on Spark.



Excellent podcast interview with two other key contributors to iPython/Jupyter: Brian Granger (@ellisonbg) and Fernando Perez (@fperez_org)


Hmm, I need to think more about the combinations of Notebooks (reproducible research) and Linked Data (processable data) ... ...