The LinkedUp Data Catalogue provides Web Data directly related to education. In order to make it homogeneous and manageable, there are a couple of simple constraints that make a dataset eligible for inclusion: 1- It has to be a directly relevant to education; and 2- it has to be available in RDF, through a SPARQL endpoint. Of course, the second constraint is a bit restrictive as many of the datasets that are interesting and useful are not available in RDF. As part of the LinkedUp project, we put mechanisms in place so that these datasets can be translated into RDF automatically.

Here we describe a simple process to transform metadata from a DSpace repository available on the Web (without necessarily having admin access to it) into a simple RDF representation. These are basic scripts that might not be entirely robust, and may get approximate results, but that can get one started with this kind of scrapping.

The first step is to download the relevant HTML code from the DSpace interface. wget is a good tool for that, but it needs to be used with care. Indeed, if applied recursively (as we want to) it might just start to try downloading a whole bunch of irrelevant things, more or less forever. The command below tells it to go from dspace.ou.nl only up to 5 levels down, and to ignore files with extensions of types we don’t need (images, archives, etc.)

wget -R war,zip,pdf,jpg,jpeg,png,doc,docx -r --level=5 http://dspace.ou.nl/

Of course, we still obtain a lot of irrelevant things, but that’s what we are going to clean up now. The resulting files are placed in a directory corresponding to the address of the website your are scrapping (here data.ou.nl). The next step is to find the files that contain metadata about resources. The following script starts with one directory, and try to recursively identify the files that have Dublin Core information embedded. I started it on data.ou.nl/handle, which is where the files of interest are.

Script: find-files

#! /bin/bash

list=`ls $1`

for file in $list
do
  if [ -d "$1/$file" ]
  then
     ./find-files $1/$file
  else
      grep -l schema.DC $1/$file
  fi
done

Once the files have been identified, the next step is to extract from them the metadata they embed and show them as RDF. The following script does this in a rather crude way – taking a list of files in input, and outputting N-Triples that represent them. In order to connect the two scripts, they can be piped, as in the following way:

./find-files dspace.ou.nl/handle | extract > results.nt

Script: extract

#! /bin/bash

while read file
do
  file2=`echo $file | sed 's/\//\\\\\//g'`
  echo '<http://'$file2'> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document>';
  grep '^<meta name=".*" content=".*" />' $file | \
  sed 's/^<meta name="\(.*\)" content="\(.*\)" xml:lang="\(.*\)" scheme=".*" \/>/<http:\/\/'$file2'> <\1> "\2"@\3./g' |
  sed 's/^<meta name="\(.*\)" content="\(.*\)" xml:lang="\(.*\)" \/>/<http:\/\/'$file2'> <\1> "\2"@\3./g' |
  sed 's/^<meta name="\(.*\)" content="\(.*\)" \/>/<http:\/\/'$file2'> <\1> "\2"./g' |
  sed 's/^<meta name="\(.*\)" content="\(.*\)" *scheme=".*" *\/>/<http:\/\/'$file2'> <\1> "\2"./g' |
  sed 's/DC\./http:\/\/purl.org\/dc\/elements\/1.1\//g' | \
  sed 's/citation_/http:\/\/data.linkedu.eu\/onto\//g' | \
  sed 's/scheme=".*"//g' | \
  sed 's/DCTERMS\./http:\/\/purl.org\/dc\/terms\//g'
done

The results are of course a bit crude and could be improved in a lot of different ways. For example, we could try to create a URI for each keyword and create triples for each… but that’s a start. A sample result is shown below:

<http://dspace.ou.nl\/handle\/\/1820\/11141> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document>
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/creator> "Kalz, Marco".
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/terms/dateAccepted> "2007-12-18T11:49:14Z" .
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/terms/available> "2007-12-18T11:49:14Z" .
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/terms/issued> "2007-12-18T11:49:14Z" .
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/identifier> "http://hdl.handle.net/1820/1141" .
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/description> "Presentation provided during the conference "Networks, Communities & Learning: Show that you share!" organized by the Bazaar project and the Institute of Education of Utrecht University on the 14th of December.
To embed this presentation in your blog, please visit http://www.slideshare.net/mkalz"@en.
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/description> "The work on this publication has been sponsored by the TENCompetence Integrated Project that is funded by the European Commission's 6th Framework Programme, priority IST/Technology Enhanced Learning. Contract 027087 [http://www.tencompetence.org]"@en.
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/terms/extent> "182226 bytes".
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/format> "application/pdf".
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/language> "en"@en.
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/subject> "open educational resources"@en.
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/subject> "authoring"@en.
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/subject> "licensing"@en.
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/subject> "creativecommons"@en.
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/title> "Developing Open Educational Resources"@en.
<http://dspace.ou.nl/handle//1820/1141> <http://purl.org/dc/elements/1.1/type> "Presentation"@en.
<http://dspace.ou.nl/handle//1820/1141> <http://data.linkedu.eu/onto/pdf_url> "http://dspace.ou.nl/bitstream/1820/1141/1/oer_authoring_kalz_bazaar07.pdf".
<http://dspace.ou.nl/handle//1820/1141> <http://data.linkedu.eu/onto/authors> "Kalz, Marco".
<http://dspace.ou.nl/handle//1820/1141> <http://data.linkedu.eu/onto/abstract_html_url> "http://dspace.ou.nl/handle/1820/1141".
<http://dspace.ou.nl/handle//1820/1141> <http://data.linkedu.eu/onto/language> "en".
<http://dspace.ou.nl/handle//1820/1141> <http://data.linkedu.eu/onto/title> "Developing Open Educational Resources".
<http://dspace.ou.nl/handle//1820/1141> <http://data.linkedu.eu/onto/keywords> "open educational resources; authoring; licensing; creativecommons; Presentation".
<http://dspace.ou.nl/handle//1820/1141> <http://data.linkedu.eu/onto/date> "2007-12-18T11:49:14Z".

In the two competitions held within the LinkedUp project, there were several interesting applications that were submitted. Apart from critically reviewing these applications for their ingenuity and usefulness, we also analysed them from an ethical perspective. What do we mean by this? The linked-data applications were scrutinised for their copyright and privacy compliance. In this blog post we describe our findings primarily from copyright perspective, more specifically on an important and often neglected topic called ‘attribution’. Our analysis showed that developers either paid little or no attention to attribution, not necessarily deliberately but more out of lack of awareness on this subject. In this blog, we first explain what we mean by attribution, then describe its components, its usefulness to developers and finally we highlight what developers can and should do to attribute their sources properly.

While Linked Data applications were submitted either as Web or mobile software systems, these applications exploited and made use of one or several external data sources. These data sources were of multiple media types, for example, some images were art works from museums, videos of course lectures, map data, medical data from public databases. However, making use of such multi-media external data sources often requires careful consideration to the legal constraints that are attached to each data. Before we discuss specific issues, we must understand the legal concepts that underpin them.

Attribution – Author’s right to be credited

Copyright is a form of protection provided to the authors of ‘original works of authorship’ including literary, dramatic, musical, artistic, and certain other intellectual works, both published and unpublished (see The United States Patent and Trademark Office, General Information Concerning Patents). Attribution is the ‘the act of establishing a particular person as the creator of a work of art’. An attribution statement identifies the name of the creator (among other details) acknowledging the source ‘appropriately’ and it is attached to a software application. When attribution statements are identified and accessed easily, it is more likely that others may want to reuse the data source.

Attribution is very similar to how research papers are cited, where authors/creators are given their due credit, especially when their creation involves time and resources. Thus, attribution is not just an author’s right but it is also the ‘right’ thing to do; software engineers should acknowledge their data sources properly.

Several popular licences which cover open data such as Open Government License (OGL), Open Data Commons (ODC) and later versions of Creative Commons (CC) have attribution as one of their key conditions. Although the format of attribution is not necessarily identical to each other, they contain information fields that are similar. Bespoke licences may or may not require attribution and this information will be attached under the ‘terms and conditions’ or ‘copyright’ notice relating to the data source.

Attribution fields

On the LinkedUp project competitions, we found the software systems made use of data which were of various types: text, photos, audios, videos and databases. This is important because the format of attribution depends on the type of media. For example, databases may be attributed differently when compared a photo. In addition to the varying according to the media type, attribution also varies according to the license attached to the data source. For example, the attribution of CC differs to OCL.

We now look at data fields that are relevant for proper attribution. Here, we combine guidance taken from CC and OGL to produce a uniform list of attribution fields.

TitleWhat is the name of the material? If the data source has a title, this should be included in the attribution statement. If a title is not provided, there is no obligation to fill this field.

AuthorWho owns the material? This field captures the name of author or authors of the data source. Sometimes, the author/licensor may want you to give credit to some other entity, like a company or pseudonym. For public sector data, released under OGL in the UK, this field refers to the department/institution which produced the data. In some exceptional cases, the licensor or author may not want to be attributed at all. In any case, the attribution requirements specified by author should be met.

SourceWhere can I find it? Provide access details for the data source, so others can also use. This is usually a URL or hyperlink to where the data resides.

YearWhen was it published? The year of publishing the data source, this is particularly important when attributing data sources from public sector organisations in the UK.

License - How can I use it? Make note of the type of license that is attached to the use of the data source, along with any additional information included by the author/licensor. It is also recommended to provide a link to the full text of the license. If a data source comes with any copyright notices, then they should also be attached. Here, a notice refers to the disclaimer of warranties; or a notice of previous modifications which may be quite important to potential users of the data source. Regarding modifications, it is important to record any modifications you may have carried out on the data source and cite it accordingly.

Attribution statements

Now that we described the most important attribution fields, we now look at how they may be used in actually making an attribution statement. While content of attribution statements generally do not vary, their arrangement and format can vary depending on the media and the licence used. In this section we provide examples of attribution format from CC, OGL and ODC licensing schemes which cover most of open data sources.

CC uses a straightforward attribution statement, in an abstracted form, it has the following format:

“<Title with source URL>by <Author, linked to profile page> is licensed under <license type linked to license deed>

 An example of this attribution for a photo is shown in the example below.

Creative Commons 10th Birthday Celebration San Francisco” by tvol is licensed under CC BY 2.0

This is a proper attribution because it has the following attribution fields:

Title: “Creative Commons 10th Birthday Celebration San Francisco”
Author: “tvol” – linked to his profile page
Source: “Creative Commons 10th Birthday Celebration San Francisco” – linked to original Flickr page
License: “CC BY 2.0” – linked to license deed

Modified or derived data

It may sometimes be necessary to modify the original work to create new derivatives. In such cases, the nature of the modifications should be explicitly stated when making an attribution. A suggested format for attributions covering derived or modified work would be:

This work, “<Title of modified work with source URL>”, is a derivative of “<Title of original work with source URL>” by <Author of original work, linked to profile page>, used under <original license type linked to license deed>. <Title of modified work with source URL>” is licensed under <license type linked to license deed> by <Author of modified work linked to profile page>

 Given below is an example of an attribution statement for a derived work of the earlier example:

This work, "90fied", is a derivative of "Creative Commons 10th Birthday Celebration San Francisco" by tvol, used under CC BY. "90fied" is licensed under CC BY by Alice and Bob.

 

 If you note, this attribution contains fields for both the original work and the derived work.

Attribution for multiple sources

When an application uses multiple data sources which are licensed under heterogeneous licensing schemes, then the each part of the application must individually attribute to the original work and its associated license as shown below. Assume a software system contains two sub systems 1 and 2 which use two separate data sources then their attribution statement could be:

Sub-system 1 (under terms of use)
This <title of sub-system> uses <title of source> which is licensed under <license type linked to license deed> by <Author linked to profile page>
Sub–system 2 (under terms of use)
This <title of sub-system> uses <title of source> which is licensed under <license type linked to license deed> by <Author linked to profile page>

 As shown above, each sub-system individually attributes the data source and its associated licenses. If however, a single software system uses multiple data sources, authored by the same entity and covered by the same license, then the format could be much simpler as shown below:

This <title of work> uses:
<title of source -1 > which is licensed under <license type linked to license deed> by <Author linked to profile page>
<title of source -2 > which is licensed under <license type linked to license deed> by <Author linked to profile page>
.
.
<title of source -n> which is licensed under <license type linked to license deed> by <Author linked to profile page>

Datasets from public sector bodies in the UK

In the UK, thousands of public sector datasets have been released under the Open Government Licence (OGL). OGL is an open licensing model and tool for public sector bodies (in the UK) to license the re-use of their data easily. Use of datasets under the OGL is free and allows data to be used and re-used for commercial and/or non-commercial purposes. When using this data, an attribution statement using the following format must be attached:

<Title>, <author department/organisation>, <year of publication>, <applicable copyright or database right notice>. This information is licensed under the terms of the Open Government Licence [http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2].

 If your application uses multiple sources with OGL and it is impractical to attribute them individually, then the following attribution statement can be used:

Contains public sector information licensed under the Open Government Licence v2.0.

However, OGL recommends maintaining a record or list of sources and attributions in another file or location, if it is not practical to include these prominently within your product.

Attribution within media

Generally, attribution statements can be found in ‘Terms of use’ or ‘About us’ pages on a Web or mobile application. However, in some cases this may not be adequate. For example, when an application makes use of mapping services such as Google Maps, the attribution in such cases must appear on the medium or content itself as shown below:

google-map-attribution

 Although, Google automatically generates such attribution information for its mapping service, users of the service are discouraged from disabling this feature. Others such as Fouresquare recommend a similar approach when using their data in mobile applications and it might be a good practice to adopt.

 

 

With the growing amounts of Open Educational Resources available online, finding the right resource for a user depending on their interest is becoming a challenge. The Open University (UK) was facing this challenge internally for their thousands of extracts from course material, podcasts and other multimedia resources, for which basic keyword based search did not work properly. Part of it is that the user’s need can rarely be expressed in a set of keywords that would march the right resources. DiscOU therefore started with the goal to achieve automatically answering some of the common requests received from prospective students, including “I’ve seen this programme yesterday on the BBC. It’s cool! What can I learn about?”

So this is exactly what DiscOU does. Starting from a BBC programme page (such as this programme) or a page of a programme on iPlayer, a small bookmarklet basically adds on top of the page a list of 10 pieces of open content (audio, video, text) from the Open University that are about the topics covered by the programme.

DiscOU screenshot

This is achieve first thanks to the fact that both the Open University and the BBC expose their content as Linked Data – the information about the programme can be obtained directly from the page, and information about the available open educational content be indexed from the Open University’s linked data platform.
DiscOU topic selectionBut the recommendation itself also uses linked data. Indeed, simply relying on the description of the programme would not give very meaningful results. Here, we use DBpedia Spotlight to connect the programmes to the topics they are covering in DBpedia, characterising each of them with a topic profile where each topic is a Linked Data URI. We also index open education resources from the Open University using the same process. Recommending resources therefore becomes a task of connecting the programme’s profile to the profile of open content covering similar topics.

One of the key advantages of this approach is that the “query” for content can be customised. Clicking on the small “gear” on the left opens a panel showing the detected profile of the programme, where the user can decide which of the topic are more meaningful then others to them, updating the recommendation accordingly.

Another advantage of the approach is that, being based on standard linked data technologies, it can be adapted to all sort of other resources and starting points. DiscOU Alfa already can achieve the same results, but taking a piece of text as a starting point. Other versions have also been made that worked on closed educational material, for the purpose of supporting course creation from legacy internal resources.

 

In a previous post, Besnik explained how they used various technique to extract the most important topics covered by a dataset. The results of this process have now been made available in a SPARQL endpoint for many datasets of the Linked Data Cloud, including the ones of the LinkedUp Catalogue. Here, we quickly describe how we can use this SPARQL endpoint and a bit of PHP/HTML/CSS to create a visualisation of this in a ‘topic cloud’. This same code is used in the interface of the LinkedUp catalogue, for each endpoint (see for example, the one of data.open.ac.uk).

Topic cloud of data.open.ac.uk
Topic cloud of data.open.ac.uk in the LinkedUp catalogue

The first thing to think about here is: How to get a list of topics from the SPARQL endpoint containing them (i.e.,
http://meco.l3s.uni-hannover.de:8890/sparql), together with the associated score reflecting the strength/relevance of each topic to each dataset. The following query return the topics (category) and score of the 60 most relevant topics of data.open.ac.uk.

SELECT DISTINCT ?category ?score WHERE { 
GRAPH <http://data-observatory.org/lod-profiles/linked-education-profile> {
  ?dataset owl:sameAs <http://data.linkededucation.org/linkedup/dataset/data-open-ac-uk>.
  ?linkset void:target ?dataset.
  ?linkset vol:hasLink ?link.
  ?link vol:linksResource ?category.
  ?link vol:hasScore ?score.
}} ORDER BY DESC(?score) LIMIT 60

Now, let’s get into PHP. First, we will not detail here the way to run the query (with SPARQL, everything is just an HTTP request anyway) and assume that the $data variable contains the bindings of the SPARQL query results, structured as PHP objects.

We now want to display these results with different sizes and colours depending on the score. We decide to use 10 different possible combination of sizes and colours. We don’t need to fix these now, but to organise the topics within 10 different sets, which will then be styled with CSS. The first thing to do is therefore to normalise the scores and discretise them into numbers from 0 to 10.

$maxscore = 0;
$ar = array();
// find biggest score and create array
foreach($data as $b){
  $ar[$b->category->value] = $b->score->value;
  if ($b->score->value > $maxscore) $maxscore = $b->score->value;
}
// reduce scores in array into numbers from 0 to 10
foreach ($ar as $cat=>$s){
  $ar[$cat] = round((($s*10)/$maxscore));
}

The result of this is an array ($ar) which associate each topic with a normalised, discrete score. We then reorganize it by alphabetical order of topics (just a cosmetic choice):

ksort($ar);

As quickly mentioned above, the idea is that we can then display each topic with a different style depending on the nornalised score. Here we use basic HTML/CSS; i.e. we display each topic as an HTML element which is associated with a class that corresponds to its score: The class name starts with ‘tcloudcat’ concatenated with the number between 0 and 10 of the score.

echo '<div class="tcloud">';
foreach($ar as $cat=>$score){
  $fcat = urldecode($fcat);
  echo '<span class="tclouditem tcloudcat'.$score.'">'.
     $fcat.'</span><span class="tcloudsep"> </span> ';
}
echo '</div>';

What this code will do is generate a div element (tcloud) containing a set of span elements with classes such as tcloudcat2 or tcloudcat8. The only thing left to do is to include in the CSS of the page styling information for all these classes.

.tcloud{
  padding: 10px 10px 10px 10px;
  text-align: center;
}
.tcloudcat10{
  font-size: 150%;
  color: #000;
}
.tcloudcat9{
  font-size: 140%;
  color: #000;
}
.tcloudcat8{
  font-size: 130%;
  color: #000;
}
.tcloudcat7{
  font-size: 120%;
  color: #000;
}
.tcloudcat6{
  font-size: 110%;
  color: #222;
}
.tcloudcat5{
  font-size: 100%;
  color: #444;
}
.tcloudcat4{
  font-size: 90%;
  color: #666;
}
.tcloudcat3{
  font-size: 80%;
  color: #888;
}
.tcloudcat2{
  font-size: 70%;
  color: #aaa;
}
.tcloudcat1{
  font-size: 60%;
  color: #aaa;
}
.tcloudcat0{
  font-size: 50%;
  color: #aaa;
}

This, very simply, makes bigger and darker topics with high scores, and smaller/lighter the ones with lower scores, generating the topic cloud shown in the picture above.

The LinkedUp Data Catalogue is currently expending a lot. One of the datasets I am personally quite excited about is the Key Information Set about UK Universities, as collected and made available by Unistats. Indeed, this gives you as open data information about what students have done after a certain degree in a certain university, if they went to further studies, what sort of jobs they got, etc. This has very strong potential especially for the PathFinder track of the LinkedUp Vidi competition.

However, the data is currently available only as a set of XML files which you need to download (zipped) and process yourself. In other words, writing an application with this data, even if it is open, would be a major pain. Well, not anymore! We have transformed this data into RDF/Linked Data and created a SPARQL endpoint for it.

And, just to make my point clear that this makes building stuff on top of the data much easier, I wrote a small application that does something simple with it: tell it the kind of job you want to do, it tells you what degrees in what university tends to lead to this kind of job. You start by typing something (e.g. “tech”, “health”, “sci”) and it will auto-complete into the jobs that the dataset knows, and once one is selected, it will give you a list of degrees (with links), including the percentage of students who have gone into employment that have taken up this type of jobs.

kis-app

It is not the most sophisticated app ever, but my point here is, look at the sources of the page: it is less than 100 lines of HTML/Javascript! That’s it… nothing else. The autocomplete feature is based on jQuery UI, which is fed with the results of a very simple SPARQL query to get the URIs and labels of jobs. Once one is selected (say, “Protective service officers”, i.e. http://data.linkedu.eu/kis/job/117), then it only takes the following simple SPARQL query to get what we need out of the data, already ordered, ready to be displayed:

select distinct ?course ?label ?link ?perc ?uni where {
  ?o <http://purl.org/linked-data/cube#dataSet> <http://data.linkedu.eu/kis/dataset/commonJobs>.
  ?o <http://data.linkedu.eu/kis/ontology/job> <http://data.linkedu.eu/kis/job/117>.
  ?o <http://data.linkedu.eu/kis/ontology/course> ?course.
  ?course <http://purl.org/dc/terms/title> ?label.
  ?course <http://data.linkedu.eu/kis/ontology/courseUrl> ?link.
  ?o <http://data.linkedu.eu/kis/ontology/percentage> ?perc.
  ?course <http://courseware.rkbexplorer.com/ontologies/courseware#taught-at> ?i.
  ?i <http://www.w3.org/2000/01/rdf-schema#label> ?uni.
  filter ( ?perc > 0 )
} order by desc(?perc)

(if you don’t believe me that this is simple, just look at it a bit more, there is no sophistication to this).

A bit of HTML/CSS and javascript to show the whole thing, and “voila”: an application. It took me about 2 hours to write it. It requires almost no resource (most of it is done client-side and by our SPARQL endpoint). This is garage-coding. So if that’s the kind of thing I can do out of boredom on a Wednesday evening when there was nothing on TV, imagine what you could to with that kind of data… and a whole web of other data!

This blog is all about showing bits and pieces of code, processes and applications that demonstrate how to use linked data in education. In addition to these bits an pieces, several initiatives, including the LinkedUp project, the EUCLID project and the SSSW summer school have recently published general resources providing background information about linked data technologies and their use. We have therefore started collecting such resources for the LinkedUp Devtalk Resource Page, as references for us to rely on when discussing and demonstrating specific technical issues.

We hope you will find these useful, and of course, please let us know if something is missing ;-)

As part of the current activities of the LinkeUp project, we have shown in previous work ways of generating automatically dataset profiles showing the most prominent covered topics. The first tool is the dataset explorer (see [1] for more details) which is an interactive user-interface, from which datasets can be queried based on particular topics they cover. Furthermore, the underlying data in the dataset explorer is extracted from the automatically generated metadata about dataset profiles, accessible via the SPARQL endpoint.

In addition, the generated dataset profiles make the process of finding datasets of interest very easy, by simply issuing SPARQL queries to the respective endpoint. As an example we show below a listing of all datasets names that contain the topic about Technology:

SELECT ?datasetname ?link ?score
WHERE
{
?dataset a void:Dataset.
?dataset a void:Linkset.
?dataset void:target ?datasettarget.
?datasettarget dcterms:title ?datasetname.
?dataset vol:hasLink ?link.
?link vol:linksResource <http://dbpedia.org/resource/Category:Technology>.
?link vol:hasScore ?score.

FILTER (?score > 0.5)
} LIMIT 10

In the remainder of this blog post, we describe in details the individual steps of automatically generating the dataset profiles, a task which proves to be cumbersome when considering manual assessment of datasets due to the large amount of resource instances contained within.

In our case we generate dataset profiles focusing at the linked-education group in DataHub, with profiles containing detailed information about topics covered and actual representativeness for a given dataset. A topic is a DBpedia category, a value that is assigned to DBpedia entities for the datatype property dcterms:subject as part of the Dublin Core Metadata. Hence, in order to obtain the topics we first need to perform named entity recognition on textual resources for a subset of instances from a specific dataset.

An important step is capturing such extracted topics, and the details of the generated links, describing in details how such a link is established between a dataset and a topic (DBpedia category). For this purpose, we developed a general purpose vocabulary, namely the Vocabulary of Links (VoL) and in combination with the Vocabulary of Interlinked Datasets (VoID) we capture the generated profile as following.

A link between a dataset and a topic, is captured under the datatype property vol:hasLink, where a topic is represented via the following resource type vol:Link and with vol:hasScore describing how representative it is for a dataset, while vol:linksResource refers to the actual DBpedia category.

Finally, the set of constructed links are captured using void:Linkset, which defines the set of links connecting two datasets, in our case DBpedia (respectively its categories) with a dataset from the linked-education group.

The details of the individual steps on how to obtain the dataset profiles are presented below, and require basic understanding of Java language or JavaScript which are necessary to perform HTTP requests to different web services and API’s.

The main steps are as follows:

  1. Dataset metadata extraction from DataHub
  2. Extraction of resource types information
  3. Indexing of resources
  4. Annotation of indexed resources with structured information

Using the public CKAN API for a given dataset id as given in DataHub, a RESTFul API of CKAN returns the information in JSON format containing the metadata about the dataset. In this post we will use the “lak-dataset”  as our example.

HttpPost post = new HttpPost("http://datahub.io/api/action/package_show");
post.setHeader("X-CKAN-API-Key", "YOUR_CKAN_API_KEY");
StringEntity input = new StringEntity("{\"id\":\"lak-dataset\"}");
input.setContentType("application/json");
post.setEntity(input);

After issuing the post request (in which it is necessary to have the CKAN API key to access the API), the result is retrieved in JSON format which can be used to access further information, like the URL of the SPARQL endpoint: http://data.linkededucation.org/request/lak-conference/sparql?query=

Second step, is necessary in order to continue with the analysis of covered topics in a dataset, and as an initial step is extracting the resource types by issuing the following SPARQL query:

SELECT DISTINCT ?type
WHERE
{
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type
}

With respect to the resource types, an extra step can be taken to retrieve the number of instances that exist for different resource types. For instance, assume we want to know the number of resource instances for resource type “http://swrc.ontoware.org/ontology#InProceedings” (publications published in conference proceedings, 315 in our case) from “lak-dataset” dataset, such number is extracted via the following SPARQL query:

SELECT (COUNT (DISTINCT ?x) as ?count)
WHERE
{
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swrc.ontoware.org/ontology#InProceedings>
}

The third step, builds upon results from previous steps. From the  “lak-dataset” which represents conference publications, from the extracted resource instance URI’s we analyse the resource content by indexing a specific number of instances, in this example case 100. The SPARQL query for indexing the instances is the following:

SELECT ?resource ?property ?value
WHERE
{
?resource <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swrc.ontoware.org/ontology#InProceedings>.
?resource ?property ?value
}

Fourth and final step, is analysing the indexed resources by extracting literal values assigned to datatype properties. For example, for a particular resource instance http://data.linkededucation.org/resource/lak/conference/edm2012/paper/45, has the following properties which we are interested in: dcterms:subject, dcterms:title, swrc:abstract, led:body. From the extracted values we perform Named Entity Recognition using DBpedia Spotlight, to extract matching entities from the DBpedia knowledge base, by issuing a HTTP post request to the web service offered by DBpedia Spotlight:

http://spotlight.dbpedia.org/rest/annotate/?confidence=0.25&support=20&text=TEXT

http://dbpedia.org/resource/Data_mining
http://dbpedia.org/resource/Learning
http://dbpedia.org/resource/Education

In addition, from the extracted entities we retrieve information about covered topics by a particular entity and consequently the resource instance from which the entity is extracted. The topics covered by an entity are extracted from dcterms:subject datatype property which contains DBpedia categories; example of extracted categories:

http://dbpedia.org/resource/Category:Systems_science

http://dbpedia.org/resource/Category:Intelligence

http://dbpedia.org/resource/Category:Learning

http://dbpedia.org/resource/Category:Educational_psychology

Furthermore, since we want to have a broader representation of the covered topics, we leverage the hierarchy organisation of DBpedia categories based on skos:broader datatype property, and from categories that are directly associated with an entity we expand to additional categories. For instance, for category http://dbpedia.org/resource/Category:Learning, going up in the category hierarchy up to 4 levels, the following extra categories can be extracted:

L1: http://dbpedia.org/resource/Category:Behavior
L2: http://dbpedia.org/resource/Category:Sociobiology
L3: http://dbpedia.org/resource/Category:Subfields_and_areas_of_study_related_to_evolutionary_biology
L4: http://dbpedia.org/resource/Category:Evolutionary_biology

Finally, after completing the four steps we measure which topic is covered the most by computing a normalisation score. It simply counts the number of associations a topic has in a dataset or, in case there are more datasets analysed, we take that into account by checking which topics are most representative and distinctive for a dataset. The produced output from these steps is returned in JSON format which can be further used to generate, for instance, VoID metadata or for additional analysis.

For more information and detailed reasoning of the individual steps, have a look at our paper [1]. Additionally, the whole procedure described above can be accessed via the public web service offered by the LinkedUp project, returning the results in a JSON format (shown below) and can be accessed under the following link http://data.linkededucation.org/AnnotateService-1.0/RestAnnotateService/RestAnnotateService/analysedatasetR?datasetid=lak-dataset&resourceno=20&append=false, with parameters datasetid as the DataHub dataset id, resourceno the number of resource instances to be analysed, and append if the previous results should be overwritten or built upon existing ones.
{
"Datasets":[{
"Dataset":{
"URI": "",
"Name": "",
"Description": "",
"Annotations": {
"Entities": [{
"URI": "",
"Resources": [{"Resource":"", "Frequency": "", "Categories": [{"Category": ""}]}],"OverallFrequency": "" }]
}
}}],
"Categories": [{"Category": { "Level": "", "ParentCategory": "", "URI": ""}}]
}

As quickly mentioned in our previous post on the LinkedUp Catalogue of Datasets (a.k.a the “Linked Education Cloud”), this catalogue is a tiny bit more than a list of pointers to datasets. It integrates with the Datahub, but more importantly, it includes a SPARQL endpoint with a rich description of the datasets and of their content, including mappings between the classes of objects represented in each dataset.

In this post, we show in more details how this representation can be used to find datasets that contain information about a particular type of things, and how we can actually find these things through SPARQL query federation. That sounds scary… but really, it is quite simple once you get the idea that all the LinkedUp Catalogue does is to give you a meta-description of the datasets that can be accessed and queried in the same way as the datasets themselves.

So, here we are going to use the example of schools – i.e., we are going to write a query that returns all the schools in all the datasets of the catalogue. In the LinkedUp catalogue, the chosen type for schools is the aiiso:School class. That means that every dataset either uses this class, or there will be a mapping between the class it uses for schools and this one.

The first thing to do is therefore to find all the SPARQL endpoints in the LinkedUp catalogue that have objects of this class. The VoID representation of the LinkedUp catalogue is a set of “datasets”, which might have subsets. A particular type of sub-dataset is called a class-partition, and represent the sub-part of the dataset that concerns a certain type of objects (i.e., a certain class). So we can start with the query:

prefix void: <http://rdfs.org/ns/void#> 
prefix aiiso: <http://purl.org/vocab/aiiso/schema#>

select distinct ?endpoint where {
   ?ds void:sparqlEndpoint ?endpoint.
   ?ds void:classPartition [void:class aiiso:School] 
}

If tried on the LinkedUp Catalogue’s SPARQL endpoint, that should give us the URIs of all the SPARQL endpoints that contain objects of the type aiiso:School… However, right now, it returns nothing. That’s because the objects of this class might be in a subset of a dataset attached to a SPARQL endpoint. So, basically we need to also look at subsets:

prefix void: <http://rdfs.org/ns/void#>
prefix aiiso: <http://purl.org/vocab/aiiso/schema#>

select distinct ?endpoint where {
   ?ds void:sparqlEndpoint ?endpoint.
   {{?ds void:classPartition [void:class aiiso:School]}
       union
   {?ds void:subset [void:classPartition [void:class aiiso:School]]}}
}

The union operator can be seen a bit like a OR. So this new query asks for endpoints that have objects of the type aiiso:School, or that have subsets that have objects of the type aiiso:School… and nicely enough, it returns something now: two URIs of endpoints.

Now, the thing is, not all endpoint in the catalogue represent schools as instances of the class aiiso:School. As part of building the LinkedUp catalogue, we also create class mappings – i.e., relationships indicating whether a class found in a dataset is equivalent or a subclass of another one. In order to find endpoints that talk about schools but might use other classes mapped to aiiso:School, we can therefore use the following query:

prefix void: <http://rdfs.org/ns/void#>
prefix aiiso: <http://purl.org/vocab/aiiso/schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select distinct ?endpoint ?cl where {
   ?ds void:sparqlEndpoint ?endpoint.
   {{?ds void:classPartition [ void:class ?cl]} 
        UNION
   {?ds void:subset [ void:classPartition [ void:class ?cl] ]}}
   {{?cl owl:equivalentClass aiiso:School} 
        UNION
   {?cl rdfs:subClassOf aiiso:School}
        UNION 
   {FILTER ( str(?cl) = str(aiiso:School) ) }}
}

This query, with the additional UNION clause, asks for endpoints that contain (possibly in a subset) objects of a class which is either aiiso:School, a class equivalent to aiiso:School or a subclass of aiiso:School. The result we get now is five different endpoints, two of them using aiiso:School (the same as before), and three using different classes, sometimes more specific than aiiso:School.

{
  "head": {
    "vars": [ "endpoint" , "cl" ]
  } ,
  "results": {
    "bindings": [
      {
        "endpoint": { "type": "uri" , "value": "http://www.auth.gr/sparql" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/vocab/aiiso/schema#School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://kent.zpr.fer.hr:8080/educationalProgram/sparql" } ,
        "cl": { "type": "uri" , "value": "http://kent.zpr.fer.hr:8080/educationalProgram/vocab/sisvu.rdf#Academy" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://services.data.gov.uk/education/sparql" } ,
        "cl": { "type": "uri" , "value": "http://education.data.gov.uk/def/school/School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://services.data.gov.uk/education/sparql" } ,
        "cl": { "type": "uri" , "value": "http://education.data.gov.uk/def/school/TrainingSchool" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://data.linkedu.eu/hud/query" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/vocab/aiiso/schema#School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://sparql.linkedopendata.it/scuole" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/net7/vocab/scuole/v1#Scuola" }
      }
    ]
  }
}

And that’s where the magic of query federation can happen: We now have a list of SPARQL endpoints that talk about schools, with the classes they use to talk about schools. The “service” clause in SPARQL 1.1 can be used to delegate a sub-part of a query to an external/remote SPARQL endpoint. It’s implementation is still very sketchy in many cases, and does not always work the way we want it to, but Fuseki, the triple-store we use for the LinkedUp Catalogue, makes a reasonably good job at it. Look closely at the three lines added at the end of the query (plus the additional variable in the select clause):

prefix void: <http://rdfs.org/ns/void#>
prefix aiiso: <http://purl.org/vocab/aiiso/schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select distinct ?endpoint ?school ?cl where {
   ?ds void:sparqlEndpoint ?endpoint.
   {{?ds void:classPartition [ void:class ?cl]} 
       UNION
   {?ds void:subset [ void:classPartition [ void:class ?cl] ]}}
   {{?cl owl:equivalentClass aiiso:School} 
       UNION
   {?cl rdfs:subClassOf aiiso:School}
      UNION 
   {FILTER ( str(?cl) = str(aiiso:School) ) }}
   service silent ?endpoint {
      ?school a ?cl
   }
}

What these three lines actually say is “and now give me all the objects of these classes in the endpoints where they appear”. The result, if you try it, takes a bit of time to come back, but is quite impressive: a list of schools described in five different, completely independent datasets distributed over the web!

{
  "head": {
    "vars": [ "endpoint" , "school" , "cl" ]
  } ,
  "results": {
    "bindings": [
      {
        "endpoint": { "type": "uri" , "value": "http://www.auth.gr/sparql" } ,
        "school": { "type": "uri" , "value": "https://www.auth.gr/bio" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/vocab/aiiso/schema#School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://www.auth.gr/sparql" } ,
        "school": { "type": "uri" , "value": "https://www.auth.gr/itl" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/vocab/aiiso/schema#School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://www.auth.gr/sparql" } ,
        "school": { "type": "uri" , "value": "https://www.auth.gr/theo" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/vocab/aiiso/schema#School" }
      } ,
     
      ...
   
      {
        "endpoint": { "type": "uri" , "value": "http://services.data.gov.uk/education/sparql" } ,
        "school": { "type": "uri" , "value": "http://education.data.gov.uk/id/school/100869" } ,
        "cl": { "type": "uri" , "value": "http://education.data.gov.uk/def/school/School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://services.data.gov.uk/education/sparql" } ,
        "school": { "type": "uri" , "value": "http://education.data.gov.uk/id/school/100868" } ,
        "cl": { "type": "uri" , "value": "http://education.data.gov.uk/def/school/School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://services.data.gov.uk/education/sparql" } ,
        "school": { "type": "uri" , "value": "http://education.data.gov.uk/id/school/100867" } ,
        "cl": { "type": "uri" , "value": "http://education.data.gov.uk/def/school/School" }
      } ,

      ...

      {
        "endpoint": { "type": "uri" , "value": "http://sparql.linkedopendata.it/scuole" } ,
        "school": { "type": "uri" , "value": "http://data.linkedopendata.it/scuole/resource/CTTD00601P" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/net7/vocab/scuole/v1#Scuola" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://sparql.linkedopendata.it/scuole" } ,
        "school": { "type": "uri" , "value": "http://data.linkedopendata.it/scuole/resource/CTIS00600C" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/net7/vocab/scuole/v1#Scuola" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://sparql.linkedopendata.it/scuole" } ,
        "school": { "type": "uri" , "value": "http://data.linkedopendata.it/scuole/resource/CTTD01401N" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/net7/vocab/scuole/v1#Scuola" }
      } ,

      ...      

The goal of the LinkedUp Dataset Catalog (or Linked Education Cloud) is to collect and make available, ideally in an easily usable way, all sorts of data sources of relevance to education. The aim is not only to support participants to the LinkedUp Challenge in identifying and conjointly using Web Data in their applications, but also to be a general, evolving resource for the community interested in Web data for education. As we will see here, the LinkedUp Dataset Catalog is actually more than one thing, which can be used in more than one way.

A community group on CKAN/Datahub.io

The LinkedUp Dataset Catalog is first and foremost a registry of datasets. Datahub.io is probably one of the most popular and the most used global catalogs of datasets, and is in particular at the basis of the Linked Open Data cloud. In the interest of integrating with other ongoing open data effort, rather than developing ours in isolation, the LinkedUp Dataset Catalog is created as part of Datahub.io. It takes the form of a community group in which any dataset can be included. In other words, any dataset in Datahub.io can be included in our Linked Education Cloud group (as long as it is relevant), and the datasets in this group are also visible globally on the Datahub.io portal.

On this portal, every dataset is described with a set of basic metadata, with all sorts of resources attached to them. It makes it possible to search for datasets, including faceted browsing of the results, globally or specifically in the Linked Education Cloud. For example, one can search for the word “University” in the Linked Education Cloud, and obtain datasets that explicitly mention “university” in their metadata. These results can be further reduced with filters, for example to include only the ones that provide an example resource in the RDF/XML format.

lec-ckan

One of the great advantages of relying on Datahub.io is that the catalog is not only accessible through the web portal, but also in a programmatic way through the CKAN API. Thanks to this API, it is possible to build applications that search datasets, retrieve their metadata and obtain the associated resources automatically.

A Linked Data based catalog

Going a step beyond the CKAN-based catalog above, the descriptions of the same set of data sources are also made available in a machine-readable format, following the principles of Linked Data. Here, we use the VoID vocabulary to describe the datasets, the data endpoints they use, the sub-datasets, as well as the types of data objects and relationships present in their content. This is represented in RDF, and made available using a dedicated SPARQL endpoint.

lec-sparql

Through this representation and the associated SPARQL endpoint, it is possible to query and find datasets using criteria which are more fine-grained than the ones of the CKAN API. For example, the query below would return (by default in a dedicated XML format, but others such as JSON are also supported) the list of data endpoints (in the catalog) that contain sub-datasets providing objects of the type foaf:Document, ranked according to the number of such sub-datasets in the endpoint.

select distinct ?endpoint (count(distinct ?d1) as ?count) where {
    ?dataset <http://rdfs.org/ns/void#sparqlEndpoint> ?endpoint.
    ?d1<http://rdfs.org/ns/void#subset> ?dataset.
    ?d1 <http://rdfs.org/ns/void#classPartition> ?cp.
    ?cp <http://rdfs.org/ns/void#class> <http://xmlns.com/foaf/0.1/Document>
} group by ?endpoint order by desc(?count)

Other resources an interfaces

Besides the APIs and data endpoints, several other resources are provided to support the use of the LinkedUp Dataset catalog. This blog is one of them, which aim is to show how the different datasets can be used in various situations and using a various tools. Another interface provides a way to browse through the datasets and the types of data objects they provide in order to identify interesting ones. Also, mappings between these different types are included in the VoID description, so that different datasets can be more easily used together. Such an homogeneous view also helps us in better understanding what the datasets are about, and what they can provide, like in the graph below showing common types of data objects in the catalog, and how they co-occur in the currently included datasets (more details on this can be found in this paper).

This post is a summary of one of the activities of the LAK 2013 “Using Linked Data in Learning Analytics” tutorial.

OpenRefine (formerly known as Google Refine) is a rather simple, but very convenient tool to manipulate data in a tabular format, providing features for filtering, creating facets, clustering values, reconciliation, etc. It makes it relatively easy to “play” with a bunch of data and obtain them in a form that is convenient for a specific purpose. Here, we show how we can easily load the results of a SPARQL query about UK schools into OpenRefine, to explore and view this data in simple analyses.

The first thing with OpenRefine is that, although it can import data from a large variety of formats and the RDF extension allows one to, among other things, export the results in RDF, it cannot load RDF or the results of a SPARQL query straight away from the standard XML and RDF formats. It can however load simple tabular formats such as CSV from their Web URLs.

We therefore first employ SPARQL proxy that allows us to execute a SPARQL query on any endpoint, and obtain the results in a a chosen format, in out case CSV. As shown in the figure below, we will use here the endpoint of the education.data.gov.uk UK governement dataset about education, http://education.data.gov.uk/sparql/education/query, and chose the CSV format to obtain the results of the following query:
select distinct ?school ?label ?status ?type ?cap where {
?school a <http://education.data.gov.uk/def/school/School>.
?school <http://www.w3.org/2000/01/rdf-schema#label> ?label.
?school <http://education.data.gov.uk/def/school/establishmentStatus> ?s.
?s <http://www.w3.org/2000/01/rdf-schema#label> ?status.
?school <http://education.data.gov.uk/def/school/typeOfEstablishment> ?t.
?t <http://www.w3.org/2000/01/rdf-schema#label> ?type.
?school <http://education.data.gov.uk/def/school/schoolCapacity> ?cap.
}

which gives basic information (name, status, type, capacity) about UK Schools.

SPARQL Proxy

Executing the query (clicking on the Query button) should display a simple text-based table in the CSV format. This is what we will load into OpenRefine. In OpenRefine, we can create a new project from an online data source, using thr Create Project and Web Adresses (URLs) options, and copy-pasting the URL of the previously obtained CSV results in SPARQL proxy in the dedicated field, as shown below.

Importing CSV from SPARQL proxy into open refine

Once that’s done and Next has been clicked, some options will be given related to the import. At this stage, it is important to choose the CSV option to indicate to open refine that data values are separated by commas. The preview should show the data nicely organised in a table. Once ready, going ahead will present the results in a large table, ready to be analysed. We can for example create facets to browse through and filter open schools of a certain type, with a threshold on their capacity, as shown below.

Exploring the data in OpenRefine