With the growing amounts of Open Educational Resources available online, finding the right resource for a user depending on their interest is becoming a challenge. The Open University (UK) was facing this challenge internally for their thousands of extracts from course material, podcasts and other multimedia resources, for which basic keyword based search did not work properly. Part of it is that the user’s need can rarely be expressed in a set of keywords that would march the right resources. DiscOU therefore started with the goal to achieve automatically answering some of the common requests received from prospective students, including “I’ve seen this programme yesterday on the BBC. It’s cool! What can I learn about?”

So this is exactly what DiscOU does. Starting from a BBC programme page (such as this programme) or a page of a programme on iPlayer, a small bookmarklet basically adds on top of the page a list of 10 pieces of open content (audio, video, text) from the Open University that are about the topics covered by the programme.

DiscOU screenshot

This is achieve first thanks to the fact that both the Open University and the BBC expose their content as Linked Data – the information about the programme can be obtained directly from the page, and information about the available open educational content be indexed from the Open University’s linked data platform.
DiscOU topic selectionBut the recommendation itself also uses linked data. Indeed, simply relying on the description of the programme would not give very meaningful results. Here, we use DBpedia Spotlight to connect the programmes to the topics they are covering in DBpedia, characterising each of them with a topic profile where each topic is a Linked Data URI. We also index open education resources from the Open University using the same process. Recommending resources therefore becomes a task of connecting the programme’s profile to the profile of open content covering similar topics.

One of the key advantages of this approach is that the “query” for content can be customised. Clicking on the small “gear” on the left opens a panel showing the detected profile of the programme, where the user can decide which of the topic are more meaningful then others to them, updating the recommendation accordingly.

Another advantage of the approach is that, being based on standard linked data technologies, it can be adapted to all sort of other resources and starting points. DiscOU Alfa already can achieve the same results, but taking a piece of text as a starting point. Other versions have also been made that worked on closed educational material, for the purpose of supporting course creation from legacy internal resources.

 

In a previous post, Besnik explained how they used various technique to extract the most important topics covered by a dataset. The results of this process have now been made available in a SPARQL endpoint for many datasets of the Linked Data Cloud, including the ones of the LinkedUp Catalogue. Here, we quickly describe how we can use this SPARQL endpoint and a bit of PHP/HTML/CSS to create a visualisation of this in a ‘topic cloud’. This same code is used in the interface of the LinkedUp catalogue, for each endpoint (see for example, the one of data.open.ac.uk).

Topic cloud of data.open.ac.uk
Topic cloud of data.open.ac.uk in the LinkedUp catalogue

The first thing to think about here is: How to get a list of topics from the SPARQL endpoint containing them (i.e.,
http://meco.l3s.uni-hannover.de:8890/sparql), together with the associated score reflecting the strength/relevance of each topic to each dataset. The following query return the topics (category) and score of the 60 most relevant topics of data.open.ac.uk.

SELECT DISTINCT ?category ?score WHERE { 
GRAPH <http://data-observatory.org/lod-profiles/linked-education-profile> {
  ?dataset owl:sameAs <http://data.linkededucation.org/linkedup/dataset/data-open-ac-uk>.
  ?linkset void:target ?dataset.
  ?linkset vol:hasLink ?link.
  ?link vol:linksResource ?category.
  ?link vol:hasScore ?score.
}} ORDER BY DESC(?score) LIMIT 60

Now, let’s get into PHP. First, we will not detail here the way to run the query (with SPARQL, everything is just an HTTP request anyway) and assume that the $data variable contains the bindings of the SPARQL query results, structured as PHP objects.

We now want to display these results with different sizes and colours depending on the score. We decide to use 10 different possible combination of sizes and colours. We don’t need to fix these now, but to organise the topics within 10 different sets, which will then be styled with CSS. The first thing to do is therefore to normalise the scores and discretise them into numbers from 0 to 10.

$maxscore = 0;
$ar = array();
// find biggest score and create array
foreach($data as $b){
  $ar[$b->category->value] = $b->score->value;
  if ($b->score->value > $maxscore) $maxscore = $b->score->value;
}
// reduce scores in array into numbers from 0 to 10
foreach ($ar as $cat=>$s){
  $ar[$cat] = round((($s*10)/$maxscore));
}

The result of this is an array ($ar) which associate each topic with a normalised, discrete score. We then reorganize it by alphabetical order of topics (just a cosmetic choice):

ksort($ar);

As quickly mentioned above, the idea is that we can then display each topic with a different style depending on the nornalised score. Here we use basic HTML/CSS; i.e. we display each topic as an HTML element which is associated with a class that corresponds to its score: The class name starts with ‘tcloudcat’ concatenated with the number between 0 and 10 of the score.

echo '<div class="tcloud">';
foreach($ar as $cat=>$score){
  $fcat = urldecode($fcat);
  echo '<span class="tclouditem tcloudcat'.$score.'">'.
     $fcat.'</span><span class="tcloudsep"> </span> ';
}
echo '</div>';

What this code will do is generate a div element (tcloud) containing a set of span elements with classes such as tcloudcat2 or tcloudcat8. The only thing left to do is to include in the CSS of the page styling information for all these classes.

.tcloud{
  padding: 10px 10px 10px 10px;
  text-align: center;
}
.tcloudcat10{
  font-size: 150%;
  color: #000;
}
.tcloudcat9{
  font-size: 140%;
  color: #000;
}
.tcloudcat8{
  font-size: 130%;
  color: #000;
}
.tcloudcat7{
  font-size: 120%;
  color: #000;
}
.tcloudcat6{
  font-size: 110%;
  color: #222;
}
.tcloudcat5{
  font-size: 100%;
  color: #444;
}
.tcloudcat4{
  font-size: 90%;
  color: #666;
}
.tcloudcat3{
  font-size: 80%;
  color: #888;
}
.tcloudcat2{
  font-size: 70%;
  color: #aaa;
}
.tcloudcat1{
  font-size: 60%;
  color: #aaa;
}
.tcloudcat0{
  font-size: 50%;
  color: #aaa;
}

This, very simply, makes bigger and darker topics with high scores, and smaller/lighter the ones with lower scores, generating the topic cloud shown in the picture above.

The LinkedUp Data Catalogue is currently expending a lot. One of the datasets I am personally quite excited about is the Key Information Set about UK Universities, as collected and made available by Unistats. Indeed, this gives you as open data information about what students have done after a certain degree in a certain university, if they went to further studies, what sort of jobs they got, etc. This has very strong potential especially for the PathFinder track of the LinkedUp Vidi competition.

However, the data is currently available only as a set of XML files which you need to download (zipped) and process yourself. In other words, writing an application with this data, even if it is open, would be a major pain. Well, not anymore! We have transformed this data into RDF/Linked Data and created a SPARQL endpoint for it.

And, just to make my point clear that this makes building stuff on top of the data much easier, I wrote a small application that does something simple with it: tell it the kind of job you want to do, it tells you what degrees in what university tends to lead to this kind of job. You start by typing something (e.g. “tech”, “health”, “sci”) and it will auto-complete into the jobs that the dataset knows, and once one is selected, it will give you a list of degrees (with links), including the percentage of students who have gone into employment that have taken up this type of jobs.

kis-app

It is not the most sophisticated app ever, but my point here is, look at the sources of the page: it is less than 100 lines of HTML/Javascript! That’s it… nothing else. The autocomplete feature is based on jQuery UI, which is fed with the results of a very simple SPARQL query to get the URIs and labels of jobs. Once one is selected (say, “Protective service officers”, i.e. http://data.linkedu.eu/kis/job/117), then it only takes the following simple SPARQL query to get what we need out of the data, already ordered, ready to be displayed:

select distinct ?course ?label ?link ?perc ?uni where {
  ?o <http://purl.org/linked-data/cube#dataSet> <http://data.linkedu.eu/kis/dataset/commonJobs>.
  ?o <http://data.linkedu.eu/kis/ontology/job> <http://data.linkedu.eu/kis/job/117>.
  ?o <http://data.linkedu.eu/kis/ontology/course> ?course.
  ?course <http://purl.org/dc/terms/title> ?label.
  ?course <http://data.linkedu.eu/kis/ontology/courseUrl> ?link.
  ?o <http://data.linkedu.eu/kis/ontology/percentage> ?perc.
  ?course <http://courseware.rkbexplorer.com/ontologies/courseware#taught-at> ?i.
  ?i <http://www.w3.org/2000/01/rdf-schema#label> ?uni.
  filter ( ?perc > 0 )
} order by desc(?perc)

(if you don’t believe me that this is simple, just look at it a bit more, there is no sophistication to this).

A bit of HTML/CSS and javascript to show the whole thing, and “voila”: an application. It took me about 2 hours to write it. It requires almost no resource (most of it is done client-side and by our SPARQL endpoint). This is garage-coding. So if that’s the kind of thing I can do out of boredom on a Wednesday evening when there was nothing on TV, imagine what you could to with that kind of data… and a whole web of other data!

This blog is all about showing bits and pieces of code, processes and applications that demonstrate how to use linked data in education. In addition to these bits an pieces, several initiatives, including the LinkedUp project, the EUCLID project and the SSSW summer school have recently published general resources providing background information about linked data technologies and their use. We have therefore started collecting such resources for the LinkedUp Devtalk Resource Page, as references for us to rely on when discussing and demonstrating specific technical issues.

We hope you will find these useful, and of course, please let us know if something is missing ;-)

As part of the current activities of the LinkeUp project, we have shown in previous work ways of generating automatically dataset profiles showing the most prominent covered topics. The first tool is the dataset explorer (see [1] for more details) which is an interactive user-interface, from which datasets can be queried based on particular topics they cover. Furthermore, the underlying data in the dataset explorer is extracted from the automatically generated metadata about dataset profiles, accessible via the SPARQL endpoint.

In addition, the generated dataset profiles make the process of finding datasets of interest very easy, by simply issuing SPARQL queries to the respective endpoint. As an example we show below a listing of all datasets names that contain the topic about Technology:

SELECT ?datasetname ?link ?score
WHERE
{
?dataset a void:Dataset.
?dataset a void:Linkset.
?dataset void:target ?datasettarget.
?datasettarget dcterms:title ?datasetname.
?dataset vol:hasLink ?link.
?link vol:linksResource <http://dbpedia.org/resource/Category:Technology>.
?link vol:hasScore ?score.

FILTER (?score > 0.5)
} LIMIT 10

In the remainder of this blog post, we describe in details the individual steps of automatically generating the dataset profiles, a task which proves to be cumbersome when considering manual assessment of datasets due to the large amount of resource instances contained within.

In our case we generate dataset profiles focusing at the linked-education group in DataHub, with profiles containing detailed information about topics covered and actual representativeness for a given dataset. A topic is a DBpedia category, a value that is assigned to DBpedia entities for the datatype property dcterms:subject as part of the Dublin Core Metadata. Hence, in order to obtain the topics we first need to perform named entity recognition on textual resources for a subset of instances from a specific dataset.

An important step is capturing such extracted topics, and the details of the generated links, describing in details how such a link is established between a dataset and a topic (DBpedia category). For this purpose, we developed a general purpose vocabulary, namely the Vocabulary of Links (VoL) and in combination with the Vocabulary of Interlinked Datasets (VoID) we capture the generated profile as following.

A link between a dataset and a topic, is captured under the datatype property vol:hasLink, where a topic is represented via the following resource type vol:Link and with vol:hasScore describing how representative it is for a dataset, while vol:linksResource refers to the actual DBpedia category.

Finally, the set of constructed links are captured using void:Linkset, which defines the set of links connecting two datasets, in our case DBpedia (respectively its categories) with a dataset from the linked-education group.

The details of the individual steps on how to obtain the dataset profiles are presented below, and require basic understanding of Java language or JavaScript which are necessary to perform HTTP requests to different web services and API’s.

The main steps are as follows:

  1. Dataset metadata extraction from DataHub
  2. Extraction of resource types information
  3. Indexing of resources
  4. Annotation of indexed resources with structured information

Using the public CKAN API for a given dataset id as given in DataHub, a RESTFul API of CKAN returns the information in JSON format containing the metadata about the dataset. In this post we will use the “lak-dataset”  as our example.

HttpPost post = new HttpPost("http://datahub.io/api/action/package_show");
post.setHeader("X-CKAN-API-Key", "YOUR_CKAN_API_KEY");
StringEntity input = new StringEntity("{\"id\":\"lak-dataset\"}");
input.setContentType("application/json");
post.setEntity(input);

After issuing the post request (in which it is necessary to have the CKAN API key to access the API), the result is retrieved in JSON format which can be used to access further information, like the URL of the SPARQL endpoint: http://data.linkededucation.org/request/lak-conference/sparql?query=

Second step, is necessary in order to continue with the analysis of covered topics in a dataset, and as an initial step is extracting the resource types by issuing the following SPARQL query:

SELECT DISTINCT ?type
WHERE
{
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type
}

With respect to the resource types, an extra step can be taken to retrieve the number of instances that exist for different resource types. For instance, assume we want to know the number of resource instances for resource type “http://swrc.ontoware.org/ontology#InProceedings” (publications published in conference proceedings, 315 in our case) from “lak-dataset” dataset, such number is extracted via the following SPARQL query:

SELECT (COUNT (DISTINCT ?x) as ?count)
WHERE
{
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swrc.ontoware.org/ontology#InProceedings>
}

The third step, builds upon results from previous steps. From the  “lak-dataset” which represents conference publications, from the extracted resource instance URI’s we analyse the resource content by indexing a specific number of instances, in this example case 100. The SPARQL query for indexing the instances is the following:

SELECT ?resource ?property ?value
WHERE
{
?resource <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swrc.ontoware.org/ontology#InProceedings>.
?resource ?property ?value
}

Fourth and final step, is analysing the indexed resources by extracting literal values assigned to datatype properties. For example, for a particular resource instance http://data.linkededucation.org/resource/lak/conference/edm2012/paper/45, has the following properties which we are interested in: dcterms:subject, dcterms:title, swrc:abstract, led:body. From the extracted values we perform Named Entity Recognition using DBpedia Spotlight, to extract matching entities from the DBpedia knowledge base, by issuing a HTTP post request to the web service offered by DBpedia Spotlight:

http://spotlight.dbpedia.org/rest/annotate/?confidence=0.25&support=20&text=TEXT

http://dbpedia.org/resource/Data_mining
http://dbpedia.org/resource/Learning
http://dbpedia.org/resource/Education

In addition, from the extracted entities we retrieve information about covered topics by a particular entity and consequently the resource instance from which the entity is extracted. The topics covered by an entity are extracted from dcterms:subject datatype property which contains DBpedia categories; example of extracted categories:

http://dbpedia.org/resource/Category:Systems_science

http://dbpedia.org/resource/Category:Intelligence

http://dbpedia.org/resource/Category:Learning

http://dbpedia.org/resource/Category:Educational_psychology

Furthermore, since we want to have a broader representation of the covered topics, we leverage the hierarchy organisation of DBpedia categories based on skos:broader datatype property, and from categories that are directly associated with an entity we expand to additional categories. For instance, for category http://dbpedia.org/resource/Category:Learning, going up in the category hierarchy up to 4 levels, the following extra categories can be extracted:

L1: http://dbpedia.org/resource/Category:Behavior
L2: http://dbpedia.org/resource/Category:Sociobiology
L3: http://dbpedia.org/resource/Category:Subfields_and_areas_of_study_related_to_evolutionary_biology
L4: http://dbpedia.org/resource/Category:Evolutionary_biology

Finally, after completing the four steps we measure which topic is covered the most by computing a normalisation score. It simply counts the number of associations a topic has in a dataset or, in case there are more datasets analysed, we take that into account by checking which topics are most representative and distinctive for a dataset. The produced output from these steps is returned in JSON format which can be further used to generate, for instance, VoID metadata or for additional analysis.

For more information and detailed reasoning of the individual steps, have a look at our paper [1]. Additionally, the whole procedure described above can be accessed via the public web service offered by the LinkedUp project, returning the results in a JSON format (shown below) and can be accessed under the following link http://data.linkededucation.org/AnnotateService-1.0/RestAnnotateService/RestAnnotateService/analysedatasetR?datasetid=lak-dataset&resourceno=20&append=false, with parameters datasetid as the DataHub dataset id, resourceno the number of resource instances to be analysed, and append if the previous results should be overwritten or built upon existing ones.
{
"Datasets":[{
"Dataset":{
"URI": "",
"Name": "",
"Description": "",
"Annotations": {
"Entities": [{
"URI": "",
"Resources": [{"Resource":"", "Frequency": "", "Categories": [{"Category": ""}]}],"OverallFrequency": "" }]
}
}}],
"Categories": [{"Category": { "Level": "", "ParentCategory": "", "URI": ""}}]
}

As quickly mentioned in our previous post on the LinkedUp Catalogue of Datasets (a.k.a the “Linked Education Cloud”), this catalogue is a tiny bit more than a list of pointers to datasets. It integrates with the Datahub, but more importantly, it includes a SPARQL endpoint with a rich description of the datasets and of their content, including mappings between the classes of objects represented in each dataset.

In this post, we show in more details how this representation can be used to find datasets that contain information about a particular type of things, and how we can actually find these things through SPARQL query federation. That sounds scary… but really, it is quite simple once you get the idea that all the LinkedUp Catalogue does is to give you a meta-description of the datasets that can be accessed and queried in the same way as the datasets themselves.

So, here we are going to use the example of schools – i.e., we are going to write a query that returns all the schools in all the datasets of the catalogue. In the LinkedUp catalogue, the chosen type for schools is the aiiso:School class. That means that every dataset either uses this class, or there will be a mapping between the class it uses for schools and this one.

The first thing to do is therefore to find all the SPARQL endpoints in the LinkedUp catalogue that have objects of this class. The VoID representation of the LinkedUp catalogue is a set of “datasets”, which might have subsets. A particular type of sub-dataset is called a class-partition, and represent the sub-part of the dataset that concerns a certain type of objects (i.e., a certain class). So we can start with the query:

prefix void: <http://rdfs.org/ns/void#> 
prefix aiiso: <http://purl.org/vocab/aiiso/schema#>

select distinct ?endpoint where {
   ?ds void:sparqlEndpoint ?endpoint.
   ?ds void:classPartition [void:class aiiso:School] 
}

If tried on the LinkedUp Catalogue’s SPARQL endpoint, that should give us the URIs of all the SPARQL endpoints that contain objects of the type aiiso:School… However, right now, it returns nothing. That’s because the objects of this class might be in a subset of a dataset attached to a SPARQL endpoint. So, basically we need to also look at subsets:

prefix void: <http://rdfs.org/ns/void#>
prefix aiiso: <http://purl.org/vocab/aiiso/schema#>

select distinct ?endpoint where {
   ?ds void:sparqlEndpoint ?endpoint.
   {{?ds void:classPartition [void:class aiiso:School]}
       union
   {?ds void:subset [void:classPartition [void:class aiiso:School]]}}
}

The union operator can be seen a bit like a OR. So this new query asks for endpoints that have objects of the type aiiso:School, or that have subsets that have objects of the type aiiso:School… and nicely enough, it returns something now: two URIs of endpoints.

Now, the thing is, not all endpoint in the catalogue represent schools as instances of the class aiiso:School. As part of building the LinkedUp catalogue, we also create class mappings – i.e., relationships indicating whether a class found in a dataset is equivalent or a subclass of another one. In order to find endpoints that talk about schools but might use other classes mapped to aiiso:School, we can therefore use the following query:

prefix void: <http://rdfs.org/ns/void#>
prefix aiiso: <http://purl.org/vocab/aiiso/schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select distinct ?endpoint ?cl where {
   ?ds void:sparqlEndpoint ?endpoint.
   {{?ds void:classPartition [ void:class ?cl]} 
        UNION
   {?ds void:subset [ void:classPartition [ void:class ?cl] ]}}
   {{?cl owl:equivalentClass aiiso:School} 
        UNION
   {?cl rdfs:subClassOf aiiso:School}
        UNION 
   {FILTER ( str(?cl) = str(aiiso:School) ) }}
}

This query, with the additional UNION clause, asks for endpoints that contain (possibly in a subset) objects of a class which is either aiiso:School, a class equivalent to aiiso:School or a subclass of aiiso:School. The result we get now is five different endpoints, two of them using aiiso:School (the same as before), and three using different classes, sometimes more specific than aiiso:School.

{
  "head": {
    "vars": [ "endpoint" , "cl" ]
  } ,
  "results": {
    "bindings": [
      {
        "endpoint": { "type": "uri" , "value": "http://www.auth.gr/sparql" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/vocab/aiiso/schema#School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://kent.zpr.fer.hr:8080/educationalProgram/sparql" } ,
        "cl": { "type": "uri" , "value": "http://kent.zpr.fer.hr:8080/educationalProgram/vocab/sisvu.rdf#Academy" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://services.data.gov.uk/education/sparql" } ,
        "cl": { "type": "uri" , "value": "http://education.data.gov.uk/def/school/School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://services.data.gov.uk/education/sparql" } ,
        "cl": { "type": "uri" , "value": "http://education.data.gov.uk/def/school/TrainingSchool" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://data.linkedu.eu/hud/query" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/vocab/aiiso/schema#School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://sparql.linkedopendata.it/scuole" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/net7/vocab/scuole/v1#Scuola" }
      }
    ]
  }
}

And that’s where the magic of query federation can happen: We now have a list of SPARQL endpoints that talk about schools, with the classes they use to talk about schools. The “service” clause in SPARQL 1.1 can be used to delegate a sub-part of a query to an external/remote SPARQL endpoint. It’s implementation is still very sketchy in many cases, and does not always work the way we want it to, but Fuseki, the triple-store we use for the LinkedUp Catalogue, makes a reasonably good job at it. Look closely at the three lines added at the end of the query (plus the additional variable in the select clause):

prefix void: <http://rdfs.org/ns/void#>
prefix aiiso: <http://purl.org/vocab/aiiso/schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select distinct ?endpoint ?school ?cl where {
   ?ds void:sparqlEndpoint ?endpoint.
   {{?ds void:classPartition [ void:class ?cl]} 
       UNION
   {?ds void:subset [ void:classPartition [ void:class ?cl] ]}}
   {{?cl owl:equivalentClass aiiso:School} 
       UNION
   {?cl rdfs:subClassOf aiiso:School}
      UNION 
   {FILTER ( str(?cl) = str(aiiso:School) ) }}
   service silent ?endpoint {
      ?school a ?cl
   }
}

What these three lines actually say is “and now give me all the objects of these classes in the endpoints where they appear”. The result, if you try it, takes a bit of time to come back, but is quite impressive: a list of schools described in five different, completely independent datasets distributed over the web!

{
  "head": {
    "vars": [ "endpoint" , "school" , "cl" ]
  } ,
  "results": {
    "bindings": [
      {
        "endpoint": { "type": "uri" , "value": "http://www.auth.gr/sparql" } ,
        "school": { "type": "uri" , "value": "https://www.auth.gr/bio" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/vocab/aiiso/schema#School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://www.auth.gr/sparql" } ,
        "school": { "type": "uri" , "value": "https://www.auth.gr/itl" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/vocab/aiiso/schema#School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://www.auth.gr/sparql" } ,
        "school": { "type": "uri" , "value": "https://www.auth.gr/theo" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/vocab/aiiso/schema#School" }
      } ,
     
      ...
   
      {
        "endpoint": { "type": "uri" , "value": "http://services.data.gov.uk/education/sparql" } ,
        "school": { "type": "uri" , "value": "http://education.data.gov.uk/id/school/100869" } ,
        "cl": { "type": "uri" , "value": "http://education.data.gov.uk/def/school/School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://services.data.gov.uk/education/sparql" } ,
        "school": { "type": "uri" , "value": "http://education.data.gov.uk/id/school/100868" } ,
        "cl": { "type": "uri" , "value": "http://education.data.gov.uk/def/school/School" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://services.data.gov.uk/education/sparql" } ,
        "school": { "type": "uri" , "value": "http://education.data.gov.uk/id/school/100867" } ,
        "cl": { "type": "uri" , "value": "http://education.data.gov.uk/def/school/School" }
      } ,

      ...

      {
        "endpoint": { "type": "uri" , "value": "http://sparql.linkedopendata.it/scuole" } ,
        "school": { "type": "uri" , "value": "http://data.linkedopendata.it/scuole/resource/CTTD00601P" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/net7/vocab/scuole/v1#Scuola" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://sparql.linkedopendata.it/scuole" } ,
        "school": { "type": "uri" , "value": "http://data.linkedopendata.it/scuole/resource/CTIS00600C" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/net7/vocab/scuole/v1#Scuola" }
      } ,
      {
        "endpoint": { "type": "uri" , "value": "http://sparql.linkedopendata.it/scuole" } ,
        "school": { "type": "uri" , "value": "http://data.linkedopendata.it/scuole/resource/CTTD01401N" } ,
        "cl": { "type": "uri" , "value": "http://purl.org/net7/vocab/scuole/v1#Scuola" }
      } ,

      ...      

The goal of the LinkedUp Dataset Catalog (or Linked Education Cloud) is to collect and make available, ideally in an easily usable way, all sorts of data sources of relevance to education. The aim is not only to support participants to the LinkedUp Challenge in identifying and conjointly using Web Data in their applications, but also to be a general, evolving resource for the community interested in Web data for education. As we will see here, the LinkedUp Dataset Catalog is actually more than one thing, which can be used in more than one way.

A community group on CKAN/Datahub.io

The LinkedUp Dataset Catalog is first and foremost a registry of datasets. Datahub.io is probably one of the most popular and the most used global catalogs of datasets, and is in particular at the basis of the Linked Open Data cloud. In the interest of integrating with other ongoing open data effort, rather than developing ours in isolation, the LinkedUp Dataset Catalog is created as part of Datahub.io. It takes the form of a community group in which any dataset can be included. In other words, any dataset in Datahub.io can be included in our Linked Education Cloud group (as long as it is relevant), and the datasets in this group are also visible globally on the Datahub.io portal.

On this portal, every dataset is described with a set of basic metadata, with all sorts of resources attached to them. It makes it possible to search for datasets, including faceted browsing of the results, globally or specifically in the Linked Education Cloud. For example, one can search for the word “University” in the Linked Education Cloud, and obtain datasets that explicitly mention “university” in their metadata. These results can be further reduced with filters, for example to include only the ones that provide an example resource in the RDF/XML format.

lec-ckan

One of the great advantages of relying on Datahub.io is that the catalog is not only accessible through the web portal, but also in a programmatic way through the CKAN API. Thanks to this API, it is possible to build applications that search datasets, retrieve their metadata and obtain the associated resources automatically.

A Linked Data based catalog

Going a step beyond the CKAN-based catalog above, the descriptions of the same set of data sources are also made available in a machine-readable format, following the principles of Linked Data. Here, we use the VoID vocabulary to describe the datasets, the data endpoints they use, the sub-datasets, as well as the types of data objects and relationships present in their content. This is represented in RDF, and made available using a dedicated SPARQL endpoint.

lec-sparql

Through this representation and the associated SPARQL endpoint, it is possible to query and find datasets using criteria which are more fine-grained than the ones of the CKAN API. For example, the query below would return (by default in a dedicated XML format, but others such as JSON are also supported) the list of data endpoints (in the catalog) that contain sub-datasets providing objects of the type foaf:Document, ranked according to the number of such sub-datasets in the endpoint.

select distinct ?endpoint (count(distinct ?d1) as ?count) where {
    ?dataset <http://rdfs.org/ns/void#sparqlEndpoint> ?endpoint.
    ?d1<http://rdfs.org/ns/void#subset> ?dataset.
    ?d1 <http://rdfs.org/ns/void#classPartition> ?cp.
    ?cp <http://rdfs.org/ns/void#class> <http://xmlns.com/foaf/0.1/Document>
} group by ?endpoint order by desc(?count)

Other resources an interfaces

Besides the APIs and data endpoints, several other resources are provided to support the use of the LinkedUp Dataset catalog. This blog is one of them, which aim is to show how the different datasets can be used in various situations and using a various tools. Another interface provides a way to browse through the datasets and the types of data objects they provide in order to identify interesting ones. Also, mappings between these different types are included in the VoID description, so that different datasets can be more easily used together. Such an homogeneous view also helps us in better understanding what the datasets are about, and what they can provide, like in the graph below showing common types of data objects in the catalog, and how they co-occur in the currently included datasets (more details on this can be found in this paper).

This post is a summary of one of the activities of the LAK 2013 “Using Linked Data in Learning Analytics” tutorial.

OpenRefine (formerly known as Google Refine) is a rather simple, but very convenient tool to manipulate data in a tabular format, providing features for filtering, creating facets, clustering values, reconciliation, etc. It makes it relatively easy to “play” with a bunch of data and obtain them in a form that is convenient for a specific purpose. Here, we show how we can easily load the results of a SPARQL query about UK schools into OpenRefine, to explore and view this data in simple analyses.

The first thing with OpenRefine is that, although it can import data from a large variety of formats and the RDF extension allows one to, among other things, export the results in RDF, it cannot load RDF or the results of a SPARQL query straight away from the standard XML and RDF formats. It can however load simple tabular formats such as CSV from their Web URLs.

We therefore first employ SPARQL proxy that allows us to execute a SPARQL query on any endpoint, and obtain the results in a a chosen format, in out case CSV. As shown in the figure below, we will use here the endpoint of the education.data.gov.uk UK governement dataset about education, http://education.data.gov.uk/sparql/education/query, and chose the CSV format to obtain the results of the following query:
select distinct ?school ?label ?status ?type ?cap where {
?school a <http://education.data.gov.uk/def/school/School>.
?school <http://www.w3.org/2000/01/rdf-schema#label> ?label.
?school <http://education.data.gov.uk/def/school/establishmentStatus> ?s.
?s <http://www.w3.org/2000/01/rdf-schema#label> ?status.
?school <http://education.data.gov.uk/def/school/typeOfEstablishment> ?t.
?t <http://www.w3.org/2000/01/rdf-schema#label> ?type.
?school <http://education.data.gov.uk/def/school/schoolCapacity> ?cap.
}

which gives basic information (name, status, type, capacity) about UK Schools.

SPARQL Proxy

Executing the query (clicking on the Query button) should display a simple text-based table in the CSV format. This is what we will load into OpenRefine. In OpenRefine, we can create a new project from an online data source, using thr Create Project and Web Adresses (URLs) options, and copy-pasting the URL of the previously obtained CSV results in SPARQL proxy in the dedicated field, as shown below.

Importing CSV from SPARQL proxy into open refine

Once that’s done and Next has been clicked, some options will be given related to the import. At this stage, it is important to choose the CSV option to indicate to open refine that data values are separated by commas. The preview should show the data nicely organised in a table. Once ready, going ahead will present the results in a large table, ready to be analysed. We can for example create facets to browse through and filter open schools of a certain type, with a threshold on their capacity, as shown below.

Exploring the data in OpenRefine

 

This post is a summary of one of the activities of the LAK 2013 “Using Linked Data in Learning Analytics” tutorial.

Understanding learning and research communities has a lot to do with understanding the network of relationships between the members of these communities. A powerful tool to visualize and analyse these network is Gephi. Gephi provides “out of the box”, the ability to create potentially very large graphs, to beautifully render them and to automatically layout them using force feedback-based layouts to understand the topology of the networks. Here, we quickly show how, using the Semantic Web Import plugin for Gephi, we can quickly and easily build a network of co-authors with a SPARQL query to a linked data endpoint, and display it to see groups of co-authors forming in the graph.

First, installing the Semantic Web Import plugin in Gephi is reasonably simple task. Go to the menu tools, then choose Plugins, then the Available plugins tab, where you should find and select Semantic Web Import. Once installed, the Semantic Web Import plugin should appear in the interface, as a tab of the main view in the Overview tab (see below).

Gephi Semantic Web Import plugin configuration

Creating a network with this tool requires two things: first configuring the endpoint, and second, providing a query which results should be interpreted as a network by Gephi. There are several different options that can be used as data endpoints or data sources in the plugin. Here we will only use the Remote – REST endpoint option. For the purpose of the example, we will use the SPARQL endpoint of the University of Southampton – http://sparql.data.southampton.ac.uk/ – which should be entered into the Endpoint URL field as in the figure above.

The second thing to do is to enter the SPARQL query in the query tab. The query has to be a CONSTRUCT query, as Gephi will interpret the resulting RDF graph as the network to visualize and process. Since we are here looking at a the network of co-authorship, we will use the following query:
construct {
?author1 <http://myonto.com/coauthor> ?author2.
}
where {
?pub <http://purl.org/dc/terms/creator> ?author1.
?pub <http://purl.org/dc/terms/creator> ?author2.
filter ( ?author1 != ?author2 )
}
limit 10000

(Note: we put a limit of 10,000 triples in this query as, at the time of writing, the SPARQL endpoint of the University of Southampton has a limitation that it cannot return more than a certain amount of data. This will impact on the completeness of the results, but should affect the overall process of generating the graph).

If we enter this query into the query tab below, it will create a graph showing with URIs of authors connected through their being co-authors of at least one paper in the repository. Now, this is a but difficult to read as it would be much more convenient to see the names of the authors instead. The Semantic Web Import plugin provides the ability to customize the network through specialized properties in the resulting RDF graph. Here in particular, we want to use the http://gephi.org/label property to provide the nodes with more friendly labels, transforming the query into:
construct {
?author1 <http://myonto.com/coauthor> ?author2.
?author1 <http://gephi.org/label> ?name1.
?author2 <http://gephi.org/label> ?name2.
}
where {
?pub <http://purl.org/dc/terms/creator> ?author1.
?pub <http://purl.org/dc/terms/creator> ?author2.
?author1 <http://xmlns.com/foaf/0.1/name> ?name1.
?author2 <http://xmlns.com/foaf/0.1/name> ?name2.
filter ( ?author1 != ?author2 )
}
limit 10000

Putting this query into the query tab and running it will generate the initial network. Using the force atlas layout algorithm will then gather together groups of co-authors, as shown in the following screenshots.

Putting the SPARQL query into Gephi
Putting the SPARQL query into Gephi.

 

The resulting network graph, without a layout applied.
The resulting network graph, without a layout applied.
The network graph after applying the force atlas layout algorithm. Separate groups of close collaborators are clearly forming.
The network graph after applying the force atlas layout algorithm. Separate groups of close collaborators are clearly forming.

This post is a summary of one of the activities of the LAK 2013 “Using Linked Data in Learning Analytics” tutorial.

R is probably one of the most used and the most popular tool for data analysis. It is a powerful statistics engine and a programming language allowing one to achieve all sort of complex tasks from visualization to clustering. Here, we show through the very simple task of displaying a pie chart, how data from a SPARQL endpoint can be loaded, manipulated and processed in R.

To achieve this, the first thing to do in R is to install and load the SPARQL package. The nice thing about R is that this can be achieved with a few rather simple comments, first installing the Java package, then the SPARQL one, and finally loading it:
install.packages("rJava")
install.packages(“SPARQL")
library(SPARQL)

This might trigger errors if other required libraries have not been installed. Installing them can be achieved in exactly the same way with the install.packages command.

Once the library install, the environment is ready to execute any SPARQL select query on any endpoint of your choice. Here, we will use the SPARQL endpoint of the Open University – http://data.open.ac.uk/query – with a query that collects information about courses:
select distinct ?subjectlabel ( count(distinct ?course) as ?nbcourse ) 
( avg(?creds) as ?avgcredits) ( avg(?price) as ?avgprice) where {
?course a <http://courseware.rkbexplorer.com/ontologies/courseware#Course>.
?course <http://purl.org/dc/terms/subject> ?subject.
<http://data.open.ac.uk/topic> <http://www.w3.org/2004/02/skos/core#hasTopConcept> ?subject.
?subject <http://www.w3.org/2000/01/rdf-schema#label> ?subjectlabel.
?course <http://data.open.ac.uk/saou/ontology#eu-number-of-credits> ?creds.
?course <http://purl.org/net/mlo/specifies> ?presentation.
?offer <http://purl.org/goodrelations/v1#includes> ?course.
?offer <http://purl.org/goodrelations/v1#availableAtOrFrom> <http://sws.geonames.org/2802361/>.
?offer <http://purl.org/goodrelations/v1#hasPriceSpecification> ?pricespec.
?pricespec <http://purl.org/goodrelations/v1#hasCurrencyValue> ?price.
} group by ?subjectlabel

This query in principle collects more information than strictly needed for this exercise, but generally, the results obtained are the list of top level topics in Open University courses, together with the number of courses in each topics, the average number of credits each these courses give and the average price of each course (for a student located in Belgium).

Executing this SPARQL query is achieved in R with the SPARQL packages using the SPARQL command, which takes as parameters the URL of the endpoint and the query string, as follows:
results <- SPARQL("http://data.open.ac.uk/query", "select distinct ?subjectlabel
( count(distinct ?course) as ?nbcourse ) ( avg (?creds) as ?avgcredits) ( avg(?price) as ?avgprice)
where {?course a <http://courseware.rkbexplorer.com/ontologies/courseware#Course>.
?course <http://purl.org/dc/terms/subject> ?subject.
<http://data.open.ac.uk/topic> <http://www.w3.org/2004/02/skos/core#hasTopConcept> ?subject.
?subject <http://www.w3.org/2000/01/rdf-schema#label> ?subjectlabel.
?course <http://data.open.ac.uk/saou/ontology#eu-number-of-credits> ?creds.
?course <http://purl.org/net/mlo/specifies> ?presentation.
?offer <http://purl.org/goodrelations/v1#includes> ?course.
?offer <http://purl.org/goodrelations/v1#availableAtOrFrom> <http://sws.geonames.org/2802361/>.
?offer <http://purl.org/goodrelations/v1#hasPriceSpecification> ?pricespec.
?pricespec <http://purl.org/goodrelations/v1#hasCurrencyValue> ?price.} group by ?subjectlabel")

After executing this command, the results variable contains a table that corresponds to the results of the SPARQL query, i.e.

print(results)

gives

$results
subjectlabel nbcourse avgcredits avgprice
1 "Mathematics and Statistics"@en 32 14.85714 1255.857
2 "Education"@en 26 28.26087 2149.681
3 "Business and Management"@en 40 14.11017 1506.797
4 "Environment, Development and International Studies"@en 41 24.20000 2197.533
5 "Childhood and Youth"@en 27 28.02632 2235.342
6 "Law"@en 17 16.95652 1625.652
7 "Health and Social Care"@en 45 22.05263 1852.579
8 "Science"@en 81 13.33333 1115.463
9 "Engineering and Technology"@en 44 14.67949 1598.192
10 "Computing and ICT"@en 44 14.04762 1330.488
11 "Languages"@en 24 20.20408 1712.143
12 "Social Sciences"@en 25 21.04478 1779.522
13 "Arts and Humanities"@en 51 27.26667 2302.173
14 "Psychology"@en 13 17.94643 1519.393

$namespaces
NULL

The only thing for to do then is to display the topics and number of courses in a pie chart. We first extract the actual table from the results, then display a pie chart using the values of the column nbcourse, with the labels from the column subjectlabel:

restable <- results[[1]]
pie(restable$nbcourse, restable$subjectlabel, col=rainbow(length(restable$nbcourse)))

Which opens a new window with our pie chart (tada!).

Pie chart of open university courses in R