Lab: JSON-LD

From info216
Revision as of 11:05, 6 March 2024 by Sinoa (talk | contribs) (→‎Topics)

Topics

JSON-LD @context and processing in the JSON-LD Playground.

Using a Web APIs to retrieve JSON-LD data from ConceptNet, parse them programmatically, use JSON-LD to turn them into RDF.

Useful Reading

Imports:

  • import json
  • import json-ld
  • import rdflib
  • import requests

Tasks

Part 1: Basic JSON-LD

Task: In a web browser, go to [1] and search for a term you are interested in. (It is good to take concept related to the Mueller investigation, for example 'indictment'.)

The same URL, but with https://api.conceptnet.io/ instead of just https://conceptnet.io/ returns the data as JSON-LD. It looks awfully detailed, but it will be easy to simplify it!

Task: In another web browser tab, go to [2] and continue to the Playground. Copy your JSON-LD data from the ConceptNet tab to the JSON-LD Input form.

In the Expanded form below the input form you will see a processed version of the JSON-LD input. Compare the first few lines of the Input and Expanded output. Many of the keys and values in the output have been expanded according to mappings defined in the @context file [3] specified in the beginning of the JSON-LD input:

  "@context": [
      "http://api.conceptnet.io/ld/conceptnet5.7/context.ld.json"
  ],

Task: Instead of a file, we will write our own simpler @context object into the JSON-LD Input. It should look like this instead:

  "@context": {
       "current_key": "url_we_want_the_key_mapped_to",
       ...
  },

We are interested in these keys: edges, start, rel, end. Map them to simple URLs, like http://ex.org/t (for triple), http://ex.org/s, http://ex.org/p and http://ex.org/o. These are the basic triples we are most interested in!

Look at the Expanded version again. It is much simpler now: the JSON-LD processor ignores regular keys that are not mapped (but the special keys with @ are still there.)

Task: Remove the line that maps the edges key. What happens and why? Put the edges mapping back in again.

Task: In addition to the Expanded tab, the Playground can show Compacted and Flattened versions of the JSON-LD Input too. They are different ways of processing the same data, each of them useful for different purposes.

Which one do you prefer for reading? Which one would be easiest to program as JSON?

Task: We have lost the labels again!

Map label to http://www.w3.org/2000/01/rdf-schema#label and see what happens.


Part 2: Programming JSON-LD in Python

Task: Install the rdflib-jsonld package in the same environment as you have rdflib installed.

Create a graph object and parse the https://api.conceptnet.io/... URL you used to download JSON-LD data earlier. You need to add the argument format="json-ld" when you call parse(...), but you should not need to import more than rdflib as before.

Task: Inspect the graph object using simple SPARQL queries to find the distinct predicates and types used.

You can also count the number of triples in the graph:

print(len(g))

or iterate through all the triples

for s, p, o in g:
    print(s, p, o)

Task: Unfortunately, the graph is much more complex than we need and it is not easy to pick out the triples we want. We want to add our own context object like we did in the Playground. Instead of parsing a graph directly from a URL, we first download it as a JSON object, for example:

import json
import requests

CN_BASE = 'http://api.conceptnet.io/c/en/'

json_obj = requests.get(CN_BASE+'indictment').json()

Now, json_obj['@context'] contains the @context object. Define your own context object in Python similar to the one you used in the play ground, and assign it to json_obj['@context'].

First parse the modified JSON object into a JSON string (import json and json.dumps(...)). Then create another graph object and parse the JSON string. You need to add the argument data=... in addition to format="json-ld" when you call parse(...) because you are no longer parsing from a file or URL, but from a string.

Save the JSON string for later, so you do not have to retrieve the same data over and over from ConceptNet.

Task: Create a new SPARQL SELECT query that lists all the (s, p, o) triples in your graph.

The URLs for predicates should be fine now, but the URLs of subjects and objects can be improved by mapping the special @base key in the @context object to a simple URL like http://ex.org/.

Task: Extend the SELECT query so that it also lists all the labels of subjects and objects.

Task: Change the SELECT query into a CONSTRUCT query that return a new graph of all the basic triples in the original JSON-LD data. Save it to file and look at it in a visualiser you like.

If you have more time...

Task: Merge the new triples with your existing graph if they fit there.

Task: Wrap the code you have written into a function describe_concept(...) that takes a concept name as argument (e.g., 'indictment') and returns a ConceptNet subgraph that describes the concept.

Task: The original JSON-LD data from https://api.conceptnet.io/... contains a view object at the end. Check it out!

By default, the API only returns 20 edges at a time. You can modify that by adding a ?limit=... argument to your URL.

Modify your describe_concept(...) method to take an extra argument that controls how many edges are downloaded.

Task: You still have way too many triples. Use FILTER and STRENDS to ignore some predicates like Synonym and general RelatedTo.

Task: Modify your @context and query so you can remove triples with concepts that are not in English language (en).