Lab: Semantic Lifting - CSV
Topic
- Reading non-semantic data tables into semantic knowledge graphs
- Specifically, reading data in CSV format via Pandas dataframes into RDF graphs
Useful materials
The textbook (Allemang, Hendler & Gandon):
- chapter on RDF (section on Distributing Data across the Web)
- Information about the dataset
Pandas:
- Article about working with pandas.DataFrames and CSV
- class: DataFrame (methods: read_csv, set_index, apply, iterrows, astype)
rdflib:
- classes/interfaces from earlier (such as Graph, Namespace, URIRef, Literal, perhaps BNode)
- also vocabulary classes like RDF (e.g., type), RDFS (e.g., label) and XSD (for various datatypes)
Tasks
We will be working with the same dataset as in the SPARQL exercise: FiveThirtyEight's Russia Investigation. It contains data about special investigations conducted by the United States from the Watergate-investigation until May 2017. This page explains the Russia Investigation dataset a bit more.
Task: In the SPARQL exercise, you downloaded the data as a Turtle file (File:Russia investigation kg.txt, which you renamed to .ttl). This time you will download the data as a CSV file from GitHub.
Task: Install Pandas in your virtual environment, for example
pip install pandas
Write a Python program that imports the pandas API and uses Pandas' read_csv function to load the russia-investigation.csv dataset into a Pandas dataframe.
Task: (Pandas basics) Inspect the Pandas dataframe. If you have called your dataframe df, you can check out the expressions below. Use the documentation to understand what each of them does.
df.shape df.index # ...and list(df.index) df.columns df['name'] df.name df.loc[3] df.loc[3]['president']
(Pandas offers many ways of picking out rows, columns, and values. These are just examples to get started.)
Task: (Pandas basics) Pandas' apply method offers a compact way to process all the rows in a dataframe. This line lists all the rows in your dataframe as a Python dict():
df.apply(lambda row: print(dict(row)), axis=1)
What happens if you drop the axis argument, or set axis=0?
Task: Instead of the lambda function, you can use a named function. Write a function that prints out only the name and indictment-days in a row, and use it to print out the name and indictment-days for all rows in the dataframe.
Alternative to df.apply(): Pandas offers several ways to iterate through data. You can also use the itertuples methods in a simple for-loop to iterate through rows.
Task: Modify your function so it adds name and indictment-days triples to a global rdflib Graph for each row in the dataframe. The subject in each triple could be the numeric index of the row.
You can use standard terms from RDF, RDFS, XSD, and other vocabularies when you see fit. Otherwise, just use an example-prefix.
Things may be easier if you copy df.index into an ordinary column of the dataframe:
df['id'] = df.index
You can use this index, along with a prefix, as the subject in your triples.
Task: Continue to extend your function to convert the non-semantic CSV dataset into a semantic RDF one.
name = row["investigation"]
investigation = URIRef(ex + name)
g.add((investigation, RDF.type, sem.Event))
Further we will create a relation between the investigation and all its associated columns. For when the investigation started we'll use the "investigation-start"-column and we can use the property sem:hasBeginTimeStamp:
investigation_start = row["investigation-start"]
g.add((investigation, sem.hasBeginTimeStamp, Literal(investigation_start, datatype=XSD.date)))
To represent the result of the investigation, if it has one, We can create another entity and connect it to the investigation using the sem:hasSubEvent. If so the following columns can be attributed to the sub-event:
- type
- indictment-days
- overturned
- pardon
- cp_date
- cp_days
- name (the name of the investigatee, not the name of the investigation)
Code to get you started
import pandas as pd
import rdflib
from rdflib import Graph, Namespace, URIRef, Literal, BNode
from rdflib.namespace import RDF, RDFS, XSD
ex = Namespace("http://example.org/")
dbr = Namespace("http://dbpedia.org/resource/")
sem = Namespace("http://semanticweb.cs.vu.nl/2009/11/sem/")
tl = Namespace("http://purl.org/NET/c4dm/timeline.owl#")
g = Graph()
g.bind("ex", ex)
g.bind("dbr", dbr)
g.bind("sem", sem)
g.bind("tl", tl)
df = pd.read_csv("data/investigations.csv")
# We need to correct the type of the columns in the DataFrame, as Pandas assigns an incorrect type when it reads the file (for me at least). We use .astype("str") to convert the content of the columns to a string.
df["name"] = df["name"].astype("str")
df["type"] = df["type"].astype("str")
# iterrows creates an iterable object (list of rows)
for index, row in df.iterrows():
# Do something here to add the content of the row to the graph
pass
g.serialize("output.ttl", format="ttl")
If you have more time
Task: If you have not done so already, you should include checks to ensure that you do not add empty columns to your graph.
Task: If you have more time, you can use DBpedia Spotlight to try to link the people (and other "named entities") mentioned in the dataset to DBpedia resources. You can start with the code example below, but you will need exception-handling when DBpedia is unable to find a match. For instance:
# parameter given to spotlight to filter out results with confidence lower than this value
CONFIDENCE = 0.5
def annotate_entity(entity, filters={'types': 'DBpedia:Person'}):
annotations = []
try:
annotations = spotlight.annotate(SERVER, entity, confidence=CONFIDENCE, filters=filters)
# this catches errors thrown from Spotlight, including when no resource is found in DBpedia
except SpotlightException as e:
print(e)
# handle exceptions here
return annotations
Here we use the types-filter with DBpedia:Person, as we only want it to match with people. You can choose to only implement the URIs in the response, or the types as well. An issue here is that
Useful materials: