Lab: Semantic Lifting - CSV: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
= | == Topic == | ||
* Reading non-semantic data tables into semantic knowledge graphs | |||
* Specifically, reading data in CSV format via Pandas dataframes into RDF graphs | |||
== Useful materials == | |||
The textbook (Allemang, Hendler & Gandon): | |||
* chapter on RDF (section on ''Distributing Data across the Web'') | |||
Pandas: | |||
* class: DataFrame (methods: read_csv, apply, iterrows, astype) | |||
== | rdflib: | ||
* classes/interfaces from earlier (such as Graph, Namespace, URIRef, Literal, perhaps BNode) | |||
* also vocabulary classes like RDF and XSD (for datatypes) | |||
== Tasks == | |||
We will be working with the same dataset as in the SPARQL exercise: [https://projects.fivethirtyeight.com/russia-investigation/ FiveThirtyEight's Russia Investigation]. It contains data about special investigations conducted by the United States from the Watergate-investigation until May 2017. [[Russian investigation KG | This page explains]] the Russia Investigation dataset a bit more. | |||
'''Task:''' | |||
In the SPARQL exercise, you downloaded the data as a Turtle file ([[File:russia_investigation_kg.txt]], which you renamed to ''.ttl''). This time you will [https://github.com/fivethirtyeight/data/tree/master/russia-investigation download the data as a CSV file from GitHub]. | |||
'''Task:''' | |||
Install Pandas in your virtual environment, for example | |||
pip install pandas | |||
Write a Python program that imports the ''pandas'' API and uses Pandas' ''read_csv'' function to load the ''russia-investigation.csv'' dataset into a Pandas ''dataframe''. | |||
'''Task:''' | |||
Inspect the Pandas dataframe. If you have called your dataframe ''df'', you can start with the expressions below. Use the documentation to understand what each of them does. | |||
df.shape | |||
df.index | |||
df.columns | |||
df.name | |||
df['name'] | |||
** You can use the data in the Turtle file [[File:russia_investigation_kg.txt]]. Make sure you save it with the correct extension, as ''russia_investigation_kg.ttl'' (not ''.txt''). | |||
You | |||
'''Task:''' | |||
Using the data in ''russia_investigation_kg.ttl'', write the following SPARQL SELECT queries. | |||
([[Russian investigation KG | This page explains]] the Russian investigation KG a bit more.) | |||
It contains the following columns: | It contains the following columns: | ||
Line 47: | Line 61: | ||
For a row we will start by creating a resource representing the investigation. In this example we handle all investigations with the same name as the samme entity, and will therefore use the name of the investigation ("investigation"-column) to create the URI: | For a row we will start by creating a resource representing the investigation. In this example we handle all investigations with the same name as the samme entity, and will therefore use the name of the investigation ("investigation"-column) to create the URI: | ||
=== Semantic Vocabularies === | |||
You do not have to use the same ones, but these should be well suited. | |||
* RDF: type | |||
* RDFS: label | |||
* Simple Event Ontology (sem): Event, eventType, Actor, hasActor, hasActorType, hasBeginTimeStamp, EndTimeStamp, hasTime, hasSubEvent | |||
* TimeLine Ontology (tl): durationInt | |||
* An example-namespace to represent terms not found elsewhere (ex): IndictmentDays, Overturned, Pardoned | |||
* DBpedia | |||
<syntaxhighlight> | <syntaxhighlight> |
Revision as of 06:59, 13 February 2023
Topic
- Reading non-semantic data tables into semantic knowledge graphs
- Specifically, reading data in CSV format via Pandas dataframes into RDF graphs
Useful materials
The textbook (Allemang, Hendler & Gandon):
- chapter on RDF (section on Distributing Data across the Web)
Pandas:
- class: DataFrame (methods: read_csv, apply, iterrows, astype)
rdflib:
- classes/interfaces from earlier (such as Graph, Namespace, URIRef, Literal, perhaps BNode)
- also vocabulary classes like RDF and XSD (for datatypes)
Tasks
We will be working with the same dataset as in the SPARQL exercise: FiveThirtyEight's Russia Investigation. It contains data about special investigations conducted by the United States from the Watergate-investigation until May 2017. This page explains the Russia Investigation dataset a bit more.
Task: In the SPARQL exercise, you downloaded the data as a Turtle file (File:Russia investigation kg.txt, which you renamed to .ttl). This time you will download the data as a CSV file from GitHub.
Task: Install Pandas in your virtual environment, for example
pip install pandas
Write a Python program that imports the pandas API and uses Pandas' read_csv function to load the russia-investigation.csv dataset into a Pandas dataframe.
Task: Inspect the Pandas dataframe. If you have called your dataframe df, you can start with the expressions below. Use the documentation to understand what each of them does.
df.shape df.index df.columns df.name df['name']
- You can use the data in the Turtle file File:Russia investigation kg.txt. Make sure you save it with the correct extension, as russia_investigation_kg.ttl (not .txt).
Task: Using the data in russia_investigation_kg.ttl, write the following SPARQL SELECT queries. ( This page explains the Russian investigation KG a bit more.)
It contains the following columns:
- investigation
- investigation-start
- investigation-end
- investigation-days
- name
- indictment-days
- type
- cp-date
- cp-days
- overturned
- pardoned
- american
- president
More information about the columns and the dataset here: https://github.com/fivethirtyeight/data/tree/master/russia-investigation
Our goal is to convert this non-semantic dataset into a semantic one. To do this we will go row-by-row through the dataset and extract the content of each column. An investigation may have multiple rows in the dataset if it investigates multiple people, you can choose to represent these as one or multiple entities in the graph. Each investigation may also have a sub-event representing the result of the investigation, this could for instance be indictment or guilty-plea.
For a row we will start by creating a resource representing the investigation. In this example we handle all investigations with the same name as the samme entity, and will therefore use the name of the investigation ("investigation"-column) to create the URI:
Semantic Vocabularies
You do not have to use the same ones, but these should be well suited.
- RDF: type
- RDFS: label
- Simple Event Ontology (sem): Event, eventType, Actor, hasActor, hasActorType, hasBeginTimeStamp, EndTimeStamp, hasTime, hasSubEvent
- TimeLine Ontology (tl): durationInt
- An example-namespace to represent terms not found elsewhere (ex): IndictmentDays, Overturned, Pardoned
- DBpedia
name = row["investigation"]
investigation = URIRef(ex + name)
g.add((investigation, RDF.type, sem.Event))
Further we will create a relation between the investigation and all its associated columns. For when the investigation started we'll use the "investigation-start"-column and we can use the property sem:hasBeginTimeStamp:
investigation_start = row["investigation-start"]
g.add((investigation, sem.hasBeginTimeStamp, Literal(investigation_start, datatype=XSD.date)))
To represent the result of the investigation, if it has one, We can create another entity and connect it to the investigation using the sem:hasSubEvent. If so the following columns can be attributed to the sub-event:
- type
- indictment-days
- overturned
- pardon
- cp_date
- cp_days
- name (the name of the investigatee, not the name of the investigation)
Code to get you started
import pandas as pd
import rdflib
from rdflib import Graph, Namespace, URIRef, Literal, BNode
from rdflib.namespace import RDF, RDFS, XSD
ex = Namespace("http://example.org/")
dbr = Namespace("http://dbpedia.org/resource/")
sem = Namespace("http://semanticweb.cs.vu.nl/2009/11/sem/")
tl = Namespace("http://purl.org/NET/c4dm/timeline.owl#")
g = Graph()
g.bind("ex", ex)
g.bind("dbr", dbr)
g.bind("sem", sem)
g.bind("tl", tl)
df = pd.read_csv("data/investigations.csv")
# We need to correct the type of the columns in the DataFrame, as Pandas assigns an incorrect type when it reads the file (for me at least). We use .astype("str") to convert the content of the columns to a string.
df["name"] = df["name"].astype("str")
df["type"] = df["type"].astype("str")
# iterrows creates an iterable object (list of rows)
for index, row in df.iterrows():
# Do something here to add the content of the row to the graph
pass
g.serialize("output.ttl", format="ttl")
If you have more time
If you have not already you should include some checks to assure that you don't add any empty columns to your graph.
If you have more time you can implement DBpedia Spotlight to link the people mentioned in the dataset to DBpedia resources. You can use the same code example as in the last lab, but you will need some error-handling for when DBpedia is unable to find a match. For instance:
# Parameter given to spotlight to filter out results with confidence lower than this value
CONFIDENCE = 0.5
def annotate_entity(entity, filters={"types":"DBpedia:Person"}):
annotations = []
try:
annotations = spotlight.annotate(SERVER, entity, confidence=CONFIDENCE, filters=filters)
# This catches errors thrown from Spotlight, including when no resource is found in DBpedia
except SpotlightException as e:
print(e)
# Implement some error handling here
return annotations
Here we use the types-filter with DBpedia:Person, as we only want it to match with people. You can choose to only implement the URIs in the response, or the types as well. An issue here is that