Lab: Semantic Lifting - CSV: Difference between revisions

From info216
No edit summary
No edit summary
Line 52: Line 52:
ex = Namespace("httph://example.org/")
ex = Namespace("httph://example.org/")
g.bind("ex", ex)
g.bind("ex", ex)
columns = csv_data.columns


# iterate through each row. First I select the subjects of the triples which will be the names.
# iterate through each row. First I select the subjects of the triples which will be the names.
for index, row in csv_data.iterrows():
for index, row in csv_data.iterrows():
    subject = row['Name'].replace(" ", "_")
    #Continue Code here:


    subject = row['Name'].replace(" ", "_")


print(g.serialize(format="turtle").decode())
</syntaxhighlight>
</syntaxhighlight>


==Useful Readings==
==Useful Readings==

Revision as of 21:56, 12 March 2020

Lab 9: Semantic Lifting - CSV

Topics

Today's topic involves lifting the data in CSV format into RDF. The goal is for you to learn an example of how we can convert unsemantic data into RDF.

CSV stands for Comma Seperated Values, meaning that each point of data is seperated by a column.

Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.

Relevant Libraries

  • Pandas
  • Python functions:

split(), replace().

Tasks

Task 1

Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project folder and write a program with a loop that reads each line from that file (except the initial header line) and adds it to your graph as triples:

"Name","Gender","Country","Town","Expertise","Interests"
"Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music travelling"
"Achille Blaise","M","France","Nancy","","Chess, computer games"
"Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks","Hiking, botany"
"Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"

When solving the task take note of the following:

  • The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
  • Some columns like expertise have multiple values for one person. You should create unique triple for each of these expertises.
  • Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.
  • Any case with missing data should not form a triple.
  • For consistency, make sure all resources start with a Captital letter.


Code to Get Started

from rdflib import Graph, Literal, Namespace, URIRef

import pandas as pd

csv_data = pd.read_csv("task1.csv")

g = Graph()
ex = Namespace("httph://example.org/")
g.bind("ex", ex)

# iterate through each row. First I select the subjects of the triples which will be the names.
for index, row in csv_data.iterrows():
    subject = row['Name'].replace(" ", "_")

     #Continue Code here:



print(g.serialize(format="turtle").decode())

Useful Readings