Lab: Semantic Lifting - XML: Difference between revisions
No edit summary |
No edit summary |
||
(29 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
=Lab | =Lab 9: Semantic Lifting - XML= | ||
== | ==Topics== | ||
https:// | The first task for today will be finishing of [https://wiki.uib.no/info216/index.php/Lab:_Semantic_Lifting_-_CSV the CSV lab]. Both finishing parsing and lifting the file, and implementing DBpedia Spotlight on at least one of the columns. | ||
If you have completed that you can start working with lifting XML data in the task below. | |||
XML stands for Extensible Markup Language and is still widely used for data storage/transfer, especially for websites. | |||
XML stands for Extensible Markup Language and is used to | |||
XML has a tree structure similar to HTML, consisting of a root element, children and parent elements, attributes and so on. | |||
The goal is for you to learn an example of how we can convert unsemantic data into RDF. | The goal is for you to learn an example of how we can convert unsemantic data into RDF. | ||
Line 14: | Line 15: | ||
import requests | import requests | ||
import xml.etree.ElementTree as ET | import xml.etree.ElementTree as ET | ||
* ET.ElementTree() | |||
* ET.parse('xmlfile.xml') | |||
* ET.fromstring("XML_data_as_string") | |||
All parts of the XML tree are considered '''Elements'''. | |||
* Element.getroot() | |||
* Element.findall("path_in_tree") | |||
* Element.find("name_of_tag") | |||
* Element.text | |||
* Element.attrib("name_of_attribute") | |||
Line 21: | Line 35: | ||
'''Task 1''' | '''Task 1''' | ||
'''Lift the XML data from | |||
'''Lift the XML data from http://feeds.bbci.co.uk/news/rss.xml about news articles by BBC_News into RDF triples.''' | |||
You can look at the actual XML structure of the data by clicking ctrl + U when you have opend the link in browser. | |||
The actual data about the news articles are stored under the <item></item> tags | |||
For instance a triple should be something of the form: news_paper_id - hasTitle - titleValue | |||
Do this by parsing the XML using ElementTree (see import above). | Do this by parsing the XML using ElementTree (see import above). | ||
This | |||
I recommend starting with the code at the bottom of the page and continuing on it. This code retrieves the XML using a HTTPRequest and saves it to an XML_file, so that you can view and parse it easily. | |||
You can use this regex (string matcher) to get only the ID's from the full url that is in the <guid> data. | |||
<syntaxhighlight> | |||
news_id = re.findall('\d+$', news_id)[0] | |||
</syntaxhighlight> | |||
'''Task 2''' | '''Task 2''' | ||
Parse | Parse through the fictional XML data below and add the correct journalists as the writers of the news_articles from earlier. | ||
This means that e.g if the news article is written on a Tuesday, Thomas Smith is the one who wrote it. | This means that e.g if the news article is written on a Tuesday, Thomas Smith is the one who wrote it. | ||
One way to do this is by checking if any of the days in the "whenWriting" attribute is contained in the news articles "pubDate". | One way to do this is by checking if any of the days in the "whenWriting" attribute is contained in the news articles "pubDate". | ||
Line 48: | Line 76: | ||
</journalist> | </journalist> | ||
</news_publisher> | </news_publisher> | ||
</data | </data> | ||
</syntaxhighlight> | </syntaxhighlight> | ||
==If You have more Time== | |||
Extend the graph using the PROV vocabulary to describe Agents and Entities. | |||
For instance, we want to say that the news articles originates from BBC, | |||
and that the journalists acts on behalf of BBC. | |||
==Code to Get Started== | ==Code to Get Started== | ||
Line 70: | Line 98: | ||
prov = Namespace("http://www.w3.org/ns/prov#") | prov = Namespace("http://www.w3.org/ns/prov#") | ||
g.bind("ex", ex) | g.bind("ex", ex) | ||
g.bind(" | g.bind("prov", prov) | ||
# | # URL of xml data | ||
url = 'http://feeds.bbci.co.uk/news/rss.xml' | url = 'http://feeds.bbci.co.uk/news/rss.xml' | ||
# | # Retrieve the xml data from the web-url. | ||
resp = requests.get(url) | resp = requests.get(url) | ||
# saving the xml file | # Creating an ElementTree from the response content | ||
with open(' | tree = ET.ElementTree(ET.fromstring(resp.content)) | ||
# Or saving the xml data to a .xml file and creating a tree from this | |||
with open('news.xml', 'wb') as f: | |||
f.write(resp.content) | f.write(resp.content) | ||
Line 89: | Line 120: | ||
| <strong>Hints</strong> | | <strong>Hints</strong> | ||
|- | |- | ||
| | | | ||
|} | |} | ||
==Useful Reading== | ==Useful Reading== | ||
[https://www.geeksforgeeks.org/xml-parsing-python/ XML-parsing-python by geeksforgeeks.org] | * [https://www.geeksforgeeks.org/xml-parsing-python/ XML-parsing-python by geeksforgeeks.org] | ||
* [https://www.w3schools.com/xml/xml_whatis.asp XML information by w3schools.com] | |||
* [https://www.w3.org/TR/prov-o/#description PROV vocabulary] |
Latest revision as of 12:10, 17 March 2022
Lab 9: Semantic Lifting - XML
Topics
The first task for today will be finishing of the CSV lab. Both finishing parsing and lifting the file, and implementing DBpedia Spotlight on at least one of the columns.
If you have completed that you can start working with lifting XML data in the task below. XML stands for Extensible Markup Language and is still widely used for data storage/transfer, especially for websites.
XML has a tree structure similar to HTML, consisting of a root element, children and parent elements, attributes and so on. The goal is for you to learn an example of how we can convert unsemantic data into RDF.
Relevant Libraries/Functions
import requests
import xml.etree.ElementTree as ET
- ET.ElementTree()
- ET.parse('xmlfile.xml')
- ET.fromstring("XML_data_as_string")
All parts of the XML tree are considered Elements.
- Element.getroot()
- Element.findall("path_in_tree")
- Element.find("name_of_tag")
- Element.text
- Element.attrib("name_of_attribute")
Tasks
Task 1
Lift the XML data from http://feeds.bbci.co.uk/news/rss.xml about news articles by BBC_News into RDF triples.
You can look at the actual XML structure of the data by clicking ctrl + U when you have opend the link in browser.
The actual data about the news articles are stored under the <item></item> tags
For instance a triple should be something of the form: news_paper_id - hasTitle - titleValue
Do this by parsing the XML using ElementTree (see import above).
I recommend starting with the code at the bottom of the page and continuing on it. This code retrieves the XML using a HTTPRequest and saves it to an XML_file, so that you can view and parse it easily.
You can use this regex (string matcher) to get only the ID's from the full url that is in the <guid> data.
news_id = re.findall('\d+$', news_id)[0]
Task 2
Parse through the fictional XML data below and add the correct journalists as the writers of the news_articles from earlier. This means that e.g if the news article is written on a Tuesday, Thomas Smith is the one who wrote it. One way to do this is by checking if any of the days in the "whenWriting" attribute is contained in the news articles "pubDate".
<data>
<news_publisher name="BBC News">
<journalist whenWriting="Mon, Tue, Wed" >
<firstname>Thomas</firstname>
<lastname>Smith</lastname>
</journalist>
<journalist whenWriting="Thu, Fri" >
<firstname>Joseph</firstname>
<lastname>Olson</lastname>
</journalist>
<journalist whenWriting="Sat, Sun" >
<firstname>Sophia</firstname>
<lastname>Cruise</lastname>
</journalist>
</news_publisher>
</data>
If You have more Time
Extend the graph using the PROV vocabulary to describe Agents and Entities. For instance, we want to say that the news articles originates from BBC, and that the journalists acts on behalf of BBC.
Code to Get Started
from rdflib import Graph, Literal, Namespace, URIRef
from rdflib.namespace import RDF, XSD
import xml.etree.ElementTree as ET
import requests
import re
g = Graph()
ex = Namespace("http://example.org/")
prov = Namespace("http://www.w3.org/ns/prov#")
g.bind("ex", ex)
g.bind("prov", prov)
# URL of xml data
url = 'http://feeds.bbci.co.uk/news/rss.xml'
# Retrieve the xml data from the web-url.
resp = requests.get(url)
# Creating an ElementTree from the response content
tree = ET.ElementTree(ET.fromstring(resp.content))
# Or saving the xml data to a .xml file and creating a tree from this
with open('news.xml', 'wb') as f:
f.write(resp.content)
Hints |