Lab: Semantic Lifting - HTML: Difference between revisions
No edit summary |
No edit summary |
||
(28 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
=Lab | =Lab 11: Semantic Lifting - HTML= | ||
<!-- ALO: I think this was from 2020: | |||
==Link to Discord server== | ==Link to Discord server== | ||
https://discord.gg/t5dgPrK | https://discord.gg/t5dgPrK | ||
--> | |||
==Topics== | ==Topics== | ||
Today's topic involves lifting data in HTML format into RDF. | Today's topic involves lifting data in HTML format into RDF. | ||
HTML stands for HyperText Markup Language and is used to describe the structure and content of websites. | HTML stands for HyperText Markup Language and is used to describe the structure and content of websites. | ||
HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on. | HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on. | ||
The goal is for you to learn an example of how we can convert unsemantic data into RDF. | The goal is for you to learn an example of how we can convert unsemantic data into RDF. | ||
For parsing of the HTML, we will use the python library: BeautifulSoup. | |||
==Relevant Libraries/Functions== | ==Relevant Libraries/Functions== | ||
from bs4 import BeautifulSoup | *from bs4 import BeautifulSoup as bs | ||
*import requests | |||
*import re | |||
*beautifulsoup.find() | |||
*beautifulsoup.find_all() | |||
*string.replace(), string.split() | |||
*re.findall() | |||
Line 23: | Line 34: | ||
'''Task 1''' | '''Task 1''' | ||
pip install beautifulsoup4 | |||
'''pip install beautifulsoup4''' | |||
'''Lift the HTML information about research articles found on this link into triples: "https://www.semanticscholar.org/topic/Knowledge-Graph/159858" ''' | '''Lift the HTML information about research articles found on this link into triples: "https://www.semanticscholar.org/topic/Knowledge-Graph/159858" ''' | ||
Line 36: | Line 48: | ||
Now you can hover over the HTML tags on the right side to easily find information like the ID of the paper. | Now you can hover over the HTML tags on the right side to easily find information like the ID of the paper. | ||
For example, we can see that the main topic of the page "Knowlede Graph" is under a | For example, we can see that the main topic of the page "Knowlede Graph" is under a 'h1' tag with the attribute class: "entity-name". | ||
Knowing this we can use BeautifulSoup to find this in python code e.g: | Knowing this we can use BeautifulSoup to find this in python code e.g: | ||
<syntaxhighlight> | <syntaxhighlight> | ||
topic = html | topic = html.find('h1', attrs={'class': 'entity-name'}).text | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Similarily, to find multiple values at once, we use find_all instead. E.g, Here I am selecting all the papers, Which | Similarily, to find multiple values at once, we use find_all instead. E.g, Here I am selecting all the papers, Which I can then iterate through: | ||
papers = html | |||
<syntaxhighlight> | |||
papers = html.find_all('div', attrs={'class': 'flex-container'}) | |||
for paper in papers: | for paper in papers: | ||
# e.g selecting title. | # e.g selecting title. | ||
Line 51: | Line 65: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
You can use this regex to extract the id from the Corpus ID, or the topic ID (which is in the URL) | |||
<syntaxhighlight> | |||
id = re.findall('\d+$', id)[0] | |||
</syntaxhighlight> | |||
==Task 2== | |||
Create triples for the Topic of the page ("Knowledge Graph"). | |||
For example, a topic has related topics (on the top-right of the page). It also has, "known as" values, and a description. | |||
This is a good opportunity to use the SKOS vocabulary to describe Concepts. | |||
==If You have more Time== | ==If You have more Time== | ||
If you look at the web page, you can see that there are buttons for expanding the description, related topics and more. | |||
==Code to Get Started== | This is a problem as beautiful soup won't find this additional information until these buttons are pressed. | ||
Use the python library '''selenium''' to simulate a user pressing the 'expand' buttons to get all the triples you should get. | |||
==Code to Get Started (Make sure you understand it)== | |||
<syntaxhighlight> | <syntaxhighlight> | ||
from bs4 import BeautifulSoup as bs | from bs4 import BeautifulSoup as bs | ||
from rdflib import Graph, Literal, URIRef, Namespace | from rdflib import Graph, Literal, URIRef, Namespace | ||
from rdflib.namespace import RDF, OWL, SKOS | from rdflib.namespace import RDF, OWL, SKOS, RDFS, XSD | ||
import requests | import requests | ||
import re | |||
g = Graph() | g = Graph() | ||
Line 77: | Line 105: | ||
html = bs(page.content, features="html.parser") | html = bs(page.content, features="html.parser") | ||
# print(html.prettify()) | # print(html.prettify()) | ||
# Find the html that surrounds all the papers | # Find the html that surrounds all the papers | ||
papers = html | papers = html.find_all('div', attrs={'class': 'flex-container'}) | ||
# Iterate through each paper to make triples: | # Iterate through each paper to make triples: | ||
for paper in papers: | for paper in papers: | ||
# e.g selecting title. | # e.g selecting title. | ||
title = paper.find('div', attrs={'class': 'timeline-paper-title'}) | title = paper.find('div', attrs={'class': 'timeline-paper-title'}).text | ||
print(title | print(title) | ||
</syntaxhighlight> | </syntaxhighlight> | ||
==Useful Reading== | ==Useful Reading== | ||
* [https://www.dataquest.io/blog/web-scraping-tutorial-python/ Dataquest.io - Web-scraping with Python] |
Latest revision as of 07:57, 6 April 2021
Lab 11: Semantic Lifting - HTML
Topics
Today's topic involves lifting data in HTML format into RDF. HTML stands for HyperText Markup Language and is used to describe the structure and content of websites.
HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on. The goal is for you to learn an example of how we can convert unsemantic data into RDF.
For parsing of the HTML, we will use the python library: BeautifulSoup.
Relevant Libraries/Functions
- from bs4 import BeautifulSoup as bs
- import requests
- import re
- beautifulsoup.find()
- beautifulsoup.find_all()
- string.replace(), string.split()
- re.findall()
Tasks
Task 1
pip install beautifulsoup4
Lift the HTML information about research articles found on this link into triples: "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"
The papers will be represented with their Corpus ID. (the subject of triples). For Example, a paper has a title, a year, authors and so on.
For parsing of the HTML, we will use BeautifulSoup.
I recommend right clicking on the web page itself and clicking 'inspect' in order to get a readable version of the html.
Now you can hover over the HTML tags on the right side to easily find information like the ID of the paper.
For example, we can see that the main topic of the page "Knowlede Graph" is under a 'h1' tag with the attribute class: "entity-name".
Knowing this we can use BeautifulSoup to find this in python code e.g:
topic = html.find('h1', attrs={'class': 'entity-name'}).text
Similarily, to find multiple values at once, we use find_all instead. E.g, Here I am selecting all the papers, Which I can then iterate through:
papers = html.find_all('div', attrs={'class': 'flex-container'})
for paper in papers:
# e.g selecting title.
title = paper.find('div', attrs={'class': 'timeline-paper-title'})
print(title.text)
You can use this regex to extract the id from the Corpus ID, or the topic ID (which is in the URL)
id = re.findall('\d+$', id)[0]
Task 2
Create triples for the Topic of the page ("Knowledge Graph").
For example, a topic has related topics (on the top-right of the page). It also has, "known as" values, and a description.
This is a good opportunity to use the SKOS vocabulary to describe Concepts.
If You have more Time
If you look at the web page, you can see that there are buttons for expanding the description, related topics and more.
This is a problem as beautiful soup won't find this additional information until these buttons are pressed.
Use the python library selenium to simulate a user pressing the 'expand' buttons to get all the triples you should get.
Code to Get Started (Make sure you understand it)
from bs4 import BeautifulSoup as bs
from rdflib import Graph, Literal, URIRef, Namespace
from rdflib.namespace import RDF, OWL, SKOS, RDFS, XSD
import requests
import re
g = Graph()
ex = Namespace("http://example.org/")
g.bind("ex", ex)
# Download html from URL and parse it with BeautifulSoup.
url = "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"
page = requests.get(url)
html = bs(page.content, features="html.parser")
# print(html.prettify())
# Find the html that surrounds all the papers
papers = html.find_all('div', attrs={'class': 'flex-container'})
# Iterate through each paper to make triples:
for paper in papers:
# e.g selecting title.
title = paper.find('div', attrs={'class': 'timeline-paper-title'}).text
print(title)