Lab: Semantic Lifting - HTML: Difference between revisions
From info216
No edit summary |
No edit summary |
||
Line 53: | Line 53: | ||
page = requests.get(url) | page = requests.get(url) | ||
html = bs(page.content, features="html.parser") | html = bs(page.content, features="html.parser") | ||
print(html.prettify()) | # print(html.prettify()) | ||
# This is the topic of the webpage: "Knowledge graph". | # This is the topic of the webpage: "Knowledge graph". | ||
topic = html.body.find('h1', attrs={'class': 'entity-name'}).text | topic = html.body.find('h1', attrs={'class': 'entity-name'}).text | ||
print(topic) | print(topic) | ||
# Find the html that surrounds all the papers | |||
papers = html.body.find_all('div', attrs={'class': 'flex-container'}) | |||
# Iterate through each paper to make triples: | |||
for paper in papers: | |||
# e.g selecting title. | |||
title = paper.find('div', attrs={'class': 'timeline-paper-title'}) | |||
print(title.text) | |||
</syntaxhighlight> | </syntaxhighlight> |
Revision as of 07:11, 27 March 2020
Lab 10: Semantic Lifting - HTML
Link to Discord server
Topics
Today's topic involves lifting data in HTML format into RDF. HTML stands for HyperText Markup Language and is used to describe the structure and content of websites. HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on. The goal is for you to learn an example of how we can convert unsemantic data into RDF.
Relevant Libraries/Functions
from bs4 import BeautifulSoup
Tasks
Task 1 pip install beautifulsoup4
Task 2
If You have more Time
Code to Get Started
from bs4 import BeautifulSoup as bs
from rdflib import Graph, Literal, URIRef, Namespace
from rdflib.namespace import RDF, OWL, SKOS
import requests
from selenium import webdriver
g = Graph()
ex = Namespace("http://example.org/")
g.bind("ex", ex)
# Download html from URL and parse it with BeautifulSoup.
url = "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"
page = requests.get(url)
html = bs(page.content, features="html.parser")
# print(html.prettify())
# This is the topic of the webpage: "Knowledge graph".
topic = html.body.find('h1', attrs={'class': 'entity-name'}).text
print(topic)
# Find the html that surrounds all the papers
papers = html.body.find_all('div', attrs={'class': 'flex-container'})
# Iterate through each paper to make triples:
for paper in papers:
# e.g selecting title.
title = paper.find('div', attrs={'class': 'timeline-paper-title'})
print(title.text)
Hints |