Lab: Using Graph Embeddings: Difference between revisions

From info216
No edit summary
mNo edit summary
 
(14 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=Lab 13: Using Graph Embeddings=


==Topics==
==Topics==
Using knowledge graph embeddings with TorchKGE.
Using knowledge graph embeddings with TorchKGE.


<!-- ==Tutorial== -->
==Useful readings==
 
* [https://torchkge.readthedocs.io/en/latest/ Welcome to TorchKGE’ s documentation!]
 
* The following TorchKGE classes are central:
==Classes and methods==
** ''KnowledgeGraph'' - contains the knowledge graph (KG)
The following TorchKGE classes are central:
** ''Model'' - contains the embeddings (entity and relation vectors) for some KG
* '''KnowledgeGraph''' - contains the knowledge graph (KG)
* [https://pytorch.org/docs/stable/tensors.html PyTorch Tensor Documentation]
* '''Model''' - contains the embeddings (entity and relation vectors) for some KG
 
<!--
<syntaxhighlight>
</syntaxhighlight>
-->


==Tasks==
==Tasks==
 
'''Task: knowledge graph''':
'''Knowledge Graph''':
* Use a [https://torchkge.readthedocs.io/en/latest/reference/utils.html#pre-trained-models dataset loader] to load a KG you want to work with. Freebase FB15k237 is a good choice. (You will need a pre-trained model for your KG later, to choose one of FB15k, FB15k237, WDV5, WN18RR, or Yago3-10. This lab has mostly been tested on FB15k.)
* Use a [https://torchkge.readthedocs.io/en/latest/reference/utils.html#pre-trained-models dataset loader] to load a KG you want to work with. Freebase FB15k is a good choice. (You will need a pre-trained model for your KG later, to choose one of FB15k, FB15k237, WDV5, WN18RR, or Yago3-10. This lab has mostly been tested on FB15k.)
* Use the methods provided by the [https://torchkge.readthedocs.io/en/latest/reference/data.html#knowledge-graph KnowledgeGraph class] to inspect the KG.  
* Use the methods provided by the [https://torchkge.readthedocs.io/en/latest/reference/data.html#knowledge-graph KnolwedgeGraph class] to inspect the KG.  
** Print out the numbers of entities, relations, and facts in the training, validation, and testing sets.  
** Print out the numbers of entities, relations, and facts in the training, validation, and testing sets.  
** Print the identifiers for the first 10 entities and relations (''tip:'' ent2ix and rel2ix).
** Print the identifiers for the first 10 entities and relations (''tip:'' ent2ix and rel2ix).


'''External identifiers''':
'''Task: external identifiers''':
* Download a dataset that provides more understandable labels for the entities (and perhaps relations) in your KnowledgeGraph
* Download a dataset that provides more understandable labels for the entities (and perhaps relations) in your KnowledgeGraph
** If you use FB15k, the relation names are not so bad, but the entity identifiers do not give much meaning. Same with WordNet. [https://github.com/villmow/datasets_knowledge_embedding This repository] contains mappings for the Freebase and WordNet datasets.
** If you use FB15k, the relation names are not so bad, but the entity identifiers do not give much meaning. Same with WordNet. [https://github.com/villmow/datasets_knowledge_embedding This repository] contains mappings for the Freebase and WordNet datasets.
** If you use a Wikidata graph, the entities and relations are all P- and Q-codes. To get labels, you can try a combination of [https://query.wikidata.org/ SPARQL queries] and [https://pypi.org/project/Wikidata/ this API].
** If you use a Wikidata graph, the entities and relations are all P- and Q-codes. To get labels, you can try a combination of [https://query.wikidata.org/ SPARQL queries] and [https://pypi.org/project/Wikidata/ this API].
* Create mappings from external label to entity (and perhaps relation) ids in the KnowledgeGraph. Also create the inverse mappings.
* Create mappings from external labels to entity ids (and perhaps relation ids) in the KnowledgeGraph. Also create the inverse mappings.


'''Test entities and relations''':
'''Task: test entities and relations''':
* Get the KG indexes for a few entities and relations. If you use the Freebase or Wikidata graphs, you can try 'J. K. Rowling' and 'WALL·E' as entities (''note'' that the dot in 'WALL·E' is not a hyphen or usual period.) For relations you can try 'influenced by' and 'genre'.
* Get the KG indexes for a few entities and relations. If you use the Freebase or Wikidata graphs, you can try 'J. K. Rowling' and 'WALL·E' as entities (''note'' that the dot in 'WALL·E' is not a hyphen or usual period.) For relations you can try 'influenced by' and 'genre'. (''tip'': to check names of entites and relations, open the train.txt file you cloned)


'''Model''':
'''Task: model''':
* Load a [https://torchkge.readthedocs.io/en/latest/reference/utils.html#pre-trained-models pre-trained TransE model] that matches your KnowledgeGraph.
* Load a [https://torchkge.readthedocs.io/en/latest/reference/utils.html#pre-trained-models pre-trained TransE model] that matches your KnowledgeGraph.
** Print out the numbers of entities, relations, and the dimensions of the entity and relation vectors. Do they match your KnowledgeGraph.  
** Print out the numbers of entities, relations, and [https://torchkge.readthedocs.io/en/latest/reference/models.html#transe the dimensions] of the entity and relation vectors. Do they match your KnowledgeGraph.  
* Get the vectors for your test entities and relations (for example, 'J. K. Rowling' and 'influenced by').
* Get the vectors for your test entities and relations (for example, 'J. K. Rowling' and 'influenced by').
* Find vectors for a few more entities (both unrelated and related ones, e.g., 'J. R. R. Tolkien', 'C. S. Lewis', ...). Use the [https://torchkge.readthedocs.io/en/latest/reference/models.html#translationalmodels model.dissimilarity()-method] to estimate how semantically close your entities are. Do the distances make sense?
* Find vectors for a few more entities (both unrelated and related ones, e.g., 'J. R. R. Tolkien', 'C. S. Lewis', ...). Use the [https://torchkge.readthedocs.io/en/latest/reference/models.html#translationalmodels model.dissimilarity()-method] to estimate how semantically close your entities are. Do the distances make sense?


'''K-nearest neighbours''':
'''Task: K-nearest neighbours''':
* Find the indexes of the 10 entity vectors that are nearest neighbours to your entity of choice. You can use [https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html sciKit-learn's sklearn.neighbors.NearestNeighbors.kneighbors()-method] for this.
* Find the indexes of the 10 entity vectors that are nearest neighbours to your entity of choice. You can use [https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html sciKit-learn's sklearn.neighbors.NearestNeighbors.kneighbors()-method] for this.
* Map the indexes of the 10-nearest neighbouring entities back into human-understandable labels. Does this make sense? Try the same thing with another entity (e.g., 'WALL·E').
* Map the indexes of the 10-nearest neighbouring entities back into human-understandable labels. Does this make sense? Try the same thing with another entity (e.g., 'WALL·E').


'''Translation''':
'''Task: translation''':
* Add together the vectors for an entity and a relation that that gives meaning for the entity (e.g., 'J. K. Rowling' - 'influenced by', 'WALL·E' - 'genre'). Find the 10-nearest neighbouring entities for the vector sum. Does this make sense? Try more entities and relations. Try to find examples that work and that do not work well.
* Add together the vectors for an entity and a relation that that gives meaning for the entity (e.g., 'J. K. Rowling' - 'influenced by', 'WALL·E' - 'genre'). Find the 10-nearest neighbouring entities for the vector sum. Does this make sense? Try more entities and relations. Try to find examples that work and that do not work well.


==Code to get started==
==Code to get started==
With graph embeddings, we ideally want to work with ipynb files. The code below is prepared in the following link: https://colab.research.google.com/drive/1gS2D1XYSviAmhkS8moJIpY0N8ltJFM3C


<syntaxhighlight>
<syntaxhighlight>
Line 61: Line 53:
print(list(kg_train.ent2ix.keys())[-10:])
print(list(kg_train.ent2ix.keys())[-10:])
print(list(kg_train.rel2ix.keys())[-10:])
print(list(kg_train.rel2ix.keys())[-10:])




Line 78: Line 69:
}
}
lbl2ent = {lbl: ent for ent, lbl in ent2lbl.items()}
lbl2ent = {lbl: ent for ent, lbl in ent2lbl.items()}


print([
print([
Line 84: Line 74:
     for ent in kg_train.ent2ix.keys()
     for ent in kg_train.ent2ix.keys()
     if ent in ent2lbl][-10:])
     if ent in ent2lbl][-10:])
</syntaxhighlight>
</syntaxhighlight>


==If You Have More Time==
==If You Have More Time==
* Try it out with different datasets, for example one you create youreself using SPARQL queries on an open KG.
* Try it out with different datasets, for example one you create youreself using SPARQL queries on an open KG.
==Useful readings==
* [https://torchkge.readthedocs.io/en/latest/ Welcome to TorchKGE’ s documentation!]

Latest revision as of 12:03, 2 May 2024

Topics

Using knowledge graph embeddings with TorchKGE.

Useful readings

Tasks

Task: knowledge graph:

  • Use a dataset loader to load a KG you want to work with. Freebase FB15k237 is a good choice. (You will need a pre-trained model for your KG later, to choose one of FB15k, FB15k237, WDV5, WN18RR, or Yago3-10. This lab has mostly been tested on FB15k.)
  • Use the methods provided by the KnowledgeGraph class to inspect the KG.
    • Print out the numbers of entities, relations, and facts in the training, validation, and testing sets.
    • Print the identifiers for the first 10 entities and relations (tip: ent2ix and rel2ix).

Task: external identifiers:

  • Download a dataset that provides more understandable labels for the entities (and perhaps relations) in your KnowledgeGraph
    • If you use FB15k, the relation names are not so bad, but the entity identifiers do not give much meaning. Same with WordNet. This repository contains mappings for the Freebase and WordNet datasets.
    • If you use a Wikidata graph, the entities and relations are all P- and Q-codes. To get labels, you can try a combination of SPARQL queries and this API.
  • Create mappings from external labels to entity ids (and perhaps relation ids) in the KnowledgeGraph. Also create the inverse mappings.

Task: test entities and relations:

  • Get the KG indexes for a few entities and relations. If you use the Freebase or Wikidata graphs, you can try 'J. K. Rowling' and 'WALL·E' as entities (note that the dot in 'WALL·E' is not a hyphen or usual period.) For relations you can try 'influenced by' and 'genre'. (tip: to check names of entites and relations, open the train.txt file you cloned)

Task: model:

  • Load a pre-trained TransE model that matches your KnowledgeGraph.
    • Print out the numbers of entities, relations, and the dimensions of the entity and relation vectors. Do they match your KnowledgeGraph.
  • Get the vectors for your test entities and relations (for example, 'J. K. Rowling' and 'influenced by').
  • Find vectors for a few more entities (both unrelated and related ones, e.g., 'J. R. R. Tolkien', 'C. S. Lewis', ...). Use the model.dissimilarity()-method to estimate how semantically close your entities are. Do the distances make sense?

Task: K-nearest neighbours:

  • Find the indexes of the 10 entity vectors that are nearest neighbours to your entity of choice. You can use sciKit-learn's sklearn.neighbors.NearestNeighbors.kneighbors()-method for this.
  • Map the indexes of the 10-nearest neighbouring entities back into human-understandable labels. Does this make sense? Try the same thing with another entity (e.g., 'WALL·E').

Task: translation:

  • Add together the vectors for an entity and a relation that that gives meaning for the entity (e.g., 'J. K. Rowling' - 'influenced by', 'WALL·E' - 'genre'). Find the 10-nearest neighbouring entities for the vector sum. Does this make sense? Try more entities and relations. Try to find examples that work and that do not work well.

Code to get started

With graph embeddings, we ideally want to work with ipynb files. The code below is prepared in the following link: https://colab.research.google.com/drive/1gS2D1XYSviAmhkS8moJIpY0N8ltJFM3C

!pip install torchkge
!pip install sklearn
!git clone https://github.com/villmow/datasets_knowledge_embedding.git

from torchkge.utils.datasets import load_fb15k237

kg_train, kg_val, kg_test = load_fb15k237()

print(list(kg_train.ent2ix.keys())[-10:])
print(list(kg_train.rel2ix.keys())[-10:])


"""Download files with human-readable labels for (most) Freebase entities used in the dataset. 
Labels seem to be missing for some entities used in FB15k-237."""

import json

TEXT_TRIPLES_DIR = 'datasets_knowledge_embedding/FB15k-237/'
with open(TEXT_TRIPLES_DIR+'entity2wikidata.json') as file:
    _entity2wikidata = json.load(file)

 ent2lbl = {
    ent: wd['label']
    for ent, wd in _entity2wikidata.items()
}
lbl2ent = {lbl: ent for ent, lbl in ent2lbl.items()}

print([
    ent2lbl[ent] 
    for ent in kg_train.ent2ix.keys()
    if ent in ent2lbl][-10:])

If You Have More Time

  • Try it out with different datasets, for example one you create youreself using SPARQL queries on an open KG.