Lab: Using Graph Embeddings: Difference between revisions

From info216
Line 31: Line 31:
* Find vectors for a few more entities (both unrelated and related ones, e.g., 'J. R. R. Tolkien', 'C. S. Lewis', ...). Use the [https://torchkge.readthedocs.io/en/latest/reference/models.html#translationalmodels model.dissimilarity()-method] to estimate how semantically close your entities are. Do the distances make sense?
* Find vectors for a few more entities (both unrelated and related ones, e.g., 'J. R. R. Tolkien', 'C. S. Lewis', ...). Use the [https://torchkge.readthedocs.io/en/latest/reference/models.html#translationalmodels model.dissimilarity()-method] to estimate how semantically close your entities are. Do the distances make sense?


'''K-nearest neighbours''':
'''Task: K-nearest neighbours''':
* Find the indexes of the 10 entity vectors that are nearest neighbours to your entity of choice. You can use [https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html sciKit-learn's sklearn.neighbors.NearestNeighbors.kneighbors()-method] for this.
* Find the indexes of the 10 entity vectors that are nearest neighbours to your entity of choice. You can use [https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html sciKit-learn's sklearn.neighbors.NearestNeighbors.kneighbors()-method] for this.
* Map the indexes of the 10-nearest neighbouring entities back into human-understandable labels. Does this make sense? Try the same thing with another entity (e.g., 'WALL·E').
* Map the indexes of the 10-nearest neighbouring entities back into human-understandable labels. Does this make sense? Try the same thing with another entity (e.g., 'WALL·E').


'''Translation''':
'''Task: translation''':
* Add together the vectors for an entity and a relation that that gives meaning for the entity (e.g., 'J. K. Rowling' - 'influenced by', 'WALL·E' - 'genre'). Find the 10-nearest neighbouring entities for the vector sum. Does this make sense? Try more entities and relations. Try to find examples that work and that do not work well.
* Add together the vectors for an entity and a relation that that gives meaning for the entity (e.g., 'J. K. Rowling' - 'influenced by', 'WALL·E' - 'genre'). Find the 10-nearest neighbouring entities for the vector sum. Does this make sense? Try more entities and relations. Try to find examples that work and that do not work well.



Revision as of 06:34, 18 April 2023

Topics

Using knowledge graph embeddings with TorchKGE.

Useful readings

  • Welcome to TorchKGE’ s documentation!
  • The following TorchKGE classes are central:
    • KnowledgeGraph - contains the knowledge graph (KG)
    • Model - contains the embeddings (entity and relation vectors) for some KG

Tasks

Task: knowledge graph:

  • Use a dataset loader to load a KG you want to work with. Freebase FB15k237 is a good choice. (You will need a pre-trained model for your KG later, to choose one of FB15k, FB15k237, WDV5, WN18RR, or Yago3-10. This lab has mostly been tested on FB15k.)
  • Use the methods provided by the KnolwedgeGraph class to inspect the KG.
    • Print out the numbers of entities, relations, and facts in the training, validation, and testing sets.
    • Print the identifiers for the first 10 entities and relations (tip: ent2ix and rel2ix).

Task: external identifiers:

  • Download a dataset that provides more understandable labels for the entities (and perhaps relations) in your KnowledgeGraph
    • If you use FB15k, the relation names are not so bad, but the entity identifiers do not give much meaning. Same with WordNet. This repository contains mappings for the Freebase and WordNet datasets.
    • If you use a Wikidata graph, the entities and relations are all P- and Q-codes. To get labels, you can try a combination of SPARQL queries and this API.
  • Create mappings from external labels to entity ids (and perhaps relation ids) in the KnowledgeGraph. Also create the inverse mappings.

Task: test entities and relations:

  • Get the KG indexes for a few entities and relations. If you use the Freebase or Wikidata graphs, you can try 'J. K. Rowling' and 'WALL·E' as entities (note that the dot in 'WALL·E' is not a hyphen or usual period.) For relations you can try 'influenced by' and 'genre'.

Task: model:

  • Load a pre-trained TransE model that matches your KnowledgeGraph.
    • Print out the numbers of entities, relations, and the dimensions of the entity and relation vectors. Do they match your KnowledgeGraph.
  • Get the vectors for your test entities and relations (for example, 'J. K. Rowling' and 'influenced by').
  • Find vectors for a few more entities (both unrelated and related ones, e.g., 'J. R. R. Tolkien', 'C. S. Lewis', ...). Use the model.dissimilarity()-method to estimate how semantically close your entities are. Do the distances make sense?

Task: K-nearest neighbours:

  • Find the indexes of the 10 entity vectors that are nearest neighbours to your entity of choice. You can use sciKit-learn's sklearn.neighbors.NearestNeighbors.kneighbors()-method for this.
  • Map the indexes of the 10-nearest neighbouring entities back into human-understandable labels. Does this make sense? Try the same thing with another entity (e.g., 'WALL·E').

Task: translation:

  • Add together the vectors for an entity and a relation that that gives meaning for the entity (e.g., 'J. K. Rowling' - 'influenced by', 'WALL·E' - 'genre'). Find the 10-nearest neighbouring entities for the vector sum. Does this make sense? Try more entities and relations. Try to find examples that work and that do not work well.

Code to get started

!pip install torchkge
!pip install sklearn
!git clone https://github.com/villmow/datasets_knowledge_embedding.git

from torchkge.utils.datasets import load_fb15k237

kg_train, kg_val, kg_test = load_fb15k237()

print(list(kg_train.ent2ix.keys())[-10:])
print(list(kg_train.rel2ix.keys())[-10:])


"""Download files with human-readable labels for (most) Freebase entities used in the dataset. 
Labels seem to be missing for some entities used in FB15k-237."""

import json

TEXT_TRIPLES_DIR = 'datasets_knowledge_embedding/FB15k-237/'
with open(TEXT_TRIPLES_DIR+'entity2wikidata.json') as file:
    _entity2wikidata = json.load(file)

 ent2lbl = {
    ent: wd['label']
    for ent, wd in _entity2wikidata.items()
}
lbl2ent = {lbl: ent for ent, lbl in ent2lbl.items()}

print([
    ent2lbl[ent] 
    for ent in kg_train.ent2ix.keys()
    if ent in ent2lbl][-10:])

If You Have More Time

  • Try it out with different datasets, for example one you create youreself using SPARQL queries on an open KG.