Knowledge graphs are cool :)
Intro
Knowledge Graphs at a high level is a network of entities in the real world — people, places, events, and the knowledge graph will store the relationships between them. This information is stored in a graph structure, where each entity can be linked to other entities with their relationships and visualised
A knowledge graph has three main components. Nodes, Edges and Labels. Nodes are the entities, Dog and Pug, Edges are the links, “Breed Of” and labels are the data we attach to edges and nodes
Why are Knowledge Graphs Exciting?
Knowledge graphs in my experience store knowledge how humans retrieve knowledge, and that’s what I think is so exciting about them. You can graph ANYTHING
A common use case that comes to mind people, if I wanted to store relationships between different people, locations, businesses, and I visualise that in a graph, the insights are a lot easier to find than if it’s in a row and column based structure — or God forbid, a spreadsheet
The Project
- We will explore using LLMs to create knowledge graphs.
- We will scale this up to using news articles to knowledge graph the world.
- We will use graph embeddings to do entity resolution (finding the same entities in graphs)
- Using Elasticsearch as a Graph DB vs Neo4J
The project will be an end to end pipeline of creating a knowledge graph from unstructured text, resolving entities in the graph, and storage/retrieval
Using LLMs to Create Knowledge Graphs
The problems I’ve had with knowledge graphs is making them, nobody wants to sit and write out all the different relationships between entities.
There are some mature libraries for extracting entities, like spacy — but they don’t do the next step, finding the relationship between these entities.
So what technology can? Enter LLMs. LLMs understand language and therefore, understand relationships when given the right information. Instead of us writing all of these relationships, let’s get an LLM to do the hard work for us.
Putting the LLM to Work- The Co-Reference Problem
First call is to sort out this pesky co-reference problem. When we read the sentence “John told Sally that she should come watch him play the violin”. We instinctively know “she” is referring to Sally, and “him” is referring to John. Sometimes LLMs might get this slightly wrong. There are some tests for this sort of thing called the “Winograd Schema Challenge” which has some harder examples like “Dan took the rear seat while Bill claimed the front because his “”Dibs!”” was slow.”
Whose ‘dibs’ was slow? Bill’s or Dan’s?
LLMs are better at smaller tasks, so instead of asking these models to straight away create our graph, let’s first ask them to re-write the sentence and get rid of any of these pronouns.
The sentence we’re going to use for the first part of this is : “John told Sally that she should come watch him play the violin”.
I’ve used Ollama to spin up a Phi-3 instance (I am GPU-Poor) to test this out with. Using the code below, let’s re-write our sentence.
client = OpenAI(
api_key = "EMPTY",
base_url = "http://localhost:11434/v1"
)
content = "John told Sally that she should come watch him play the violin"
prompt = f"""
You are a coreference resolution expert and will use your knowledge to reslove the co-reference problems that I give to you.
Please can you re-write the following sentence: {content}.
Please only answer with the new sentence with no pronouns and instead using the entity name.
"""
response = client.chat.completions.create(
model= "phi3",
messages = [{"role": "user", "content": prompt}]
)
msg = response.choices[0].message.content
The LLM responds with — “John informed Sally that Sally should attend to witness John performing on the violin.”. Great, it’s done exactly what we’ve asked!
Let’s try the harder one just to see how it performs.
‘Dan occupied the rear seat as Bill claimed the front due to Bill\’s delayed assertion of priority, expressed by the word “Dibs!”.’
Close but the LLM didn’t understand how “Dibs” works. I played around a little bit with the prompt but this one was just tough for the smaller Phi3 Model. I also gave the same problem to Llama3–70b to see if a bigger model could replace the pronoun correctly, but it could not.
I’ve not tested whether this actually affects the quality of the graph outputs
Getting the Graph
Now we’ve got our new, high quality, no pronoun sentence, let’s start making a graph.
prompt = f"""
You are amazing at taking in text data and mapping that text to tuples of relationships in the format "node1, connection type, node2"
Given the following sentence, can you please extract the entities to a list of tuples. Here are some examples (John , Friend Of, Sally), (Fred, Type of, Dog).
Please can you extract tuples given the following sentence: {resolved_content}.
"""
response = client.chat.completions.create(
model= "phi3",
messages = [{"role": "user", "content": prompt}]
)
msg = response.choices[0].message.content
print(msg)
To extract relationships from the given sentence "John informed Sally that Sally should attend as a witness when John performs on the violin," we need to identify entities and their connections based on the context. Here are the identified tuples:
1. ("John", "knows", "Sally") - Implied by the fact that John is informing Sally about something; this represents their relationship or connection type in terms of communication.
2. ("Sally", "plans_to_become", "Witness") - Since she is being informed to attend as a witness, it suggests an action planned for her role.
3. ("John", "performs", "Violin") - This details the specific performance John will be giving, with 'violin' being the instrument. However, since we need tuples in the format "node1, connection type, node2," let's interpret this as a relationship where John is associated with an event or activity related to his skill or action (performing), and 'Violin' can be considered as the subject/object of his performance.
4. ("Sally", "participates_as", "Witness") - Reflects Sally's role in relation to John's event, with her being an attendee (witness).
5. ("John", "performs_on", "Violin") - A more specific relationship connecting John to the act of playing a violin during his performance.
6. ("Witness", "observes", "Performance") - This relationship implies Sally's role during John's performance
Damn that’s a lot of content! It did create the tuples but there are two mains problems with this.
- This is not processable by a follow on system, the LLM has just created more unstructured text that we can’t add to a graph
- The relationships can be whatever the LLM wants, these could be useless for example Witness Observes Performance”
An interesting thing to keep in mind here though, is the LLM has explained why it’s made connections. While this isn’t processable, if we create a knowledge graph that people analyse and create outputs from, you might want to refer back to why the LLM made the connections. We will come back to this later.
Tweaking the Prompts
A lot of playing with LLMs is tweaking the prompts. LLMs are general purpose but need fine instructions to ensure they’re doing the right things. Let’s see this in action.
prompt = f"""
You are amazing at taking in text data and mapping that text to tuples of relationships in the format "node1, connection type, node2"
Given the following sentence, can you please extract the entities to a list of tuples. Here are some examples (John, Friend Of, Sally), (Fred, Type of, Dog).
Please can you extract tuples given the following sentence: {resolved_content}.
You can only use the following connection types when connecting nodes ["Known to", "Location", "Event Type", "Type Of"].
"""
response = client.chat.completions.create(
model= "phi3",
messages = [{"role": "user", "content": prompt}]
)
msg = response.choices[0].message.content
print(msg)
1. (John, Known To, Sally) - This tuple signifies that John is known to Sally since he informed her about something related to him.
2. (Sally, Location, Unspecified) - In this context, the location isn't explicitly mentioned; however, it can be inferred as a place where witnessing or receiving information takes place. This could relate indirectly to an event type but is not directly stated in the text with an event type connection.
3. (Sally, Event Type, Witness) - Sally might be involved as a witness in an event which is hinted by "attend to witness." However, it's important to note that we are constrained to use specific connection types and directly relating Sally to the act of being a witness through direct information isn't possible within those constraints. Instead, we acknowledge her role indirectly as related to an event (which is not specified by type in our given connections).
4. (John, Type Of, Violin Player) - This interpretation connects John with "Type Of" based on his action of performing on the violin, implying that he could be a violin player or artist showcasing their skill in an event type context. However, it's crucial to acknowledge that we are constrained and might not perfectly align with our given connection types directly but can infer them based on available information.
Please note the limitations of strictly adhering to the provided connection types ("Known to", "Location", "Event Type", "Type Of") when interpreting relationships from the sentence, as some connections are more indirectly implied or not explicitly stated in terms of those categories.
The LLM is now starting to use the defined connection types we’ve described in the prompt, now let’s get some processable information.
prompt = f"""
You are amazing at taking in text data and mapping that text to tuples of relationships in the format "node1, connection type, node2"
Given the following sentence, can you please extract the entities to a list of tuples. Here are some examples (John , Friend Of, Sally), (Fred, Type of, Dog).
Please can you extract tuples given the following sentence: {resolved_content}.
You can only use the following connection types when connecting nodes ["Known to", "Location", "Event Type", "Type Of"].
Please you can only respond with the tuples in a list, no other text. If you do a good job, I will tip you $100.
"""
response = client.chat.completions.create(
model= "phi3",
messages = [{"role": "user", "content": prompt}]
)
msg = response.choices[0].message.content
print(msg)
[('John', 'Known To', 'Sally'), ('Sally', 'Location', None), ('Sally', 'Event Type', 'witness John performing on the violin')]
Okay great, we’ve now got a processable list — we will have to check this as LLMs are not always predictable, but this is the basis of the work we wanted to do.
Explainability
Whenever working with LLM processed data, being able to explain why something has happened is very important, especially if this is deployed in a production environment. With that in mind, let’s make sure we keep some of that reasoning we’ve prompted out, while retaining a processable output. Here is an example of how I did that.
prompt = f"""
You are amazing at taking in text data and mapping that text to tuples of relationships in the format "node1, connection type, node2"
Given the following sentence, can you please extract the entities to a list of tuples. Here are some examples (John , Friend Of, Sally), (Fred, Type of, Dog).
Please can you extract tuples given the following sentence: {resolved_content}.
You can only use the following connection types when connecting nodes ["Known to", "Location", "Event Type", "Type Of"].
Please you can only respond with the tuples in a list, no other text. If you do a good job, I will tip you $100.
"""
response = client.chat.completions.create(
model= "phi3",
messages = [{"role": "user", "content": prompt}]
)
msg = response.choices[0].message.content
print(msg)
## Add that message to the chat
prompt2 = "Please explain why you've picked each of the nodes and their relationships in the question above."
chat = [
{"role": "user", "content": prompt },
{"role": "assistant", "content": msg},
{"role": "user", "content": prompt2}
]
response = client.chat.completions.create(
model= "phi3",
messages = chat
)
msg = response.choices[0].message.content
print(msg)
[(John, Known To, Sally), (Sally, Attend, John's Violin Performance)]
---------------------------------------------------------------------------
1. (John, Known To, Sally) - This relationship was established because John informed Sally in the sentence, implying a direct interaction or familiarity between them. The connection type "Known to" reflects this informational context where two individuals have an existing knowledge of each other.
2. (Sally, Attend, John's Violin Performance) - In the given sentence, Sally is advised to attend a specific event involving John performing on the violin. Since no explicit location or direct type of event was mentioned that matches "Location" or "Event Type", I inferred an "Event Type" for this interaction due to its nature as something Sally might be invited to witness. Here, "John's Violin Performance" is treated as a specific instance or example of the broader category of performances (event type), even though it doesn't fit perfectly within the defined categories. If allowed more flexibility in interpretation, I would have preferred using an event-related phrase like "(Sally, Invited To, Music Event)".
Please note that due to limitations imposed by the provided connection types and without additional context specifying a location or explicitly mentioned performance type, the solution required some creative inference. In real-world applications, more detailed guidelines would allow for precise mapping according to available information.
Perfect, we can now store that second response somewhere, and refer back to it if the links ever become important.
Knowledge Graph Data
This bring me onto another point with Knowledge Graphs. It’s all well and good having information stored in the format of entity, relationship, entity — but what if I want to know information about entity1 which isn’t worth storing in the graph, or doesn’t conform to one of the relationship types I’ve specified like “Age”.
Well, nodes and edges can have labels. These labels can be simple strings, or whole JSON objects, for example
Node: Dog
Label: { "Age": 4, "Colour": "Red"}
Can an LLM help us with that as well? Let’ change the input sentence.
testing = "John, Age 42, informed Sally that Sally should attend to witness John's performance on the violin."
priming_prompt = """
You are amazing at taking in text data and mapping that text to tuples of relationships in the format "node1 {node1 attribute: node1 value}, connection type, node2 {node2 attribute: node2 value }"
Given the following sentence, can you please extract the entities to a list of tuples. Here are some examples (John {"Age": 32} , Friend Of, Sally {"Age": 31}), (Fred {"Breed": "Pug"}, Type of, Dog).
"""
prompt = f"""
{priming_prompt}
Please can you extract tuples given the following sentence: {testing}.
Please only respond with the tuples in a list, and then end response. I will ask a following up question about your reasoning.
"""
response = client.chat.completions.create(
model= "phi3",
messages = [{"role": "user", "content": prompt}]
)
msg = response.choices[0].message.content
print(msg)
You can see I’ve updated the prompt to include examples with JSON labels, and I’ve updated the sentence to show Johns age.
[(John {"Age": 42}, Informed, Sally {"Action": "attend to witness"}), (Sally {"Action": "attend to witness"}, Purpose, Witness John\'s performance on the violin)]'
It pulled out Johns age and some other stuff, but it has understood it has a new capability of adding information to entities. Let’s play around with this more.
If I update the sentence to include some examples which were not in the prompt
testing = "John, Age 42, who lives in Sommerset informed Sally who has blonde hair that Sally should attend to witness John's performance on the violin."
priming_prompt = """
You are amazing at taking in text data and mapping that text to tuples of relationships in the format "node1 {node1 attribute: node1 value}, connection type, node2 {node2 attribute: node2 value }"
Given the following sentence, can you please extract the entities to a list of tuples. Here are some examples (John {"Age": 32} , Friend Of, Sally {"Age": 31}), (Fred {"Breed": "Pug"}, Type of, Dog).
"""
prompt = f"""
{priming_prompt}
Please can you extract tuples given the following sentence: {testing}.
Please only respond wih the tuples in a list, and then end response. I will ask a following up question about your reasoning.
"""
response = client.chat.completions.create(
model= "phi3",
messages = [{"role": "user", "content": prompt}]
)
msg = response.choices[0].message.content
print(msg)
[(John {"Age": 42}, "resides_in", "Sommerset"), (Sally {"HairColor": "Blonde"}, "communicated_with", John), ("John", "performs_on", "Violin")]
Amazing, we’re making progress! We’ve now got a graph with labels/attributes. I’ve not added any attributes to edges here but you could also do that if you wanted to.
Part 2
Part 2, we’re going to scale this up to news articles to start making a rich knowledge graph of real world events, and we’re going to visualise them to see if we can find any insights.
Cheers!