Discover AuraDB Free: Week 19 β The Wordle Graph
I know Iβm late to the game as Wordle was sold to the New York Times last week. I havenβt played it much myself just saw it popping up on my twitter stream a lot.
If you havenβt played Wordle, it is a guessing game similar to mastermind, only with (5-letter) English words (there are now many clones in other languages). Correctly guessed letters in the right position appear green, wrong position yellow and incorrect letters gray. The goal is to guess the word in 6 attempts, i.e. make all 5 letters green.
Our creative folks even created a Neo4j version of it, see if you can spot the mistake.
So I thought one late night that it would be fun to represent the Wordle world as a graph. And here we go.
βββ@mesirii
If you missed our live-stream here is the recording, we had a lot of fun both playing the game but also exploring the graph version of it:
https://medium.com/media/1994d4957289e654f33d977ee723cee5/hrefThat Wordle of the day was βelderβ, which we got to in attempt 3, after someone from the stream hinting at it.
You can find past live sessions and deep dive blog write-ups.
All the Cypher statements and the source CSV can be found in this GitHub repository:
GitHub - jexp/wordle-graph: Wordle Solver as Neo4j graph database
But letβs create our database instance first, so that you can follow the modeling and import.
Create a Neo4j AuraDB Free Instance
Go to https://dev.neo4j.com/neo4j-aura to register or log into the service (you might need to verify your email address).
After clicking Create Database you can create a new Neo4j AuraDB Free instance. Select a Region close to you and give it a name, e.g. {db-name}.
Choose the βblank databaseβ option as we want to import our data ourselves.
On the Credentials popup, make sure to save the password somewhere safe. The default username is always neo4j.
Then wait 2β3 minutes for your instance to be created.
Afterwards you can connect via the Open Button with Neo4j Browser (youβll need the password).
Then also the connection URL: neo4j+s://xxx.databases.neo4j.io is available and you can copy it to your credentials as you might need it later.
If you want to see examples for programmatically connecting to the database go to the βConnectβ tab of your instance and pick the language of your choice
Dataset
I found a scraped Wordle list in this repository of an R solver, which I turned into a CSV with 12k entries.
Here are the first few words, none of which I would have associated with English :)
aahed
aalii
aargh
aarti
abaca
The CSV is available on GitHub, it doesnβt have a header and only a single column.
Import
We can load the CSV data directly with LOAD CSV into Word nodes. First we create a constraint to make this and future lookups fast and the import repeatable.
// create constraint
CREATE CONSTRAINT word_name IF NOT EXISTS ON (w:Word) ASSERT w.name IS UNIQUE;
// load csv and turn each row into a Word node
LOAD CSV FROM "https://raw.githubusercontent.com/jexp/wordle-graph/main/wordle.csv" AS row
MERGE (:Word {name:row[0]});
Thatβs a lot of 5-letter words (12k) nodes.
I also put a dump file on s3: https://data.neo4j.com/wordle.dump that you can load into your AuraDB or Neo4j Desktop.
Tale of 2 Models
The idea is to split the words into their constituent letters and represent those βcharacter positionsβ in the graph, sharing letters and positions as needed.
That will allow us to later have fun with things like:
- Character frequency
- Follow frequencies
- Top-starter words
- Possible solution words if we have already some positive and negative clues
- Playing the game
Model 1 β Letter Positions as Nodes
Initially I represented characters at positions with dedicated nodes labeled CharAtPos with a char and an idx property connected to the word via HAS and to each other with an NEXT relationship and the first node (idx=0) having an extra STARTS relationship.
As it turns out the model might have been a bit overengineered :)
Here is the code to split the words into characters and then create the first CharAt pos and subsequent (1..4) nodes and connect them.
To make it work with memory-constrained environments I wrapped it in a CALL {} IN TRANSACTIONS subquery, but even in AuraDB Free that was actually not necessary.
MATCH (w:Word)
CALL { WITH w
WITH w, split(w.name,"") AS chars
MERGE (start:CharAtPos {idx:0, char:chars[0]})
MERGE (w)-[:STARTS]->(start)
MERGE (w)-[:HAS]->(start)
WITH *
UNWIND range(1,size(chars)-1) AS idx
MERGE (next:CharAtPos {idx:idx, char:chars[idx]})
MERGE (w)-[:HAS]->(next)
WITH *
MATCH (prev:CharAtPos {idx:idx-1, char:chars[idx-1]})
MERGE (prev)-[:NEXT]->(next)
} IN TRANSACTIONS OF 1000 ROWS;
IF we query a word like crash now based on our model:
match p=(:Word {name:"crash"})--()
return p
Wordle Solver v1
To solve a word, you pass in the letters you know with their positions and the letters that you donβt have the right position for and match any words that fit this pattern.
MATCH (c1:CharAtPos {idx:0, char:'c'}),
(c5:CharAtPos {idx:4, char:'h'}),
(c:CharAtPos {char:'a'})
match (w:Word)-[:HAS]->(c1),
(w)-[:HAS]->(c5),
(w)-[:HAS]->(c)
return w.name;
ββββββββββ
β"w.name"β
ββββββββββ‘
β"clach" β
ββββββββββ€
β"clash" β
ββββββββββ€
β"caneh" β
ββββββββββ€
β"coach" β
ββββββββββ€
β"catch" β
ββββββββββ€
β"crash" β
ββββββββββ
If we have more information, then we can extend the query by excluding letters or positions and get a smaller result set. It takes a bit longer to query due to the exclusions.
match (c1:CharAtPos {idx:0, char:'c'}), // correct
(c2:CharAtPos {idx:1, char:'a'}), // wrong pos
(c3:CharAtPos {char:'l'}), // incorrect
(c4:CharAtPos {char:'i'}), // incorrect
(c5:CharAtPos {idx:4, char:'h'}), // correct
(c:CharAtPos {char:'a'})
match (w:Word)-[h1:HAS]->(c1),
(w)-[h2:HAS]->(c5), (w)-[h3:HAS]->(c)
WHERE not exists { (w)-[:HAS]->(c2) } and not exists { (w)-[:HAS]->(c3) } and not exists { (w)-[:HAS]->(c4) }
return *
Model 2 β Positions in Relationships
An alternative model represents just the 26 characters and puts the position onto the relationship either as a property or as the rel-type.
Because we know we have 5 letters we can just spell it out.
MATCH (w:Word)
WITH w, split(w.name,"") AS chars
MERGE (c0:Char {char:chars[0]})
MERGE (w)-[p0:POS0]->(c0) SET p0.idx = 0
MERGE (c1:Char {char:chars[1]})
MERGE (w)-[p1:POS1]->(c1) SET p1.idx = 1
MERGE (c2:Char {char:chars[2]})
MERGE (w)-[p2:POS2]->(c2) SET p2.idx = 2
MERGE (c3:Char {char:chars[3]})
MERGE (w)-[p3:POS3]->(c3) SET p3.idx = 3
MERGE (c4:Char {char:chars[4]})
MERGE (w)-[p4:POS4]->(c4) SET p4.idx = 4;
Model 2 Exploration
We can first look at the representation of a word (βdiverβ) in this model.
match (w:Word {name:"diver"})-[r]->(c:Char)
return *;
Then see how shared characters between two words (βdiverβ and βelderβ) look like, here c are the shared letters and c1, c2 the other letters respectively.
match path = (c1:Char)<--(:Word {name:"diver"})-->(c:Char)
match path2 = (c:Char)<--(:Word {name:"elder"})-->(c2:Char)
return path, path2;
Looking at that frequent suffix of er, we wanted to see what the letter frequencies look like.
Letter Frequencies
With this model we can also easy look at character frequencies and follow probabilities.
For the letter frequencies we just count the number of relationships pointing to a character node (aka the in-degree).
MATCH (c:Char)
RETURN c.char, size((c)<--()) as deg
ORDER BY deg DESC;
As expected for the English language the vowels, and R,S,T are pretty high up.
ββββββββββ€ββββββ
β"c.char"β"deg"β
ββββββββββͺββββββ‘
β"s" β6665 β
ββββββββββΌββββββ€
β"e" β6662 β
ββββββββββΌββββββ€
β"a" β5990 β
ββββββββββΌββββββ€
β"o" β4438 β
ββββββββββΌββββββ€
β"r" β4158 β
ββββββββββΌββββββ€
β"i" β3759 β
ββββββββββΌββββββ€
β"l" β3371 β
ββββββββββΌββββββ€
β"t" β3295 β
ββββββββββΌββββββ€
β"n" β2952 β
ββββββββββΌββββββ€
β"u" β2511 β
ββββββββββΌββββββ€
β"d" β2453 β
ββββββββββΌββββββ€
β"y" β2074 β
ββββββββββΌββββββ€
β"c" β2028 β
ββββββββββΌββββββ€
β"p" β2019 β
ββββββββββΌββββββ€
β"m" β1976 β
ββββββββββΌββββββ€
β"h" β1760 β
ββββββββββΌββββββ€
β"g" β1644 β
ββββββββββΌββββββ€
β"b" β1627 β
ββββββββββΌββββββ€
β"k" β1505 β
ββββββββββΌββββββ€
β"f" β1115 β
ββββββββββΌββββββ€
β"w" β1039 β
ββββββββββΌββββββ€
β"v" β694 β
ββββββββββΌββββββ€
β"z" β434 β
ββββββββββΌββββββ€
β"j" β291 β
ββββββββββΌββββββ€
β"x" β288 β
ββββββββββΌββββββ€
β"q" β112 β
ββββββββββ΄ββββββ
We can now use that information to find good starting words, by summing up the character frequencies for each word.
MATCH (w:Word)
MATCH (w)-->(c:Char)
RETURN w.name, sum(size((c)<--())) as total
ORDER BY total DESC LIMIT 5;
We see there are quite a lot of βcheaterβ words which contain the high frequency characters multiple times.
ββββββββββ€ββββββββ
β"w.name"β"total"β
ββββββββββͺββββββββ‘
β"esses" β33319 β
ββββββββββΌββββββββ€
β"sasse" β32647 β
ββββββββββΌββββββββ€
β"sessa" β32647 β
ββββββββββΌββββββββ€
β"asses" β32647 β
ββββββββββΌββββββββ€
β"eases" β32644 β
ββββββββββ΄ββββββββ
If we want to avoid that we can state, that we want to only look at words with 5 distinct characters.
MATCH (w:Word)
MATCH (w)-->(c:Char)
RETURN w.name, sum(size((c)<--())) as total, count(distinct c) = 5 as uniques
ORDER BY uniques DESC, total DESC LIMIT 10;
ββββββββββ€ββββββββ€ββββββββββ
β"w.name"β"total"β"uniques"β
ββββββββββͺββββββββͺββββββββββ‘
β"arose" β27913 βtrue β
ββββββββββΌββββββββΌββββββββββ€
β"soare" β27913 βtrue β
ββββββββββΌββββββββΌββββββββββ€
β"aeros" β27913 βtrue β
ββββββββββΌββββββββΌββββββββββ€
β"serai" β27234 βtrue β
ββββββββββΌββββββββΌββββββββββ€
β"arise" β27234 βtrue β
ββββββββββΌββββββββΌββββββββββ€
β"reais" β27234 βtrue β
ββββββββββΌββββββββΌββββββββββ€
β"aesir" β27234 βtrue β
ββββββββββΌββββββββΌββββββββββ€
β"raise" β27234 βtrue β
ββββββββββΌββββββββΌββββββββββ€
β"aloes" β27126 βtrue β
ββββββββββΌββββββββΌββββββββββ€
β"stoae" β27050 βtrue β
ββββββββββ΄ββββββββ΄ββββββββββ
Still not perfect, most of these are just variations of the same set of letters. Letβs group them by the sorted set of letters and show the first few.
MATCH (w:Word)
MATCH (w)-->(c:Char)
RETURN apoc.coll.sort(split(w.name,'')) as letters, sum(size((c)<--())) as total, count(distinct c) = 5 as uniques, collect(w.name)[0..2] as words
ORDER BY uniques DESC, total DESC LIMIT 10;
βββββββββββββββββββββββ€ββββββββ€ββββββββββ€ββββββββββββββββββ
β"letters" β"total"β"uniques"β"words" β
βββββββββββββββββββββββͺββββββββͺββββββββββͺββββββββββββββββββ‘
β["a","e","r","s","t"]β348010 βtrue β["aster","arets"]β
βββββββββββββββββββββββΌββββββββΌββββββββββΌββββββββββββββββββ€
β["a","e","p","r","s"]β331422 βtrue β["apres","apers"]β
βββββββββββββββββββββββΌββββββββΌββββββββββΌββββββββββββββββββ€
β["a","e","l","s","t"]β311796 βtrue β["salet","taels"]β
βββββββββββββββββββββββΌββββββββΌββββββββββΌββββββββββββββββββ€
β["a","e","l","p","s"]β247070 βtrue β["pales","lapse"]β
βββββββββββββββββββββββΌββββββββΌββββββββββΌββββββββββββββββββ€
β["d","e","o","r","s"]β243760 βtrue β["doser","deros"]β
βββββββββββββββββββββββΌββββββββΌββββββββββΌββββββββββββββββββ€
β["a","e","l","r","s"]β241614 βtrue β["arles","earls"]β
βββββββββββββββββββββββΌββββββββΌββββββββββΌββββββββββββββββββ€
β["a","c","e","r","s"]β229527 βtrue β["acres","acers"]β
βββββββββββββββββββββββΌββββββββΌββββββββββΌββββββββββββββββββ€
β["d","e","i","l","s"]β229100 βtrue β["delis","diels"]β
βββββββββββββββββββββββΌββββββββΌββββββββββΌββββββββββββββββββ€
β["a","i","r","s","t"]β214803 βtrue β["artis","astir"]β
βββββββββββββββββββββββΌββββββββΌββββββββββΌββββββββββββββββββ€
β["a","e","l","s","v"]β210438 βtrue β["avels","valse"]β
βββββββββββββββββββββββ΄ββββββββ΄ββββββββββ΄ββββββββββββββββββ
Follow Frequencies
For example, in position 1 we can use this for the follow frequency
MATCH (c1:Char)<-[:POS0]-(w)-[:POS1]->(c2:Char)
RETURN c1.char, c2.char, count(*) as freq
ORDER BY freq DESC LIMIT 5;
βββββββββββ€ββββββββββ€βββββββ
β"c1.char"β"c2.char"β"freq"β
βββββββββββͺββββββββββͺβββββββ‘
β"c" β"o" β220 β
βββββββββββΌββββββββββΌβββββββ€
β"m" β"a" β193 β
βββββββββββΌββββββββββΌβββββββ€
β"r" β"e" β187 β
βββββββββββΌββββββββββΌβββββββ€
β"s" β"t" β183 β
βββββββββββΌββββββββββΌβββββββ€
β"b" β"o" β175 β
βββββββββββ΄ββββββββββ΄βββββββ
Globally we can either do a big union and sum them up as we did on the stream. Or with the positions on the relationships, we can also generalize our statement from above.
MATCH (c1:Char)<-[p1]-(w)-[p2]->(c2:Char)
WHERE p1.idx +1 = p2.idx
RETURN c1.char, c2.char, count(*) as freq
ORDER BY freq DESC LIMIT 10;
βββββββββββ€ββββββββββ€βββββββ
β"c1.char"β"c2.char"β"freq"β
βββββββββββͺββββββββββͺβββββββ‘
β"e" β"s" β867 β
βββββββββββΌββββββββββΌβββββββ€
β"e" β"r" β765 β
βββββββββββΌββββββββββΌβββββββ€
β"a" β"r" β659 β
βββββββββββΌββββββββββΌβββββββ€
β"r" β"e" β646 β
βββββββββββΌββββββββββΌβββββββ€
β"e" β"d" β642 β
βββββββββββΌββββββββββΌβββββββ€
β"r" β"a" β562 β
βββββββββββΌββββββββββΌβββββββ€
β"i" β"n" β556 β
βββββββββββΌββββββββββΌβββββββ€
β"a" β"n" β544 β
βββββββββββΌββββββββββΌβββββββ€
β"a" β"l" β536 β
βββββββββββΌββββββββββΌβββββββ€
β"a" β"s" β532 β
βββββββββββ΄ββββββββββ΄βββββββ
These can help us finding that missing letter, but I guess you have internalized it already from your language knowledge, except for some rare combinations.
One interesting question we wanted to answer is, which letters appear frequently in succession as double-letters.
MATCH (c:Char)<-[r1]-(w:Word)-[r2]->(c)
WHERE r1.idx = r2.idx +1
RETURN c.char, count(*) as freq
ORDER BY freq DESC LIMIT 8
As expected o and e lead the pack, l was a bit surprising to me.
ββββββββββ€βββββββ
β"c.char"β"freq"β
ββββββββββͺβββββββ‘
β"o" β328 β
ββββββββββΌβββββββ€
β"e" β308 β
ββββββββββΌβββββββ€
β"l" β209 β
ββββββββββΌβββββββ€
β"t" β114 β
ββββββββββΌβββββββ€
β"f" β111 β
ββββββββββΌβββββββ€
β"r" β108 β
ββββββββββΌβββββββ€
β"s" β105 β
ββββββββββΌβββββββ€
β"n" β91 β
ββββββββββ΄βββββββ
For generic, repeats and re-occurrences one could generalize that to
Wordle Solver v2
For resolving our wordle puzzle (v2) we could use this Cypher using this time the relationships as structuring means.
MATCH (c:Char {char:'c'}),
(h:Char {char:'h'}),
(a:Char {char:'a'})
MATCH (wordle:Word)-[p0:POS0]->(c),
(wordle)-[p4:POS4]->(h),
(wordle)-[px]->(a)
WHERE not exists { (wordle)-[:POS1]->(a) }
AND not exists { (wordle)-[:POS2]->(:Char {char:'l'}) }
AND not exists { (wordle)-[:POS3]->(:Char {char:'i'}) }
RETURN *;
If we have more information, then we can extend the query by excluding letters or positions and get a smaller result set.
MATCH (c:Char {char:'c'}),
(h:Char {char:'h'}),
(a:Char {char:'a'})
MATCH (wordle:Word)-[p0:POS0]->(c),
(wordle)-[p4:POS4]->(h),
(wordle)-[px]->(a)
WHERE not exists { (wordle)-[:POS1]->(a) }
AND not exists { (wordle)-[:POS2]->(:Char {char:'l'}) }
AND not exists { (wordle)-[:POS3]->(:Char {char:'i'}) }
RETURN *;
We can even go so far as to implement a generic solver, that takes an structured input (we could also split an parse a marked up word as shown in the 2nd statement) and for each position, includes or excludes the letter (at position) as needed.
// Laser π©π¨β¬π©β¬
WITH [{char:'l',match:true},{char:'a'},{char:'s',match:false},{char:'e',match:true},{char:'r',match:false}] as input
MATCH (w:Word)
CALL {
WITH w, input
UNWIND range(0,size(input)-1) as idx
WITH size(input) as total, idx, w, input[idx].match as m, input[idx].char as char
// for matching must match position
WHERE (m AND exists { (w)-[{idx:idx}]->(:Char {char:char}) })
// for non-matching must not contain
OR (m = false AND NOT exists { (w)-->(:Char {char:char}) })
// existing must contain somewhere
OR (m IS NULL AND exists { (w)-->(:Char {char:char}) })
// all conditions need to match for this word
WITH total, count(*) as count WHERE count = total
RETURN true AS found
}
RETURN w.name;
For βlaserβ this returns 9 suggestions for the next word:
label, laced, lacet, lacey, laded, laden, laked, lamed, lapel, lated, laten, latex, laved, lawed, laxed, layed, lazed, lutea, lycea.
We could now rank them by the information gain, i.e. which of those have high frequency new characters or reduce most of the uncertainty.
I leave that for you valued reader for now, so that we can have a look at (the much simpler) wordle game implemented using Cypher.
Playing Wordle in Your Terminal
If you just want to play, run ./wordle-neo4j.sh in your terminal, it sends a Cypher query to a wordle database in demo.neo4j.labs.com (username, password, database = wordle) to see if your guesses were right.
https://medium.com/media/9165d67097601f14e45d193aba698df5/hrefThe statement that itβs running with the guesses in a loop (via http but could also do via cypher-shell) is:
match (w:Word)
with w skip $word limit 1
// turn guess and word into lists
with split($guess,'') as guessed, split(w.name,'') as letters, w.name as name
// iterate and combine for each position
return reduce(res='', idx in range(0,size(letters)-1) | res +
// when correct
case when guessed[idx] = letters[idx] then 'π©'
// when contained
when name contains guessed[idx] then 'π¨'
// otherwise
else 'β¬' end) as res
Happy guessing and playing with the Wordle data!
Let us know if you have more ideas / suggestions on how to have more fun with this or other datasets on AuraDB Free.
Discover AuraDB Free: Week 19 β The Wordle Graph was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.