<html> <div class=„section-divider“><hr class=„section-divider“/></div><div class=„section-content“><div class=„section-inner sectionLayout–insetColumn“><h1 name=„b262“ id=„b262“ class=„graf graf–h3 graf–leading graf–title“>Machine Learning is for 10th-Graders</h1><p name=„e301“ id=„e301“ class=„graf graf–p graf-after–h3“><em class=„markup–em markup–p-em“>Write some Python that learns to solve a maze… using 10th-grade math.</em></p><p name=„b3c4“ id=„b3c4“ class=„graf graf–p graf-after–p“>How can a single algorithm solve a maze, play tic-tac-toe, or teach a robot to walk, without knowing anything about mazes, tic-tac-toe, or walking?</p><p name=„07d0“ id=„07d0“ class=„graf graf–p graf-after–p“>The algorithm is called <strong class=„markup–strong markup–p-strong“>Q-Learning: </strong>a simple, brilliant piece of artificial intelligence that we’re going to build from scratch, using introductory Python and some 10th-grade math.</p><p name=„3a3d“ id=„3a3d“ class=„graf graf–p graf-after–p“>If I asked you to write three programs —</p><pre name=„7fa1“ id=„7fa1“ class=„graf graf–pre graf-after–p“>(1) solve a maze; (2) play tic-tac-toe; (3) make a robot walk</pre><p name=„ceeb“ id=„ceeb“ class=„graf graf–p graf-after–pre“>— you might ask yourself—</p><pre name=„6a58“ id=„6a58“ class=„graf graf–pre graf-after–p“>(1) solve a maze: how do I avoid getting lost or stuck?<br/>(2) play tic-tac-toe: how do I win, or at least not lose?<br/>(3) make a robot walk: how do I walk without falling?</pre><p name=„3c0c“ id=„3c0c“ class=„graf graf–p graf-after–pre“>You might try to write three different programs, each designed to avoid mistakes, like getting stuck, or losing a game, or falling down.</p><p name=„bd9a“ id=„bd9a“ class=„graf graf–p graf-after–p“><strong class=„markup–strong markup–p-strong“><em class=„markup–em markup–p-em“>Q-Learning</em></strong><em class=„markup–em markup–p-em“> turns that approach inside-out.</em><strong class=„markup–strong markup–p-strong“><em class=„markup–em markup–p-em“> QL</em></strong><em class=„markup–em markup–p-em“> is a type of </em><strong class=„markup–strong markup–p-strong“><em class=„markup–em markup–p-em“>reinforcement learning</em></strong><em class=„markup–em markup–p-em“> that is perfectly happy to learn from mistakes. Once we learn how to learn, we won’t need to know details about mazes, games, or robots.</em></p><p name=„06cd“ id=„06cd“ class=„graf graf–p graf-after–p“>If you’re in high school and aren’t afraid of making a few mistakes, keep reading. Machine learning is all about programs that learn from making lots mistakes. How hard can it be to make lots of mistakes?</p><p name=„a7b3“ id=„a7b3“ class=„graf graf–p graf-after–p“>It’s not hard. I do it all the time.</p><h4 name=„c9a0“ id=„c9a0“ class=„graf graf–h4 graf-after–p“><strong class=„markup–strong markup–h4-strong“>Examples 1,2,3: can you learn from these mistakes?</strong></h4><p name=„739f“ id=„739f“ class=„graf graf–p graf-after–h4“>(1) I flip a coin and you guess heads, but I say, “no, it’s tails.”</p><p name=„7d23“ id=„7d23“ class=„graf graf–p graf-after–p“>(2) I choose a number from 1 to 10 and you guess 6, but I say, “no, it’s not 6.”</p><p name=„c6c9“ id=„c6c9“ class=„graf graf–p graf-after–p“>(3) I choose a number from 1 to 10 and you guess 6, but I say, “6 is too high.”</p><p name=„e97f“ id=„e97f“ class=„graf graf–p graf-after–p“>What can you learn from those mistakes? Does calling a coin toss give you any information about the next coin toss? (hint: no).</p><p name=„1127“ id=„1127“ class=„graf graf–p graf-after–p“>What about guessing my secret number?</p><pre name=„2bbb“ id=„2bbb“ class=„graf graf–pre graf-after–p“><a href=„https://github.com/jvon-challenges/guessing-games“ data-href=„https://github.com/jvon-challenges/guessing-games“ class=„markup–anchor markup–pre-anchor“ rel=„nofollow noopener nofollow noopener“ target=„_blank“>https://github.com/jvon-challenges/guessing-games</a></pre><pre name=„b2d6“ id=„b2d6“ class=„graf graf–pre graf-after–pre“>(2.1) I am thinking of a number (1 in 10 = 10.0%)<br/>(2.2) guess of 6 is <strong class=„markup–strong markup–pre-strong“>incorrect</strong>!<br/>(2.3) remaining values: [1,2,3,4,5,7,8,9,10] (1 in 9 = 11.1%)</pre><pre name=„31c6“ id=„31c6“ class=„graf graf–pre graf-after–pre“>(3.1) I am thinking of a number (1 in 10 = 10.0%)<br/>(3.2) guess of 6 is <strong class=„markup–strong markup–pre-strong“>too high</strong>!<br/>(3.3) remaining values: [1,2,3,4,5] (1 in 5 = 20.0%)</pre><p name=„6013“ id=„6013“ class=„graf graf–p graf-after–pre“>Except in trivial cases: a guess that ends in failure will often clarify the path to success. <strong class=„markup–strong markup–p-strong“>Don’t ever waste a mistake.</strong> Each mistake contains an important lesson of what not to do next (except for flipping coins — which is why football games don’t start with a ref saying <em class=„markup–em markup–p-em“>I’m thinking of a number…</em>).</p><h4 name=„1aa0“ id=„1aa0“ class=„graf graf–h4 graf-after–p“>So you get trapped in a maze by an evil alien…</h4><p name=„1ed1“ id=„1ed1“ class=„graf graf–p graf-after–h4“>Here is the maze (the alien will be along shortly):</p><pre name=„9490“ id=„9490“ class=„graf graf–pre graf-after–p“><a href=„https://github.com/jvon-challenges/guessing-games“ data-href=„https://github.com/jvon-challenges/guessing-games“ class=„markup–anchor markup–pre-anchor“ rel=„nofollow noopener noopener nofollow noopener“ target=„_blank“>https://github.com/jvon-challenges/guessing-games</a></pre><pre name=„7812“ id=„7812“ class=„graf graf–pre graf-after–pre“>===================================<br/> … … … +++ <br/>enter-> (1) … … +++ <br/> … … … +++ <p>… +++ … … <br/> … +++ … … <br/> … +++ … … </p><p>… … +++ … <br/> … … +++ … <br/> … … +++ … </p><p>+++ +++ … … <br/> +++ +++ … … <-exit<br/> +++ +++ … … </p><p>===================================</p></pre><p name=„e166“ id=„e166“ class=„graf graf–p graf-after–pre“>If I asked you to solve the maze by moving North, South, East, or West, you might take the steps E→E→S→E→S→S, resulting in the solution:</p><pre name=„2f89“ id=„2f89“ class=„graf graf–pre graf-after–p“><a href=„https://github.com/jvon-challenges/guessing-games“ data-href=„https://github.com/jvon-challenges/guessing-games“ class=„markup–anchor markup–pre-anchor“ rel=„nofollow noopener noopener noopener nofollow noopener“ target=„_blank“>https://github.com/jvon-challenges/guessing-games</a></pre><pre name=„636d“ id=„636d“ class=„graf graf–pre graf-after–pre“>===================================<br/> … … … +++ <br/>enter-> (1) (2) (3) +++ <br/> … … … +++ <p>… +++ … … <br/> … +++ (4) (5) <br/> … +++ … … </p><p>… … +++ … <br/> … … +++ (6) <br/> … … +++ … </p><p>+++ +++ … … <br/> +++ +++ … (7) <-exit<br/> +++ +++ … … </p><p>===================================</p></pre><p name=„d788“ id=„d788“ class=„graf graf–p graf-after–pre“>Easy. That’s because you know a lot about mazes. You know about walls, boundaries, and blocks. You recognize “enter” and “exit”. You know the meaning of moving N, S, E, or W. You have a goal: <em class=„markup–em markup–p-em“>reach the exit.</em> You have a strategy: <em class=„markup–em markup–p-em“>avoid dead ends. </em>You are a total maze expert.</p><p name=„a0b4“ id=„a0b4“ class=„graf graf–p graf-after–p“>Let’s turn that inside-out. Let’s say you’re trapped in the maze by an evil alien. But: the evil alien doesn’t tell you it’s a maze — just that you are trapped.</p><p name=„2929“ id=„2929“ class=„graf graf–p graf-after–p“>In fact, the alien just says, “you are in position [0,0].”</p><p name=„1bf3“ id=„1bf3“ class=„graf graf–p graf-after–p“>Then you wait, and the alien waits.</p><p name=„5487“ id=„5487“ class=„graf graf–p graf-after–p“>Finally, you summon your courage and ask the alien, “what can I do?”</p><p name=„a848“ id=„a848“ class=„graf graf–p graf-after–p“>The alien says, “you can choose <em class=„markup–em markup–p-em“>N,S,E, or W</em>”.</p><p name=„4520“ id=„4520“ class=„graf graf–p graf-after–p“>Not knowing what else to do, you take a deep breath and say, “I choose <em class=„markup–em markup–p-em“>N.</em>”</p><p name=„b7cb“ id=„b7cb“ class=„graf graf–p graf-after–p“>The alien says, “that went badly.”</p><p name=„611a“ id=„611a“ class=„graf graf–p graf-after–p“>And then you wait, and the alien waits, but nothing happens.</p><p name=„363c“ id=„363c“ class=„graf graf–p graf-after–p“>Finally, you get bored and ask, “can I try again?”</p><p name=„fd07“ id=„fd07“ class=„graf graf–p graf-after–p“>The alien responds, “you are in position [0,0].”</p><p name=„e258“ id=„e258“ class=„graf graf–p graf-after–p“>So what do you choose, this time? Probably not <em class=„markup–em markup–p-em“>N…</em> <em class=„markup–em markup–p-em“>N</em> went badly.</p><p name=„ab78“ id=„ab78“ class=„graf graf–p graf-after–p“>So you say, “I choose <em class=„markup–em markup–p-em“>E</em>”.</p><p name=„c23a“ id=„c23a“ class=„graf graf–p graf-after–p“>And the alien says, “you are in position [0,1].”</p><p name=„af1c“ id=„af1c“ class=„graf graf–p graf-after–p“>Hey! We’re getting somewhere.</p><h4 name=„4900“ id=„4900“ class=„graf graf–h4 graf-after–p“>Write a program to defeat the alien (and its annoying friends)</h4><p name=„4955“ id=„4955“ class=„graf graf–p graf-after–h4“>Before long, you tire of talking to the alien. You write some Python code to deal with not only this alien, but any future aliens who exhibit similar annoying tendencies.</p><p name=„414d“ id=„414d“ class=„graf graf–p graf-after–p“>Turns out these aliens have a few things in common:</p><ul class=„postList“><li name=„db3e“ id=„db3e“ class=„graf graf–li graf-after–p“>aliens tell you what <em class=„markup–em markup–li-em“>state</em> you are in (e.g., position [0,0])</li><li name=„5a1e“ id=„5a1e“ class=„graf graf–li graf-after–li“>aliens allow you to choose among <em class=„markup–em markup–li-em“>actions</em> (e.g., N,S,W,E)</li><li name=„6fa2“ id=„6fa2“ class=„graf graf–li graf-after–li“>aliens reveal the <em class=„markup–em markup–li-em“>dimensions</em> of all possible states (e.g., 4×4)</li></ul><p name=„2367“ id=„2367“ class=„graf graf–p graf-after–li“>Here is the first annoying alien, in Python. Note that whenever you choose an action, the alien responds with three things: (1) your new state, (2) a reward or penalty, and (3) true/false indicating whether your attempt has ended:</p><figure name=„caaa“ id=„caaa“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/82a7580bdb78a985f8b9bb2f3bab0229?postId=d4f9ac7a798“ data-media-id=„82a7580bdb78a985f8b9bb2f3bab0229“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure><p name=„c5fa“ id=„c5fa“ class=„graf graf–p graf-after–figure“>You need your own Python code, to defeat the alien. After thinking it through, you realize that your code needs to remember what <em class=„markup–em markup–p-em“>state</em> you are in, and whether taking an <em class=„markup–em markup–p-em“>action</em> from that state goes well or badly.</p><p name=„8f3c“ id=„8f3c“ class=„graf graf–p graf-after–p“>So from your first encounter:</p><ul class=„postList“><li name=„f732“ id=„f732“ class=„graf graf–li graf-after–p“>state<strong class=„markup–strong markup–li-strong“>[0,0]</strong>→<strong class=„markup–strong markup–li-strong“> </strong>action<strong class=„markup–strong markup–li-strong“> N</strong>→ goes badly, game over!</li><li name=„cae2“ id=„cae2“ class=„graf graf–li graf-after–li“>state<strong class=„markup–strong markup–li-strong“>[0,0]</strong>→ action <strong class=„markup–strong markup–li-strong“>E</strong>→ neither good nor bad.</li></ul><p name=„f645“ id=„f645“ class=„graf graf–p graf-after–li“><em class=„markup–em markup–p-em“>(you may guess that “0,0” means row 0, column 0, and “0,1” means row 0, column 1, N is North, going north from (0,0) is out of bounds and therefore ‘bad’, E is East… but don’t bother with details! — that will just make it harder to defeat the annoying tic-tac-toe playing alien that lies in your not-too-distant future)</em></p><p name=„45c6“ id=„45c6“ class=„graf graf–p graf-after–p“>For starters, you will need some code that can take note of things at every possible state, all 4×4=16 of them. Here is some Python that makes room to store at least one thing about every state in a 4×4 universe:</p><figure name=„f5ad“ id=„f5ad“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/8d9d0994b75fa466c09aff6e4ddf5a61?postId=d4f9ac7a798“ data-media-id=„8d9d0994b75fa466c09aff6e4ddf5a61“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure><p name=„e48e“ id=„e48e“ class=„graf graf–p graf-after–figure“>Let’s see… you need to note your first mistake:</p><ul class=„postList“><li name=„8290“ id=„8290“ class=„graf graf–li graf-after–p“>state<strong class=„markup–strong markup–li-strong“>[0,0]</strong>→<strong class=„markup–strong markup–li-strong“> </strong>action<strong class=„markup–strong markup–li-strong“> N</strong>→ goes badly, game over!</li></ul><p name=„0a58“ id=„0a58“ class=„graf graf–p graf-after–li“>You could put something in position (0,0) of our array, but then you might want to make a note of your next attempt:</p><ul class=„postList“><li name=„f3b6“ id=„f3b6“ class=„graf graf–li graf-after–p“>state<strong class=„markup–strong markup–li-strong“>[0,0]</strong>→ action <strong class=„markup–strong markup–li-strong“>E</strong>→ neither good nor bad.</li></ul><p name=„812d“ id=„812d“ class=„graf graf–p graf-after–li“>Which would also go in position (0,0) of your array, wiping out your first entry (and forgetting your mistake, dooming you possibly to repeat it).</p><p name=„2ad5“ id=„2ad5“ class=„graf graf–p graf-after–p“>Turns out your array needs a space to make note of the result of every action taken from every state. That’s called a <em class=„markup–em markup–p-em“>state transition, </em>because it represents the change in state caused by an action, and the resulting array is a <em class=„markup–em markup–p-em“>q-table,</em> because it tracks the quality of each transition. To track the effect of state transitions, every state needs four entries, one per available action (N,S,E,W):</p><figure name=„b3fd“ id=„b3fd“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/596bdfe63bcfe5194b58a71e091989c6?postId=d4f9ac7a798“ data-media-id=„596bdfe63bcfe5194b58a71e091989c6“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure><p name=„0436“ id=„0436“ class=„graf graf–p graf-after–figure“>Now that you have a place to keep track of things, just go ahead and make a bunch of mistakes, by moving at random. If you make a thousand random moves, what would your q-table look like?</p><figure name=„5244“ id=„5244“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/78c53913c9827791bd3486ec8a72483c?postId=d4f9ac7a798“ data-media-id=„78c53913c9827791bd3486ec8a72483c“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure><pre name=„6ce5“ id=„6ce5“ class=„graf graf–pre graf-after–figure“>Here is sample output… in my session, I found the exit… once!</pre><pre name=„8162“ id=„8162“ class=„graf graf–pre graf-after–pre“> North South East West</pre><pre name=„417e“ id=„417e“ class=„graf graf–pre graf-after–pre“>[-273. 0. 0. -269.] row = 0, col = 0<br/> [ -95. -80. 0. 0.] row = 0, col = 1<br/> [ -20. 0. -16. 0.] row = 0, col = 2<br/> [ 0. 0. 0. 0. …<p>0. 0. -74. -88.] row = 1, col = 0<br/> [ 0. 0. 0. 0.] ...<br/> [ 0. -13. 0. -5.]<br/> [ -1. 0. -2. 0.</p><p>0. -15. 0. -26.] row = 2, col = 0<br/> [ -5. -11. -5. 0.] ...<br/> [ 0. 0. 0. 0.]<br/> [ 0. 1. -1. 0. <- see the +1? you went S from (2,3)<br/> and found the exit.<br/> 0. 0. 0. 0.]<br/> [ 0. 0. 0. 0.]<br/> [ 0. 0. 0. 0.]<br/> [ 0. 0. 0. 0.] row = 3, col = 3 (aka: the exit)</p></pre><p name=„089f“ id=„089f“ class=„graf graf–p graf-after–pre“>What does the sample output, above, say?</p><p name=„152d“ id=„152d“ class=„graf graf–p graf-after–p“>The first row represents actions from state (0,0); it says: [-273, 0, 0, -269]. That means the <em class=„markup–em markup–p-em“>cumulative reward</em> for taking actions from (0,0) were:</p><ul class=„postList“><li name=„8ff4“ id=„8ff4“ class=„graf graf–li graf-after–p“>state<strong class=„markup–strong markup–li-strong“>[0,0]</strong>→action <strong class=„markup–strong markup–li-strong“>N</strong>→<strong class=„markup–strong markup–li-strong“> -273</strong> (don’t got north! big penalty)</li><li name=„1cde“ id=„1cde“ class=„graf graf–li graf-after–li“>state<strong class=„markup–strong markup–li-strong“>[0,0]</strong>→action <strong class=„markup–strong markup–li-strong“>S</strong>→<strong class=„markup–strong markup–li-strong“> 0</strong> (noting happens when you go south)</li><li name=„ad60“ id=„ad60“ class=„graf graf–li graf-after–li“>state<strong class=„markup–strong markup–li-strong“>[0,0]</strong>→action <strong class=„markup–strong markup–li-strong“>E</strong>→<strong class=„markup–strong markup–li-strong“> 0</strong> (noting happens when you go east)</li><li name=„b01f“ id=„b01f“ class=„graf graf–li graf-after–li“>state<strong class=„markup–strong markup–li-strong“>[0,0]</strong>→action <strong class=„markup–strong markup–li-strong“>W</strong>→<strong class=„markup–strong markup–li-strong“> -269 </strong>(don’t got west! big penalty)</li></ul><p name=„bd9f“ id=„bd9f“ class=„graf graf–p graf-after–li“>Good news: your q-table clearly says “don’t start by going north or west,” which is good advice. Bad news: you wasted 273+269 = 542 out of 1,000 attempts making those two mistakes.</p><p name=„f358“ id=„f358“ class=„graf graf–p graf-after–p“>Humph.</p><p name=„04f3“ id=„04f3“ class=„graf graf–p graf-after–p“>Maybe exploring the problem isn’t entirely a random exercise? Maybe sometimes you should <em class=„markup–em markup–p-em“>explore</em> the problem, and sometimes you should <em class=„markup–em markup–p-em“>exploit</em> what you’ve learned. Take a look at the source code, to see how to split your actions 50/50 between exploring and exploiting:</p><figure name=„567d“ id=„567d“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/33248fd61246577413e1a4393561288d?postId=d4f9ac7a798“ data-media-id=„33248fd61246577413e1a4393561288d“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure><pre name=„3397“ id=„3397“ class=„graf graf–pre graf-after–figure“>(my session produced this output… what about yours?)<br/><strong class=„markup–strong markup–pre-strong“> <br/></strong>the annoying alien says that your starting state is = [0, 0]</pre><pre name=„6d83“ id=„6d83“ class=„graf graf–pre graf-after–pre“>- I failed 237+229 = 466 times<br/> [ -29. -30. 0. 0.]<br/> [ -41. 0. -26. 0.]<br/> [ 0. 0. 0. 0.<p>0. 0. -157. -136.]<br/> [ 0. 0. 0. 0.]<br/> [ 0. -29. 0. -12.]<br/> [ -3. 0. -6. 0.</p><p>0. -24. 0. -18.]<br/> [ -4. -4. -2. 0.]<br/> [ 0. 0. 0. 0.]<br/> [ 0. 10. -1. -2. <- hey! I found the exit 10 times</p><p>0. 0. 0. 0.]<br/> [ 0. 0. 0. 0.]<br/> [ 0. 0. 0. 0.]<br/> [ 0. 0. 0. 0.]</p></pre><p name=„51cb“ id=„51cb“ class=„graf graf–p graf-after–pre“>Almost done. One more thing.</p><p name=„d297“ id=„d297“ class=„graf graf–p graf-after–p“>Would it help if you could find the exit, starting from the entrance?</p><p name=„5ce7“ id=„5ce7“ class=„graf graf–p graf-after–p“>Well, you can — else I wouldn’t have asked — but how?</p><p name=„8e68“ id=„8e68“ class=„graf graf–p graf-after–p“>Or: what is special about state<strong class=„markup–strong markup–p-strong“>[2,3]</strong>→action <strong class=„markup–strong markup–p-strong“>S</strong>? In my session, that transition had a q-value of +10 (lots of rewards!). So here’s the real question: if you were adjacent to state<strong class=„markup–strong markup–p-strong“>[2,3]</strong>, and you knew from prior mistakes that transitioning to state<strong class=„markup–strong markup–p-strong“>[2,3]</strong> offered a reward, would you try to take advantage of that? If you were right above state<strong class=„markup–strong markup–p-strong“>[2,3]</strong>, in state<strong class=„markup–strong markup–p-strong“>[1,3]</strong>, and trying to exploit prior mistakes… what would help you to choose <strong class=„markup–strong markup–p-strong“>action=S</strong>, and transition to state<strong class=„markup–strong markup–p-strong“>[2,3]</strong>? <em class=„markup–em markup–p-em“>How can you move toward a positive reward?</em></p><p name=„457d“ id=„457d“ class=„graf graf–p graf-after–p“>Or: what if the positive reward had a way to move backwards, toward you?</p><p name=„6f4d“ id=„6f4d“ class=„graf graf–p graf-after–p“>(I warned you that Q-learning does things inside-out, remember?)</p><p name=„7608“ id=„7608“ class=„graf graf–p graf-after–p“>Take a really good look at this code, and run it over and over:</p><figure name=„97a7“ id=„97a7“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/904fa44c7b9853799753e1a8e0750170?postId=d4f9ac7a798“ data-media-id=„904fa44c7b9853799753e1a8e0750170“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure><pre name=„c3f4“ id=„c3f4“ class=„graf graf–pre graf-after–figure“>Here are the q-values upon finding the exit for the first time:</pre><pre name=„c13d“ id=„c13d“ class=„graf graf–pre graf-after–pre“>[-1. 0. 0. -1.]<br/> [-1. -1. 0. 0.]<br/> [-1. 0. -1. 0.]<br/> [ 0. 0. 0. 0.<p>0. 0. -1. -1.]<br/> [ 0. 0. 0. 0.]<br/> [ 0. 0. 0. -1.]<br/> [-1. 0. -1. 0.</p><p>0. -1. 0. -1.]<br/> [-1. -1. 0. 0.]<br/> [ 0. 0. 0. 0.]<br/> [ 0. 1. 0. 0. <- see the +1? Found the exit!</p><p>0. 0. 0. 0.]<br/> [ 0. 0. 0. 0.]<br/> [ 0. 0. 0. 0.]<br/> [ 0. 0. 0. 0.]</p></pre><pre name=„5adb“ id=„5adb“ class=„graf graf–pre graf-after–pre“><br/>Here are the q-values upon noticing a future reward (from row=1 to row=2):</pre><pre name=„137c“ id=„137c“ class=„graf graf–pre graf-after–pre“>[-1. 0. 0. -1. ]<br/> [-1. -1. 0. 0. ]<br/> [-1. 0. -1. 0. ]<br/> [ 0. 0. 0. 0.<p>0. 0. -1. -1. ]<br/> [ 0. 0. 0. 0. ]<br/> [ 0. -1. 0. -1. ]<br/> [-1. 0.9 -1. 0. <- see the +0.9? Looked ahead, noticed<br/> that a reward had been found previously <br/> 0. -1. 0. -1. ]<br/> [-1. -1. -1. 0. ]<br/> [ 0. 0. 0. 0. ]<br/> [ 0. 1. 0. 0.</p><p>0. 0. 0. 0. ]<br/> [ 0. 0. 0. 0. ]<br/> [ 0. 0. 0. 0. ]<br/> [ 0. 0. 0. 0.]</p></pre><pre name=„09b1“ id=„09b1“ class=„graf graf–pre graf-after–pre“><br/>Here are the q-values when the reward works its way backward, to the start:</pre><pre name=„3248“ id=„3248“ class=„graf graf–pre graf-after–pre“>- the reward made its way to (0,0)E!<br/> [-1. -1. 0.66 0. ]<br/> [-1. 0.73 -1. 0. ]<br/> [ 0. 0. 0. 0.<p>0. 0. -1. -1. ]<br/> [ 0. 0. 0. 0. ]<br/> [ 0. -1. 0.81 -1. ]<br/> [-1. 0.9 -1. 0.</p><p>0. -1. 0. -1. ]<br/> [-1. -1. -1. 0. ]<br/> [ 0. 0. 0. 0. ]<br/> [ 0. 1. 0. -1.</p><p>0. 0. 0. 0. ]<br/> [ 0. 0. 0. 0. ]<br/> [ 0. 0. 0. 0. ]<br/> [ 0. 0. 0. 0.]</p></pre><p name=„0abd“ id=„0abd“ class=„graf graf–p graf-after–pre“>The reward eventually works its way backward, one step at a time, from the end of the problem all the way to the beginning. Once you know the direction of a positive reward from the very first step, it isn’t hard to defeat the annoying alien:</p><figure name=„688c“ id=„688c“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/3ded6ff3b8142e35c4cf2184675b23f4?postId=d4f9ac7a798“ data-media-id=„3ded6ff3b8142e35c4cf2184675b23f4“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure><p name=„d47e“ id=„d47e“ class=„graf graf–p graf-after–figure“>That’s it: a complete solution to a problem in the field of reinforcement learning, using Q-learning. It’s not quite as complete as some implementations, and the code is not too compact, but you did that on purpose, so that other readers can follow the meaning without getting lost in the math.</p><p name=„fdf4“ id=„fdf4“ class=„graf graf–p graf-after–p graf–trailing“>Next time: tic-tac-toe.</p></div></div> </html>