Qgelm

Machine Learning for 10th-Graders – Michael Megliola – Medium

<html> <div class=„section-divider“><hr class=„section-divider“/></div><div class=„section-content“><div class=„section-inner sectionLayout–insetColumn“><h1 name=„b262“ id=„b262“ class=„graf graf–h3 graf–leading graf–title“>Machine Learning is for 10th-Graders</h1>Write some Python that learns to solve a maze… using 10th-grade math.How can a single algorithm solve a maze, play tic-tac-toe, or teach a robot to walk, without knowing anything about mazes, tic-tac-toe, or walking?The algorithm is called Q-Learning: a simple, brilliant piece of artificial intelligence that we’re going to build from scratch, using introductory Python and some 10th-grade math.If I asked you to write three programs —<pre name=„7fa1“ id=„7fa1“ class=„graf graf–pre graf-after–p“>(1) solve a maze; (2) play tic-tac-toe; (3) make a robot walk</pre>— you might ask yourself—<pre name=„6a58“ id=„6a58“ class=„graf graf–pre graf-after–p“>(1) solve a maze: how do I avoid getting lost or stuck? (2) play tic-tac-toe: how do I win, or at least not lose? (3) make a robot walk: how do I walk without falling?</pre>You might try to write three different programs, each designed to avoid mistakes, like getting stuck, or losing a game, or falling down.Q-Learning turns that approach inside-out. QL is a type of reinforcement learning that is perfectly happy to learn from mistakes. Once we learn how to learn, we won’t need to know details about mazes, games, or robots.If you’re in high school and aren’t afraid of making a few mistakes, keep reading. Machine learning is all about programs that learn from making lots mistakes. How hard can it be to make lots of mistakes?It’s not hard. I do it all the time.<h4 name=„c9a0“ id=„c9a0“ class=„graf graf–h4 graf-after–p“>Examples 1,2,3: can you learn from these mistakes?</h4>(1) I flip a coin and you guess heads, but I say, “no, it’s tails.”(2) I choose a number from 1 to 10 and you guess 6, but I say, “no, it’s not 6.”(3) I choose a number from 1 to 10 and you guess 6, but I say, “6 is too high.”What can you learn from those mistakes? Does calling a coin toss give you any information about the next coin toss? (hint: no).What about guessing my secret number?<pre name=„2bbb“ id=„2bbb“ class=„graf graf–pre graf-after–p“><a href=„https://github.com/jvon-challenges/guessing-games“ data-href=„https://github.com/jvon-challenges/guessing-games“ class=„markup–anchor markup–pre-anchor“ rel=„nofollow noopener nofollow noopener“ target=„_blank“>https://github.com/jvon-challenges/guessing-games</a></pre><pre name=„b2d6“ id=„b2d6“ class=„graf graf–pre graf-after–pre“>(2.1) I am thinking of a number (1 in 10 = 10.0%) (2.2) guess of 6 is incorrect! (2.3) remaining values: [1,2,3,4,5,7,8,9,10] (1 in 9 = 11.1%)</pre><pre name=„31c6“ id=„31c6“ class=„graf graf–pre graf-after–pre“>(3.1) I am thinking of a number (1 in 10 = 10.0%) (3.2) guess of 6 is too high! (3.3) remaining values: [1,2,3,4,5] (1 in 5 = 20.0%)</pre>Except in trivial cases: a guess that ends in failure will often clarify the path to success. Don’t ever waste a mistake. Each mistake contains an important lesson of what not to do next (except for flipping coins — which is why football games don’t start with a ref saying I’m thinking of a number…).<h4 name=„1aa0“ id=„1aa0“ class=„graf graf–h4 graf-after–p“>So you get trapped in a maze by an evil alien…</h4>Here is the maze (the alien will be along shortly):<pre name=„9490“ id=„9490“ class=„graf graf–pre graf-after–p“><a href=„https://github.com/jvon-challenges/guessing-games“ data-href=„https://github.com/jvon-challenges/guessing-games“ class=„markup–anchor markup–pre-anchor“ rel=„nofollow noopener noopener nofollow noopener“ target=„_blank“>https://github.com/jvon-challenges/guessing-games</a></pre><pre name=„7812“ id=„7812“ class=„graf graf–pre graf-after–pre“>=================================== … … … +++ enter-> (1) … … +++ … … … +++ … +++ … … … +++ … … … +++ … … … … +++ … … … +++ … … … +++ … +++ +++ … … +++ +++ … … <-exit +++ +++ … … ===================================</pre>If I asked you to solve the maze by moving North, South, East, or West, you might take the steps E→E→S→E→S→S, resulting in the solution:<pre name=„2f89“ id=„2f89“ class=„graf graf–pre graf-after–p“><a href=„https://github.com/jvon-challenges/guessing-games“ data-href=„https://github.com/jvon-challenges/guessing-games“ class=„markup–anchor markup–pre-anchor“ rel=„nofollow noopener noopener noopener nofollow noopener“ target=„_blank“>https://github.com/jvon-challenges/guessing-games</a></pre><pre name=„636d“ id=„636d“ class=„graf graf–pre graf-after–pre“>=================================== … … … +++ enter-> (1) (2) (3) +++ … … … +++ … +++ … … … +++ (4) (5) … +++ … … … … +++ … … … +++ (6) … … +++ … +++ +++ … … +++ +++ … (7) <-exit +++ +++ … … ===================================</pre>Easy. That’s because you know a lot about mazes. You know about walls, boundaries, and blocks. You recognize “enter” and “exit”. You know the meaning of moving N, S, E, or W. You have a goal: reach the exit. You have a strategy: avoid dead ends. You are a total maze expert.Let’s turn that inside-out. Let’s say you’re trapped in the maze by an evil alien. But: the evil alien doesn’t tell you it’s a maze — just that you are trapped.In fact, the alien just says, “you are in position [0,0].”Then you wait, and the alien waits.Finally, you summon your courage and ask the alien, “what can I do?”The alien says, “you can choose N,S,E, or W”.Not knowing what else to do, you take a deep breath and say, “I choose N.”The alien says, “that went badly.”And then you wait, and the alien waits, but nothing happens.Finally, you get bored and ask, “can I try again?”The alien responds, “you are in position [0,0].”So what do you choose, this time? Probably not N… N went badly.So you say, “I choose E”.And the alien says, “you are in position [0,1].”Hey! We’re getting somewhere.<h4 name=„4900“ id=„4900“ class=„graf graf–h4 graf-after–p“>Write a program to defeat the alien (and its annoying friends)</h4>Before long, you tire of talking to the alien. You write some Python code to deal with not only this alien, but any future aliens who exhibit similar annoying tendencies.Turns out these aliens have a few things in common:<ul class=„postList“><li name=„db3e“ id=„db3e“ class=„graf graf–li graf-after–p“>aliens tell you what state you are in (e.g., position [0,0])</li><li name=„5a1e“ id=„5a1e“ class=„graf graf–li graf-after–li“>aliens allow you to choose among actions (e.g., N,S,W,E)</li><li name=„6fa2“ id=„6fa2“ class=„graf graf–li graf-after–li“>aliens reveal the dimensions of all possible states (e.g., 4×4)</li></ul>Here is the first annoying alien, in Python. Note that whenever you choose an action, the alien responds with three things: (1) your new state, (2) a reward or penalty, and (3) true/false indicating whether your attempt has ended:<figure name=„caaa“ id=„caaa“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/82a7580bdb78a985f8b9bb2f3bab0229?postId=d4f9ac7a798“ data-media-id=„82a7580bdb78a985f8b9bb2f3bab0229“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure>You need your own Python code, to defeat the alien. After thinking it through, you realize that your code needs to remember what state you are in, and whether taking an action from that state goes well or badly.So from your first encounter:<ul class=„postList“><li name=„f732“ id=„f732“ class=„graf graf–li graf-after–p“>state[0,0]→ action N→ goes badly, game over!</li><li name=„cae2“ id=„cae2“ class=„graf graf–li graf-after–li“>state[0,0]→ action E→ neither good nor bad.</li></ul>(you may guess that “0,0” means row 0, column 0, and “0,1” means row 0, column 1, N is North, going north from (0,0) is out of bounds and therefore ‘bad’, E is East… but don’t bother with details! — that will just make it harder to defeat the annoying tic-tac-toe playing alien that lies in your not-too-distant future)For starters, you will need some code that can take note of things at every possible state, all 4×4=16 of them. Here is some Python that makes room to store at least one thing about every state in a 4×4 universe:<figure name=„f5ad“ id=„f5ad“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/8d9d0994b75fa466c09aff6e4ddf5a61?postId=d4f9ac7a798“ data-media-id=„8d9d0994b75fa466c09aff6e4ddf5a61“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure>Let’s see… you need to note your first mistake:<ul class=„postList“><li name=„8290“ id=„8290“ class=„graf graf–li graf-after–p“>state[0,0]→ action N→ goes badly, game over!</li></ul>You could put something in position (0,0) of our array, but then you might want to make a note of your next attempt:<ul class=„postList“><li name=„f3b6“ id=„f3b6“ class=„graf graf–li graf-after–p“>state[0,0]→ action E→ neither good nor bad.</li></ul>Which would also go in position (0,0) of your array, wiping out your first entry (and forgetting your mistake, dooming you possibly to repeat it).Turns out your array needs a space to make note of the result of every action taken from every state. That’s called a state transition, because it represents the change in state caused by an action, and the resulting array is a q-table, because it tracks the quality of each transition. To track the effect of state transitions, every state needs four entries, one per available action (N,S,E,W):<figure name=„b3fd“ id=„b3fd“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/596bdfe63bcfe5194b58a71e091989c6?postId=d4f9ac7a798“ data-media-id=„596bdfe63bcfe5194b58a71e091989c6“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure>Now that you have a place to keep track of things, just go ahead and make a bunch of mistakes, by moving at random. If you make a thousand random moves, what would your q-table look like?<figure name=„5244“ id=„5244“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/78c53913c9827791bd3486ec8a72483c?postId=d4f9ac7a798“ data-media-id=„78c53913c9827791bd3486ec8a72483c“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure><pre name=„6ce5“ id=„6ce5“ class=„graf graf–pre graf-after–figure“>Here is sample output… in my session, I found the exit… once!</pre><pre name=„8162“ id=„8162“ class=„graf graf–pre graf-after–pre“> North South East West</pre><pre name=„417e“ id=„417e“ class=„graf graf–pre graf-after–pre“>[-273. 0. 0. -269.] row = 0, col = 0 [ -95. -80. 0. 0.] row = 0, col = 1 [ -20. 0. -16. 0.] row = 0, col = 2 [ 0. 0. 0. 0. …0. 0. -74. -88.] row = 1, col = 0 [ 0. 0. 0. 0.] ... [ 0. -13. 0. -5.] [ -1. 0. -2. 0.0. -15. 0. -26.] row = 2, col = 0 [ -5. -11. -5. 0.] ... [ 0. 0. 0. 0.] [ 0. 1. -1. 0. <- see the +1? you went S from (2,3) and found the exit. 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.] row = 3, col = 3 (aka: the exit)</pre>What does the sample output, above, say?The first row represents actions from state (0,0); it says: [-273, 0, 0, -269]. That means the cumulative reward for taking actions from (0,0) were:<ul class=„postList“><li name=„8ff4“ id=„8ff4“ class=„graf graf–li graf-after–p“>state[0,0]→action N→ -273 (don’t got north! big penalty)</li><li name=„1cde“ id=„1cde“ class=„graf graf–li graf-after–li“>state[0,0]→action S→ 0 (noting happens when you go south)</li><li name=„ad60“ id=„ad60“ class=„graf graf–li graf-after–li“>state[0,0]→action E→ 0 (noting happens when you go east)</li><li name=„b01f“ id=„b01f“ class=„graf graf–li graf-after–li“>state[0,0]→action W→ -269 (don’t got west! big penalty)</li></ul>Good news: your q-table clearly says “don’t start by going north or west,” which is good advice. Bad news: you wasted 273+269 = 542 out of 1,000 attempts making those two mistakes.Humph.Maybe exploring the problem isn’t entirely a random exercise? Maybe sometimes you should explore the problem, and sometimes you should exploit what you’ve learned. Take a look at the source code, to see how to split your actions 50/50 between exploring and exploiting:<figure name=„567d“ id=„567d“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/33248fd61246577413e1a4393561288d?postId=d4f9ac7a798“ data-media-id=„33248fd61246577413e1a4393561288d“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure><pre name=„3397“ id=„3397“ class=„graf graf–pre graf-after–figure“>(my session produced this output… what about yours?) the annoying alien says that your starting state is = [0, 0]</pre><pre name=„6d83“ id=„6d83“ class=„graf graf–pre graf-after–pre“>- I failed 237+229 = 466 times [ -29. -30. 0. 0.] [ -41. 0. -26. 0.] [ 0. 0. 0. 0.0. 0. -157. -136.] [ 0. 0. 0. 0.] [ 0. -29. 0. -12.] [ -3. 0. -6. 0.0. -24. 0. -18.] [ -4. -4. -2. 0.] [ 0. 0. 0. 0.] [ 0. 10. -1. -2. <- hey! I found the exit 10 times0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.]</pre>Almost done. One more thing.Would it help if you could find the exit, starting from the entrance?Well, you can — else I wouldn’t have asked — but how?Or: what is special about state[2,3]→action S? In my session, that transition had a q-value of +10 (lots of rewards!). So here’s the real question: if you were adjacent to state[2,3], and you knew from prior mistakes that transitioning to state[2,3] offered a reward, would you try to take advantage of that? If you were right above state[2,3], in state[1,3], and trying to exploit prior mistakes… what would help you to choose action=S, and transition to state[2,3]? How can you move toward a positive reward?Or: what if the positive reward had a way to move backwards, toward you?(I warned you that Q-learning does things inside-out, remember?)Take a really good look at this code, and run it over and over:<figure name=„97a7“ id=„97a7“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/904fa44c7b9853799753e1a8e0750170?postId=d4f9ac7a798“ data-media-id=„904fa44c7b9853799753e1a8e0750170“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure><pre name=„c3f4“ id=„c3f4“ class=„graf graf–pre graf-after–figure“>Here are the q-values upon finding the exit for the first time:</pre><pre name=„c13d“ id=„c13d“ class=„graf graf–pre graf-after–pre“>[-1. 0. 0. -1.] [-1. -1. 0. 0.] [-1. 0. -1. 0.] [ 0. 0. 0. 0.0. 0. -1. -1.] [ 0. 0. 0. 0.] [ 0. 0. 0. -1.] [-1. 0. -1. 0.0. -1. 0. -1.] [-1. -1. 0. 0.] [ 0. 0. 0. 0.] [ 0. 1. 0. 0. <- see the +1? Found the exit!0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.]</pre><pre name=„5adb“ id=„5adb“ class=„graf graf–pre graf-after–pre“> Here are the q-values upon noticing a future reward (from row=1 to row=2):</pre><pre name=„137c“ id=„137c“ class=„graf graf–pre graf-after–pre“>[-1. 0. 0. -1. ] [-1. -1. 0. 0. ] [-1. 0. -1. 0. ] [ 0. 0. 0. 0.0. 0. -1. -1. ] [ 0. 0. 0. 0. ] [ 0. -1. 0. -1. ] [-1. 0.9 -1. 0. <- see the +0.9? Looked ahead, noticed that a reward had been found previously 0. -1. 0. -1. ] [-1. -1. -1. 0. ] [ 0. 0. 0. 0. ] [ 0. 1. 0. 0.0. 0. 0. 0. ] [ 0. 0. 0. 0. ] [ 0. 0. 0. 0. ] [ 0. 0. 0. 0.]</pre><pre name=„09b1“ id=„09b1“ class=„graf graf–pre graf-after–pre“> Here are the q-values when the reward works its way backward, to the start:</pre><pre name=„3248“ id=„3248“ class=„graf graf–pre graf-after–pre“>- the reward made its way to (0,0)E! [-1. -1. 0.66 0. ] [-1. 0.73 -1. 0. ] [ 0. 0. 0. 0.0. 0. -1. -1. ] [ 0. 0. 0. 0. ] [ 0. -1. 0.81 -1. ] [-1. 0.9 -1. 0.0. -1. 0. -1. ] [-1. -1. -1. 0. ] [ 0. 0. 0. 0. ] [ 0. 1. 0. -1.0. 0. 0. 0. ] [ 0. 0. 0. 0. ] [ 0. 0. 0. 0. ] [ 0. 0. 0. 0.]</pre>The reward eventually works its way backward, one step at a time, from the end of the problem all the way to the beginning. Once you know the direction of a positive reward from the very first step, it isn’t hard to defeat the annoying alien:<figure name=„688c“ id=„688c“ class=„graf graf–figure graf–iframe graf-after–p“><div class=„aspectRatioPlaceholder is-locked“><div class=„aspectRatioPlaceholder-fill“ style=„padding-bottom: 75%;“/><div class=„iframeContainer“><iframe data-width=„800“ data-height=„600“ width=„700“ height=„525“ src=„https://medium.com/media/3ded6ff3b8142e35c4cf2184675b23f4?postId=d4f9ac7a798“ data-media-id=„3ded6ff3b8142e35c4cf2184675b23f4“ data-thumbnail=„https://i.embed.ly/1/image?url=https%3A%2F%2Frepl.it%2Fpublic%2Fimages%2Freplit-logo-800x600.png&key=a19fcc184b9711e1b4764040d3dc5c07“ allowfullscreen=„“ frameborder=„0“>[embedded content]</iframe></div></div><figcaption class=„imageCaption“>press the run button (top, center) — or open in repl.it (top, right)</figcaption></figure>That’s it: a complete solution to a problem in the field of reinforcement learning, using Q-learning. It’s not quite as complete as some implementations, and the code is not too compact, but you did that on purpose, so that other readers can follow the meaning without getting lost in the math.Next time: tic-tac-toe.</div></div> </html>

Qgelm

Seiten-Werkzeuge

Webseiten-Werkzeuge

Benutzer-Werkzeuge

Machine Learning for 10th-Graders – Michael Megliola – Medium