[{"categories":[],"content":"If you\u0026rsquo;re reading this, you probably know some SQL and are aware of some of its building blocks (in order of operation):\n from (including joins) where group by having select order by limit But now you\u0026rsquo;re doing some analysis in Pandas and you need to be able to query your dataframe much like you usually do in SQL. This article attempts to provide a mapping between common operations in SQL and their counterpart in Pandas using a fictional housing dataset.\nSelect, order by, and limit Say you want to select all columns in your dataset and limit the number of rows coming back. In SQL,\nselect * from housing limit 10 In Pandas,\nhousing_df.head(10) Want to order by number of bedrooms? In SQL,\nselect * from housing order by num_bedrooms limit 10 In Pandas,\nhousing_df.sort_values(by=[\u0026#39;num_bedrooms\u0026#39;]).head(10) Just want to select a few columns? In SQL,\nselect num_bedrooms, price from housing In Pandas,\nhousing_df[[\u0026#39;num_bedrooms\u0026#39;, \u0026#39;price\u0026#39;]] Filtering with where Only care about houses with three or more bedrooms? In SQL,\nselect * from housing where num_bedrooms \u0026gt;= 3 In Pandas,\nhousing_df[housing_df.num_bedrooms \u0026gt;= 3] Want multiple filters? In SQL,\nselect * from housing where num_bedrooms \u0026gt;= 3 and price \u0026lt; 450000 In Pandas,\nhousing_df[(housing_df.num_bedrooms \u0026gt;= 3) \u0026amp; (housing_df.price \u0026lt; 450000)] Or what if you want to filter rows by values in a column?\nselect * from housing where building_material in (\u0026#39;brick\u0026#39;, \u0026#39;wood\u0026#39;) In Pandas,\nhousing_df[housing_df.building_material.isin([\u0026#39;brick\u0026#39;, \u0026#39;wood\u0026#39;])] Want rows not in the list?\nhousing_df[~housing_df.building_material.isin([\u0026#39;brick\u0026#39;, \u0026#39;wood\u0026#39;])] Joins I\u0026rsquo;ve found myself many times in the situation where I want to join two tables together on a particular column so that I have as many features available for later analysis. In SQL, this has been relativey easy:\nselect * from housing left outer join more_housing on housing.house_id = more_housing.house_id In Pandas,\npd.merge(housing_df, more_housing_df, on=\u0026#39;house_id\u0026#39;, how=\u0026#39;left\u0026#39;) Joining on multiple columns between both tables? In SQL,\nselect * from housing left outer join more_housing on housing.house_id = more_housing.house_id and housing.price = more_housing.price In Pandas,\npd.merge(housing_df, more_housing_df, on=[\u0026#39;house_id\u0026#39;, \u0026#39;price\u0026#39;], how=\u0026#39;left\u0026#39;) Want to stack or union two tables? Let\u0026rsquo;s get a list of all housing prices from both tables.\nselect price from housing union select price from more_housing In Pandas,\npd.concat([housing_df, more_housing_df]).price Group by and having Say we want to count how many houses there are when grouped by num_bedrooms and is_two_story.\nselect num_bedrooms, is_two_story, count(*) from housing group by num_bedrooms, is_two_story In Pandas,\nhousing.groupby([\u0026#39;num_bedrooms\u0026#39;, \u0026#39;is_two_story\u0026#39;]).size() Want to sort those values now?\nhousing.groupby([\u0026#39;num_bedrooms\u0026#39;, \u0026#39;is_two_story\u0026#39;]).size().to_frame(\u0026#39;size\u0026#39;).reset_index().sort_values([\u0026#39;num_bedrooms\u0026#39;, \u0026#39;size\u0026#39;]) Or what if you just want to consider groups where the count is greater than 5?\nhousing.groupby([\u0026#39;num_bedrooms\u0026#39;, \u0026#39;is_two_story\u0026#39;]).filter(lambda x: len(x) \u0026gt; 5).groupby([\u0026#39;num_bedrooms\u0026#39;, \u0026#39;is_two_story\u0026#39;]).size().to_frame(\u0026#39;size\u0026#39;).reset_index().sort_values([\u0026#39;num_bedrooms\u0026#39;, \u0026#39;size\u0026#39;]) A note about multiple indices Say you\u0026rsquo;re doing some quick analysis on COVID-19 data and want to get the total number of confirmed cases over time for the US. You\u0026rsquo;d start with:\nus_df = df[df[\u0026#39;Country\u0026#39;] == \u0026#39;US\u0026#39;].groupby([\u0026#39;Country\u0026#39;, \u0026#39;Last Updated\u0026#39;]).sum()[[\u0026#39;Confirmed\u0026#39;]] The problem is that now the dataframe as two indices and this can make it hard to plot the data. We can unstack the first column in the groupby, Country, to leave us with the Last Updated column as an index:\nus_df = us_df.unstack(level=0) Now we just rename the only column we have, add the index as a new column, then just replace the index with the row count.\nus_df.columns = [\u0026#39;Confirmed\u0026#39;] us_df[\u0026#39;Last Updated\u0026#39;] = us_df.index us_df.index = np.arange(us_df.shape[0]) Hopefully this helps! For more details about implementation, make sure to check out Pandas\u0026rsquo; documentation. There are also some other good posts about SQL to Pandas usage like this one.\n","date":"2020-04-01T11:26:50-05:00","title":"How to Rewrite SQL Queries in Pandas","uri":"https://www.bobbywlindsey.com/2020/04/01/how-to-rewrite-sql-queries-in-pandas/"},{"categories":null,"content":"","date":"2020-04-01T11:26:50-05:00","title":"Posts","uri":"https://www.bobbywlindsey.com/posts/"},{"categories":[],"content":"In his recent podcast, Sam Harris talks with WordPress creator Matt Mullenweg about distributed work and the benefits of working from home, topics that are particularly relevant during this COVID-19 pandemic we find ourselves in. I thought this episode was great so I wanted to share some notes that I took away from this conversation.\nThe levels of autonomy Matt makes an argument for a kind of remote work on steroids, which he calls distributed work. His company, Automattic, works in this way with ~1,200 employees in 75 different countries. He explains many companies land somewhere on the scale of distributed work with tech companies usually leading the way while more legacy companies are trailing behind. He explains this scale in a little more detail and calls it the \u0026ldquo;5 levels of autonomy\u0026rdquo;:\nLevel 1: You can get by for a day without being in the office, but you\u0026rsquo;ll probably be less effective and put things off\nLevel 2: You try to recreate what you do in the office but just do it online. Everything is still synchronous - you still work fixed hours, attend the deluge of meetings, etc\u0026hellip;\nLevel 3: You start to really take advantage of remote-work-enabling tools. You might:\n Share your screen Use a shared Google Doc for meeting notes so everyone can see notes being written live and ensure the notes accurately reflect the shared understanding of what was agreed to, thereby preventing drama and conflict over the confusion down the line Invest in better equipment for audio, lighting, etc\u0026hellip; Matt mentions krisp.ai which uses machine learning to remove background noise from incoming and outgoing audio Invest in the quality of written communication since it plays a more valuable role in these conditions Make information more transparent internally so that information is not locked up in private email boxes or other siloed software Level 4: You start working asynchronously. This means you don\u0026rsquo;t have to be on your computer at the same time as your team. You can design your day. Your boss judges \u0026ldquo;work\u0026rdquo; based on output, not time spent in office.\nThe effects of this is transparent - what would normally take an organization 3 days to do using synchronous work patterns could take just 1 day with asynchronous work. On a technical level, this makes sense - synchronous patterns have dependencies; operations must wait on previous operation to complete.\nLevel 5: You\u0026rsquo;re doing better work than any in-person organization could do. You can design your environment and your day around health and well-being, like doing squats and pushups after a meeting, using a treadmill desk, lighting a candle at your desk, etc\u0026hellip; people can bring their best selves to their work.\nYou might observe that as these levels of autonomy increase, there\u0026rsquo;s more emphasis on using technology that enables work to be done in a more asynchronous way. For example, if communication and progress is transparent through tools like Slack and Jira, then their advancement is much like a baton being quickly and easily passed from runner to runner. Blockers are minimized and work is more easily distributed.\nHow to work distributively Fortunately, Matt provided some tips on how many of us can move more toward this distributed way of work.\nHe mentions software like Zoom, Slack, email (just for private stuff like HR things), a blogging system like Discourse for discussion threads, and something like Google Alerts for your company\u0026rsquo;s internal content so that you can set alerts for content you care about or when someone mentions you (far better than those long CC email chains). He also throws in the idea of asynchronous audio where you send short audio clips (like in WhatsApp or Signal) instead of synching up for a call.\nOf course, we\u0026rsquo;ve all seen the difference between a moderated thread and an unmoderated one. He suggests to start a thread with what you want the outcome to be and when you need it by. Then after everyone provides their arguments, summarize the best arguments on every side and then what the decision was. In essence, each thread will be a self-contained decision making artifact that can be used for later reflection.\nIn a similar vein, messages in general should be specific and contain as much of the context as possible. Each message should be self-contained so that a person has everything they need to respond and so there\u0026rsquo;s less chance for misinterpretation.\nWhen reading others\u0026rsquo; messages, assume positive intent (consider Hanlon\u0026rsquo;s razor). It\u0026rsquo;s also helpful if you remove ambiguity in the intent of your own messages by perhaps throwing in some extra fluffy language or using an emoji or gif.\nMy thoughts I\u0026rsquo;m a big fan of working remotely and fortunately I\u0026rsquo;ve been able to work with teams who share similar ideas of how to best work more distributively. That being said, I am aware of the fact that this comes easier to those who might work in a technical field as I do.\nI also think there\u0026rsquo;s a place for on-site meetups, and Matt makes mention of this a bit on the podcast as well. However, for many companies, the balance of on-site meetings vs. remote work can afford to be adjusted a bit more in favor of remote work. This not only decreases operational overhead but also makes your team more antifragile to unforeseen circumstances.\n","date":"2020-03-28T17:51:25-05:00","title":"The New Future of Work","uri":"https://www.bobbywlindsey.com/2020/03/28/the-new-future-of-work/"},{"categories":["Math"],"content":"You want to design an experiment like an A/B test. You\u0026rsquo;re always told to randomly assign members into two different groups, treatment and control, in order minimize systematic differences between the two groups. Why do you care about minimizing differences? Because this helps you minimize the number of confounding variables in your experiment so that if you see an effect, you can be more confident it was due to the treatment itself.\nAwesome, this sounds great. But how do you really know that random assignment minimizes systematic differences between two groups?\nRandomly splitting into two groups Say you have some finite population from which to choose members of both your treatment and control groups:\n$$ x_1, \\dots, x_{2N} $$\nYou do what you\u0026rsquo;re told and randomly assign members into two different groups. The first group:\n$$ u_1, \\dots, u_N $$\nAnd the second group:\n$$ v_1, \\dots, v_N $$\nSo our finite population is just a combination of these two groups:\n$$ x_1, \\dots, x_{2N} = u_1, \\dots, u_N + v_1, \\dots, v_N $$\nIf we divide both sides by $2N$, we get:\n$$ \\begin{aligned} \\mu \u0026amp;= \\frac{1}{2}\\bar{u} + \\frac{1}{2}\\bar{v} \\\\\\\n\u0026amp;= \\frac{1}{2}(\\bar{u} + \\bar{v}) \\end{aligned} $$\nSo the average of the two sample means is equal to the population mean, that\u0026rsquo;s interesting.\nDiscovering the differences But what do we know about the difference between the two groups? Well,\n$$ \\begin{aligned} \\bar{u} - \\bar{v} \u0026amp;= \\bar{u} - (2\\mu - \\bar{u}) \\\\\\\n\u0026amp;= \\bar{u} - 2\\mu + \\bar{u} \\\\\\\n\u0026amp;= 2\\bar{u} - 2\\mu \\\\\\\n\u0026amp;= 2(\\bar{u} - \\mu) \\end{aligned} $$\nSo the difference between the two groups is just twice the distance from $\\bar{u}$ to $\\mu$. And since $u_1, \\dots, u_N$ is a random sample from our population, then:\n$$ \\E(\\bar{u}) = \\mu $$\nand\n$$ \\begin{aligned} \\var(\\bar{u}) \u0026amp;= \\frac{\\sigma^2}{N} \\cdot \\sqrt{\\frac{2N - N}{2N - 1}} \\\\\\\n\u0026amp;\\approx \\frac{\\sigma^2}{2N} \\text{ since 2N-1 is approx 2N as N gets larger} \\end{aligned} $$\nwhere $\\sqrt{\\frac{2N - N}{2N - 1}}$ is a finite population correction factor since the size of our groups is greater than 5% of the finite population size. For a further detailed derivation of $\\var(\\bar{u})$, you should read my post about the Central Limit Theorem.\nNow that we have these two results, we can now answer the question \u0026ldquo;what is the expected difference between both groups?\u0026quot;:\n$$ \\begin{aligned} \\E(\\bar{u} - \\bar{v}) \u0026amp;= 2\\E(\\bar{u} - \\mu) \\\\\\\n\u0026amp;= 2 \\cdot 0 \\\\\\\n\u0026amp;= 0 \\end{aligned} $$\nAnd what about the variance of the difference between both groups?\n$$ \\begin{aligned} \\var(\\bar{u} - \\bar{v}) \u0026amp;= 2^2 \\var(\\bar{u} - \\mu) \\\\\\\n\u0026amp;= 4 \\var(\\bar{u}) \\\\\n\u0026amp;= \\frac{2\\sigma^2}{N} \\end{aligned} $$\nSo on average, the difference between $\\bar{u}$ and $\\bar{v}$ is 0 which tells us that the two groups will be about the same! But what about random volatility? In practice, random volatility is unlikely since the variance of $\\bar{u} - \\bar{v}$ converges to 0 as the population size approaches infinity. This, of course, is nothing new. Sample variance decreases as the sample size increases.\nConfidence about the differences We just mentioned that random volatility is unlikely and variance converges to 0 as we keep increasing the population size. In practice however, increasing your population size might be impractical. So let\u0026rsquo;s say you just want to be 95% confident that the difference between the means of the two randomly selected groups, with each group having $N$ members, is less than some small number. In mathematical terms, this is asking for us to solve:\n$$ 2\\sigma(\\bar{u} - \\bar{v}) = \\epsilon $$\nWe write $2 \\sigma$ since any normal random variable is within two standard deviations of the mean about 95% of the time. And we know $\\bar{u} - \\bar{v} = 2(\\bar{u} - \\mu)$ is normal since $\\bar{u}$ is normal by the Central Limit Theorem.\nSolving for $N$, we get:\n$$ \\begin{aligned} 2 \\sigma(2(\\bar{u} - \\mu)) \u0026amp;= \\epsilon \\\\\\\n2 \\sqrt{2} \\sigma(\\bar{u} - \\mu) \u0026amp;= \\epsilon \\\\\\\n\\frac{2 \\sqrt{2} \\sigma}{\\sqrt{2N}} \u0026amp;= \\epsilon \\\\\\\n\\frac{4 \\sigma^2}{\\epsilon^2} \u0026amp;= N \\end{aligned} $$\nOf course this further reinforces the fact that by increasing $N$, the two sample means become arbitrarily close.\nConclusion And there you have it! We\u0026rsquo;ve shown how random assignment ensures that any differences between and within the groups are not systematic at the outset of the experiment. This means that any observed differences between the groups at the end of the experiment can be more confidently attributed to the effects of the experiment itself, rather than underlying differences between groups.\n","date":"2020-02-23T06:00:30-06:00","title":"How Random Assignment Minimizes Systematic Differences Between Groups","uri":"https://www.bobbywlindsey.com/2020/02/23/how-random-assignment-minimizes-systematic-differences-between-groups/"},{"categories":["Self Improvement"],"content":"Cognitive biases are mental shortcuts hardwired in your brain that, in the past, helped your ancestors to survive. Although these mental heuristics might have been useful then, today they lead to many errors, sometimes very grave errors. This post is a non-exhaustive list of some biases I think are useful to know. I also try to provide an example of the bias and some questions you might ask yourself to (hopefully) increase your likelihood of dodging it.\nKeep in mind, these aren\u0026rsquo;t blanket statements. Biases are tendencies we have, not something that happens 100% of the time. Also, some of the names of these biases might not be what you call them. Some of them might not even have \u0026ldquo;bias\u0026rdquo; in the name. That\u0026rsquo;s ok. I\u0026rsquo;ve likely renamed a few just for my own personal taste.\nSurvivorship bias You draw conclusions from things that made it pass some filter and ignore the things that didn\u0026rsquo;t\n In WWII, the US wanted to figure out which parts of the plane to reinforce so that more survived air fights. They looked at the distribution of bullets on all the planes that came back and tried to armor the areas with a higher concentration of bullet holes. Abraham Wald caught this instance of survivorship bias and told the US to be looking at the planes that didn\u0026rsquo;t come back, where they found that fatal bullets hit the engines. So they armored up the engines helping to turn the war.\nAsk yourself: Do I have the whole dataset or am I drawing conclusions just based on what is visible? Who is missing from the sample population? What could be making this sample population nonrandom relative to the underlying population?\nAssociation bias You automatically associate things with painful or pleasurable experiences including liking or disliking something associated with something good or bad\n Your assistant is always telling you something\u0026rsquo;s gone wrong. Due to association bias, you start to associate your assistant with bad feelings, even though the problems aren\u0026rsquo;t ones they created. Don\u0026rsquo;t shoot the messenger.\nAsk yourself: Am I evaluating this thing, situation, or person based on their track record? Am I encouraging people to tell me bad news immediately? Am I separating the event from the association? Do I like their personality just because they\u0026rsquo;re attractive? Am I mistaking the person\u0026rsquo;s appearance for reality?\nDisconfirmation bias You require more proof for ideas or evidence you don\u0026rsquo;t want to believe\n John isn\u0026rsquo;t convinced in the theory of evolution by natural selection. You keep showing him incremental stages of fossils but he keeps asking you to find fossils at a stage in between. You do this and he continues to ask.\nReciprocation bias You reciprocate what others have done for or to you\n You got coffee with Jerry today and he paid for it. After a few minutes of talking, he invited you to a rally he was going to. You feel an obligation to go even though you don\u0026rsquo;t want to. After all, he did just buy you a coffee.\nAsk yourself: If I\u0026rsquo;m about to make a concession, what do I really want to achieve?\nMemory bias You remember selectively and incorrectly including being influenced by different word phrasings (i.e. the framing effect)\n David\u0026rsquo;s upset because his girlfriend seems to only remember his mistakes. In a twist, David tends to only remember the negative things his girlfriend says. They both are guilty of memory bias.\nYou must choose one of two options:\n A 33% chance of saving all 600 people, 66% possibility of saving no one. A 33% chance that no people will die, 66% probability that all 600 will die. People have a tendency to choose the first since it was positively framed. But in reality, they\u0026rsquo;re both the same scenario. This framing effect was demonstrated by Amos Tversky and Daniel Kahneman in 1981.\nAsk yourself: Am I depending on my memory or someone else\u0026rsquo;s memory or testimony? Am I keeping records of important events?\nAbsurdity bias You give a zero probability to events you can\u0026rsquo;t remember or have never happened\n The Turkey is fed by the farmer every day, each day giving him more confidence that he\u0026rsquo;ll be fed tomorrow with high probability. Then Thanksgiving comes along. This is similar to Bertrand Russell\u0026rsquo;s chicken problem.\nOmission and abstract blindness You only see things that grab your attention and neglect important missing information or the abstract\n Ask yourself: Although I see the available information, what other information could I be missing? Are there other explanations for this? Have I used inversion to turn the situation upside down? Did I compare both positive and negative characteristics? Does my judgement change as a result?\nIllusion of transparency You expect others to know what you mean by your words since you know what you mean by your words\n Daniel thought he had explained it to his classmate perfectly. He thinks to himself, \u0026ldquo;why doesn\u0026rsquo;t he get it?\u0026rdquo;\nAsk youself: Am I actively culling ambiguity in my words (both spoken and written) since they\u0026rsquo;re probably more ambiguous than I think?\nPlanning fallacy You make plans that are unrealistically close to best-case scenarios\n You planned your trip only to find out you didn\u0026rsquo;t have enough time to do everything you wanted.\nNote that the planning fallacy has asymmetric behavior since it skews your predictions to best-case scenarios.\nAsk youself: Am I looking at the base rates (the statistics) for how often this kind of plan or forecast has succeeded in the past to create a base line prediction? Am I using specific information about the case to adjust the base line prediction if there are particular reasons to expect the optimistic bias to be more or less pronounced in this plan or forecast than in others of the same type? Did I perform a Reference Forecast?\nSocial proof You imitate the behavior of a crowd\n John walked by a man crying for help and observed that everyone around the man didn\u0026rsquo;t seem to care. So John continued walking.\nThe Bystander Effect is a corollary of social proof.\nAsk Yourself: Am I just relying on others decisions because I\u0026rsquo;m in an unfamiliar environment or crowd, lack knowledge, stressed, or have low self-esteem? Do I want to agree with the group because it\u0026rsquo;s more comfortable? Am I being pluralistically ignorant because I think nothing is wrong since nobody else thinks nothing is wrong? Am I diffusing responsibility in my decision by siding with the majority because the more people there are, the less personal responsibility I feel?\nReason-respecting bias You comply with requests merely because you\u0026rsquo;ve been given a reason\n Ask Yourself: Is the reason I\u0026rsquo;m being given actually a good reason?\nContrast-comparison bias You judge something by comparing it with something else, instead of judging it on its own\n Jessica saw a shirt she kinda liked. She looked at the tag and it was 70% off at a final price of $60. With such a price difference, she thinks it\u0026rsquo;s such a good bargain and buys the shirt.\nAsk Yourself: Am I evaluating this person or thing by itself and not by their contrast?\nYou can of course leverage this bias (which marketers do all the time). You can:\n Show a thing you want to sell next to an inferior version Sell the expensive thing first, then sell the cheaper stuff afterward since it\u0026rsquo;ll seem like a bargain Self-deception and denial bias You distort reality to reduce pain or increase pleasure\n Joe saw that he visibly irritated his friends with they way he phrased some of his comments. He thought to himself, \u0026ldquo;they\u0026rsquo;re just stuck-up, they know it\u0026rsquo;s true. I don\u0026rsquo;t know why they have a problem with me, I\u0026rsquo;m just fine\u0026rdquo;.\nAsk Yourself: Am I telling myself a narrative that saves me pain or gives me pleasure? Is this narrative actually wishful thinking? Am I in denial?\nYou can, of course, use this bias in good ways. Stoics distort reality by reframing a situation that happened to them in more positives terms. The difference here is that they know it\u0026rsquo;s a reframing or distortion; you get in trouble in you conflate the distortion for reality.\nDeprivation syndrome You strongly react when something you like or have is taken away or threatens to be taken away. You also value more what you can\u0026rsquo;t have (this includes scarcity)\n Ask Yourself: Do I want this for emotional or rational reasons? Why do I want this? Am I placing a higher value because I almost have it and am afraid to lose it? Do I only want it just because I can’t have it? Do I only want it just because it’s rare or hard to obtain? Do I want it because other people want it?\nEndowment effect You value the things you own more than those same things if you didn\u0026rsquo;t own them\n Ask Yourself: Am I holding onto this thing when it\u0026rsquo;d be cheaper to just let it go and buy it again in the future if required? Am I placing more value on this thing than if I didn\u0026rsquo;t have it?\nConsistency bias You are consistent with your prior commitments and ideas even when acting against your best interest or in the face of disconfirming evidence\n Sarah agreed to take her and Susie\u0026rsquo;s kids to school. After a couple days, Susie asked if Sarah could do it again. Since Sarah did it before, she agrees with more ease to do it again.\nAsk Yourself: Do I already have a public, effortful, or voluntary commitment? Is this commitment inline with what I want to be doing? Do I really just need to cut my losses? Am I falling for the low-ball or the foot-in-the-door technique?\nOf course, you can use this bias in more sinister ways by getting someone to commit to something, even if small, so that they\u0026rsquo;re more likely to agree to do it again in the future and even like you more as a result. Ben Franklin did something similar with a man who didn\u0026rsquo;t particularly like him. He asked the man to borrow a book and he agreed. After a few times doing this, the man found himself in a more giving attitude toward Ben which helped smooth their relationship.\nSelf-serving bias You have an overly positive view of your abilities and future. This is over-optimism.\n Ask Yourself: Am I overestimating my ability to predict or plan for the future? How can I be wrong? Who could be qualified to tell me I’m wrong? Have I looked at the track record instead of trusting my first impressions? Am I only looking at the successes in the track record - confirmation bias?\nConfirmation bias You only consider evidence that confirms your beliefs\n Ask Yourself: Do I already have a belief about the subject? Am I looking for reasons that support my belief?\nFamiliarity bias You prefer things that are more familiar\n Ask Yourself: Am I wanting to do this thing only because I’m more familiar with it and this makes me comfortable?\nLoss aversion You find losses more painful than gains are pleasant\n Ask Yourself: Am I worried about failing even though the upside of the risk is greater and more likely?\nStatus-quo bias Loss aversion + familiarity bias\n Anchoring effect You give too much weight to initial information and this influences your decisions\n Ask Yourself: Did I already have a value in mind? Am I being careful that this value doesn’t affect my estimation of the quantity? Am I making choices from a zero base level and remembering what I want to achieve? Am I adjusting the information I have to reality? Is the question/situation being framed for me and influencing the information I pay attention to or conclusions I make - bias from omission and abstract blindness?\nAvailability bias You want to estimate the frequency or size of an event but instead give an impression of the ease with which instances come to mind\n Ask Yourself: Was it easy to recall how often this thing happens or how big this thing is? Is this thing representative evidence? Is this thing a random event?\nAffect bias You make decisions or judgements by consulting your emotions\n Ask Yourself: Am I making a decision while being angry, upset, sad, depressed, or excited?\nSolution aversion You deny problems and any scientific evidence supporting the existence of the problems when you don\u0026rsquo;t like the solution\n Hindsight bias You change the history of your beliefs after observing the outcome\n Even though he was a little shaky about his predictions during the game, after the game he shouted \u0026ldquo;I told you I knew they\u0026rsquo;d win!\u0026quot;.\nAsk Yourself: Did I think the outcome was obvious? If so, can I show it was obvious by referencing a Decision Journal entry that contained my prediction of the outcome? Am I giving too little credit to people who make good decisions even though it appeared obvious after the fact (outcome bias)?\nOutcome bias You blame decision makers for good decisions when the outcome is bad and give them too little credit for successful decision that appear obvious only after the fact (hindsight bias)\n Ask Yourself: Did I blame them for a bad outcome even though the decision was good?\nSimpson\u0026rsquo;s paradox You see a trend in a groups of data, but the trend disappears or reverses when combined\n A great example of this was the UC Berkeley gender bias case. Data from a high level suggested there might be bias in the admittance rates towards men. But when you look at how male and female applicants apply to each program, the assumed bias actually flips. In short, men were applying to programs that had a high overall acceptance rate while women were applying to departments with lower overall acceptance rates. This guaranteed an overall lower acceptance rate of women even though they might have been favored across multiple departments.\nBase rate fallacy You make decisions with specific information and forget to include the base rate (i.e. general) information\n In 1995, the UK Committee on Safety of Medicines issued a letter to doctors that said there were many next-gen oral contraceptives that doubled the chance of blood clots. Consequently, many women stopped taking birth control and unwanted pregnancies increased significantly. But the media forgot to mention base rates - the likelihood of getting blood clots in the first place was 1 out of 7,000. Third-gen contraceptives doubled this to 2 out of 7,000 - still a very small number.\nConditional fallacy P(A|B) != P(B|A)\n The probability that an animal is a dog given it has fur is not equal to the probability that an animal has fur given it\u0026rsquo;s a dog.\nThe Prosecutor\u0026rsquo;s fallacy is relevant here as there have been many times in court where probability was misapplied\nConjunction fallacy P(A and B) \u0026lt;= P(A) and P(A and B) \u0026lt;= P(B)\n Classical example from aforementioned Amos Tversky and Daniel Kahneman:\nLinda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.\nWhich is more probable?\n Linda is a bank teller. Linda is a bank teller and is active in the feminist movement. Most people choose number 2 but probability theory guarantees us 2 can never be more probable than 1.\nRecommended resources Seeking Wisdom by Peter Bevlin Poor Charlie\u0026rsquo;s Almanack Thinking Fast and Slow by Amos Tversky and Daniel Kahneman How Not To Be Wrong by Jordan Ellenberg [1] Feature image provided by xkcd\n","date":"2020-01-20T22:55:46-06:00","title":"The Cognitive Bias Checklist","uri":"https://www.bobbywlindsey.com/2020/01/20/cognitive-biases-checklist/"},{"categories":["Dev"],"content":"If you want to rename a database in Redshift, you might have tried something like this:\nALTER DATABASE oldname RENAME TO newname; But you find you get an error because people or processes are still connected to it! Fear not, there are two methods to handle this:\n Forcefully terminate the existing connections Reboot the cluster Terminating existing connections If you want to terminate the existing connections, you first need to list them:\nSELECT * FROM STV_SESSIONS WHERE user_name = \u0026#39;rdsdb\u0026#39;; Next, you need to kill each process id:\nSELECT pg_terminate_backend(\u0026lt;process\u0026gt;) FROM pg_stat_activity WHERE -- Don\u0026#39;t kill your own connection procpid \u0026lt;\u0026gt; pg_backend_pid() -- And don\u0026#39;t kill connections to other databases AND datname = \u0026#39;oldname\u0026#39;; Now you should be able to rename the database:\nALTER DATABASE oldname RENAME TO newname; Reboot the cluster If the above doesn\u0026rsquo;t work and you\u0026rsquo;ve got the time, you can always just reboot the Redshift cluster. You can do this by:\n Selecting your cluster in the AWS Console Clicking the \u0026ldquo;Actions\u0026rdquo; dropdown box Selecting \u0026ldquo;Reboot cluster\u0026rdquo; ","date":"2020-01-18T18:06:30-06:00","title":"Renaming a Database in Redshift","uri":"https://www.bobbywlindsey.com/2020/01/18/renaming-database-in-redshift/"},{"categories":["Self Improvement"],"content":"Stoicism is an extremely practical philosophy for how to live a better life. It helps you become more resilient, overcome negative emotions, and provides crucial perspective.\nI\u0026rsquo;ve done my best to try and practice mindfulness-based meditation every day for a while now and I\u0026rsquo;ve found that Stoic practices not only complement mindfulness practice but can sometimes be the preferred choice. Even Sam Harris, a fervent supporter of mindfulness practice, mentions in his Waking Up app that Stoic practices can at times be more effective than mindfulness:\n In many situations, reframing [a stoic practice] is actually more powerful than merely being mindful, even if you can practice mindfulness at a very high level. Because if the way you\u0026rsquo;re thinking about a situation is making you angry, say, the anger will keep coming back every time you get lost in thought. Of course you can be mindful of the anger, and it will dissipate; but it will come back again the moment you\u0026rsquo;re no longer mindful. But if you can find a fundamentally different way of thinking about the situation that you\u0026rsquo;re in, one that actually makes you happy, or at least not angry, you\u0026rsquo;ve solved your internal problem in a much more comprehensive way. And this is really what Stoicism is good for.\n With that, let\u0026rsquo;s get into some of the ways you can practice Stoicism.\nJournal A myriad of notable people in history have journaled - the Stoics were no different. Try to keep a journal where you write about your setbacks, their source, significance, and your response. A general overview of your day can be helpful as well.\nReframe situations Choose not to be harmed and you won’t feel harmed. Don’t feel harmed and you haven’t been. ~ Marcus Aurelius, Meditations\n We\u0026rsquo;ve all had some sort of setback whether it\u0026rsquo;s forgetting your keys or losing your entire savings in a recession. The Stoics viewed setbacks as a test of resilience and an opportunity to learn something new. In essence, they reframed the situation (i.e. told themselves another story) to be something positive and it\u0026rsquo;s something we can do as well.\nHere are some ways to reframe a situation:\n Find the humor in it Make it an interesting story to tell later See it as a game or a test Make it something you can learn from. As Marcus Aurelius said, The impediment to action advances action. What stands in the way becomes the way.\n Lastly, you can practice prospective retrospection where you imagine that at some point in the future, you will have wished you could have gone back to this very moment. It\u0026rsquo;s a nice way to provide a shift in perspective and a helpful reminder of the bigger picture.\nPractice negative visualization Stoics also practiced Negative Visualization. This is where you imagine what could go wrong or be taken away from you, a bit like a premortem one could say. The algorithm is such:\n Imagine what could go wrong or be taken away from you (these are negative outcomes). You can also imagine the worst that could happen. Be mindful the worst rarely happens. But even if it did, would it really be that bad? You could probably cope with most things that come your way and life would go on. If possible, integrate negative outcomes into your plan by preparing for them For negative outcomes where nothing can be done, you accept that it\u0026rsquo;s something out of your control and move on Practice misfortune A lot of anxiety and fear come from perception and are not based in reality. Practicing misfortune is a way to experimentally verify that in most cases, your anxiety or fear about something is misplaced.\nTake Hedonic Adaptation for example; as a contrived example, say you get a big raise at work and so you buy a fancy car that has windows that automatically roll up and you can now afford to get your groceries delivered. Over time, you get used to convenience and it becomes your new norm, so much so that you start to dread the day where you might have to downgrade your car to one where the windows manually roll up or you have to actually drive to the grocery store to get your food. You become a slave to luxury and it starts to create anxiety.\nEnter practicing misfortune. Instead of letting anxiety grow, you decide to trade cars for a week with your friend who has a functional but outdated car - you get to crank that lever to roll up the window. You also use the old car to go grocery shopping where you have to spend time finding the things you need. All the while you ask yourself, \u0026ldquo;is this what I feared?\u0026quot;. More often than not, it wasn\u0026rsquo;t as bad as you thought it\u0026rsquo;d be and you\u0026rsquo;ve thickened your hide against a perceived unfortunate turn of events. You know you\u0026rsquo;ll be just fine.\nHere are some ways to practice misfortune:\n Go on more \u0026ldquo;adventures\u0026rdquo; where setbacks are more likely to occur then use those setbacks as tests Create a setback for yourself like practicing poverty. You can do things like: Missing a day of food Not having a car to drive Not being able to use your phone for an entire day Wearing your worst clothes For setbacks that would potentially be irreversible, you can instead simulate the setback in your head and respond appropriately Everything is ephemeral Alexander the Great and his mule driver both died and the same thing happened to both. ~ Marcus Aurelius\n Remember that everything comes and goes; death is the only constant. Which leads me to the last reflection\u0026hellip;\nMemento mori Translated as \u0026ldquo;remember that you must die\u0026rdquo;. It\u0026rsquo;s a meditation on your own mortality and offers perspective (e.g. not worrying about the small things) which can influence your decisions and responses to what life throws your way.\nSo the algorithm is simple for this one: every now and then, remind yourself you\u0026rsquo;re going to die.\n","date":"2020-01-07T10:36:38-06:00","title":"The Stoic Practice Handbook","uri":"https://www.bobbywlindsey.com/2020/01/07/stoic-practice-handbook/"},{"categories":["Data Science"],"content":"Sometimes when a hypothesis doesn’t yield a statistically significant result, there is temptation to tweak the hypothesis a bit or consider different selections of data. But this is a trap called p-hacking and it will increase the chance of false positives.\nHere\u0026rsquo;s how it happens. You\u0026rsquo;re performing a hypothesis test and if you\u0026rsquo;re like most publications, you\u0026rsquo;re looking for a statistically significant result with only a 5% likelihood (i.e. a p-value \u0026lt; 5%) that the result was just due to chance. Because your hypothesis didn\u0026rsquo;t work out, you do a post hoc analysis and determine you were considering a wrong variable or two. Thus, you tweak your hypothesis and run the statistical test again. You continue to do this over and over until you achieve significant results as demonstrated in this xkcd comic.\nHowever, every time you attempted another hypothesis test, you subsequently increased the chance of getting a false positive. To see this in action, let\u0026rsquo;s say you repeated your hypothesis test 21 times (like the comic). With only one test, your chance of a false positive was just 5% (since we require a p-value of 5%). But by the time your 21st iteration came around, this rate had increased to 66%!\nLet\u0026rsquo;s verify this experimentally with a bit of code:\nimport random num_of_1s = 0 num_of_experiments = 100000 for i in range(0 num_of_experiments): # 21 hypothesis tests each with 1/20 chance of a fluke result sequence = [random.randint(1, 20) for i in range(0, 21)] if 1 in sequence: num_of_1s += 1 # show ratio of fluke results to total number of experiments print(f\u0026#39;{num_of_1s/num_of_experiments * 100}\u0026#39;) 65.973% In the code above, we made sure that each hypothesis test performed has a 1/20 chance of having an erroneous result by using random.randint(1, 20). We then kept doing hypothesis tests for a total of 21 times. And since we\u0026rsquo;re interested in the approximate expected number of false positives, we ran this scenario a good amount of times (100,000 times in this case) for a final result of 65.97%. Of course, if you shy away from approximations and fancy a little more rigor, we can use some basic probability theory to confirm our results.\nSo, the probability of not having a false positive is 19/20 which means the probability of not having a false positive in 21 trials is $\\frac{19}{20}^{21}$ which is approximately equal to 0.34. Finally, we can determine the probability of having at least one false positive in 21 trials by subtracting 0.34 from 1 giving us 0.66.\nBut what if you\u0026rsquo;re working for a genomics company where there\u0026rsquo;s a need to perform multiple hypothesis tests without a strong basis for expecting the result to be statistically significant? Well, there are ways to control the increase in false positives, a couple of which I\u0026rsquo;ve written about here. However, you should keep in mind that each method will have its strengths and weaknesses.\nIn the end, you need to be careful when performing hypothesis tests and drawing conclusions. And remember that the more hypothesis tests you perform, the more likely it\u0026rsquo;ll be that you get a false positive.\n","date":"2019-12-28T09:30:30-06:00","title":"How P-Hacking Increases False Positives","uri":"https://www.bobbywlindsey.com/2019/12/28/how-p-hacking-increases-false-positives/"},{"categories":["Dev"],"content":"Just passed the AWS Certified Solutions Architect exam today with a score of 895 and thought I\u0026rsquo;d share how I prepared for those who are looking to tackle the exam as well.\nI started with little cloud experience. I had only created a super small EC2 instance as part of a project I did for grad school awhile back, but other than that I had zero knowledge of what I was doing.\nA Cloud Guru was my first pass at the exam\u0026rsquo;s content. A Cloud Guru was really helpful to understand the core concepts that the exam expected you to know, but was lacking in the finer details and breadth of the AWS services.\nAs such, it was nowhere near enough to prepare me for the certification exam so I started to look around for practice exams that gave me a good idea on what topics I needed to study more. That\u0026rsquo;s when a good friend pointed me toward a set of practice exams created by Jon Bonso and team.\nThese practice exams were great. They weren\u0026rsquo;t a brain dump of exam questions, but they helped me to identify weaknesses I had and where I needed to study more. After every practice exam I took, I would review both the questions I got right and the questions I got wrong.\nFor the questions I got right, I would make sure the mental model of how I got there was correct. This was easily verified since each question in the practice exams has a thorough explanation of why a particular answer was right or wrong.\nFor the questions I got wrong, I would review the explanation and additionally look up the FAQs and developer docs in AWS to learn more about the service or services related to the question.\nOverall, I did almost two complete passes through the Jon Bonso practice exams. Below were my scores:\nI should also mention that I\u0026rsquo;m a big fan of active recall. So as I was going through the A Cloud Guru videos and practice exams, I created flash cards in Anki to help me remember the details of particular services, as well as give me quick scenarios that forced my mind to think about the pros and cons of a particular architecture. This helped a lot. In the end, I created a total of 307 cards.\nLastly, I made sure to actually try to implement what I learned. I ended up creating a script that automatically trained a machine learning model on some data, then deployed that model via a CloudFormation template that automatically spun up a new endpoint in API Gateway with Lambda executing the predictive model. It was a good exercise in putting together all the services, roles, and policies involved in doing such a project.\nOverall I think the exam was at the right level of difficulty and what I did above really helped prepare me for it.\n","date":"2019-12-12T19:43:58-06:00","title":"How I Passed the AWS CSAA Exam","uri":"https://www.bobbywlindsey.com/2019/12/12/how-i-passed-the-aws-csaa-exam/"},{"categories":["Data Science"],"content":"Let\u0026rsquo;s say you collect some data from some distribution. As you might know, each distribution is just a function with some inputs. If you change the value of these inputs, the outputs will change (which you can clearly see if you plot the distribution with various sets of inputs).\nIt so happens that the data you collected were outputs from a distribution with a specific set of inputs. The goal of Maximum Likelihood Estimation (MLE) is to estimate which input values produced your data. It\u0026rsquo;s a bit like reverse engineering where your data came from.\nIn reality, you don\u0026rsquo;t actually sample data to estimate the parameter but rather solve for it theoretically; each parameter of the distribution will have its own function which will be the estimated value for the parameter.\nHow it\u0026rsquo;s done First, assume the distribution of your data. For example, if you\u0026rsquo;re watching YouTube and tracking which videos have a clickbaity title and which don\u0026rsquo;t, you might assume a Binomial distribution.\nNext, \u0026ldquo;sample\u0026rdquo; data from this distribution whose inputs you still don\u0026rsquo;t know. Remember, you\u0026rsquo;re solving this theoretically so don\u0026rsquo;t need to actually get data as the values of your sample data won\u0026rsquo;t matter in the following derivation.\nNow ask, what is the likelihood of getting the sample you got? Well, the likelihood would be the probability of getting your sample. And assuming each sample is independent from each other, we can define the likelihood function as:\n$$ \\begin{aligned} L(\\theta_0, \\theta_1, \u0026hellip;; \\text{sample 1}, \\text{sample 2}, \u0026hellip;) \u0026amp;= \\P(X_1 = \\text{sample 1}, X_2 = \\text{sample 2}, \u0026hellip;) \\\\\\\n\u0026amp;= \\product{}{} \\pmf(X_i) \\end{aligned} $$\nNow that you have your likelihood function, you want to find the value of the distribution\u0026rsquo;s parameter that maximizes the likelihood. It might help to think about the problem like this:\nIf you\u0026rsquo;re familiar with calculus, finding the maximum of a function involves differentiating it and setting it equal to zero. And if you actually differentiate the log of the function, it\u0026rsquo;ll make differentiation easier and you\u0026rsquo;ll get the same maximum.\nOnce you differentiate the log likelihood, just solve for the parameter. If you\u0026rsquo;re looking at the Bernoulli, Binomial, or Poisson distributions, you\u0026rsquo;ll only have one parameter to solve for. A Gaussian distribution will have two, etc\u0026hellip;\nModeling YouTube views Say you started a YouTube channel about a year ago. You\u0026rsquo;ve done quite well so far and have collected some data. You want to know the probability of at least $x$ visitors to your channel given some time period. The obvious choice in distributions is the Poisson distribution which depends only on one parameter, $\\lambda$, which is the average number of occurrences per interval. We want to estimate this parameter using Maximum Likelihood Estimation.\nWe start with the likelihood function for the Poisson distribution:\n$$ L(\\lambda; x_1, \u0026hellip;, x_n) = \\product{i=1}{n} \\frac{e^{-\\lambda} \\lambda^{x_i}}{x_i!} $$\nNow take its log:\n$$ \\begin{aligned} \\ln\\bigg(\\product{i=1}{n} \\frac{e^{-\\lambda} \\lambda^{x_i}}{x_i!}\\bigg) \u0026amp;= \\summation{i=1}{n} \\ln\\bigg(\\frac{e^{-\\lambda} \\lambda^{x_i}}{x_i!}\\bigg) \\\\\\\n\u0026amp;= \\summation{i=1}{n} [\\ln(e^{- \\lambda}) - \\ln(x_i!) + \\ln(\\lambda^{x_i})] \\\\\\\n\u0026amp;= \\summation{i=1}{n} [- \\lambda - \\ln(x_i!) + x_i \\ln(\\lambda)] \\\\\\\n\u0026amp;= -n\\lambda - \\summation{i=1}{n} \\ln(x_i!) + \\ln(\\lambda) \\summation{i=1}{n} x_i \\end{aligned} $$\nThen differentiate it and set the whole thing equal to zero:\n$$ \\begin{aligned} \\frac{d}{d\\lambda} \\bigg(-n\\lambda - \\summation{i=1}{n} \\ln(x_i!) + \\ln(\\lambda) \\summation{i=1}{n} x_i \\bigg) \u0026amp;= 0 \\\\\\\n-n + \\frac{1}{\\lambda} \\summation{i=1}{n} x_i \u0026amp;= 0 \\\\\\\n\\lambda \u0026amp;= \\frac{1}{n} \\summation{i=1}{n} x_i \\end{aligned} $$\nNow that you have a function for $\\lambda$, just plug in your data and you\u0026rsquo;ll get an actual value. You can then use this value of $\\lambda$ as input to the Poisson distribution in order to model your viewership over an interval of time. Cool, huh?\nMaximizing the positive is the same as minimizing the negative Now mathematically, maximizing the log likelihood is the same as minimizing the negative log likelihood. We can show this with a derivation similar to the one above:\nTake the negative log likelihood:\n$$ \\begin{aligned} -\\ln\\bigg(\\product{i=1}{n} \\frac{e^{-\\lambda} \\lambda^{x_i}}{x_i!}\\bigg) \u0026amp;= - \\summation{i=1}{n} \\ln\\bigg(\\frac{e^{-\\lambda} \\lambda^{x_i}}{x_i!}\\bigg) \\\\\\\n\u0026amp;= - \\summation{i=1}{n} [\\ln(e^{- \\lambda}) - \\ln(x_i!) + \\ln(\\lambda^{x_i})] \\\\\\\n\u0026amp;= - \\summation{i=1}{n} [- \\lambda - \\ln(x_i!) + x_i \\ln(\\lambda)] \\\\\\\n\u0026amp;= n\\lambda + \\summation{i=1}{n} \\ln(x_i!) - \\ln(\\lambda) \\summation{i=1}{n} x_i \\end{aligned} $$\nThen differentiate it and set the whole thing equal to zero:\n$$ \\begin{aligned} \\frac{d}{d\\lambda} \\bigg(n\\lambda + \\summation{i=1}{n} \\ln(x_i!) - \\ln(\\lambda) \\summation{i=1}{n} x_i \\bigg) \u0026amp;= 0 \\\\\\\nn - \\frac{1}{\\lambda} \\summation{i=1}{n} x_i \u0026amp;= 0 \\\\\\\n\\lambda \u0026amp;= \\frac{1}{n} \\summation{i=1}{n} x_i \\end{aligned} $$\nNow whether you maximize the log likelihood or minimize the negative log likelihood is up to you. But generally you\u0026rsquo;ll find maximization of the log likelihood more common.\nConclusion Now you know how to use Maximum Likelihood Estimation! To recap, you just need to:\n Find the log likelihood Differentiate it Set the result equal to zero Then solve for your parameter ","date":"2019-11-06T09:10:58-06:00","title":"Understanding Maximum Likelihood Estimation","uri":"https://www.bobbywlindsey.com/2019/11/06/understanding-maximum-likelihood-estimation/"},{"categories":["Data Science"],"content":"The perceptron model is a binary classifier whose classifications are based on a linear model. So, if your data is linearly separable, this model will find the hyperplane that separates it. The model works as such:\nEssentially, for a given sample, you multiply each feature by its own weight and sum everything up - $\\summation{j=1}{n} w_jx_j$. Then take this sum and apply the activation function. This will be your prediction. But which activation function do you use?\nActivation functions Binary classification labels generally appear as ${-1, 1}$ or ${0, 1}$. The activation function used in the perceptron model will depend on which set of binary labels you choose. If you choose ${0, 1}$, you will need to use the Heaviside step function as your activation function since it takes any real number and outputs either a 0 or a 1. Otherwise, you will use the sign function.\ndef sign(x): if x \u0026gt; 0: return 1.0 elif x \u0026lt; 0: return -1.0 else: return 0.0 def step(x): if x \u0026gt;= 0: return 1.0 else: return 0.0 Get the data Say your binary labels are ${0, 1}$. The perceptron model prediction will be $\\step\\bigg(\\summation{j=1}{n} w_jx_j\\bigg)$, producing either a 0 or 1. Let\u0026rsquo;s take a look at a quick example with some data kindly pulled from Jason Brownlee\u0026rsquo;s blog Machine Learning Master.\nimport pandas as pd import numpy as np data = [[2.7810836,2.550537003,0], [1.465489372,2.362125076,0], [3.396561688,4.400293529,0], [1.38807019,1.850220317,0], [3.06407232,3.005305973,0], [7.627531214,2.759262235,1], [5.332441248,2.088626775,1], [6.922596716,1.77106367,1], [8.675418651,-0.242068655,1], [7.673756466,3.508563011,1]] pd.DataFrame(data = data, index=list(range(len(data))), columns=[\u0026#39;X_1\u0026#39;, \u0026#39;X_2\u0026#39;, \u0026#39;label\u0026#39;]) Let\u0026rsquo;s split the dataframe up into training data and labels.\ntrain_data = pd.DataFrame(data = data, index=list(range(len(data))), columns=[\u0026#39;X_1\u0026#39;, \u0026#39;X_2\u0026#39;, \u0026#39;label\u0026#39;]) # Don\u0026#39;t forget to add a vector of 1s for the bias weight train_data[\u0026#39;bias\u0026#39;] = np.repeat(1.0, train_data.shape[0]) train_label = train_data.label train_data = train_data[[each for each in train_data.columns if each != \u0026#39;label\u0026#39;]] train_data.head() Perceptron inference To get a prediction from the perceptron model, you need to implement $\\step\\bigg(\\summation{j=1}{n} w_jx_j\\bigg)$. Recall that the vectorized equivalent of $\\step\\bigg(\\summation{j=1}{n} w_jx_j\\bigg)$ is just $\\step(w \\cdot x)$, the dot product of the weights vector $w$ and the features vector $x$.\ndef activation(x, function_name): if function_name == \u0026#39;step\u0026#39;: return step(x) elif function_name == \u0026#39;sign\u0026#39;: return sign(x) else: raise NotImplementedError def initialize_weights(num_columns): return np.zeros(num_columns) def predict(vector, weights, activation_function): linear_sum = np.dot(weights, vector) output = activation(linear_sum, activation_function) return output So you have $x$, which represents one sample where $x_i$ is some feature for the sample (like has_scales or has_fur if you\u0026rsquo;re trying to predict mammals vs. reptiles). But where do you get the weight $w_i$? This is what the perceptron model needs to learn from your labeled samples. At the start, you don\u0026rsquo;t know what these values should be, so you can just let them be all zeros.\nTry predicting the first sample:\nweights = initialize_weights(train_data.shape[1]) vector = train_data.loc[0].values label = train_label.loc[0] prediction = predict(vector, weights, \u0026#39;step\u0026#39;) error = label - prediction print(f\u0026#39;Prediction: {prediction}, Label: {label}, Error: {error}\u0026#39;) Prediction: 1.0, Label: 0, Error: -1.0 Unsurprisingly, the model doesn\u0026rsquo;t do such a great job. Let\u0026rsquo;s see if you can come up with a way for the perceptron model to learn what weights it needs in order to output the expected label.\nTeaching the perceptron to learn To begin, you need to specify a loss function which tells you how bad your model is doing. The lower the loss, the better. For our example, you can use the sum of squared errors as the loss function, $\\summation{i=1}{m} (y_i - \\yhat_i)^2$, where $\\yhat_i$ is the perceptron model\u0026rsquo;s prediction and $y$ is what the prediction should have been (i.e. the label). This function simply determines the squared distance between the prediction and the true value and sums all these distances up.\nAs with most modern machine learning methods, you might be now be tempted to use gradient descent whereby you take the gradient of the loss function, $\\summation{i=1}{m} (y_i - \\yhat_i)^2$, with respect to the weights and use that gradient to update the weights in a direction that minimizes loss. The resulting gradient would be $-(y_i - \\yhat_i) \\frac{\\partial \\step(w \\cdot x)}{\\partial w_i} x_i$, but do you see the problem here? The derivative of the step function is 0 everywhere except at $x = 0$ where it\u0026rsquo;s undefined. This would force the entire gradient to be 0 and the weights would never be updated. The perceptron model would never learn. The same problem also plagues the sign function.\nSo how do you update the weights? Well, it turns out that if your data is linearly separable, then the following weight-update rule is guaranteed (via a convergence theorem proof) to converge to a set of weights in a finite number of steps that will linearly separate the data into two different classes. This update rule is defined as:\n$$ w = w + y \\cdot x $$\nand only applied if $x$ was misclassified by the perceptron model.\nBut this update rule was derived under the assumption that the binary labels were ${-1, 1}$, not ${0, 1}$. If your labels were ${-1, 1}$, then $y$ in the update rule would be either -1 or 1, thus changing the direction in which the weights update.\nBut since your binary labels are ${0, 1}$, this presents a problem since $y$ could be 0. This would mean that if an $x$ got misclassified and its true value was 0, then $w = w + 0 \\cdot x = w$ and the weights would never be updated.\nFortunately, you can account for this by amending the update rule that still guarantees convergence but is suited to both ${-1, 1}$ and ${0, 1}$ as the binary labels:\n$$ w = w + (y - \\yhat) \\cdot x $$\nNote that if your binary labels are ${0, 1}$, $(y - \\yhat)$ is 0 if the perceptron model predicted correctly (thereby leaving the weights unchanged) and 1 or -1 if predicted incorrectly (which will ensure that the weights are updated in the right direction). If your binary labels are ${-1, 1}$, $(y - \\yhat)$ is 0 if the perceptron model predicted correctly and 2 or -2 if predicted incorrectly. This amended weight-update rule ensures the correct directional change no matter which set of binary labels you choose.\nNow that you know how to update the weights, try taking a sample, predicting its label, then updating the weights. Repeat this for each sample you have.\nweights = initialize_weights(train_data.shape[1]) sum_of_squared_errors = 0.0 for sample in range(train_data.shape[0]): vector = train_data.loc[sample].values label = train_label.loc[sample] prediction = predict(vector, weights, \u0026#39;step\u0026#39;) error = label - prediction sum_of_squared_errors += error**2 weights = weights + error * vector print(f\u0026#39;SSE: {sum_of_squared_errors}\u0026#39;) print(f\u0026#39;Weights: {weights}\u0026#39;) SSE: 2.0 Weights: [4.84644761, 0.20872523, 0.] Looks like the perceptron model didn\u0026rsquo;t find weights to perfectly separate the two classes yet. How about you give it more time to learn by taking multiple passes through the data. Let\u0026rsquo;s try 3 passes.\nweights = initialize_weights(train_data.shape[1]) for epoch in range(3): sum_of_squared_errors = 0.0 for sample in range(train_data.shape[0]): vector = train_data.loc[sample].values label = train_label.loc[sample] prediction = predict(vector, weights, \u0026#39;step\u0026#39;) error = label - prediction sum_of_squared_errors += error**2 weights = weights + error * vector print(f\u0026#39;SSE: {sum_of_squared_errors}\u0026#39;) print(f\u0026#39;Weights: {weights}\u0026#39;) SSE: 2.0 SSE: 1.0 SSE: 0.0 Weights: [ 2.06536401, -2.34181177, -1.] Nice! The sum of squared errors is zero which means the perceptron model doesn\u0026rsquo;t make any errors in separating the data.\nPerceptron applied to different binary labels Now say your binary labels are ${-1, 1}$. Using the same data above (replacing 0 with -1 for the label), you can apply the same perceptron algorithm. This time, you\u0026rsquo;ll see that $w = w + y \\cdot x$ and $w = w + (y - \\yhat) \\cdot x$ both find a set of weights to separate the data correctly (even if the weights are different).\nHere\u0026rsquo;s the perceptron model with $w = w + y \\cdot x$ as the update rule:\nweights = initialize_weights(train_data.shape[1]) for epoch in range(3): sum_of_squared_errors = 0.0 for sample in range(train_data.shape[0]): vector = train_data.loc[sample].values label = train_label.loc[sample] if label == 0: label = -1 prediction = predict(vector, weights, \u0026#39;sign\u0026#39;) error = label - prediction sum_of_squared_errors += error**2 if error != 0: weights = weights + label * vector print(f\u0026#39;SSE: {sum_of_squared_errors}\u0026#39;) print(f\u0026#39;Weights: {weights}\u0026#39;) SSE: 5.0 SSE: 4.0 SSE: 0.0 Weights: [ 2.06536401, -2.34181177, -1.] And now with $w = w + (y - \\yhat) \\cdot x$ as the update rule:\nweights = initialize_weights(train_data.shape[1]) for epoch in range(3): sum_of_squared_errors = 0.0 for sample in range(train_data.shape[0]): vector = train_data.loc[sample].values label = train_label.loc[sample] if label == 0: label = -1 prediction = predict(vector, weights, \u0026#39;sign\u0026#39;) error = label - prediction sum_of_squared_errors += error**2 weights = weights + error * vector print(f\u0026#39;SSE: {sum_of_squared_errors}\u0026#39;) print(f\u0026#39;Weights: {weights}\u0026#39;) SSE: 5.0 SSE: 8.0 SSE: 0.0 Weights: [ 3.98083288, -6.85733669, -3.] Both converged!\nWrapping up In the post, you\u0026rsquo;ve learned what a perceptron model is, what kind of data it can be applied to, the mathematics behind the model and how it learns, along with implementing all your findings in Python!\nOf course in a realistic setting, you\u0026rsquo;ll want to cross-validate your model and preferably use a tried-and-true implementation of the model available in libraries like scikit-learn. But I hope this peak under the hood of the perceptron model has been helpful and if you have any questions, please feel free to reach out to me at bobbywlindsey.com or follow me on Medium or Twitter.\n","date":"2019-10-06T20:41:46-05:00","title":"Understanding the Perceptron","uri":"https://www.bobbywlindsey.com/2019/10/06/understanding-the-perceptron/"},{"categories":["Book Reviews"],"content":"Conscious, as the title suggests, explores some questions about consciousness. These questions include:\n How do we know other people or things are conscious? Might everything in the universe be conscious to some degree? Do we have free will? How do we know other people or things are conscious? Annaka defines consciousness as experience in its most basic form. But what external evidence is there that we can rely upon to determine if other people or things are conscious? Consider patients undergoing surgery who are actually aware of the procedure while under anesthesia. All evidence points to them being unconscious when they\u0026rsquo;re really not.\nOr consider other patients who appear comatose yet are experiencing locked-in syndrome, completely aware of events happening around them.\nWhat about complex behavior in plants and animals? Are they conscious too? How do we know they\u0026rsquo;re not having some sort of conscious experience even if what they experience might be less complex than that of a human?\nAnnaka reminds us of David Chalmers\u0026rsquo; Philosophical Zombie thought experiment where we imagine everyone else doesn\u0026rsquo;t have consciousness but still behaves in the same ways and says the same things. This thought experiment reminds me of the HBO series Westworld where the limits of artificial intelligence are pushed in some distant future. When the guests in the park interact with what seems like real people, why does one not assume the AI is experiencing anything? If the guests didn\u0026rsquo;t know the attractions in the park were \u0026ldquo;not real\u0026rdquo;, would they be able to tell what\u0026rsquo;s real from what wasn\u0026rsquo;t? Would they assume consciousness in the android standing in the front of them? Annaka neatly summarizes the issue:\n The problem is that both conscious and nonconscious states seem to be compatible with any behavior, even those associated with emotion, so a behavior itself doesn\u0026rsquo;t necessarily signal the presence of consciousness\u0026hellip;when we trick ourselves into imagining that people lack consciousness, we can begin to wonder if we\u0026rsquo;re in fact tricking ourselves all the time when we deem other living systems - climbing ivy, say, or stinging sea anemones - to be without it.\n Annaka remarks that perhaps the only evidence of consciousness we have is the fact that we can think about the experience of consciousness.\nMight everything in the universe be conscious to some degree? She also dedicates a portion of the book exploring panpsychism, a hypothesis stating that all matter may have consciousness. There\u0026rsquo;s a lot of debate around this idea and Annaka stresses that although it might not be true, it shouldn\u0026rsquo;t be thrown out just yet. The more prevalent hypothesis today is that consciousness emerges from some collection of matter but Annaka argues that this hypothesis describes matter from the outside and tells us nothing about what matter is like in the inside. She says:\n Calling consciousness an emergent phenomenon doesn’t actually explain anything, because to the observer, matter is behaving as it always does. If some matter has experience and some doesn’t (and some emergent phenomena entail experience and some don’t), the concept of emergence as it is traditionally used in science simply doesn’t explain consciousness.\n Do we have free will? Along with Sam Harris, Annaka argues that free will is an illusion. She provides examples of parasites and bacteria influencing their hosts in ways that almost undoubtedly remove any notion of free will.\n It’s hard to see how our behavior, preferences, and even choices could be under the control of our conscious will in any real sense. It seems much more accurate to say that consciousness is along for the ride - watching the show, rather than creating or controlling it.\n She also argues that genetics, your immediate environment, and your life history all have an impact on the decision you thought you made:\n It seems clear that we can’t decide what to think or feel, any more than we can decide what to see or hear. A highly complicated convergence of factors and past events—including our genes, our personal life history, our immediate environment, and the state of our brain - is responsible for each next thought.\n Of course, she also mentions the experiments conducted by Benjamin Libet where researchers used an EEG to detect when a subject was going to move half a second before the subject feels they make the decision to move.\nLastly, Annaka brings up the split-brain experiments where patients who suffer from epilepsy undergo a corpus callosotomy. Essentially, the connection between the right and left hemispheres of the brain is severed and results in what appears to be two different entities each having its own will and intentions and occasionally sabotaging each other. In this scenario, what does free will mean?\nShould you read it? I think so! Annaka\u0026rsquo;s clear writing makes the topic very enjoyable to read about and extremely thought-provoking.\n","date":"2019-07-11T13:51:51-05:00","title":"Conscious","uri":"https://www.bobbywlindsey.com/2019/07/11/conscious/"},{"categories":["Math"],"content":"I remember a feeling of utter confusion when I first learned vector spaces in my first course of Linear Algebra. What\u0026rsquo;s a space? And what are these vectors, really? Lists? Functions? Kittens? And how the hell does that relate to all this data that I analyze for some scientific endeavor? In this article, I attempt to explain the topic of vectors, vector spaces, and how they relate to data in a way that my former self would have appreciated.\nThe utility of the vector Imagine a two-by-two graph as you might have seen in school. A vector is an object that\u0026rsquo;s typically seen in the form of an ordered list of elements, like (4, 3), that lives in a vector space and is commonly represented as an arrow whose tail starts at the origin, (0, 0), and ends at some other point, like (4, 3).\nThis arrow has a couple properties that are interesting to note: it has a direction and a length. A vector can represent many real-world objects like the wind, the throwing of a football, and the velocity of your car. It can even represent features of an animal, the quantities of your grocery list, or a row in your spreadsheet of data.\nA space for your vectors But how do you know if your object or ordered list is a vector? Well, it must belong to a set called a vector space. A vector space can be any set of elements (this could be lists, functions, or other objects) but these elements must follow some rules:\n Adding any two elements in the set results in another element that’s already in the set Multiplying any elements in the set by some number (called a scalar) results in another element that’s already in the set All the elements are associative, commutative, and scalars are distributive with respect to element addition There\u0026rsquo;s an element in the set such that adding it to any other element doesn\u0026rsquo;t change its value There\u0026rsquo;s some number (called a scalar) such that multiplying it by any other element doesn\u0026rsquo;t change the element\u0026rsquo;s value Any element in the set has some element that can be added to it which results in an element of 0s (called the zero vector) If all the elements in your set follow the rules above, then congratulations! Your elements are called vectors and the set they belong to is called a vector space. But break any one of these rules, and you\u0026rsquo;re set is not a vector space and the elements inside your set aren\u0026rsquo;t vectors.\nIn mathematical jargon, these rules translate to the following:\n All elements are closed under addition All elements are closed under scalar multiplication All elements are associative, commutative, and scalars are distributive with respect to element addition An additive identity exists for every element A multiplicative identity exists for every element An additive inverse exists for every element A tangible vector space Consider the set you already know and love, $\\R^2$. $\\R^2$ is just $\\R \\cross \\R$ which is the set of all possible two-dimensional lists represented by the following set notation ${(a, b) : a, b \\in \\R}$. You can visualize $\\R^2$ below (obviously the x and y axes limits do not stop at ten but instead approach infinity):\nIs the set $\\R^2$ a vector space? Well if it were, it would need to follow all the rules of a vector space that were mentioned above. Consider the first rule: adding any two elements in the set results in another element that’s already in the set. Does $\\R^2$ follow this rule? Well, for example, take any two elements in $\\R^2$, like (2, 3) and (-1, -4). When you add them together, do you get an element that\u0026rsquo;s also in $\\R^2$? Yes! In fact, you get (1, -1) which does indeed live in the set $\\R^2$. This is true for any two elements you add in $\\R^2$ - you\u0026rsquo;ll always get something back that also belongs to $\\R^2$.\nSo you\u0026rsquo;ve verified the first rule! But in order for $\\R^2$ to be called a vector space, you must verify that it follows all the rules; which it does.\nIn short, there are many sets that follow the rules above so naturally you give it a name. And now when someone talks about some arbitrary set, you know that if the elements of the set follow the rules of a vector space, then the set must be a vector space.\nFields Another note to make is about these scalars with which you can multiply vectors. Scalars are elements that belong to a set called a field which has the same rules as a vector space, but with the bonus of having a multiplicative inverse as well.\nSince a field is a set that has all the rules a vector space has, then a field is also a vector space. You’ve actually been using fields all along, like the set of real numbers or the set of complex numbers. Any time you draw a plot on a two-dimensional grid, you’re actually drawing on a two-dimensional field.\nSubspaces In practice, you might think that it’s tedious to check every rule against a set to see if the set is a vector space. Well, you’re right. So that’s why it’s far easier to identify a set as a subset of a vector space you already know (like the set of real numbers) and prove that this subset is nonempty and that it’s closed under the same operations as the vector space (i.e. under addition and scalar multiplication).\nIf you’re able to do this, then your subset is called a subspace which also happens to be a vector space in and of itself (again, to see for yourself, take a subspace and verify that it has all the properties of a vector space).\nSo if you want to prove that a set is a vector space, try to prove that it\u0026rsquo;s a subspace instead. Since subspaces are vector spaces in their own right, you\u0026rsquo;ll have successfully shown that a set is a vector space.\nReal-world data Now all this theory is not very helpful if you can\u0026rsquo;t apply it. So consider the first few rows of the classic Iris dataset, which is a dataset containing samples of three different species of the Iris flower.\n Sepal Length Sepal Width Petal Length Petal Width Species 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3 1.4 0.2 Iris-setosa 7 3.2 4.7 1.4 Iris-versicolor 6.9 3.1 4.9 1.5 Iris-versicolor 6.3 3.3 6 2.5 Iris-virginica 7.1 3 5.9 2.1 Iris-virginica As you can see from the header, the features of each sample are:\n sepal length sepal width petal length petal width Each row of features can be viewed as an ordered list. The first list would be (5.1, 3.5, 1.4, 0.2), the second (4.9 , 3, 1.4, 0.2), and so on. But are these ordered lists vectors? Well, each entry in the lists is a real number and thus belongs to the set of real numbers, $\\R$. And since each of these ordered lists are four-dimensional, then they live in $\\R^4$. Since $\\R^4$ is a vector space, then these ordered lists can be called vectors.\nNotice that each entry in these vectors represents a dimension in $\\R^4$ where each dimension corresponds to a feature in the dataset (like sepal length, sepal width, etc\u0026hellip;). That is, each feature in your data can be considered a random variable. And since they\u0026rsquo;re random variables, we can do some descriptive statistics like find their means and standard deviations.\nAs you\u0026rsquo;ve already represented each row in the table as a vector, you can calculate the means and standard deviations of each random variable in one fell swoop. This is the power of vectors.\nPretend that the rows/vectors (5.1, 3.5, 1.4, 0.2) and (4.9 , 3, 1.4, 0.2) were all the data you had in the table above. Using two of the vector space rules, scalar multiplication and addition, we can easily calculate the means of every random variable we have.\n$$ \\frac{1}{2} \\begin{bmatrix} 5.1 \\\\\\\n3.5 \\\\\\\n1.4 \\\\\\\n0.2 \\end{bmatrix}+ \\frac{1}{2}\\begin{bmatrix} 4.9 \\\\\\\n3 \\\\\\\n1.4 \\\\\\\n0.2 \\end{bmatrix}= \\begin{bmatrix} 5 \\\\\\\n3.25 \\\\\\\n1.4 \\\\\\\n0.2 \\end{bmatrix} $$\nSo the mean of sepal length is 5, the mean of sepal width is 3.25, and so on. Representing your data as a set of vectors is not just aesthetically pleasing to look at, but also more performant in calculations since computers are optimized for computations involving vectors (turns out you can replace a lot of for loops with vector and matrix operations instead).\nConclusion In the end, because you can represent your data as vectors which belong to a set with special rules called a vector space, many of the awesome things you do with your data like: linear feature transformations, standardizations, and dimensionality-reduction techniques can be justified by the rules of vector spaces.\nAnd not only are vector spaces foundational to any work involving data, it constitutes the bedrock of linear algebra which is central to almost all areas of mathematics. Even if the phenomenon you\u0026rsquo;re studying is nonlinear, linear algebra is the go-to tool to use as a first-order approximation.\nIn short, vectors are tightly intertwined with your data and are very useful! So next time you\u0026rsquo;re importing data into a dataframe and performing a bunch of operations, remember that your computer is treating your data as a set of vectors and is happily applying transformations in a performant way. And all thanks to these little guys called vectors.\n","date":"2019-03-05T05:34:54Z","title":"A Practical Look at Vectors and Your Data","uri":"https://www.bobbywlindsey.com/2019/03/05/practical-look-at-vectors-and-your-data/"},{"categories":["Data Science"],"content":"Doing good science is hard and a lot of experiments fail. Although the scientific method helps to reduce uncertainty and lead to discoveries, its path is full of potholes. In this post, you’ll learn about common p-value misinterpretations, p-hacking, and the problem with performing multiple hypothesis tests. Of course, not only are the problems presented, but their potential solutions as well. By the end of the post, you should have a good idea of some of the pitfalls of hypothesis testing, how to avoid them, and an appreciation for why doing good science is so hard.\nP-Value misinterpretations There are many ways to misinterpret a p-value. By definition, a p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming the null hypothesis is true.\nWhat the p-value is not:\n A measure of the size of the effect or the strength of the evidence The chance that the intervention is effective A statement that the null hypothesis is true or false A statement that the alternative hypothesis is true or false If you want to measure the strength of the evidence or size of effect, then you need to calculate the effect size. This can be done with Pearson’s r correlation, standardized difference of means, or other methods. Reporting the effect size in your research is suggested since p-values will tell you the likelihood that experimental results differ from chance expectations but not the relative magnitude of the experimental treatment or the size of the experimental effect.\nP-values also don’t tell you the chance that the intervention is effective but calculating precision does, and base rates influence this calculation. If the base rate for the intervention is low, this opens the door to many opportunities for false positives even if a hypothesis test shows a statistically significant result. For example, if the chance the intervention is effective is 65%, then there is still only a 65% chance that the intervention was actually effective while leaving a false discovery rate of 35%. Neglecting the impact of base rates is known as the base rate fallacy and happens more often than you think.\nLastly, p-values also can’t tell you whether a hypothesis is true or false. Statistics is an inferential framework and there’s no way to know for sure if some hypothesis is true or not. Remember, there’s no such thing as proof in science.\nThe p-hacking problem As a scientist, one of your degrees of freedom when setting up a hypothesis test is deciding which variables to include in the data you test. Your hypothesis will, to a degree, influence which variables you might include in the data and after testing the hypothesis with those variables, you might get a p-value greater than 5%.\nAt this point, you might be tempted to try different variables in your data and retest. But if you try enough combinations of variables and test each scenario, you’re likely to get a p-value of 5% or less as demonstrated by this app in this fivethirtyeight blog post. It\u0026rsquo;s called p-hacking, and it can allow you to achieve a p-value of 5% or less under competing alternative hypotheses.\nThere are at least a few problems with this:\n Since you can get a statistically significant p-value under competing alternative hypotheses as a result of the data you choose to include in testing, p-hacking doesn’t help you get closer to the truth of the thing you’re studying. Even worse, if such results are published and the research makes its way into conventional wisdom, it’ll be difficult to remove. As the number of hypothesis tests performed increases, the rate of false positives (i.e. erroneously calling a null finding significant) increases. You might be falling victim to confirmation bias by ignoring the results of other hypothesis tests performed and only considering results of the tests that align with your beliefs. Since many journals require a p-value of 5% or less for publication, it creates an incentive for you to p-hack your way to this 5% threshold creating not only an ethical dilemma but also lower quality research. Addressing p-hacking To help mitigate p-hacking, you should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all p-values computed. If you performed multiple hypothesis tests without a strong basis for expecting the result to be statistically significant, as can happen in genomics where genotypes for millions of genetic markers can be measured and tested, you should verify that there was some sort of control for the family-wise error rate or false discovery rate (as discussed in the next section). Otherwise, the study might not be meaningful.\nIt might also be a good idea to report the power of the hypothesis test. That is, report 1 - the probability of not rejecting the null hypothesis when it’s false. Keep in mind that power can be influenced by the sample size, significance level, variability in the dataset, and if the true parameter is far from the parameter assumed by the null hypothesis. In short, the greater the sample size, the greater the power. The greater the significance level, the greater the power. The lower the variability in the dataset, the greater the power. And the further away the true parameter is from the parameter assumed by the null hypothesis, the greater the power.\nTesting multiple hypotheses with the Bonferroni Correction Since the probability of false positives increases as the number of hypothesis tests performed increases, it is necessary to try and control this. As such, you might want to control the probability of one or more false positives out of all hypothesis tests conducted. This is sometimes called the family-wise error rate.\nOne way to control for this is to set the significance level to $\\alpha/\\text{n}$ where $n$ is the number of hypothesis tests. This kind of correction is called the Bonferroni correction and ensures that the family-wise error rate is less than or equal to $\\alpha$.\nHowever, this correction can be too strict especially if you’re performing many hypothesis tests. The reason is since you’re controlling for the family-wise error rate, you also might be missing some true positives that existed at a higher significance level. Clearly there’s a balance to be struck between increasing the power of the hypothesis test (i.e. increasing the probability of rejecting the null hypothesis when the alternative hypothesis is true) and controlling for false positives.\nTesting multiple hypotheses with the Benjamini-Hochberg procedure Instead of trying to control for the family-wise error rate, you can instead try to control for the false discovery rate which is the proportion of all the hypothesis tests identified as having statistically significant results that actually don’t have statistically significant results. In other words, the false discovery rate is equal to FP/(FP + TP).\nControlling for the false discovery rate should help you identify as many hypothesis tests with statistically significant results as possible, but still try to keep a relatively low proportion of false positives. Like $\\alpha$ which controls the false positive rate, we similarly use another significance level, $\\beta$, which controls the false discovery rate.\nThe procedure you can use to control the false discovery rate is called the Benjamini-Hochberg procedure. You first choose a $\\beta$, the significance level for the false discovery rate. Then calculate the p-values for all null hypothesis tests performed and sort from lowest to highest with $i$ being the index of the p-value in the list. Now find the index, $k$, of the largest p-value such that it’s less than or equal to $\\frac{i}{m} \\beta$ where $m$ is the number of null hypothesis tests performed. All null hypothesis tests with p-value index $i \u0026lt;= k$ are considered statistically significant by the Benjamini-Hochberg procedure.\nConclusion As you can see, doing good science does not just involve performing a null hypothesis test and publishing your findings when you get a p-value less than or equal to 5%. There are ways to misinterpret p-values, to tweak data to get the right p-value for a hypothesis that you’re convinced of, and to perform enough tests with different samples of data until you get the desired p-value.\nBut now that you you’re aware of the potholes and are armed with some ways to avoid them, I hope it helps you improve the quality of your research and get you closer to the truth.\nReferences 5 Tips For Avoiding p-Value Potholes Definition of Power The American Statistical Association’s Statement on p-Values An Investigation of the False Discovery Rate and the Misinterpretation of p-values False Positive Rate vs False Discovery Rate Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Introduction to Power in Significance Tests - Khan Academy Sensitivity and Specificity - Wikipedia The p-Value and the Base Rate Fallacy — Statistics Done Wrong How to Calculate Effect Sizes from Published Research: A simplified Methodology Multiple Comparisons Problem - Wikipedia Photo by Steve Johnson on Unsplash ","date":"2019-02-25T22:32:33Z","title":"Why Doing Good Science is Hard and How to Do it Better","uri":"https://www.bobbywlindsey.com/2019/02/25/good-science-is-hard/"},{"categories":["Data Science"],"content":"Hypothesis testing is the bedrock of the scientific method and by implication, scientific progress. It allows you to investigate a thing you’re interested in and tells you how surprised you should be about the results. It’s the detective that tells you whether you should continue investigating your theory or divert efforts elsewhere. Does that diet pill you’re taking actually work? How much sleep do you really need? Does that HR-mandated team-building exercise really help strengthen your relationship with your coworkers?\nSocial media and the news saturates us with “studies show this” and “studies show that” but how do you know if any of those studies are valid? And what does it even mean for them to be valid? Although studies can definitely be affected by the data collection process, the majority of this article is going to focus on the actual hypothesis test itself and why being familiar with its process will arm you with the necessary set of skills to perform replicable, reliable, and actionable tests that drive you closer to the truth - and to call bullshit on a study.\nIn any hypothesis test, you have a default hypothesis (the null hypothesis) and the theory you’re curious about (the alternative hypothesis). The null hypothesis is the hypothesis that whatever intervention/theory you’re studying has no effect. For example, if you’re testing whether a drug is effective, the null hypothesis would state that the drug has no effect while the alternative hypothesis would posit that it does. Or maybe you’d like to know if a redesign of your company’s website actually made a difference in sales — the null hypothesis is that the redesign had no effect on sales and the alternative hypothesis is that it did.\nHypothesis testing is a bit like playing devil’s advocate with a friend, but instead of just trolling, you both go out and collect data, run repeatable tests, and determine which of you is more likely to be right. In essence, having a null hypothesis ensures that the data you’re studying is not only consistent with whatever theory you have, but also inconsistent with the negation of your theory (i.e. the null hypothesis).\nHow a hypothesis test works Once identifying your null and alternative hypotheses, you need to run the test. Skipping over a bunch of math formulas, it goes something like this:\n Perform an experiment (this is where you collect your data). Assume that the null hypothesis is true and let the p-value be the probability of getting the results at least as extreme as the ones you got. If the p-value is quite small (i.e. \u0026lt; 5%), your results are statistically significant which gives you evidence to reject the null hypothesis; otherwise, the null hypothesis can\u0026rsquo;t be ruled out just yet. You might be wondering why a p-value of 5% could mean that your results are statistically significant. Let\u0026rsquo;s say your null hypothesis is that condoms don\u0026rsquo;t have an effect on STD transmission and you assume this to be true. You run your experiment, collect some data, and turns out you get some results that were very unlikely to get (meaning the probability of getting those results was really small). This might cause you to doubt the assumption you made about condoms having no effect. Why? Because you got results that were very rare to get, meaning your results were significant enough to cast doubt on your assumption that the null hypothesis was true.\nJust like with most things in life, you want to minimize your probability of being wrong, including when performing a hypothesis test. So consider the ways you could be wrong when interpreting the results of a hypothesis test: you can either reject the null hypothesis when it\u0026rsquo;s actually true (a Type I error), or fail to reject it when it\u0026rsquo;s actually false (a Type II error).\nSince you can\u0026rsquo;t decrease the chance of both types of errors without raising the sample size and you can\u0026rsquo;t control for the Type II error, then you require that a Type I error be equal to 5% which is a way of requiring that any statistically significant results you get can only have a 5% chance of being a coincidence. It\u0026rsquo;s damage control to make sure you don\u0026rsquo;t make an utter fool of yourself and this restriction leaves you with a 95% confidence when claiming statistically significant results and a 5% margin of error.\nThe statistically significant results in the above condom example could have been a fluke, just coincidence, but it\u0026rsquo;s only 5% likely. Even though there\u0026rsquo;s sound reason for having a small p-value, the actually threshold of 5% happens to be a convention created by Ronald Fisher, considered to be the father of modern statistics. This convention exists so that when a scientist talks about how they achieved statistically significant results, other scientists know that the results in question were significant enough to only be coincidence at most 5% of the time.\nA fuzzy contradiction For the mathematically literate, the null hypothesis test might resemble a fuzzy version of the scaffolding for a proof by contradiction whose steps are as such:\n Suppose hypothesis, $\\mathrm{H}$, is true. Since $\\mathrm{H}$ is true, some fact, $\\mathrm{F}$, can\u0026rsquo;t be true. But $\\mathrm{F}$ is true. Therefore, $\\mathrm{H}$ is false. Compared to the steps for a hypothesis test:\n Suppose the null hypothesis, $\\mathrm{H_0}$, is true. Since $\\mathrm{H_0}$ is true, it follows that a certain outcome, $\\mathrm{O}$, is very unlikely. But $\\mathrm{O}$ was actually observed. Therefore, $\\mathrm{H_0}$ is very unlikely. The difference between the proof by contradiction and the steps involved in performing a hypothesis test? Absolute mathematical certainty versus likelihood. You might be tempted into thinking statistics shares the same certainty enjoyed by mathematics, but it doesn\u0026rsquo;t. Statistics is an inferential framework and as such, it depends on data that might be incomplete or tampered with; not to mention the data could have been derived from an improperly-set-up experiment that left plenty of room for a plethora of confounding variables. Uncertainty abounds in the field and the best answer any statistician can ever give is in terms of likelihood, never certainty.\nActually doing the hypothesis test Now onto the technical details of a hypothesis test. Although statistics classes somehow find a way to muddy the waters, the test itself is fortunately not too complicated. And once you understand these details, you can have a computer do the computations for you. For simplicity, assume you\u0026rsquo;re doing a hypothesis test about the effect of an intervention on a population by looking at its sample mean. But keep in mind that this procedure is very similar for other tests.\nFirst, collect enough data (preferably at least 30 samples) from your population of choice without the intervention and calculate its mean, $\\mu_0$. This is called a sample mean and represents the population mean because the Central Limit Theorem tells you that as you take larger and larger samples, the sample mean approaches the population mean. And since the sample mean is a statistic, it belongs to a particular sampling distribution which the Central Limit Theorem says is the Normal distribution.\nNext, calculate the standard deviation of $\\mu_0$ which is equal to $\\frac{\\sigma}{\\sqrt{n}}$ where $\\sigma$ is the population standard deviation and $n$ is the size of your sample. But since you don’t know what $\\sigma$ really is, you can estimate it with the sample standard deviation, $S$, found in the data from the population with the intervention. So, collect over 30 samples from your population with the intervention and calculate its mean, $\\mu$, and sample standard deviation, $S$.\nNow assume the null hypothesis is true and ask yourself, what is the probability of getting $\\mu$? Another way of saying that is, how many standard deviations is $\\mu$ away from $\\mu_0$ and what is the probability of getting a result at least that many standard deviations away from $\\mu_0$? As mentioned earlier, this probability is called the p-value.\nWell, to calculate how many standard deviations $\\mu_0$ is away from $\\mu$, you subtract $\\mu_0$ from $\\mu$ and divide by the standard deviation of $\\mu_0$. The result is called a standard score or Z-score.\n$$\\frac{\\mu - \\mu_0}{S/\\sqrt{n}}$$\nNow what is the probability of getting a standard score that’s at least as extreme as the one you got? This is the same as asking what the probability is of at least $\\frac{\\mu - \\mu_0}{S/\\sqrt{n}}$ deviations from $\\mu_0$. Well, this depends on the form of your alternative hypothesis.\nIf your alternative hypothesis is that $\\mu \\neq \\mu_0$, then the probability of the standard score is just one minus the integral from $-\\frac{\\mu - \\mu_0}{S/\\sqrt{n}}$ to $\\frac{\\mu - \\mu_0}{S/\\sqrt{n}}$ of the probability density function (pdf) for the Normal distribution. For example, if $\\frac{\\mu - \\mu_0}{S/\\sqrt{n}} = 2$, then \u0026ldquo;1 minus the integral above\u0026rdquo; will find the area of the shaded regions under the curve (which is the probability you\u0026rsquo;re looking for):\nIf your alternative hypothesis is $\\mu \u0026gt; \\mu_0$, then the probability of the standard score is the integral from $\\frac{\\mu - \\mu_0}{S/\\sqrt{n}}$ to $\\infinity$ of the pdf for the Normal distribution. Assuming a standard score of two like in the example above, this equates to trying to find the area in the graph below:\nSimilarly, if your alternative hypothesis is $\\mu \u0026lt; \\mu_0$, then the probability of the standard score is the integral from $- \\infinity$ to $\\frac{\\mu - \\mu_0}{S/\\sqrt{n}}$ of the pdf for the Normal distribution. Like in the examples above, this time you\u0026rsquo;re trying to find the following area:\nNow that you’ve found the probability of your results (via the standard score), you can use this probability to decide whether to reject the null hypothesis or not. Say this probability was 3%. So the data you gathered from the population after the intervention was applied had a 3% probability of happening under the assumption that the null hypothesis was true. But it happened anyway! So maybe the null hypothesis wasn’t true after all. You don’t know for sure, but the evidence seems to suggest that you can reject it.\nTuning the microscope As mentioned before, hypothesis testing is a scientific instrument with a degree of precision and as such, you must carefully decide what precision is needed given the experiment. An underpowered hypothesis test would not be powerful enough to detect whatever effect you\u0026rsquo;re trying to observe. It\u0026rsquo;s analogous to using a magnifying glass in order to observe one of your cheek cells. But a magnifying glass is too weak to observe something so small and you might as well have not bothered with the test at all. Typically, a test is underpowered when studying a small population where the difference in the population creates an effect that\u0026rsquo;s just big enough to pass the p-value threshold of 5%.\nIn his book, \u0026ldquo;How Not to Be Wrong\u0026quot;, Jordan Ellenberg gives a great example of an underpowered test. He mentions a journal article in Psychological Science that found married women in the middle of their ovulatory cycle more likely to vote for the presidential candidate Mitt Romney. A population size of 228 women were polled; of those women polled during their peak fertility period, 40.4% said they\u0026rsquo;d support Romney, while only 23.4% of the other married women who weren\u0026rsquo;t ovulating showed support for Romney. With such a small population size, the difference between the two groups of women was big enough to pass the p-value test and reject the null hypothesis (i.e. ovulation in married women has no effect on supporting Mitt Romney). Ellenberg goes on to say on page 149:\n The difference is too big. Is it really plausible that, among married women who dig Mitt Romney, nearly half spend a large part of each month supporting Barack Obama? Wouldn\u0026rsquo;t anyone notice? If there\u0026rsquo;s really a political swing to the right once ovulation kicks in, it seems likely to be substantially smaller. But the relatively small size of the study means a more realistic assessment of the strength of the effect would have been rejected, paradoxically, by the p-value filter.\n The opposite problem is had with an overpowered study. Say such an overpowered study (i.e. a study with a large population size) is performed and it showed that taking a new blood pressure medication doubled your chance of having a stroke. Now some people might choose to stop taking their blood pressure meds for fear of having a stroke; after all, you\u0026rsquo;re twice as likely. But if the likelihood of having a stroke in the first place was 1 in 8,000, a number very close to zero, then doubling that number, 2 in 8,000, is still really close to zero. Twice a very small number is still a very small number.\nAnd that\u0026rsquo;s the headline - an overpowered study is really sensitive to small effects which might pass as statistically significant but might not even matter. What if a patient with heart disease suffered an infarction because he or she decided to stop taking their blood pressure meds after reading the \u0026ldquo;twice as likely to stroke\u0026rdquo; headline? The overpowered study took a microscope to observe a golf ball and missed the forest from the trees. Care must be taken when reading or hearing such headlines and questions must be asked. That all being said, in the real world an overpowered study is preferred to an underpowered one. If the test has significant results, you just need to make sure you interpret those results in a practical manner.\nThe importance of replicable research Toward the beginning of this post, you saw that if the null hypothesis were true, you\u0026rsquo;re still 5% likely to reject it in favor of the alternative. That\u0026rsquo;s why you can only say you\u0026rsquo;re 95% confident in the results you get because 1 out of 20 times, your results aren\u0026rsquo;t actually significant at all, but are due to random chance. Test 20 different jelly beans for a link to acne and it\u0026rsquo;s not surprising that 1 out of 20 show a link.\nThis should hammer home the importance of replicable research, which entails following the same steps of the research, but with new data. Repeating your research with new data helps to ensure that you\u0026rsquo;re not that one lucky scientist who did the research once and found that green jelly beans had a statistically significant effect on acne.\nClosing remarks Hypothesis testing has been a godsend in scientific investigation. It\u0026rsquo;s allowed for the ability to focus efforts toward more promising areas of research and has provided the opportunity to challenge commonly held beliefs and defend against harmful actions. Now that you know how to perform a hypothesis test and are aware of the pitfalls, I hope it increases value not only in your profession but also in your personal life.\n","date":"2019-02-19T20:57:00Z","title":"Understanding Hypothesis Testing","uri":"https://www.bobbywlindsey.com/2019/02/19/hypothesis-testing/"},{"categories":["Math","Popular"],"content":"“Prove it” is a phrase thrown around with glee. But rarely are you really challenging a friend to provide a rigorous proof of some statement you disagree with. In science, there is no proof. Instead, the scientific method encourages you to form a hypothesis, collect data, and test that hypothesis. Repeating this process advances you closer and closer to the truth, but nowhere in this process can you say you’ve proven a hypothesis. This would be impossible as you would have to exhaustively collect all data in the known universe to see if your hypothesis still holds and even then, your measurements will always be imprecise. The scientific body of knowledge will always be subject to revision as more data is collected and better measuring instruments are built.\nOn the other hand, mathematics is built from the ground up using a set of simple statements called axioms. These axioms are statements composed of sets, functions, and logical symbols. I’ve always found this a beautiful fact of mathematics; that it’s nothing more than some sets, functions, and logic. Of course this means that now you can build the whole of mathematics and ensure that every new statement is coherent with previous ones. This naturally endows anyone looking to prove a mathematical statement with the ability to:\n prove it with 100% certainty disprove it with 100% certainty prove that it is not provable with 100% certainty (ref. Gödel’s incompleteness theorems) So how do you prove a mathematical statement? There are generally four methods:\n Implication Contradiction Contrapositive Induction Proof by implication Proof by implication (also sometimes called proof by direct implication) is a method where you start with one fact, and use that fact to imply other facts. Through this chain of implications, you reach some conclusion. More simply, you’re saying “if A, then B”. This is commonly written as:\n$$ A \\then B $$\nTo drive this home, say you’re trying to prove that if $m$ and $n$ are perfect squares, then $mn$ is a perfect square.\nYou first start off with the fact you know $m$ and $n$ are perfect squares. What does this mean? Well, $m = x^2$ and $n = y^2$ where $x$ and $y$ are some integers. Using that fact, you can show that $mn = x^2y^2 = (xy)^2$, thus showing that $mn$ is also a perfect square.\nIn practice, proof by implication is a common first-approach to proving some statement. By working forward from A and backward from B, you eventually bridge the gap and create a chain of implications.\nProof by contradiction But sometimes that gap can’t be bridged. In these cases, it might be helpful to consider proving “if A, then B” by reaching a contradiction. The reasoning is as follows: if you’re trying to prove “if A, then B” and this statement is indeed true, then there must be a reason why B cannot be false. Your goal is to find that reason by seeking a contradiction. To do this, you assume A is true and (not B) is true, then work forward by implication. The hope is that you reach some contradiction and if you do, then you’ve successfully proven “if A, then B”.\nYou should try this with an example. Let $n$ be an integer. If $n^2$ is even, then $n$ is even. If you were to try and prove this using implication, you might say that since $n^2$ is even, then there exists some integer $m$ such that $n^2 = 2m$. And since $n$ is even, then there exists some integer $k$ such that $n = 2k$. But how do you get $\\sqrt{2m}$ to look like $2k$? This is where you can use the contradiction method.\nSuppose $n$ is an integer and $n^2$ is even and $n$ is odd. Then there exists some integer $k$ such that $n = 2k + 1$. This would imply that\n$$ \\begin{aligned} n^2 \u0026amp;= (2k + 1)^2 \\\\\\\n\u0026amp;= 4k^2 + 4k + 1 \\\\\\\n\u0026amp;= 2(2k^2 + 2k) + 1 \\\\\\\n\\end{aligned} $$\nIf you let $z = 2k^2 + 2k$ then $n^2 = 2z + 1$, an odd number. But wait! This is a contradiction since you had assumed $n^2$ was even. So “if $n^2$ is even, then $n$ is even” must be true.\nA proof by contradiction isn’t the only method you can try if implication doesn’t work. Sometimes you can just turn implication on its head.\nProof by contrapositive You’re still trying to prove “if A, then B” but it’s not working out. You might instead try “if not B, then not A”. In fact, this statement is the logical equivalent to “if A, then B” and is called its contrapositive. If you’re not familiar with why it’s logically equivalent, it happens to be a result of boolean logic which is usually represented as a truth table. However, consider the example statement “if it’s raining outside, then the grass is wet”. The contrapositive of this statement would be “if the grass is not wet, then it’s not raining outside”. Both statements are logically equivalent to each other. And once the statement you’re trying to prove is in its contrapositive form, you can then use implication to complete the proof.\nTake the statement we recently proved via contradiction:\n Let $n$ be an integer. If $n^2$ is even, then $n$ is even.\n Now take the contrapositive of this statement:\n Let $n$ be an integer. If $n$ is odd, then $n^2$ is odd.\n If $n$ is odd, then there is some integer $k$ such that $n = 2k + 1$. So by implication,\n$$ \\begin{aligned} n^2 \u0026amp;= (2k + 1)^2 \\\\\\\n\u0026amp;= 4k^2 + 4k + 1 \\\\\\\n\u0026amp;= 2(2k^2 + 2k) + 1 \\\\\\\n\\end{aligned} $$\nIf you let $z = 2k^2 + 2k$ then $n^2 = 2z + 1$ which is an odd number. And the proof is complete.\nAnother way to think about a proof by contrapositive is that you suppose A and (not B) are both true and work forward to prove that A is false (i.e. not A). In essence, you’re really doing a proof by contradiction but this time you know what contradiction you’re looking for.\nProof by induction A proof by induction is a bit like falling dominos. The first domino falls. Then when any domino falls, the next one falls. Proof by induction is similar. You first start by proving the base case, $n = 1$. Then you assume the statement is true for $k$ and show that it’s also true for $k + 1$. Once you’ve done that, then you’ve officially proven the statement for all $n$.\nIt’s now mathematical lore that in the late 1700s while in elementary school, Carl Friedrich Gauss was given busy work by his math teacher. He was told to sum the numbers from 1 to 100. After some clever observations, he found that he could easily calculate the total with the following formula, $\\frac{n(n + 1)}{2}$. This is what you’ll prove by induction, that the sum of $n$ positive integers is equal to the function $\\frac{n(n + 1)}{2}$.\nFirst you prove the base case where $n = 1$. Does the sum of 1 equal $f(n) = \\frac{n(n + 1)}{2}$? Plugging in 1 for $n$, it most assuredly does. So the base case is proven. Now assume $f(k) = \\frac{k(k + 1)}{2}$ to be true. You want to show that $f(k + 1)$ is also true. Well $f(k + 1)$ is just the sum of the first $k + 1$ positive integers\n$$ 1 + 2 + \\dots + k + (k + 1) $$\nWell this is equal to the sum of the first $k$ positive integers plus $k + 1$, giving you $\\frac{k(k + 1)}{2} + (k + 1)$. If you use some basic algebra to apply a common denominator and factor out some terms, you’ll end up with $\\frac{(k + 1)((k + 1) + 1)}{2}$. Taking a closer look at that expression, you’ll see that it resembles our initial function, $\\frac{n(n + 1)}{2}$. And now the proof is complete!\nTo wrap you head around what you just did, you assumed the function was true for $k$ and proved it was also true for $k+1$. Since you proved it was true for $f(1)$ (our base case), then you know it’s true for $f(2)$. And since you know it\u0026rsquo;s true for $f(2)$, then it\u0026rsquo;s true for $f(3)$, and so on… So by induction, you’ve proven the function must be true for all of $n$ (i.e. $f(n) = \\frac{n(n + 1)}{2}$ is true).\nLogical symbols in statements With the above proof methods, you’re ready to tackle most proofs out there. However, in some statements you might see phrases (or their logical symbol equivalents) like “there exists”, “for all”, or “unique”. Statements that contain these phrases are still tackled with the four proof methods, but there are some helpful guidelines for addressing each one. For simplicity, assume you’re still thinking about the statement, “if A, then B”.\nExistential crisis If you see that B has the phrase “there exists”, do the following:\n Suppose A is true Guess or construct the object having the certain property and show that the something happens Conclude the desired object exists Me too If you see that B has the term “for all” or “for every”, do the following:\n Suppose A is true Choose an object with the certain property Use implication to conclude that the something happens The only one If you see that B has the word “unique”, do the following:\n Suppose A is true and that two such objects exist Use implication to conclude that the two objects are equal thus showing there’s really only one This or that If B has the form “C or D”:\n Suppose A is true and (not C) is true Use implication to prove that D is true Or you could also:\n Suppose A is true and (not D) is true Use implication to prove that C is true Conclusion Through the use of making iron-clad logical deductions, you now know how to really prove something. Hedging your bets is no longer required. If you want to learn more about how to do proofs, Daniel Solow has a really good book called How to Read and Do Proofs. Jeremy Kun also has some good primers on the methods of proof available at his blog.\n","date":"2019-02-13T16:24:23Z","title":"How to Really Prove Something","uri":"https://www.bobbywlindsey.com/2019/02/13/how-to-really-prove-something/"},{"categories":["Math"],"content":"The Central Limit Theorem in the wild On many occasions, you\u0026rsquo;re trying to ask questions about the average of some population of unknown distribution in order to find a confidence interval or test a hypothesis. Imagine you\u0026rsquo;ve taken some relatively large IID random sample, $X_1, \\dots, X_n$, from an unknown distribution but you know that distribution has a finite positive variance and finite mean.\nSo what can you say about the average, $\\bar{X}$, of $X_1, \\dots, X_n$? Well, $\\bar{X}$ is a random variable itself and like any other random variable there\u0026rsquo;s a distribution that describes it. Because $\\bar{X}$ is a random variable that\u0026rsquo;s a statistic, you call the distribution that describes it a sampling distribution. So imagine you take a random sample and calculate its mean, $\\bar{X}$, and you repeat this process over and over so that you have a set of means, $\\bar{X}_1, \\dots, \\bar{X}_n$.\nNotice what happens when you take the value of these means and plot their percentage frequencies; an interesting shape starts to form which you\u0026rsquo;ll quickly identify to be reminiscent of a normal distribution. And if you do this experiment again but with even larger random sample sizes, you\u0026rsquo;ll notice that the plot of percentage frequencies looks even more like a normal distribution. Setosa has a great visual example of this experiment in action.\nThe expected value and standard deviation of the sample mean Although you haven\u0026rsquo;t proved that the sampling distribution for $\\bar{X}$ approaches a normal distribution, you\u0026rsquo;ve noticed it experimentally. What other questions can you ask about the statistic $\\bar{X}$? What is its expected value? Well, you know that $\\bar{X} = \\frac{X_1 + \\dots + X_n}{n}$. So it follows that\n$$ \\begin{aligned} \\E(\\bar{X}) \u0026amp;= \\E\\bigg(\\frac{X_1 + \\dots + X_n}{n}\\bigg) \\\\\\\n\u0026amp;= \\frac{1}{n} \\E(X_1 + \\dots + X_n) \\\\\\\n\u0026amp;= \\frac{1}{n} \\big(\\E(X_1) + \\dots + \\E(X_n)\\big) \\\\\\\n\u0026amp;= \\frac{1}{n} (\\mu + \\dots + \\mu) \\\\\\\n\u0026amp;= \\frac{1}{n} \\cdot n \\mu \\\\\\\n\u0026amp;= \\mu \\end{aligned} $$\nNotice the significance of this result; the expected value of the mean from a random sample is equal to the population mean that the random sample comes from.\nCan you deduce the standard deviation of $\\bar{X}$?\n$$ \\begin{aligned} \\var(\\bar{X}) \u0026amp;= \\var\\bigg(\\frac{X_1 + \\dots + X_n}{n}\\bigg) \\\\\\\n\u0026amp;= \\bigg(\\frac{1}{n}\\bigg)^2 \\var(X_1 + \\dots + X_n) \\\\\\\n\u0026amp;= \\frac{1}{n^2} (\\var(X_1) + \\dots + \\var(X_n)) \\\\\\\n\u0026amp;= \\frac{1}{n^2} (\\sigma^2 + \\dots + \\sigma^2) \\\\\\\n\u0026amp;= \\frac{1}{n^2} \\cdot n \\sigma^2 \\\\\\\n\u0026amp;= \\frac{\\sigma^2}{n} \\end{aligned} $$\nSince the variance is $\\frac{\\sigma^2}{n}$, the standard deviation is $\\frac{\\sigma}{\\sqrt{n}}$.\nForming the Central Limit Theorem To recap, you\u0026rsquo;ve observed that the distribution of $\\bar{X}$ seems to be normal and have proved that $\\E(\\bar{X}) = \\mu$ and $\\text{std}(\\bar{X}) = \\frac{\\sigma}{\\sqrt{n}}$. You know that an easier way to describe the standard normal distribution, $N(0, 1)$, is with its standard normal form (also sometimes called the standard score or z-score), $\\frac{X - \\mu}{\\sigma}$. This standard score is a random variable that indicates how many standard deviations a random variable, $X$, is from the mean. Since you suspect the distribution of $\\bar{X}$ to be normal, then you might infer its standard normal form to be $\\frac{\\bar{X} - \\mu}{\\sigma / \\sqrt{n}}$ since $\\mu = \\E(\\bar{X})$ and $\\frac{\\sigma}{\\sqrt{n}}$ is the standard deviation of $\\bar{X}$.\nNow all you need to do is prove that the sampling distribution for $\\bar{X}$ approaches a standard normal distribution, $N(0, 1)$, as the sample size approaches infinity. Mathematically you want to prove:\n$$ \\frac{\\bar{X} - \\mu}{\\sigma / \\sqrt{n}} = \\frac{\\summation{i=1}{n} X_i - n \\mu}{\\sqrt{n} \\sigma} \\approaches N(0, 1) \\text{ as } n \\approaches \\infinity $$\nThe relationship above is famously known as the Central Limit Theorem and indeed has been proven utilizing moment generating functions and a bit of calculus. Fortunately for you, we won\u0026rsquo;t go through that proof here, but you should relish the fact that your experimental results holds up against theoretical rigor.\nAt the end of the day, what does the Central Limit Theorem mean for you? Well, it allows you to use probabilistic and statistical methods that you already know work for the normal distribution for problems involving other types of distributions. That random sample you took at the beginning of this article could have been from any distribution. Yet, the distribution of its mean is normal. This theorem is indispensable because, among many reasons, it allows you to create confidence intervals for means and perform hypothesis tests - topics that will be covered in later posts. But until then, I hope this guide has been a sufficient introduction into the Central Limit Theorem.\n","date":"2019-01-01T09:30:30-06:00","title":"Understanding the Central Limit Theorem","uri":"https://www.bobbywlindsey.com/2019/01/01/the-central-limit-theorem/"},{"categories":["Data Science","Popular"],"content":"Docker is hot in the developer world and although data scientists aren\u0026rsquo;t strictly software developers, Docker has some very useful features for everything from data exploration and modeling to deployment. And since major services like AWS support Docker containers, it\u0026rsquo;s even easier to implement Continuous Integration Continuous Delivery with Docker. In this post, I\u0026rsquo;ll show you how to use Docker as a data scientist.\nWhat is Docker? It\u0026rsquo;s a software container platform that provides an isolated container for us to have everything we need for our experiments to run. Essentially, it\u0026rsquo;s a light-weight VM that\u0026rsquo;s built from a script that can be version controlled; so we can now version control our data science environment! Developers use Docker when collaborating on code with coworkers and they also use it to build agile software delivery pipelines to ship new features faster. Any of this sound familiar?\nDocker terminology I have a mathematics background, so I can\u0026rsquo;t avoid definitions.\nContainers: very small user-level virtualization that helps you build, install, and run your code\nImages: a snapshot of your container\nDockerfile: a yaml-based file that\u0026rsquo;s used to build your image; this is what we can version control\nDockerhub: GitHub for your Docker images; you can set up Dockerhub to automatically build an image anytime you update your Dockerfile in GitHub\nWhy Docker is so awesome for data science Ever heard these comments from your coworkers?\n “Not sure why it’s not working on your computer, it’s working on mine.” “It’s a pain to install everything from scratch for Linux, Windows, and MacOS, and trying to build the same environment for each OS.” “Can’t install the package that you used, can you help me out?” “I need more compute power. I could use AWS but it’ll take so long just to install all those packages and configure settings just like I have it on my machine.” For the most part, these concerns are easily resolved by Docker. The exception at the moment of posting is GPU support for Docker images, which only run on Linux machines. Other than that, you\u0026rsquo;re golden.\nDocker for Python and Jupyter Notebook Take a look at this Dockerfile.\n# reference: https://hub.docker.com/_/ubuntu/FROMubuntu:16.04# Adds metadata to the image as a key value pair example LABEL version=\u0026#34;1.0\u0026#34;LABEL maintainer=\u0026#34;Bobby Lindsey \u0026lt;some_email@domain.com\u0026gt;\u0026#34;# Set environment variablesENV LANG=C.UTF-8 LC_ALL=C.UTF-8# Create empty directory to attach volumeRUN mkdir ~/GitProjects# Install Ubuntu packagesRUN apt-get update \u0026amp;\u0026amp; apt-get install -y \\ wget \\ bzip2 \\ ca-certificates \\ build-essential \\ curl \\ git-core \\ htop \\ pkg-config \\ unzip \\ unrar \\ tree \\ freetds-dev# Clean upRUN apt-get clean \u0026amp;\u0026amp; rm -rf /var/lib/apt/lists/*# Install Jupyter configRUN mkdir ~/.ssh \u0026amp;\u0026amp; touch ~/.ssh/known_hostsRUN ssh-keygen -F github.com || ssh-keyscan github.com \u0026gt;\u0026gt; ~/.ssh/known_hostsRUN git clone https://github.com/bobbywlindsey/dotfiles.gitRUN mkdir ~/.jupyterRUN cp /dotfiles/jupyter_configs/jupyter_notebook_config.py ~/.jupyter/RUN rm -rf /dotfiles# Install AnacondaRUN echo \u0026#39;export PATH=/opt/conda/bin:$PATH\u0026#39; \u0026gt; /etc/profile.d/conda.shRUN wget --quiet https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh -O ~/anaconda.shRUN /bin/bash ~/anaconda.sh -b -p /opt/condaRUN rm ~/anaconda.sh# Set path to condaENV PATH /opt/conda/bin:$PATH# Update AnacondaRUN conda update conda \u0026amp;\u0026amp; conda update anaconda \u0026amp;\u0026amp; conda update --all# Install Jupyter themeRUN pip install msgpack jupyterthemesRUN jt -t grade3# Install other Python packagesRUN conda install pymssqlRUN pip install SQLAlchemy \\ missingno \\ json_tricks \\ bcolz \\ gensim \\ elasticsearch \\ psycopg2-binary# Configure access to JupyterWORKDIR/root/GitProjectsEXPOSE8888CMD jupyter lab --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token=\u0026#39;data-science\u0026#39;If you\u0026rsquo;ve ever installed packages in Ubuntu, this should look very familiar. In short, this Dockerfile is a script to automatically build and setup a light-weight version of Ubuntu with all the necessary Ubuntu packages and Python libraries needed to do [my] data science exploration with Jupyter Notebooks. The best part is that this will run the same way whether I\u0026rsquo;m on MacOS, Linux, or Windows - no need to code separate install scripts and third-party tools to have the same environment in each operating system.\nTo build a Docker image from this Dockerfile, all you need to do is execute\ndocker build -t bobbywlindsey/docker-data-science . in the command line and Bob\u0026rsquo;s your uncle. To run the image, you have two options - you can either run the image interactively (which means you\u0026rsquo;ll see the output of your Jupyter Notebook server in real time) or in detached mode (where you can drop into the image\u0026rsquo;s terminal and play around).\nTo run the image interactively on Windows, execute\ndocker run -it -v ~/GitProjects:/root/GitProjects --network=host -i bobbywlindsey/docker-data-science Otherwise,\ndocker run -it -v ~/GitProjects:/root/GitProjects -p 8888:8888 -i bobbywlindsey/docker-data-science To run the image in detached mode for linux:\ndocker run -d --name data-science -v ~/GitProjects:/root/GitProjects --network=host -i bobbywlindsey/docker-data-science docker exec -it data-science bash or for MacOS and Windows:\ndocker run -d --name data-science -v ~/GitProjects:/root/GitProjects -p 8888:8888 -i bobbywlindsey/docker-data-science docker exec -it data-science bash Not too bad! I realize those run commands might be a bit much to type up, so there\u0026rsquo;s a couple options that I see. You can either alias those commands or you can use a docker-compose file instead.\nUsing multiple containers I won\u0026rsquo;t go into Docker Compose much here, but as an example, this a docker-compose file I have to run a Docker image used for a Jekyll blog:\nversion:\u0026#39;3\u0026#39;services:site:environment:- JEKYLL_ENV=dockerimage:bobbywlindsey/docker-jekyllvolumes:- ~/Dropbox/me/career/website-and-blog/bobbywlindsey:/root/bobbywlindseyports:- 4000:4000- 35729:35729With that file, your run command then becomes:\ndocker-compose run --service-ports site But Docker Compose is much more capable than just using it as a substitute for aliasing your run commands. Your docker-compose file can configure multiple images and by using a single command, you create and start all your services at once. For example, let\u0026rsquo;s say you build one Docker image to preprocess your data, another to model the data, and another to deploy your model as an API. You can use docker-compose to manage each image\u0026rsquo;s configurations and run them with a single command.\nConclusion Even though Docker might require a learning curve for some data scientists, I believe it\u0026rsquo;s well worth the effort and it doesn\u0026rsquo;t hurt to brush up those DevOps skills. Have you used Docker for your data science efforts?\n","date":"2018-07-16T18:00:45Z","title":"Docker for Data Scientists","uri":"https://www.bobbywlindsey.com/2018/07/16/docker-for-data-scientists/"},{"categories":["Dev"],"content":"Below is a cheatsheet I reference of the keybindings I\u0026rsquo;ve set in my neovim dotfiles.\nSurrounding text If you\u0026rsquo;re using vim-surround:\n ysiw' : surround inner word with ' cs'\u0026quot; : change surrounding single quotes with double quotes yss) : surround line with parentheses dst : delete surrounding tags ds) : delete surrounding parentheses If you\u0026rsquo;re using vim-sandwich:\n saiwtem : surround inner word with \u0026lt;em\u0026gt; saiw\u0026quot; : surround the inner word with double quotes sd\u0026quot; : delete surrounding double quote sdt : delete surrounding tag sr\u0026quot;' : change surrounding double quote to single quote srttstrong : change surrounding tag to \u0026lt;strong\u0026gt; tag sai3w] or v3esa] : put brackets around next 3 words Here\u0026rsquo;s a good article about vim-sandwich.\nBasic text manipulation \u0026lt;Leader\u0026gt;/ : comment/uncomment code ciw : \u0026ldquo;change inner word\u0026rdquo;; use instead of \u0026ldquo;change word\u0026rdquo; cit : change text inside tags \u0026lt;C-u\u0026gt; : upper case word \u0026lt;C-l\u0026gt; : lower case word YY : copy to clipboard \u0026lt;Leader\u0026gt;a = : align text by = Navigating file system \u0026lt;C-n\u0026gt; : toggle NERDTree to navigate filesystem and open files \u0026lt;C-f\u0026gt; : fuzzy search files in your open file\u0026rsquo;s directory Managing buffers \u0026lt;Tab\u0026gt; : go to next buffer Managing tabs and windows \u0026lt;C-w\u0026gt;s : split window horizontally \u0026lt;C-w\u0026gt;v : split window vertically \u0026lt;C-h\u0026gt; : switch to window on the left \u0026lt;C-l\u0026gt; : switch to window on the right \u0026lt;C-k\u0026gt; : switch to window above \u0026lt;C-j\u0026gt; : switch to window below Advanced selections : -8,-6co. : copy relative line numbers -8 through -6 to where the cursor is at Editing modes \u0026lt;Leader\u0026gt;z : toggle zenroom (removes all distractions and centers text while editing markdown and text files) ","date":"2018-07-15T05:19:29Z","title":"Vim Cheat Sheet","uri":"https://www.bobbywlindsey.com/2018/07/15/vim-cheatsheet/"},{"categories":["Book Reviews"],"content":"This book decides to focus on four barriers to decision making:\n We define our choices too narrowly We develop a belief about a situation, then seek out information that supports it (i.e. confirmation bias) We have short-term emotions We are overconfident The authors develop a framework for addressing each of the barriers called WRAP:\n Widen your options Reality-test your assumptions Attain distance before deciding Prepare to be wrong Widen your options The phrase \u0026ldquo;whether or not\u0026rdquo; should set off a mental alarm.\nConsider the opportunity cost of making such a decision. What else could you do with the same time and money?\nPerform a Vanishing Options test. That is, imagine you can\u0026rsquo;t choose any of the options you\u0026rsquo;re considering, what else could you do? Necessity is the mother of invention.\nCultivate multiple options at the same time or find someone who\u0026rsquo;s solved your problem.\nReality-test your assumptions Consider the opposite of your assumption or position in a situation. Ask yourself what would have to be true for every option to be the very best choice. Ask disconfirming questions.\nZoom out and zoom in. Factor in statistics, base rates, and averages. If you can\u0026rsquo;t find numbers, consult an expert. Get on the ground floor with the troops.\nPerform small experiments to gather more information. This will help test your assumptions.\nAttain distance before deciding Perform a 10/10/10 analysis. That is, ask yourself how you\u0026rsquo;d feed about a particular decision 10 minutes from now, 10 days from now, and 10 years from now.\nWhat would you tell your best friend to do in this situation?\nNote that agonizing decisions are often a sign of conflict among your core priorities.\nPrepare to be wrong Do a premortem - consider you failed, how did it happen?\nPerform a Failure Mode and Effect Analysis (FMEA). This analysis is where team members identify what could go wrong at every step of their plans and for each potential failure they ask “how likely is it?” and “how severe would the consequences be?”. After assigning a score from 1 to 10 for each question, they multiply the two numbers to get a total. The highest totals gets the most attention.\nConsider you succeeded and there’s a parade in your honor. How do you ensure that you’re ready for it? Assume you’re being overconfident and give yourself a healthy margin of error.\nUse prospective hindsight in order to work backward from a certain future which will make you better at generating explanations for why the event might happen.\nSet a tripwire like a deadline or partition (like budgeting x dollars for some project). This has the effect of putting an upper bound on risk.\nFinal remarks The authors closed with a few remarks about how procedural justice is critical in determining how people feel about a decision and that stating back the other side\u0026rsquo;s position better than they could have stated it is a valuable skill to develop. As Charlie Munger put it, \u0026ldquo;I’m not entitled to have an opinion on this subject unless I can state the arguments against my position better than the people do who are supporting it. I think that only when I reach that stage am I qualified to speak.\u0026rdquo;\n","date":"2018-07-12T17:48:47Z","title":"Decisive","uri":"https://www.bobbywlindsey.com/2018/07/12/decisive/"},{"categories":["Data Science"],"content":"The below is a curated list of public datasets and will be updated over time.\n Kaggle ImageNet Awesome Public Datasets U.S. government’s open data AWS datasets US Census Bureau European Union’s open data UK government’s open data Canadian government’s open data CIA World Factbook U.S. healthcare data Facebook public data UCI machine learning repo chars74k dataset DeepMind Open Source Dataset FastAI Datasets Driving video data Dataquest\u0026rsquo;s 17 places to find data sets for data science projects Google\u0026rsquo;s search engine for searching across 25 million datasets ","date":"2018-06-29T05:00:58Z","title":"Datasets","uri":"https://www.bobbywlindsey.com/2018/06/29/datasets/"},{"categories":["Data Science"],"content":"The Bias-Variance trade-off When building a model, one of the first things we look at are the prediction errors of the dev and test sets. These errors can be decomposed into two components known as bias and variance. Bias is an error that results from incorrect assumptions the model makes about the training data and variance is an error that happens when the model is sensitive to small changes in the training data.\nIf you build a model and observe that it has a hard time predicting data it\u0026rsquo;s already seen (i.e. it has a low training accuracy), then your model doesn\u0026rsquo;t fit the data well and so we say it has high bias. On the other hand, if your model is too sensitive to changes in the training data, then it will try to predict random noise rather than the intended outputs and thus will overfit your data. This usually results in a very high training accuracy and we say the model has high variance.\nSince we want a model that minimizes bias and variance, a trade-off arises. We could have a model with a high training accuracy, but performs poorly on the dev and test sets. In this case, it\u0026rsquo;s better to sacrifice some of that training accuracy in exchange for better performance on the dev and test sets. This trade-off is perhaps more intuitively understood by the image below.\nBias-Variance decomposition Although the bias-variance trade-off might feel experimentally familiar, it should be mathematically verified that we can decompose prediction errors in terms of bias and variance (and an unavoidable error term).\nSo the data we have comes from some underlying function, $f$, mixed with some noise, $\\epsilon$. Let\u0026rsquo;s represent this as $y = f + \\epsilon$ and note that we assume $\\epsilon$ to have a normal distribution with a mean of $0$ and variance $\\sigma^2$. The underlying function is what we\u0026rsquo;re trying to approximate with some model, $\\hat{f}$, and so to show that the error of $\\hat{f}$ in predicting $y$ (i.e. the mean squared error) can be seen as a combination of bias, variance, and an irreducible error term (which is an inevitable result of the noise, $\\epsilon$, in the data), we need to show that\n$$ \\E\\big[(y - \\hat{f})^2\\big] = \\bias\\big[\\hat{f}\\big]^2 + \\var\\big[\\hat{f}\\big] + \\sigma^2 $$\nwhere\n$$ \\bias\\big[\\hat{f}\\big] = \\E\\big[\\hat{f} - f\\big] $$\nand\n$$ \\var\\big[\\hat{f}\\big] = \\E\\Big[\\hat{f}^2\\Big] - \\E\\big[\\hat{f}\\big]^2 $$\nSo,\n$$ \\begin{aligned} \\E\\big[(y - \\hat{f})^2\\big] \u0026amp;= \\E\\Big[y^2 - 2y\\hat{f} + \\hat{f}^2\\Big] \\\\\\\n\u0026amp;= \\E\\big[y^2\\big] + \\E\\Big[\\hat{f}^2\\Big] - \\E\\big[2y\\hat{f}\\big] \\end{aligned} $$\nAnd since $\\var[y] = \\E\\big[y^2\\big] - \\E[y]^2$ and $\\var\\big[\\hat{f}\\big] = \\E\\Big[\\hat{f}^2\\Big] - \\E\\big[\\hat{f}\\big]^2$, then with a little rearranging we have,\n$$ \\E\\big[y^2\\big] = \\var[y] + \\E[y]^2 $$\n$$ \\E\\Big[\\hat{f}^2\\Big] = \\var\\big[\\hat{f}\\big] + \\E\\big[\\hat{f}\\big]^2 $$\nSo now\n$$ \\begin{aligned} \\E\\big[(y - \\hat{f})^2\\big] \u0026amp;= \\var[y] + \\E[y]^2 + \\var\\big[\\hat{f}\\big] + \\E\\big[\\hat{f}\\big]^2 - 2\\E[y]\\E[\\hat{f}] \\\\\\\n\u0026amp;= \\var[y] + \\E[y]^2 + \\var\\big[\\hat{f}\\big] + \\E\\big[\\hat{f}\\big]^2 - 2f\\E[\\hat{f}] \\\\\\\n\u0026amp;= \\var[y] + \\var\\big[\\hat{f}\\big] + \\Big(f^2 - 2f\\E[\\hat{f}] + \\E[\\hat{f}]^2\\Big) \\\\\\\n\u0026amp;= \\var[y] + \\var[\\hat{f}] + (f - \\E[\\hat{f}])^2 \\end{aligned} $$\nSince $\\bias[\\hat{f}]^2 = \\big(\\E\\big[\\hat{f}\\big] - f\\big)^2 = \\big(f - \\E\\big[\\hat{f}\\big]\\big)^2$, we now have\n$$ \\E\\big[(y - \\hat{f})^2\\big] = \\var[y] + \\var[\\hat{f}] + \\bias\\big[\\hat{f}\\big]^2 $$\nAnd since $\\var[y] = \\sigma^2$, we finally have\n$$ \\E\\big[(y - \\hat{f})^2\\big] = \\bias\\big[\\hat{f}\\big]^2 + \\var\\big[\\hat{f}\\big] + \\sigma^2 $$\nFor more justification in each of these steps, refer to the derivation procedure posted on Berkeley\u0026rsquo;s machine learning blog.\nAddressing high bias and high variance To minimize bias and variance, we need to have a game-plan for how to address high bias and high variance. In the case of high bias, we can:\n Choose a more complex model that can learn from the data Make sure model assumptions about data are verified Train the model for a greater amount of time To address high variance, we have a few more options at our disposal:\n Get more data! Theoretically, variance approaches zero as the number of samples approaches infinity. However, collecting more labeled data can be costly so it\u0026rsquo;s easier in most cases, like for images, to just generate more data via augmentation using techniques like: Mirroring Random cropping Rotation Color shifting (using PCA augmentation as explained in the AlexNet papers) Normalize the data which has the effect of making the cost function more symmetric so gradient descent takes less time since you can use larger learning rates Handicap your model so that it doesn\u0026rsquo;t become too complex via regularization techniques. In the case of neural networks, this has the effect of encouraging the weights to get close to zero which essentially prunes the network. Some regularization techniques for neural networks include: Dropout, so that the neural network doesn\u0026rsquo;t rely on too much on one feature and is forced to spread out the weights Stop earlier in gradient descent, but the downside of this is that you can\u0026rsquo;t decouple your cost function and reducing overfitting Initialize your weights appropriately using methods like the Xavier or He methods which helps to avoid vanishing or exploding gradients Ensemble multiple models In a separate post, we\u0026rsquo;ll cover cross-validation which can have a desirable effect on both bias and variance.\nConclusion As you can see, the model(s) you choose, the parameters you settle for, and the data you collect all have an effect on bias and variance. Ideally, there\u0026rsquo;s a perfect balance out there for any given situation but to find it we\u0026rsquo;ll have to rely on a bit of intuition, more experimentation, and robust methods.\n","date":"2018-06-28T02:58:04Z","title":"Bias and Variance","uri":"https://www.bobbywlindsey.com/2018/06/28/bias-and-variance/"},{"categories":["Data Science"],"content":"Although linear models tend to be considered as a classical form of statistical modeling, it still remains to be widely deployed in many branches of science as a foundational tool to better understand data. A lot of online articles tend to either focus on just theory or take the other extreme of just writing a few lines of code to throw data at. As in most things in life, there\u0026rsquo;s a balance which I hope this post achieves.\nConfigure environment from header import * from sympy import * init_printing(use_unicode=True) import matplotlib.pyplot as plt %matplotlib inline plt.style.use(\u0026#39;ggplot\u0026#39;) Get data To understand a more complicated topic, it\u0026rsquo;s best to start with a simple problem whose insights can be scaled to much larger problems. In this spirit, let\u0026rsquo;s say you have a few data points: $(2, 3), (7, 4)$, and $(9, 8)$. We call each point and two-dimensional point since each point has two numbers. Using Python, you can easily generate and plot these points.\n# put points in matrix points = np.array([[2, 3], [7, 4], [9, 8]]) # plot styling plt.title(\u0026#34;Example points\u0026#34;) plt.xlabel(\u0026#34;x\u0026#34;) plt.ylabel(\u0026#34;y\u0026#34;) plt.xlim([0, 10]) plt.ylim([0, 10]) plt.plot(points[:, 0], points[:, 1], \u0026#39;o\u0026#39;, color=\u0026#34;#77DD77\u0026#34;) Solving the system of equations The goal is to model the trend in the data (in order to predict future unseen values) so we might want to draw a straight line that would hit all the points, but that\u0026rsquo;s not possible. Although you can easily see that a curved line would fit all the points, there\u0026rsquo;s actually a mathematical way you can prove that a straight line won\u0026rsquo;t cut it by using a system of equations. To create this system, use the equation for a straight line for each data point:\n$$ \\beta_0 + \\beta_1x = y\\ \\mathrm{or}\\ b + mx = y $$\nand for each point you\u0026rsquo;ll get the following system of equations:\n$\\beta_0 + \\beta_1 2 = 3$\n$\\beta_0 + \\beta_1 7 = 4$\n$\\beta_0 + \\beta_1 9 = 8$\nEach $\\beta_0$ is the coefficient for the intercept term (which is implicitly set to 1) so that by adjusting $\\beta_0$, you can move the line up or down.\nSince you might have much more data than the three points we have above, you\u0026rsquo;ll want to represent this system of equations in a way that is more computationally efficient for your computer to crunch. This can be achieved by using linear algebra notation:\n$$ X \\beta = y $$\nwhich expands to\u0026hellip;\n$$ \\left[\\begin{matrix}1 \u0026amp; 2 \\\\ 1 \u0026amp; 7 \\\\ 1 \u0026amp; 9\\end{matrix}\\right] \\left[\\begin{matrix} \\beta_1 \\\\ \\beta_2\\end{matrix}\\right] = \\left[\\begin{matrix} 3 \\\\ 4 \\\\ 8 \\end{matrix}\\right] $$\nWe call $X$ the design matrix and $y$ the observation vector.\nNow solve the system of equations! You can easily do this by using the row-reduced echelon (RRE) algorithm which is easily handled in Python.\n# create design matrix X = np.array([[1, 2], [1, 7], [1, 9]]) # observation vector y = np.array([[3, 4, 8]]) augmented_matrix = Matrix(np.concatenate((X, y.T), axis=1)) printltx(r\u0026#34;Augmented Matrix =\u0026#34; + ltxmtx(augmented_matrix)) printltx(r\u0026#34;rref(Augemented Matrix) =\u0026#34; + ltxmtx(augmented_matrix.rref()[0])) Augmented Matrix = $$\\left[\\begin{matrix}1.0 \u0026amp; 2.0 \u0026amp; 3.0 \\\\ 1.0 \u0026amp; 7.0 \u0026amp; 4.0 \\\\ 1.0 \u0026amp; 9.0 \u0026amp; 8.0\\end{matrix}\\right]$$\nrref(Augemented Matrix) = $$\\left[\\begin{matrix}1.0 \u0026amp; 0 \u0026amp; 0 \\\\ 0 \u0026amp; 1.0 \u0026amp; 0 \\\\ 0 \u0026amp; 0 \u0026amp; 1.0\\end{matrix}\\right]$$\nLook at the last row of rref(augmented matrix). This is saying that $0 = 1$ which is a contradiction! That\u0026rsquo;s the algorithm\u0026rsquo;s way of saying that the system of equations you made has no solution. In other words, you can\u0026rsquo;t find a fixed $\\beta_0$ and $\\beta_1$ that you could stick in the linear model you\u0026rsquo;re trying to build ($\\beta_0 + \\beta_1x = y$ which is just the equation for a line) that would satisfy all the x\u0026rsquo;s we have, $(2, 7, 9)$, and output all the $y$'s we have, $(3, 4, 8)$.\nThe below illustration shows that you can choose any $\\beta_0$ and $\\beta_1$, but it will always end up in the range of $X$ while never reaching $y = (3, 4, 8)$.\nEstimate $\\beta$ by projecting $y$ onto the range of $X$ Since $y$ is not in the range of $X$, the best you can do is come up with a close version of $y$ that is in the range of $X$. You can do this by \u0026ldquo;projecting\u0026rdquo; $y$ onto the range of $X$ (which will end up being some point in the range of $X$ that is closest to $y$).\nThere are two ways you can do this:\n the hard way orthogonalize your design matrix $X$ by using the Gram-Schmidt Algorithm the projection of $y$ onto the range of $X$ is $\\hat{y} = \\frac{\\langle y, x_1 \\rangle}{\\langle x_1, x_1 \\rangle}x_1 + \\dots + \\frac{\\langle y, x_p \\rangle}{\\langle x_p, x_p \\rangle}x_p$ note that $\\hat{y}$ is a target that can be hit and $\\hat{y}$ is most like $y$ solve $X \\hat{\\beta} = \\hat{y}$ the easy way use the solutions of the normal equation, $(X\u0026rsquo;X) \\hat{\\beta} = X\u0026rsquo;y$, which is the least squares solution to $X \\beta = y$ remember that $X \\beta = y$ is what you were trying to solve earlier with the system of equations but couldn\u0026rsquo;t find a solution; the normal equation adjusts $y$ in the most minimal way as to give you an equation that will be solvable and leave you with $\\hat{\\beta}$s notice that $\\hat{\\beta}$s are not the same as $\\beta$s since $\\beta$s are the true parameters that we couldn\u0026rsquo;t find before using the system of equations and $\\hat{\\beta}$s are estimates of the true parameters There\u0026rsquo;s a couple things to notice here:\n $\\epsilon$, the residual vector, is perpendicular to $\\hat{y}$ meaning that $\\epsilon \\perp \\hat{y}$ $\\hat{y} + \\epsilon = y$ One can also see the use of Pythagorean\u0026rsquo;s theorem at play here.\n$\\epsilon \\perp \\hat{y} \\implies ||y||^2 = ||\\hat{y} + \\epsilon||^2 = ||\\hat{y}||^2 + ||\\epsilon||^2$\n$\\epsilon$ is minimized with $\\hat{\\beta}$ For this regression model, you want to show that the residual vector, $\\epsilon$, is minimized with $\\hat{\\beta}$. To show that, you need to minimize the sum of squared errors which translates to minimizing $||y - X \\beta||^2$. But to show $\\beta$ is our best candidate, you need to start with an arbitrary candidate for our sum of squared errors, $|| y - X \\gamma||^2$.\nFirstly, you know that $y - X \\gamma = y - X \\hat{\\beta} + X(\\hat{\\beta} - \\gamma)$ because:\n$$ \\begin{aligned} y - X \\gamma \u0026amp; = y - X \\gamma + X \\hat{\\beta} - X \\hat{\\beta} \\\\\n\u0026amp; = y - X \\hat{\\beta} + X \\hat{\\beta} - X \\gamma \\\\\n\u0026amp; = y - X \\hat{\\beta} + X(\\hat{\\beta} - \\gamma) \\\\\n\\end{aligned} $$\nSince you know $\\epsilon \\perp X$ by definition, then you can say that $|| y - X \\gamma||^2 = ||y - X \\hat{\\beta}||^2 + ||X(\\hat{\\beta} - \\gamma)||^2$.\nSecondly, you use this fact to show that $||y - X \\gamma||^2$ is minimized when $\\gamma = \\hat{\\beta}$.\nIf $\\gamma = \\hat{\\beta}$ then $||y - X \\gamma||^2 = || y - X \\hat{\\beta}||^2$. And if $\\gamma \\ne \\hat{\\beta}$ then $||y - X \\gamma||^2 = ||y - X \\hat{\\beta}||^2 + ||X(\\hat{\\beta} - \\gamma)||^2$ which means $||y - X \\hat{\\beta}||^2 + ||X(\\hat{\\beta} - \\gamma)||^2 \\ge ||y - X \\hat{\\beta}||^2$.\nSo, $||y - X \\gamma||^2$ is minimized when $\\gamma = \\hat{\\beta}$.\nSolving $X \\hat{\\beta} = \\hat{y}$ Now we need to solve $X \\hat{\\beta} = \\hat{y}$.\nWe know $\\epsilon$ is orthogonal to every element in the range of $X$. So, we can say that $\\epsilon \\perp X \\implies X \\perp \\epsilon \\implies X\u0026rsquo; \\epsilon = 0$.\n$X\u0026rsquo; \\epsilon = X\u0026rsquo;(y - X \\hat{\\beta}) = X\u0026rsquo;y - X\u0026rsquo;X \\hat{\\beta}$\nSo,\n$$ \\begin{aligned} X\u0026rsquo;y - X\u0026rsquo;X \\hat{\\beta} \u0026amp; = 0 \\\\\\\nX\u0026rsquo;X \\hat{\\beta} \u0026amp; = X\u0026rsquo;y\\ \\mathrm{(this\\ is\\ the\\ normal\\ equation)} \\\\\\\n\\hat{\\beta} \u0026amp; = (X\u0026rsquo;X)^{-1}X\u0026rsquo;y\\ \\mathrm{(the\\ least\\ squares\\ solution)} \\\\\\\n\\end{aligned} $$\nConclusion Now that we know how to get our $\\hat{\\beta}$'s, we can use the design matrix, X, and our observation vector, y, to find the betas for our example. Plugging in X and y into $\\hat{\\beta} = (X\u0026rsquo;X)^{-1}X\u0026rsquo;y$, we get $\\hat{\\beta} = (1.30769231, 0.61538462)$ as shown in the Python code below (remember to transpose y to make it a column vector):\nbeta_hat = np.linalg.inv((X.T.dot(X))).dot(X.T).dot(y.T) beta_hat And there you have it! $\\hat{\\beta}$ is the vector containing each of the coefficients you were trying to find for your linear model and represents the closest thing we can get to a perfect solution.\n","date":"2017-11-07T04:30:55Z","title":"Understanding Linear Regression","uri":"https://www.bobbywlindsey.com/2017/11/07/understanding-linear-regression/"},{"categories":["Math"],"content":"For those who haven\u0026rsquo;t heard of LaTeX, it\u0026rsquo;s an awesome typesetting language typically used when writing mathematics. As I study mathematics, I\u0026rsquo;ve gotten quite familiar with LaTeX over time as I\u0026rsquo;ve been using it for taking notes and submitting assignments. All that being said, I can never remember what LaTeX command I need to get the symbol I want.\nFor example, to make the summation, $\\summation{n=1}{\\infinity}$, you\u0026rsquo;d have to write \\sum\\limits_{n=1}^{\\infty}. Yeah, you can see why I don\u0026rsquo;t like that approach. So instead, I have a macro that allows me to write \\summation{n=1}{\\infinity} to achieve the same result. If you ask me, that\u0026rsquo;s much more readable, and easier to remember.\nSo this spurred an entire set of custom macros I wrote along with modularizing the look of the LaTeX document depending on if the document is to be used for notes or for assignment. The macros are automatically imported from the notes and homework packages and usage is simple. If you\u0026rsquo;re creating a new LaTeX document to take notes, simply put at the top of your LaTeX file:\n\\documentclass[12pt]{article} \\usepackage{../notes} and you can start writing a document that when compiled might look something like this:\nIf you\u0026rsquo;re submitting an assignment, replace \u0026ldquo;notes\u0026rdquo; with \u0026ldquo;homework\u0026rdquo; like so:\n\\documentclass[12pt]{article} \\usepackage{../homework} A simple homework document structure looks like the following:\n\\documentclass[12pt]{article} \\usepackage{../homework} \\begin{document} \\title{Homework X} \\author{Your Name\\\\ MATH 4309 - Awesome Math Class} \\maketitle \\begin{exercise}{1.4.1} Some exercise prompt here. \\end{exercise} \\begin{exercise}{1.4.2} Some other exercise prompt here. \\end{exercise} \\end{document} and voila! Your Real Analysis homework pops out.\nAs you can see, the preamble to these documents are nice and clean without a million references to packages or macros. To use the macros I created along with the notes and homework formatting, grab them at my GitHub.\n","date":"2017-11-02T18:11:16Z","title":"Better LaTeX Than Never","uri":"https://www.bobbywlindsey.com/2017/11/02/better-latex-than-never/"},{"categories":["Dev","Popular"],"content":"Awhile ago, I had AWS set up to provide me a unique URL that I could navigate to and use Jupyter Notebooks. I admired the convenience and the ability to just start a computation and close my laptop knowing full well my computations continued working away. However, using an AWS P2 instance can get very costly depending on your usage, which for me would be around $600 per month. So, I figured I could just build a computer with that kind of money which could serve as a deep learning rig along with the occasional video gaming :).\nThis post describes the configuration setup once you have already built your computer and installed a flavor of Linux like Ubuntu. Turns out that the following configuration was way easier for me to setup than AWS, and with the help of aliases, remote computing with Jupyter Notebooks is easier than ever. Let\u0026rsquo;s get started!\nInstallation So first you need to install the following on your Ubuntu server:\n Anaconda, which will provide a lot of the default Python packages you\u0026rsquo;d need openssh-server, which can be installed with the following: sudo apt-get install openssh-server -y tmux, which can be installed with sudo apt-get install tmux -y If you ever need to see the status of openssh-server or restart it, type the below into your terminal:\nsystemctl status ssh sudo service ssh restart Connecting locally to your server To make sure most things are set up correctly, we first need to verify that you can connect to your server on your local network.\nOk! So on your server, open your sshd_config file located at /etc/ssh/sshd_config. To make changes to it, you\u0026rsquo;ll need sudo privileges. Once the file is open, you\u0026rsquo;ll need to specify what port you\u0026rsquo;ll want to use when connecting in. Whatever you choose, I highly advise not using the default port 22. Let\u0026rsquo;s say you decide to use port 22222 instead. There\u0026rsquo;s an entry in your sshd_config file called Port and you should edit it as such:\nPort 22222 Under AllowUsers, put the username you use when logging into your server.\nAllowUsers your_username Next, set PasswordAuthentication to yes.\nLastly, to make sure Ubuntu won\u0026rsquo;t block incoming web traffic on port 8888, we need to adjust its iptables:\nsudo iptables -I INPUT -p tcp -s 0.0.0.0/0 --dport 8888 -j ACCEPT sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 --dport 443 -j ACCEPT sudo netfilter-persistent save Now, take note of your server\u0026rsquo;s IP address. This can be found by typing ifconfig into your terminal and looking for something like: inet 192.168.1.112. Once you\u0026rsquo;ve identified your server\u0026rsquo;s IP address on your local network, it\u0026rsquo;s time to pick up your laptop and try to log into it:\nssh your_username@192.168.1.112 -p 22222 If you get a terminal prompt, you\u0026rsquo;re in!\nConnecting remotely to your server Now the whole point of setting up remote computing is so that you can leave your house and remote into your server while on someone else\u0026rsquo;s network. To do this only requires a few changes (which require you to still be on your network):\n If you don\u0026rsquo;t already have a set of public and private keys on your laptop, generate them Copy your public key to your server: ssh-copy-id your_username@192.168.1.112 -p 22222 Identify your server\u0026rsquo;s WAN IP address by typing in your server\u0026rsquo;s terminal: curl 'https://api.ipify.org' (note that your ISP changes this address frequently which is why I use Google Wifi which allows me to check my WAN address from anywhere) On your router, turn on port forwarding. Using our example, you need to forward port 22222 to port 22222. Now try to remote into your server using the WAN address you found!\nssh your_username@server_wan_ip -p 22222 If you see a prompt, well done! One last thing is to set PasswordAuthentication to no in your sshd_config file since now you\u0026rsquo;re logging in with a ssh key; that way no one can try brute-forcing your password.\nYou can now access your server from outside your network, go grab yourself that Starbucks coffee :).\nStarting a remote Jupyter Notebook Now that all the hard work is done, you can easily use a remote Jupyter Notebook with the following steps:\n ssh into your server: ssh your_username@server_wan_ip -p 22222 Start a new tmux session that you can easily detch from later: tmux new -s session-name Start jupyter-notebook without a browser: jupyter-notebook --no-browser --port=8889 Now in a new terminal on your laptop, forward your server\u0026rsquo;s port traffic to your laptop\u0026rsquo;s local port: ssh -N -L localhost:8888:localhost:8889 your_username@server_wan_ip -p 22222 In your web browser, navigate to localhost:8888/tree and you should see your Jupyter Notebooks! Now you can easily use Jupyter Notebooks on your laptop, but instead using your server\u0026rsquo;s beefy resources to do the computing.\nOne last thing, after having figured out the steps above, I thought I\u0026rsquo;d make the process more simple by using aliases and functions. The below are the relevant lines I added to my laptop\u0026rsquo;s .bashrc file:\nserver-connect() { ssh your_username@$1 -p 22222 } jn-connect() { ssh -N -L localhost:8888:localhost:8889 your_username@$1 -p 22222 } And the lines I added to my server\u0026rsquo;s .bashrc file:\nalias jn-remote=\u0026#34;jupyter-notebook --ip=\u0026#39;*\u0026#39; --no-browser --port=8889\u0026#34; Now the 5 steps above become:\n server-connect server_wan_ip tmux new -s session-name jn-remote open new terminal and type jn-connect server_wan_ip navigate to localhost:8888/tree (Photo by Lukas)\n","date":"2017-08-10T05:26:07Z","title":"Remote Computing with Jupyter Notebooks","uri":"https://www.bobbywlindsey.com/2017/08/10/remote-computing-with-jupyter-notebooks/"},{"categories":["Data Science"],"content":"Principal Components Analysis (PCA for short) is a technique used to reduce the dimensions of a data set. There are a few helpful ways (at least for the math literate) to explain what PCA does:\n PCA computes the most meaningful basis to re-express a noisy, garbled data set. PCA yields a feature subspace that maximizes the variance along the axes. PCA reduces the dimensionality of the original feature space by projecting it onto a smaller subspace, where the eigenvectors (which all have the same unit length of 1) will form the axes. PCA is an extremely useful technique in data science. Hi-resolution images can have thousands of dimensions (MRI scans are huge) and those familiar with \u0026ldquo;the curse of dimensionality\u0026rdquo; know that certain algorithms are only approachable when one\u0026rsquo;s data set is of a manageable size. PCA employs a combination of topics like correlation, eigenvalues, and eigenvectors to determine which variables account for most of the variability in the data set.\nEigenvalues and eigenvectors introduction Eigenvalues and eigenvectors describe inherent properties of a linear transformation, much like prime numbers can be used to describe some inherent properties of an integer. An eigenvector is just a vector such that when you apply a linear transformation to it, it only scales. The factor by which it scales is called its eigenvalue which we denote as $\\lambda$. To show how you can solve for these values, let\u0026rsquo;s take a look at a quick example.\nSay you have a linear transformation:\n$$\\T = \\begin{bmatrix} 7 \u0026amp; -1 \\\\\\\n-1 \u0026amp; 7 \\\\\\\n\\end{bmatrix} $$\nTo find the eigenvalues, $\\lambda$s, of the transformation:\n$$ \\T x = \\lambda x $$ $$ \\T x = \\lambda \\mathrm{I} x $$ $$ \\T x - \\lambda \\mathrm{I} x = 0 $$ $$ (\\T - \\lambda \\mathrm{I})x = 0 : x \\ne 0 $$ $$ x \\in \\mathrm{ker}(\\T - \\lambda \\mathrm{I}) $$ $$ \\mathrm{det}(\\T - \\lambda \\mathrm{I}) = 0 $$ $$ \\mathrm{det}(\\lambda \\mathrm{I} - \\T) = 0 $$ $$ \\mathrm{det}\\bigg(\\lambda \\bigg(\\begin{bmatrix} 1 \u0026amp; 0\\\\ 0 \u0026amp; 1 \\end{bmatrix}\\bigg) - \\begin{bmatrix} 7 \u0026amp; -1\\\\ -1 \u0026amp; 7 \\end{bmatrix}\\bigg) = 0 $$ $$ \\mathrm{det}\\bigg(\\begin{bmatrix} \\lambda \u0026amp; 0\\\\ 0 \u0026amp; \\lambda \\end{bmatrix} - \\begin{bmatrix} 7 \u0026amp; -1\\\\ -1 \u0026amp; 7 \\end{bmatrix}\\bigg) = 0 $$ $$ \\mathrm{det}\\bigg(\\begin{bmatrix} \\lambda-7 \u0026amp; 1\\\\ 1 \u0026amp; \\lambda-7 \\end{bmatrix}\\bigg) = 0 $$ $$ (\\lambda - 7)^2 - 1 = 0 $$ $$ \\lambda^2 - 14 \\lambda + 49 - 1 = 0 $$ $$ \\lambda^2 - 14 \\lambda + 48 = 0 $$ $$ (\\lambda - 6)(\\lambda - 8) = 0 $$ $$ \\lambda = 6, \\lambda = 8 $$\nTo find the eigenvectors associated with the eigenvalues, we need only plug each eigenvalue into the original starting point, $\\T x = \\lambda x$, and solve for the vector that provides the solution:\n$$ \\lambda = 6 $$ $$ (6 \\mathrm{I} - \\T)x = 0 $$\n$$ \\begin{bmatrix} -1 \u0026amp; 1 \\\\\\\n1 \u0026amp; -1 \\end{bmatrix} \\begin{bmatrix} a \\\\\\\nb \\end{bmatrix} = \\begin{bmatrix} 0 \\\\\\ 0 \\end{bmatrix} $$\nSolve the above to get the following set of vectors as a solution:\n$$ \\begin{bmatrix} a \\\\\\ a \\end{bmatrix} : a \\in \\mathbb{R} = a\\begin{bmatrix} 1 \\\\\\ 1 \\end{bmatrix} : a \\in \\mathbb{R} = \\mathrm{span}\\bigg(\\begin{bmatrix} 1 \\\\\\ 1\\end{bmatrix}\\bigg) $$\nWe can repeat the above steps with $\\lambda = 8$ and get the set of vectors: $$\\mathrm{span}\\bigg(\\begin{bmatrix} 1 \\\\\\ -1\\end{bmatrix}\\bigg)$$\nNow let\u0026rsquo;s verify this in Python.\nimport numpy as np from numpy import linalg as LA import math # create a numpy matrix for our transformation T = np.array([[7, -1], [-1, 7]]) # get eigenvalues and eigenvectors eigenvalues, eigenvectors = LA.eig(T) print(\u0026#34;eigenvalues: %s\\n\u0026#34; %eigenvalues) # eigenvectors will be normalized print(\u0026#34;normalized eigenvectors:\\n%s\\n\u0026#34; %eigenvectors) #print(eigenvectors) # un-normalized eigenvectors normalized_eigenvectors = eigenvectors*math.sqrt(2) print(\u0026#34;un-normalized eigenvectors:\\n%s\u0026#34; %normalized_eigenvectors) eigenvalues: [ 8. 6.] normalized eigenvectors: [[ 0.70710678 0.70710678] [-0.70710678 0.70710678]] un-normalized eigenvectors: [[ 1. 1.] [-1. 1.]] PCA the hard way Normalize data set (always a good idea to do so that certain algorithms aren\u0026rsquo;t affected by the scale of different variables) Calculate the correlation matrix for your dataset. This matrix will be your transformation and will provide the correlation between each pair of variables. (We use correlation instead of covariance since correlation is a normalized version of the covariance matrix; although we\u0026rsquo;re not too concerned about normalization since we already normalized the data set in step 1). Find the eigenvectors and eigenvalues of the correlation matrix using the steps above. For each eigenvalue, take the absolute value and divide it by the sum of all the eigenvalues which will provide the proportion of variance that its associated eigenvector contributes to the data. Sort the eigenvectors (i.e. principal components) by their associated eigenvalues (highest to lowest). Eigenvalues tell you which principal components to keep. The number of principal components to keep will be based on the cumulative explained variance that the eigenvalues account for. The cumulative explained variance to stop at is a threshold that is decided by how much variability you want to explain. Stick each eigenvector you want to keep in a matrix that we call a \u0026ldquo;feature vector\u0026rdquo;. Now project the data set onto the new feature space (i.e. the feature vector) by multiplying your data set by the feature vector. import pandas as pd df = pd.read_csv( filepath_or_buffer=\u0026#39;https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\u0026#39;, header=None, sep=\u0026#39;,\u0026#39;) # normalize data set data_set = df.ix[:,0:3].values data_set = StandardScaler().fit_transform(data_set) # calculate correlation matrix correlation_matrix = np.corrcoef(data_set.T) print(\u0026#34;Correlation matrix:\\n%s\\n\u0026#34; %correlation_matrix) # find eigenvalues and eigenvectors eigenvalues, eigenvectors = LA.eig(correlation_matrix) print(\u0026#34;Eigenvectors:\\n%s\\n\u0026#34; %eigenvectors) # make a list of (eigenvalue, eigenvector) tuples eigen_pairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))] # sort the (eigenvalue, eigenvector) tuples from highest to lowest eigen_pairs.sort() eigen_pairs.reverse() print(\u0026#39;Eigenvalues in descending order:\u0026#39;) for pair in eigen_pairs: print(pair[0]) Correlation matrix: [[ 1. -0.10936925 0.87175416 0.81795363] [-0.10936925 1. -0.4205161 -0.35654409] [ 0.87175416 -0.4205161 1. 0.9627571 ] [ 0.81795363 -0.35654409 0.9627571 1. ]] Eigenvectors: [[ 0.52237162 -0.37231836 -0.72101681 0.26199559] [-0.26335492 -0.92555649 0.24203288 -0.12413481] [ 0.58125401 -0.02109478 0.14089226 -0.80115427] [ 0.56561105 -0.06541577 0.6338014 0.52354627]] Eigenvalues in descending order: 2.91081808375 0.921220930707 0.147353278305 0.0206077072356 # find explained variance of eigenvalues eigenvalues_sum = sum(eigenvalues) explained_variance = [(eigenvalue / eigenvalues_sum) for eigenvalue in sorted(eigenvalues, reverse=True)] for i in explained_variance: print(i) 0.727704520938 0.230305232677 0.0368383195763 0.00515192680891 It appears the first two eigenvectors account for the majority of the variance in the data set. Let\u0026rsquo;s just use the two eigenvectors associated with those first two eigenvalues, stick them in a matrix, and apply said matrix to the data set.\n# put eigenvectors in matrix feature_vector = np.hstack((eigen_pairs[0][1].reshape(4,1), eigen_pairs[1][1].reshape(4,1))) # project data set onto feature_vector pca_data_set = pd.DataFrame(data_set.dot(feature_vector)) PCA the easy way Now that we understand the mechanics of Principal Components Analysis, we can use convenience functions in sklearn to do all the heavy lifing for us.\n# scikit-learn convenience command from sklearn.decomposition import PCA as sklearnPCA from sklearn.preprocessing import StandardScaler # standardize data set (sklearn uses covariance matrix instead of correlation matrix) data_set = df.ix[:,0:3].values data_set_std = StandardScaler().fit_transform(data_set) sklearn_pca = sklearnPCA(n_components=4) sklearn_pca.fit(data_set_std) # covariance matrix covariance_matrix = sklearn_pca.get_covariance() print(\u0026#34;Covariance matrix:\\n%s\\n\u0026#34; %covariance_matrix) # eigenvectors eigenvectors = sklearn_pca.components_.T print(\u0026#34;Eigenvectors:\\n%s\\n\u0026#34; %eigenvectors) # eigenvalues explained_variance = sklearn_pca.explained_variance_ print(\u0026#34;Explained Variance:\\n%s\\n\u0026#34; %explained_variance) # explained variance shows we really only need the eigenvectors associated with the first two eigenvalues sklearn_pca = sklearnPCA(n_components=2) sklearn_pca.fit(data_set_std) # apply the feature vector to the data set pca_data_set = pd.DataFrame(sklearn_pca.transform(data_set_std)) Covariance matrix: [[ 1. -0.10936925 0.87175416 0.81795363] [-0.10936925 1. -0.4205161 -0.35654409] [ 0.87175416 -0.4205161 1. 0.9627571 ] [ 0.81795363 -0.35654409 0.9627571 1. ]] Eigenvectors: [[ 0.52237162 0.37231836 -0.72101681 -0.26199559] [-0.26335492 0.92555649 0.24203288 0.12413481] [ 0.58125401 0.02109478 0.14089226 0.80115427] [ 0.56561105 0.06541577 0.6338014 -0.52354627]] Explained Variance: [ 2.91081808 0.92122093 0.14735328 0.02060771] References plot.ly tutorial math behind pca tutorial ","date":"2017-08-09T02:25:44Z","title":"Principal Components Analysis Explained","uri":"https://www.bobbywlindsey.com/2017/08/09/principal-components-analysis/"},{"categories":["Dev"],"content":"Vim is an amazing text editor and over the years has allowed me to be far more efficient when writing code or editing text in general. Although the initial learning curve is a bit steep, it\u0026rsquo;s well worth the time to learn to navigate and edit files without your mouse. But what makes Vim even more powerful is that it\u0026rsquo;s hackable - if you find yourself executing a sequence of keystrokes over and over for certain tasks, you can create your own function that can be used throughout Vim.\nSometimes I get Excel spreadsheets from clients who want me to look at data related to a list of numbers they provided me in an Excel column. If I were to paste those numbers in Vim, I\u0026rsquo;d get something that looks like the following:\n300418944 300404780 300482301 300354016 300295311 300417275 300409184 300480616 300478444 300491475 300478160 300482299 300482959 300154869 If I were to use that list of numbers as a SQL list in the WHERE clause of a SQL query, I\u0026rsquo;d need to surround all of the numbers by quotes and put a comma at the end of each number. Finally, I\u0026rsquo;d need to collapse all the rows into one line and surround that line with parentheses. Essentially, I need to take those numbers and create a tuple. So I want something that looks like this:\n('300418944', '300404780', '300482301', '300354016', '300295311', '300417275', '300409184', '300480616', '300478444', '300491475', '300478160', '300482299', '300482959', '300154869') Doing that by hand would take quite a bit of time, especially if given a list of hundreds of numbers. This is where Vim shines - we can create a function in our vimrc file that will handle these steps for us.\n\u0026#34; convert rows of numbers or text (as if pasted from excel column) to a tuplefunction! ToTupleFunction() range silent execute a:firstline . \u0026#34;,\u0026#34; . a:lastline . \u0026#34;s/^/\u0026#39;/\u0026#34; silent execute a:firstline . \u0026#34;,\u0026#34; . a:lastline . \u0026#34;s/$/\u0026#39;,/\u0026#34; silent execute a:firstline . \u0026#34;,\u0026#34; . a:lastline . \u0026#34;join\u0026#34; silent execute \u0026#34;normal I(\u0026#34; silent execute \u0026#34;normal $xa)\u0026#34; silent execute \u0026#34;normal ggVGYY\u0026#34;endfunctioncommand! -range ToTuple \u0026lt;line1\u0026gt;,\u0026lt;line2\u0026gt; call ToTupleFunction()This function will not only format your text, but also copy the result to your clipboard so you can paste it in whatever SQL query editor you use.\nLet\u0026rsquo;s break down each line of the body of the function.\nsilent execute a:firstline . \u0026#34;,\u0026#34; . a:lastline . \u0026#34;s/^/\u0026#39;/\u0026#34;For all visually selected lines, the line above jumps to the beginning of each line and inserts a single quotation mark.\nsilent execute a:firstline . \u0026#34;,\u0026#34; . a:lastline . \u0026#34;s/$/\u0026#39;,/\u0026#34;This line goes to the end of each line and inserts a single quotation mark and comma.\nThe next line of code joins all the lines of text we have so far into one line:\nsilent execute a:firstline . \u0026#34;,\u0026#34; . a:lastline . \u0026#34;join\u0026#34;Now we add an open parenthesis at the beginning of the line:\nsilent execute \u0026#34;normal I(\u0026#34;And then insert the closing one:\nsilent execute \u0026#34;normal $xa)\u0026#34;The last line of the function selects the entire text and copies it to the clipboard (I have a custom mapping for copying to the clipboard: vnoremap YY \u0026quot;*y).\nAt last, here\u0026rsquo;s the function in action:\nIf you\u0026rsquo;d like to have a similar function that creates an array instead, you need only make a small change to the ToTupleFunction and give the function a new name.\n\u0026#34; convert rows of numbers or text (as if pasted from excel column) to an arrayfunction! ToArrayFunction() range silent execute a:firstline . \u0026#34;,\u0026#34; . a:lastline . \u0026#34;s/^/\u0026#39;/\u0026#34; silent execute a:firstline . \u0026#34;,\u0026#34; . a:lastline . \u0026#34;s/$/\u0026#39;,/\u0026#34; silent execute a:firstline . \u0026#34;,\u0026#34; . a:lastline . \u0026#34;join\u0026#34;\u0026#34; these two lines below are different by only one character! silent execute \u0026#34;normal I[\u0026#34; silent execute \u0026#34;normal $xa]\u0026#34;endfunctioncommand! -range ToArray \u0026lt;line1\u0026gt;,\u0026lt;line2\u0026gt; call ToArrayFunction()","date":"2017-07-30T18:37:50Z","title":"Custom Vim Functions to Format Your Text","uri":"https://www.bobbywlindsey.com/2017/07/30/vim-functions/"},{"categories":["Math"],"content":"Markov models are remarkably suited to mimicking the structure of a phenomenon and as a result, I thought it would be interesting to explore that application in the context of textual analysis. This post describes the experiment of taking a few books penned by Mark Twain with the goal of building a Markov model that probabilistically generates text that shares Twain\u0026rsquo;s writing style and syntactical structure.\nMethod I start by collecting various writings from Mark Twain and combining them to create one big corpus. I then clean this big corpus by stripping all punctuation and lowercasing all text, allowing the corpus to be more easily parsed later on.\nfunction clean_corpus(text, regex; normalize = true, lower_case = true) if normalize # replace control characters with spaces text = normalize_string(text, stripmark = true, stripignore = true, stripcc = true) end if lower_case text = lowercase(text) end # remove unwanted characters text = replace(text, regex, \u0026#34;\u0026#34;) # remove \u0026#34;\u0026#34; text = split(text) target_index = 1 for i in 1:length(text) target_index = findnext(text, \u0026#34;\u0026#34;, target_index) if target_index == 0 break else splice!(text, target_index) end end text = join(text, \u0026#34; \u0026#34;) end; # import books f = open(\u0026#34;mark_twain_books/adventures_of_tom_sawyer.txt\u0026#34;) ats = readall(f); f = open(\u0026#34;mark_twain_books/huckleberry_finn.txt\u0026#34;) hf = readall(f) f = open(\u0026#34;mark_twain_books/the_prince_and_the_pauper.txt\u0026#34;) tpatp = readall(f) # clean books # create regex object (I prefer whitelisting characters I want to keep) chars_to_remove = r\u0026#34;[^a-z ]\u0026#34; ats_clean = clean_corpus(ats, chars_to_remove); hf_clean = clean_corpus(hf, chars_to_remove) tpatp_clean = clean_corpus(tpatp, chars_to_remove) # combine all books big_corpus_clean = ats_clean * \u0026#34; \u0026#34; * hf_clean * \u0026#34; \u0026#34; * tpatp_clean The next step is to convert each word in the text into a numerical representation which will make building a frequency array, which will be discussed below, from the corpus both convenient and computationally efficient.\nfunction text_to_numeric(text, symbols) numeric_text = [] for each in text push!(numeric_text, findfirst(symbols, each)) end numeric_text end; I\u0026rsquo;ll also create a function to map the numerical representation back to the original text.\nfunction numeric_to_text(numeric, symbols) text= [] for num in numeric push!(text, symbols[num]) end text end; Since Markov models do not care about the past and writing style depends on what has already been written, I can group ngram words into each state as a workaround to the model\u0026rsquo;s inherently limited memory. As an aside, it\u0026rsquo;s typically common to consider ngram values of 1 to 4 and usually as ngram increases, the quality of the text generated by the Markov model does as well.\nIn order to know how to jump from one state to another, I need to generate probabilities for these jumps. For our purposes, I\u0026rsquo;m going to generate a ngram+1 dimensional frequency array, P, that contains frequency counts for each ngram+1 words. I can achieve this through the following:\n Combine all texts into one massive corpus. I then find all unique words in the corpus and assign each word to a unique integer. In the corpus, I extract n = ngram + 1 words at a time, iterating through every word in the text in a sliding-window-like fashion. In each iteration, I associate each word in the ngram+1 words with its corresponding unique number. I then use these numbers as indices in the frequency array, P, and increment that spot in P by one. This loop repeats until I reach the end of the text minus ngram+1 words. function get_corpus_frequencies(corpus, ngram; groupby = \u0026#34;words\u0026#34;) # to get frequency of symbol x after ngram symbols ngram = ngram + 1 if groupby == \u0026#34;chars\u0026#34; corpus = split(corpus, \u0026#34;\u0026#34;) else corpus = split(corpus) end # find unique symbols unique_symbols = unique(corpus) # convert text to numbers corpus_numeric = text_to_numeric(corpus, unique_symbols); # create M dimensions = repeat([length(unique_symbols)], outer=[ngram]) M = repeat(zeros(UInt16, 1), outer = dimensions) # get frequencies for ngram of text for i in 1:length(corpus)-ngram+1 M[corpus_numeric[i:i+ngram-1]...] += 1 end M end; M_2 = get_corpus_frequencies(big_corpus_clean, 2) Now P will give me insight into which word, and by inference the state, I should go to next given ngram words and as such, P is a transition matrix which functions as our Markov model. With P, I can determine the next word to jump to by the following:\n Choose ngram words as a starting point. Convert those words to their numerical representations. For x steps, do the following: Look up in P my last ngram words to find probabilities for all possible next words. Create ranges from the above probabilities. Choose a random number, r, from 0 to 1. Find out which range r lands in, and that is the next word to go to. For this, we need a function to choose the next state.\nfunction choose_next_state(distribution, r) # only consider entries that are non-zero nonzero_entries = findn(distribution) distribution_nonzero = distribution[nonzero_entries] ranges = cumsum(distribution_nonzero) for (idx, range) in enumerate(ranges) if r \u0026amp;lt; range return nonzero_entries[idx] end end end Implementation challenges There\u0026rsquo;s a couple implementation challenges I need to account for. Firstly, I have to approach the fact that there might not be a next state to jump to for some states. Initially, I decided to address this problem by randomly choosing the next state to go to. This is better than letting the entire process grind to a halt but still negatively affects sentence structure.\nTo overcome this, I instead use a ngram-dimension version of P to attain a probability for going to another state. If this failed, then I\u0026rsquo;d resort to just randomly choosing the next state. In practice, I believe this \u0026ldquo;trickle-down\u0026rdquo; method of transition matrices is quite effective.\nFor example, if I want to test with trigram words, I\u0026rsquo;ll first create the original 4-dimensional P, P4, but then also a 3-dimensional P, P3, for when there is no probability entry in P4, and create a 2-dimensional P, P2, for when there is no probability entry in P3. This concept can be extended to ngram words and the total number of transition matrices needed to use this method would be ngram.\nBelow is the implementation of this \u0026ldquo;trickle-down\u0026rdquo; logic.\nfunction trickle_down(current_state, M) none_worked = true for (idx, P) in enumerate(M) sigma = convert(Int, sum(P[current_state[idx:end]..., :][:])) if sigma != 0 # avoid division by 0 error distribution = P[current_state[idx:end]..., :][:] / sum(P[current_state[idx:end]..., :][:]) r = rand() next_word_idx = choose_next_state(distribution, r) none_worked = false break end end if none_worked # just choose next state at random next_word_idx = rand(1:length(M[1][current_state..., :][:])) end next_word_idx end Now finally I build the machinery that uses the transition matrix that I generated from the Mark Twain books.\nfunction markov_model(ϕ, num_steps, unique_symbols, ngram, M, groupby) if groupby == \u0026#34;chars\u0026#34; ϕ = split(ϕ, \u0026#34;\u0026#34;) else ϕ = split(ϕ) end # create empty array to store result of Markov jumping from state to state markov_chain_text = [] append!(markov_chain_text, ϕ) current_state = text_to_numeric(ϕ, unique_symbols) # \u0026#34;trickle-down\u0026#34; transition matrices for step in 1:num_steps next_word_idx = trickle_down(current_state, M) next_word = numeric_to_text([next_word_idx], unique_symbols)[1] push!(markov_chain_text, next_word) current_state = text_to_numeric(markov_chain_text[end-ngram+1:end], unique_symbols) end markov_chain_text end Now I\u0026rsquo;m just going to write a function to run this Markov model-based text generator.\nfunction run(corpus, M; num_steps = 10, ngram = 2, groupby = \u0026#34;words\u0026#34;) unique_symbols = unique(split(corpus)) # choose random ngram set of symbols from text ϕ = get_phi(corpus, ngram, groupby = groupby) @show ϕ markov_chain_text = markov_model(ϕ, num_steps, unique_symbols, ngram, M, groupby) join(markov_chain_text, \u0026#34; \u0026#34;) end function get_phi(cleaned_corpus, ngram; groupby = \u0026#34;words\u0026#34;) if groupby == \u0026#34;chars\u0026#34; cleaned_corpus_array = split(cleaned_corpus, \u0026#34;\u0026#34;) else cleaned_corpus_array = split(cleaned_corpus) end starting_point = rand(1:length(cleaned_corpus_array)-ngram) ϕ = join(cleaned_corpus_array[starting_point:starting_point+ngram-1], \u0026#34; \u0026#34;) end Now let\u0026rsquo;s run it! Hopefully the output sounds a bit like Mark Twain!\nngram2results = run(big_corpus_clean, M_2, num_steps = 200, ngram = 2); Results For ngram = 2 with an initial state of \u0026ldquo;high boardfence\u0026rdquo;, we get promising results from the Markov Twain bot:\n \u0026ldquo;high boardfence and disappeared over it his aunt polly asked him questions that were full of these wonderful things and many a night as he went on and told me all about such things jim would happen in and say hm what you know bout witches and that when he came home emptyhanded at night he knew the model boy of the window on to know just how long he can torment me before i could budge it was all right because she done it herself her sister miss watson a tolerable slim old maid with goggles on had just come to live with her and took a set at me now with a string and said it was minutes and minutes that there warnt a sound that a ghost makes when it was to see with gay banners waving from every balcony and housetop and splendid pageants marching along by night it was to go around all day long with you id made sure youd played hookey and been aswimming but i bet you ill she did not wait for the rest as he lay in the part where tom canty lapped in silks and satins unconscious of all this fuss\u0026rdquo;\n ","date":"2017-05-17T17:58:02Z","title":"Story Time with Markov Twain","uri":"https://www.bobbywlindsey.com/2017/05/17/story-time-with-markov-twain/"},{"categories":["Math"],"content":"Millions of Americans play the lotto each year as an innocent way to gamble their savings on winning some money and if they lose, help to fund government programs like schools. Some studies coupled with an excellent report by John Oliver show that States actually anticipate lottery earnings and instead of adding that extra money on top of the base budget for programs like schools, they just replace the base budget with lotto money and route the rest of the budget towards other Stately concerns. But this post is not about public policy but about gambling because, after all, the lotto is definitely a game of chance.\nI have a few friends who play Lotto Texas somewhat religiously and I\u0026rsquo;ve always been skeptical of the game. So, I decided to actually calculate how much money they should expect to win per ticket they buy. The first step, collect data.\nWinnings structure (at the time of writing) The rules of Lotto Texas are quite simple:\n choose 6 numbers each number can be a number from 1 to 54 (inclusive) each number you choose must be unique (i.e. don\u0026rsquo;t choose a number you\u0026rsquo;ve already chosen) order doesn\u0026rsquo;t matter If you play the base game, you spend 1 dollar per ticket. If you play the Extra! game, you pay 2 dollars per ticket.\nThe potential earnings are below:\n Number Correct Prize Amount Total Prize w/Extra! 6 of 6 $12 million $12 million 5 of 6 $2,084 $12,084 4 of 6 $54 $154 3 of 6 $3 $13 2 of 6 N/A $2 Expected value The expected value of something is an idea in mathematics that tells you a value you should expect given enough tries. In our lotto case, we\u0026rsquo;re trying to find the expected value of a ticket (i.e. if we played the lotto enough times and calculated our earnings and losses, what would be the average value of each ticket we invested in?). For example, if you buy a 2 dollar ticket, but the expected value of that ticket is only 1 dollar, then over time, you should expect to loose a dollar every time you play - that\u0026rsquo;s a sucker\u0026rsquo;s bet. So what\u0026rsquo;s the expected value of a Lotta Texas ticket? Let\u0026rsquo;s find out.\nSince there\u0026rsquo;s 54 numbers to choose from and only 6 to choose, we have 25,827,165 ways to choose 6 numbers. Winning 12 million dollars means we get all 6 of those numbers right which gives us the probability $\\frac{1}{25, 827, 165}$. Below are the rest of the probabilities:\n$\\mathrm{P}(\\mathrm{5\\ of\\ 6}) = \\frac{\\binom{6}{5} \\binom{48}{1}}{\\binom{54}{6}} = \\frac{288}{25,827,165} = \\frac{32}{2,829,685}$\n$\\mathrm{P}(\\mathrm{4\\ of\\ 6}) = \\frac{\\binom{6}{4} \\binom{48}{2}}{\\binom{54}{6}} = \\frac{16,920}{25,827,165} = \\frac{376}{573,937}$\n$\\mathrm{P}(\\mathrm{3\\ of\\ 6}) = \\frac{\\binom{6}{3} \\binom{48}{3}}{\\binom{54}{6}} = \\frac{345,920}{25,827,165} = \\frac{69,184}{5,165,433}$\n$\\mathrm{P}(\\mathrm{2\\ of\\ 6}) = \\frac{\\binom{6}{2} \\binom{48}{4}}{\\binom{54}{6}} = \\frac{2,918,700}{25,827,165} = \\frac{64,860}{573,937}$\nSo my friends play the base game so they only spend 1 dollar per ticket. If we want to calculate the expected earnings per ticket they buy (minus the 1 dollar they spend), we need to take the probabilities we calculated above and multiply each of them by their respective prize amount (the 2nd column of the table).\n$$ \\mathrm{E(ticket)} = 11,999,999(\\frac{1}{25,827,165}) + 2,083(\\frac{32}{2,869,685}) + 53(\\frac{376}{573,937}) + 2(\\frac{69,184}{5,165,433}) \\approx 0.55 $$\nYikes, that\u0026rsquo;s an expected earnings of 55 cents per dollar spent. Not much of an earnings, more like a loss. So over time, my friends can expect to loose almost half of the money they spend playing Lotto Texas.\nIf they played the Extra! version of the game, their expected earnings (minus the $2 they\u0026rsquo;d spend) per ticket bought would be below:\n$$ \\mathrm{E(Extra!\\ ticket)} = 11,999,998(\\frac{1}{25,827,165}) + 12,082(\\frac{32}{2,869,685}) + 152(\\frac{376}{573,937}) + 11(\\frac{69,184}{5,165,433}) + 0(\\frac{64,860}{573,937}) \\approx 0.85 $$\nThat\u0026rsquo;s a better expected value if you play Extra! but you\u0026rsquo;re still loosing money over time. And these calculations are based upon the assumption that you don\u0026rsquo;t have to split any of the prize money because if you do, these expected value calculations only get worse.\nConclusion Playing the Lotto isn\u0026rsquo;t a crime, but it\u0026rsquo;s best to know your odds of winning before you play. And if you still want to bet, then all power to you. But Lotto Texas is a sucker\u0026rsquo;s bet.\n","date":"2017-01-17T18:38:25Z","title":"Lotto Texas: A Sucker’s Bet","uri":"https://www.bobbywlindsey.com/2017/01/17/lotto-texas/"}]