Independence of events in real-life data












1














Most of statistical methods (if not all) rely on independence of events. How do we know that this assumption is valid in real-life problems like clinical trials or web crawling? What might be the consequences of statistical modelling of data which violate independence assumption, but we do not know about that?










share|cite|improve this question





























    1














    Most of statistical methods (if not all) rely on independence of events. How do we know that this assumption is valid in real-life problems like clinical trials or web crawling? What might be the consequences of statistical modelling of data which violate independence assumption, but we do not know about that?










    share|cite|improve this question



























      1












      1








      1







      Most of statistical methods (if not all) rely on independence of events. How do we know that this assumption is valid in real-life problems like clinical trials or web crawling? What might be the consequences of statistical modelling of data which violate independence assumption, but we do not know about that?










      share|cite|improve this question















      Most of statistical methods (if not all) rely on independence of events. How do we know that this assumption is valid in real-life problems like clinical trials or web crawling? What might be the consequences of statistical modelling of data which violate independence assumption, but we do not know about that?







      machine-learning estimation inference independence bias






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited 3 hours ago









      kjetil b halvorsen

      28.5k980208




      28.5k980208










      asked 4 hours ago









      WoofDoggy

      1213




      1213






















          2 Answers
          2






          active

          oldest

          votes


















          2














          Often the question is the events independent? is the wrong question. The observation we want to analyze are represented in some model as random variables, and if we should model them as independent is a modeling decision.



          A better question to ask is often: is the events exchangeable? This means that the random variables plays a symmetric role, there is apriori (given our state of knowledge) any reason to believe that, say, $X_1$ should probably be larger than $X_2$ or the opposite. This is typically the case in experiments, say, where the variables represents observations on randomly drawn people that we do not know much about (decidedly not to distinguish between them). Simple random sampling without replacement is a typical example which leads to exchangeability (but not independence).



          The clue now is that there is a theorem, the deFinetti representation theorem, which says that exchangeable random variables can be represented as independent random variables conditional on a latent variable. You can take that latent variable as a parameter in some parametric model, which now is a typical IID model.$^dagger$



          But say that you enlarge the experiment, instead of doing the experiment only with students from your class, you do it also with students from some other class at another university. Now, the complete sample is no longer exchangeable, because you might know there are some demographic differences between the student bodies, say. But the two subsamples are still separately exchangeable. But then, constructing a model which contains an indicator variable coding for university, the arguments above again leads to an IID model.



          Conclusion: It is better to ask oneself: Are my random variables exchangeable? than asking about independence directly. A book taking this route to construction of statistical models (within the Bayesian paradigm) is Bernardo & Smith.



          $^dagger$ There are some technical points we left out.






          share|cite|improve this answer





























            1














            First, not all methods rely on independence - e.g. paired t-tests, repeated measure ANOVA, multilevel models, generalized estimating equations and a whole array of time series methods do not. In fact, they rely on the data not being independent.



            Second, we don't usually know events are independent, but it often makes a lot of sense to assume they are, because there is no plausible source of dependence. Suppose, for example, I am studying the relationship between political preference and various demographics. If I survey a bunch of people and the people are at least roughly randomly selected from some population, it doesn't seem that there is any way there could be dependence: My political preferences (and their relation to my demographics) are not related to some other random person's.



            On the other hand, if we were interested in the role of being a husband or being a wife, we might study married couples. Then the data would certainly be dependent and we would need to use methods that account for this.






            share|cite|improve this answer





















              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "65"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f384965%2findependence-of-events-in-real-life-data%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              2














              Often the question is the events independent? is the wrong question. The observation we want to analyze are represented in some model as random variables, and if we should model them as independent is a modeling decision.



              A better question to ask is often: is the events exchangeable? This means that the random variables plays a symmetric role, there is apriori (given our state of knowledge) any reason to believe that, say, $X_1$ should probably be larger than $X_2$ or the opposite. This is typically the case in experiments, say, where the variables represents observations on randomly drawn people that we do not know much about (decidedly not to distinguish between them). Simple random sampling without replacement is a typical example which leads to exchangeability (but not independence).



              The clue now is that there is a theorem, the deFinetti representation theorem, which says that exchangeable random variables can be represented as independent random variables conditional on a latent variable. You can take that latent variable as a parameter in some parametric model, which now is a typical IID model.$^dagger$



              But say that you enlarge the experiment, instead of doing the experiment only with students from your class, you do it also with students from some other class at another university. Now, the complete sample is no longer exchangeable, because you might know there are some demographic differences between the student bodies, say. But the two subsamples are still separately exchangeable. But then, constructing a model which contains an indicator variable coding for university, the arguments above again leads to an IID model.



              Conclusion: It is better to ask oneself: Are my random variables exchangeable? than asking about independence directly. A book taking this route to construction of statistical models (within the Bayesian paradigm) is Bernardo & Smith.



              $^dagger$ There are some technical points we left out.






              share|cite|improve this answer


























                2














                Often the question is the events independent? is the wrong question. The observation we want to analyze are represented in some model as random variables, and if we should model them as independent is a modeling decision.



                A better question to ask is often: is the events exchangeable? This means that the random variables plays a symmetric role, there is apriori (given our state of knowledge) any reason to believe that, say, $X_1$ should probably be larger than $X_2$ or the opposite. This is typically the case in experiments, say, where the variables represents observations on randomly drawn people that we do not know much about (decidedly not to distinguish between them). Simple random sampling without replacement is a typical example which leads to exchangeability (but not independence).



                The clue now is that there is a theorem, the deFinetti representation theorem, which says that exchangeable random variables can be represented as independent random variables conditional on a latent variable. You can take that latent variable as a parameter in some parametric model, which now is a typical IID model.$^dagger$



                But say that you enlarge the experiment, instead of doing the experiment only with students from your class, you do it also with students from some other class at another university. Now, the complete sample is no longer exchangeable, because you might know there are some demographic differences between the student bodies, say. But the two subsamples are still separately exchangeable. But then, constructing a model which contains an indicator variable coding for university, the arguments above again leads to an IID model.



                Conclusion: It is better to ask oneself: Are my random variables exchangeable? than asking about independence directly. A book taking this route to construction of statistical models (within the Bayesian paradigm) is Bernardo & Smith.



                $^dagger$ There are some technical points we left out.






                share|cite|improve this answer
























                  2












                  2








                  2






                  Often the question is the events independent? is the wrong question. The observation we want to analyze are represented in some model as random variables, and if we should model them as independent is a modeling decision.



                  A better question to ask is often: is the events exchangeable? This means that the random variables plays a symmetric role, there is apriori (given our state of knowledge) any reason to believe that, say, $X_1$ should probably be larger than $X_2$ or the opposite. This is typically the case in experiments, say, where the variables represents observations on randomly drawn people that we do not know much about (decidedly not to distinguish between them). Simple random sampling without replacement is a typical example which leads to exchangeability (but not independence).



                  The clue now is that there is a theorem, the deFinetti representation theorem, which says that exchangeable random variables can be represented as independent random variables conditional on a latent variable. You can take that latent variable as a parameter in some parametric model, which now is a typical IID model.$^dagger$



                  But say that you enlarge the experiment, instead of doing the experiment only with students from your class, you do it also with students from some other class at another university. Now, the complete sample is no longer exchangeable, because you might know there are some demographic differences between the student bodies, say. But the two subsamples are still separately exchangeable. But then, constructing a model which contains an indicator variable coding for university, the arguments above again leads to an IID model.



                  Conclusion: It is better to ask oneself: Are my random variables exchangeable? than asking about independence directly. A book taking this route to construction of statistical models (within the Bayesian paradigm) is Bernardo & Smith.



                  $^dagger$ There are some technical points we left out.






                  share|cite|improve this answer












                  Often the question is the events independent? is the wrong question. The observation we want to analyze are represented in some model as random variables, and if we should model them as independent is a modeling decision.



                  A better question to ask is often: is the events exchangeable? This means that the random variables plays a symmetric role, there is apriori (given our state of knowledge) any reason to believe that, say, $X_1$ should probably be larger than $X_2$ or the opposite. This is typically the case in experiments, say, where the variables represents observations on randomly drawn people that we do not know much about (decidedly not to distinguish between them). Simple random sampling without replacement is a typical example which leads to exchangeability (but not independence).



                  The clue now is that there is a theorem, the deFinetti representation theorem, which says that exchangeable random variables can be represented as independent random variables conditional on a latent variable. You can take that latent variable as a parameter in some parametric model, which now is a typical IID model.$^dagger$



                  But say that you enlarge the experiment, instead of doing the experiment only with students from your class, you do it also with students from some other class at another university. Now, the complete sample is no longer exchangeable, because you might know there are some demographic differences between the student bodies, say. But the two subsamples are still separately exchangeable. But then, constructing a model which contains an indicator variable coding for university, the arguments above again leads to an IID model.



                  Conclusion: It is better to ask oneself: Are my random variables exchangeable? than asking about independence directly. A book taking this route to construction of statistical models (within the Bayesian paradigm) is Bernardo & Smith.



                  $^dagger$ There are some technical points we left out.







                  share|cite|improve this answer












                  share|cite|improve this answer



                  share|cite|improve this answer










                  answered 3 hours ago









                  kjetil b halvorsen

                  28.5k980208




                  28.5k980208

























                      1














                      First, not all methods rely on independence - e.g. paired t-tests, repeated measure ANOVA, multilevel models, generalized estimating equations and a whole array of time series methods do not. In fact, they rely on the data not being independent.



                      Second, we don't usually know events are independent, but it often makes a lot of sense to assume they are, because there is no plausible source of dependence. Suppose, for example, I am studying the relationship between political preference and various demographics. If I survey a bunch of people and the people are at least roughly randomly selected from some population, it doesn't seem that there is any way there could be dependence: My political preferences (and their relation to my demographics) are not related to some other random person's.



                      On the other hand, if we were interested in the role of being a husband or being a wife, we might study married couples. Then the data would certainly be dependent and we would need to use methods that account for this.






                      share|cite|improve this answer


























                        1














                        First, not all methods rely on independence - e.g. paired t-tests, repeated measure ANOVA, multilevel models, generalized estimating equations and a whole array of time series methods do not. In fact, they rely on the data not being independent.



                        Second, we don't usually know events are independent, but it often makes a lot of sense to assume they are, because there is no plausible source of dependence. Suppose, for example, I am studying the relationship between political preference and various demographics. If I survey a bunch of people and the people are at least roughly randomly selected from some population, it doesn't seem that there is any way there could be dependence: My political preferences (and their relation to my demographics) are not related to some other random person's.



                        On the other hand, if we were interested in the role of being a husband or being a wife, we might study married couples. Then the data would certainly be dependent and we would need to use methods that account for this.






                        share|cite|improve this answer
























                          1












                          1








                          1






                          First, not all methods rely on independence - e.g. paired t-tests, repeated measure ANOVA, multilevel models, generalized estimating equations and a whole array of time series methods do not. In fact, they rely on the data not being independent.



                          Second, we don't usually know events are independent, but it often makes a lot of sense to assume they are, because there is no plausible source of dependence. Suppose, for example, I am studying the relationship between political preference and various demographics. If I survey a bunch of people and the people are at least roughly randomly selected from some population, it doesn't seem that there is any way there could be dependence: My political preferences (and their relation to my demographics) are not related to some other random person's.



                          On the other hand, if we were interested in the role of being a husband or being a wife, we might study married couples. Then the data would certainly be dependent and we would need to use methods that account for this.






                          share|cite|improve this answer












                          First, not all methods rely on independence - e.g. paired t-tests, repeated measure ANOVA, multilevel models, generalized estimating equations and a whole array of time series methods do not. In fact, they rely on the data not being independent.



                          Second, we don't usually know events are independent, but it often makes a lot of sense to assume they are, because there is no plausible source of dependence. Suppose, for example, I am studying the relationship between political preference and various demographics. If I survey a bunch of people and the people are at least roughly randomly selected from some population, it doesn't seem that there is any way there could be dependence: My political preferences (and their relation to my demographics) are not related to some other random person's.



                          On the other hand, if we were interested in the role of being a husband or being a wife, we might study married couples. Then the data would certainly be dependent and we would need to use methods that account for this.







                          share|cite|improve this answer












                          share|cite|improve this answer



                          share|cite|improve this answer










                          answered 4 hours ago









                          Peter Flom

                          74.2k11105202




                          74.2k11105202






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Cross Validated!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f384965%2findependence-of-events-in-real-life-data%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Михайлов, Христо

                              Центральная группа войск

                              Троллейбус