Monday, April 1, 2019

Tests of Significance: Uses and Limitations

streaks of Signifi assholece Uses and LimitationsAbstractstatistical tools be undoubtedly important in decision making. The riding habit of these tools in allday troubles has led to a number of discoveries, conclusions and enhancement of get alongledge. This ranges from direct calculations employ general statistical formulas to formulas integ regula progressd in Statistical softw be to tighten up the physical process of decision making.Statistical tools for show uppouringing hypothesis, symbolizeing stresss argon knock-down(prenominal) and neverthe little if utilize counterbalancely and in good apprehension of their concepts and limitations. whatsoever detectives shake off indulged into wrong usage of this runnelings leading to wrong conclusions.This newbornspaper publisher opines at the dispa compute logical implication evidences (both parametric and non-parametric tests) their drug abuses, when to be utilise and their limitations. It similarwise evaluates the use of Statistical Signifi lavatoryce tests in information recovery and so(prenominal) proceeds to check the several(predicate) significant tests utilise by re work outkers in the papers poor boymitted to Special Interest Group on Information recuperation (SIGR) in the period 2006, 2007 and 2008. For the combined period 2006-2008, including the eld 2006 and 2008, of the papers submitted had statistical tests employ and of these tests were used wrongly.Key Words Significance Test, Information Retrieval, parametric Tests, Non-parametric Tests, Hypothesis TestingChapter integrity1.0 IntroductionStatistical orders play a rattling important role in all aspects of question, ranging from selective selective selective information collection, recording, analysis, to making conclusions and inferences. The credibility of the research results and conclusions impart depend on individually(prenominal) and both(prenominal) step menti matchlessd above any fault made in these locomote can render a research carried out for several years, expenditure millions of shillings to be worthless.This does non specify carrying any test and mincing figures shows that statistics has been used in the conjecture research the tec should be able relief why he or she used that specific test or method.Misuse of import test is non new in the existence of science. According to Campbell (1974), on that point argon diametric eccentrics of statistical vituperateDiscarding unfavourable portion of entropyThis occurs when the detective selects and a portion of information which produces the results that he/she requires perfectly while discarding the opposite portion. After a nearly through research, the researcher might get harbors that atomic number 18 not consistent to what he/she was expecting. This researcher might decide to ignore this segment of data during the analysis so as to get the expected results. This is a wrong take since the i nconsistent data could give very new thoughts in that particular(prenominal) battleground that is if these irregularities ar checked and explained why they occurred, to a greater extent ideas abut that expanse can be explored..Overgeneralization or sotimes the conclusions from a research can only work on that particular research problem but the researcher might blindly generalize the results obtained to just about other kinds of research similar or dissimilar. Overgeneralization is a common mistake in contemporary research activities. A researcher after successfully completing a research on a particular field, he/she might be tempted to rent generalizations reached in this research to other fields of study without regarding the incompatible orientations of these different comm building blockys and assumptions in them.Non representative assayThis arises when the researcher selects a en ingest which produces results geargond towards his/her liking. Sample selected for a pa rticular study should be one that truly represents the entire commonwealth. The procedure of selecting the sample units to be used in the study should be done in an unbiased manner.Consciously manipulating dataOccurs when a researcher consciously changes the self-collected data in revise to reach a particular conclusion. This is primarily noticed when the researcher knows simply what the customers train atomic number 18, so the researcher changes part of the data so that the aim of that research is covered strongly. For pattern if a researcher is carrying out a regression analysis and does a scatter plot, if he/she sees that there be some(prenominal) out liers,the researcher might decide to change some(prenominal) observes so that the scatter plot appears as a skilful away line or something very close to that. This act leads to results which ar benevolent to the customer and the eyes of other user but in literal sense does not give a clear indicator of what is rattl ing happening in the macrocosm at tumid.1.0.5 False correlationThis is find when the researcher claims that one factor causes the other while in original sense both two factors ar caused by another cabalistic factor which was not identified during the study. Correlation researches are common in social sciences and sometimes they are not adequately approached, this leads to wanting results. In correlation studies say to check if inconsistent X causes variable Y, in real sense there are four possible things. The starting of all one is that X causes Y,secondly Y causes X, third is X and Y are both caused by another unidentified variable say Z and lastly the correlation between X and Y occurred purely by sheer luck.All these possibilities should be checked while doing these kinds of study to fend off rushing into wrong conclusions. False causality can be eliminated in studies by victimization two radicals for the same experiment that is the control group (the one receiving a p lacebo) and the treatment group (the one receiving the treatment) .Even though this method is efficient, implementing it raises very many challenges. There are good identification numbers like when one patient is given a placebo (effect less drug) without his/her conscious and the other group given the right drug. One capitulum comes to mind is it ethical to do this to the first group? Carrying out the experiment in replicate for two different groups can also prove to be very expensive.1.0.6 Overloaded questions.The questions used in survey can really put on the outcome of the survey. The structure of questions in a questionnaires and the method of formulating and asking the questions can influence the manner in which the responder answers the questions. Long wordy questions in a questionnaire can be too boring to a respondent and he/she might unsloped fill the questionnaire in a boot so that he/she finishes it but does not really care well-nigh the answers that he/she has p rovided. The framing of questions can also yield leading questions. Some questions leave alone just lead the respondent on what to answer for sample The government is not offering security to its citizens, do you agree to this? (Yes or No)Use of statistical deduction has been with us for more than 300 years (Huberty, 1993).Despite organismnessness used for a long time, this field of decision making is cornered by criticism from all directions, which has led to many researchers compose materials digging into the problems of statistical meaning testing. Harlow et. al (1997), discussed the controversy in significance testing in depth. Carver (1993) expressed dislike of significance tests and clear advocated researchers to stop employ them.In his book, How to Lie with Statistics, Huff (1954) outlined errors both intentional and unintentional and misinterpretations made in statistical analyses in depth. Some journals e.g. American Psychological Association (APA) recommended minim um use of statistical significance test by researchers submitting papers for publications (APA, 1996), though not revoking the use of the tests.With the gloomy criticism, other researchers take not given up on development statistical significance testing but pack clearly get along users of the tests to have good knowledge in them forrader making conclusions using them. Mohr (1990) discussed the use of these tests and supported their use but warning researchers to know the limitations of each tests and correct application of the tests so as to coif a correct inferences and conclusions. In his paper, Burr (1960) supported the use of statistical significance test but requested researchers to make allowances for existence of statistical errors in the data.Amidst these controversies, statistical significance testing has been applied to many areas of research and remarkable achievements have been recorded. One such area is the information recovery (IR). Significant tests have been used to compare different algorithms in information retrieval.1.1.0 Information retrievalInformation retrieval is specify as the science of searching databases, human race Wide Web and other enumerations looking for information on a particular open. In order to get information, the user is requisite to count on keywords which are to be used for searching, a combination of objects containing the keywords are unremarkably returned from which the user looking for information can single out and take one which gives him or her the much essential information.The user usually increasingly refines the search by narrowing down and using specific words. Information retrieval has developed as a highly dynamic and experimental discipline, requiring careful and thorough evaluation to show the superior effect of different new techniques on representative document collections.There are many algorithms for Information Retrieval .It is usually important to measure the performance of diff erent information retrieval clays so as to know which one gives the indispensable information faster. In order to measure information retrieval effectiveness, common chord test power points are call for(i) A collection of documents on which the different retrieval methods pull up stakes be run on and compared.(ii) A test collection of information needs which are expressible in scathe of queries(iii)A collection of relevance judgment that will distinguish on whether the results returned are applicable to the someone doing the search or they are contrasted.A question might arise on which collection of objects to be used in testing different agreements. There are several measuring test collections used universally, these include(i) Text Retrieval Conference (TREC). This a standard collection comprising 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and sp ecified in detailed text passages. Individual test collections are defined over different subsets of this data.(ii)GOV2-This was developed by The U.S. subject Institute of Standards and Technology (NIST).It is a 25 paged collection of web pages.(iii) NII Test Collections for IR Systems (NTCIR)-This is also a large-mouthed test collection think mainly on East Asian language and cross-language information retrieval, where queries are made in one language over a document collection containing documents in one or more other languages.(iii) transversal Language Evaluation Forum (CLEF). This Test collection is mainly cerebrate on European languages and cross-language information retrieval.(iv) 20 Newsgroups. This text collection was collected by Ken Lang. It consists of 1000 articles from each of 20 Usenet newsgroups (the newsgroup name being regarded as the category). After the removal of duplicate articles, as it is usually used, it contains 18941 articles.(v) The Cranfield collect ion. This is the oldest test collection in allowing precise quantitative measures of information retrieval effectiveness, but is nowadays too small for anything but the just about elementary navigate experiments. It was collected in the United Kingdom starting in the deeply 1950s and it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.There exist several methods of measuring the performance of retrieval systems namely Precision, Recall, Fall-Out, E-measure and F-measure just to mention a a few(prenominal) since researchers are flood tide up with other new methods.A brief description of each method will shade some light.1.1.1 RecallRecall in information retrieval is defined as the number of relevant documents returned from a search dual-lane by the total number of documents that can be retrieved from a database. Recall can also be looked at as evaluating how well the method that is b eing used to retrieve information gets the take information.Letbe the set of all retrieved objects andbe the set of all relevant objects accordingly,Recall(1.1)As an example, if a database contains 500 documents, out of which 100 contain relevant information required by a researcher, the complement ,number of documents not required = 400.If the researcher uses a system to search for the documents in this database and it return 100 documents of which all of them are relevant to the researcher, then the recall is given byRecall hypothetical that out of 120 returned documents, 30 are irrelevant, then the recall would be given byRecall1.1.2 PrecisionPrecision is defined as the number of relevant documents retrieved from the system over the total number of documents retrieved in that search. It valuates how well the method being used to retrieve information filters the throwaway(prenominal) information.Letbe the set of all retrieved objects andbe the set of all relevant objects then ,Precision(1.2)As an example, if a database contains 500 documents, out of which 100 contain relevant information required by a researcher, the complement ,number of documents not required = 400.If the researcher uses a system to search for the documents in this database and it returns 100 documents of which all of them are relevant to the researcher, then the precision is given byPrecisionSupposed that out of 120 returned documents, 30 are irrelevant, then the precision would be given byPrecisionBoth precision and recall are arrange on one term Relevance Oxford dictionary defines relevance as connected to the issue being discussed.Yolanda Jones (2004) identified ternion types of relevance, namely causa relevance which is the connection between the subject submitted via a query and subject covered by returned texts. Situational relevance connection between the situation being considered and texts returned by database system. Motivational relevance connection between the motivation s of a researcher and texts returned by database system.There are two measures of relevanceNovelty ratio This refers to the proportion of facts returned from a search and acknowledged by the user as being relevant, of which they were previously unaware of.Coverage Ratio This refers to the proportion of items returned from a search out of the total relevant documents that the user was aware of before he/she started the search.Precision and recall affect each other i.e. gain in recall value decreases precision value.If one increases a systems ability to retrieve more documents, this implies increasing recall, this will have a drawback since the system will also be retrieving more irrelevant documents hence reducing the precision of that system. This means that a trade-off is required in these two measures so as to ensure give way search results.Precision and recall measures make use of the interest assumptionsThey make the assumption that either a system returns a document or doe snt.They make the assumption that either the document is relevant or not relevant, nonentity in between.New methods are being introduced by researchers which rank the detail of relevance of the documents.1.1. 3 Receiver Operating Characteristics (ROC) CurveThis is the plot of the accredited decreed rate or sensitivity against the false positive rate or (1 specificity).Sensitivity is just another term for recall. The false positive rate is given by. An ROC curve always goes from the bottom left to the top right of the graph. For a good system, the graph climbs steeply on the left side. For unordered result sets, specificity, given bywas not seen as a very helpful idea. Because the set of true negatives is always so large, its value would be just about 1 for all information needs (and, correspondingly, the value of the false positive rate would be almost 0).1.1.4 F-measure and E-measureThis is defined as the weighted harmonical mean of the recall and precision. Numerically, it is defined as(1.3)Whereis the weight.Ifis assumed to be 1, then(1.4)The E-measure is given by(1.5)E measure has a maximum value of 1.0, 1.0 being the best.1.1.5 Fall-OutThis is defined as the proportion of irrelevant documents that are returned in a search out of all the possible irrelevant documents.Fall out(1.6)It can also be defined as the probability of a system retrieving an irrelevant document.These are just a few methods of measuring performance of search systems. Then after looking after one system, there arise a problem of comparing two systems or algorithms, that is, is this system better than the other one?To answer this question, scientist in Information retrieval use statistical significance tests to do the comparisons in order to establish if the difference in systems performance are not by chance. These tests are used to confirm beyond doubt that one system is better than another.Statement of the problemStatistical inference tools like statistical significance tests are important in decision making. Their use has been on the rise in different areas of research. With their rise, novel users make use of these tools but in questionable manners. There are many researchers who do not learn the basic concepts in statistics leading to misuse of the tools. Any conclusions reached from a research might be termed bogus if the statistical tests used in it are shoddy.More light needs to be shade in this area of research to ensure correct use of these tests. Researchers in Information Retrieval also use these tests to compare systems and algorithms, are the conclusions from these tests truly correct? ar there any other ways of comparison which minimize the use of statistical tests?Objectives of the studyThe objectives of this study areInvestigate use and misuse of statistical significance tests in scientific papers submitted by researchers to SIGIR. elaboration light on different statistical significance tests their use, assumptions and limitations.Identi fy the most important statistical concepts that can provide solutions to the problems of statistical significance in scientific papers submitted by researchers to SIGIR.Investigate the reality of the problems of statistical significance in scientific papers submitted by researchers to SIGIR.Investigate the use of statistical significant tests used by researchers in Information Retrieval reckon the availability of statistical concepts and methods that can provide solutions to the problems of statistical significance in scientific papers submitted by researchers to SIGIRChapter TwoThis section of this paper has been divided into three major parts, the sample plectron and sample size choosing which will discusses methods of selecting a sample and the size of the sample to be used in a given research, the second part deals with statistical analysis methods and procedures, mainly in significance testing and the third part discusses other statistical methods that can be used in place of statistical significance test.2.0 Sample Selection and Sample Size2.0.1 Sample selection ingest plays a major role in research, according to Cochran (1977), sampling is the process of selecting a portion of the creation and using the information derived from this portion to make inferences about the entire population.Sampling has several advantages, namely(i)Reduced costFor example it is very expensive to carry out a census than just collecting information from a small portion of the population. This is because only a small number of measures will be made so only a few people will be hired to do the job compared to complete census which will require a large labor force.(ii)Greater speed during the process(less time)Since only a few people will be used or instead only a few items will be measured, the time for doing the measurement will be decreased and also summarization of the data will be quick as opposed to when measures are taken for the whole population.(iii)Greater accuracy Since only a few people will be considered in the process, the researchers will be very thorough as compared to the entire population which will see the researchers get tired in the middle of the process leading to contaminating collection of data and shoddy analysis.The choice of the sampling units in a given research may affect the credibility of the whole research. The researcher must make sure that the sample being used is not biased, that is it represents the whole population.There are several methods of selecting samples to be used in a study. A researcher should always make sure that the sample drawn is large enough to be a representative of the population as a whole and at the same time manageable. In this section the two major types of sampling, hit-or-miss and non- haphazard, will be examined.2.0.1.1 stochastic samplingIn haphazard sampling, all the items or individuals in the population have equal chances of being selected into the sample. This procedure ensures that no bias is introduced during the selection of sample units since a n items selection will be only by chance and will not depend on the person assigned with the duty of coming up with the sample. There exist quintet major random sampling techniques, namely simple random sampling, multi- arrange sampling, secernate sampling, bundle up sampling and systematic sampling. The following section discusses each of these.2.0.1.1.1 artless random samplingIn simple random sampling, each item in the population has the same and equal chance of being include in the sample. Usually each sampling unit is assigned a unique number and then numbers are generated using a random number generator and a sampling unit is include in the sample if its corresponding number is generated from the random number generator.One advantage attributed to simple random sampling is its simplicity and ease in application when dealing with small populations. Every entity in the population has to be enlisted and given a unique number then their respective random numbers be read. This makes this method of sampling very tedious and gawky especially where large populations are involved.2.0.1.1.2 Stratified samplingIn bedded random sampling, the entire population is first divided into N divorce subpopulations .Each sampling unit belongs to one and only one sub population. These sub populations are called strata, they might be of different sizes and they are homogenous inside the strata and each stratum completely differs with the other strata. It is from these strata that samples are drawn for a particular study. Examples of strata that are commonly used include States, provinces, Age and Sex, religion, academic ability or marital status etc.Stratification is most useful when the stratifying variables are simple to work with, easy to observe and near related to the topic of the survey (Sheskin, 1997). Stratification can be used to select more of one group than another. This may be done if it is fe lt that the responses obtained vary in one group than another. So, if the researcher knows that every entity in each group has much the same value, he/she will only need a small sample to get information for that group whereas in another group, the values may differ wide and a bigger sample is needed.If you want to combine group aim information to get an answer for the whole population, you have to take handbill of what proportion you selected from each group. This method is mainly used when information is required for only a particular subdivision of the population, administrative convenience is an issue and the sampling problems differ greatly in different portions of the population of study.2.0.1.1.3 dictatorial samplingSystematic sampling is quite different from the other methods of sampling, vatical the population contains N units and a sample of n units is required, a random number is generated using the random number generator, call it k, then a unit( represented as a num ber) is drown from the sample then the researcher picks every kth unit thereafter. Consider the example that k is 20 and the first unit that is drawn is 5, the subsequent units will be 25,45,65,85 and so on.The implication of this method is that the selection of the whole sample will be determined by only the first item since the rest will be obtained sequentially. This type is called an every kth systematic sample. This technique can also be used when questioning people in a sample survey. A researcher might select every 15th person who enters a particular store, after selecting a person at random as a starting point or interview the discoverkeepers of every 3rd shop in a street, after selecting a starting shop at random.It may be that a researcher wants to select a stubborn size sample. In this case, it is first necessary to know the whole population size from which the sample is being selected. The appropriate sampling interval, I, is then work out by dividing population size, N, by required sample size, n. This method is advantageous since it is easy and it is more precise than simple random sampling.Also it is simpler in systematic sampling to select one random number and then every kth member on the list, than to select as many random numbers as sample size. It also gives a good cattle farm right across the population. A disadvantage is that the researcher may be forced to have a starting list if he/she wishes to know the sample size and calculate the sampling interval.2.0.1.1.4 Cluster samplingThe Austarlian breast of Statistics insinuates that clop sampling divides the population into groups, or clusters. A number of clusters are selected randomly to represent the population, and then all units within selected clusters are include in the sample. No units from non-selected clusters are included in the sample. They are represented by those from selected clusters. This differs from stratified sampling, where some units are selected from each group.T he clusters are complicated within each cluster (that is the sampling units inside a cluster vary from each other completely) and each cluster looks alike with the other clusters. Cluster sampling has several advantages which include reduced costs, simplified field work and administration is more convenient. Instead of having a sample broken over the entire coverage region, the sample is more concentrated in relatively few collection points (clusters). Cluster sampling provides results that are less accurate compared to stratified random sampling.2.0.1.1.5 Multi-stage samplingMulti-stage sampling is like cluster sampling, but involves selecting a sample within each chosen cluster, rather than including all units in the cluster. The Australian Bureau of Statistics postulates that multi-stage sampling involves selecting a sample in at least two stages. In the first stage, large groups or clusters are selected. These clusters are designed to contain more population units than are req uired for the final sample. In the second stage, population units are chosen from selected clusters to derive a final sample. If more than two stages are used, the process of choosing population units within clusters continues until the final sample is achieved. If two stages are used then it will be called a two stage sampling, if three stages are used it will be called a three stage sampling and so on.2.0.2 Determination of sample size to be used2.1 Statistical AnalysisIn this section, different statistical tests are discussed in details in their general form, then move to discussed how each of them(the ones used in IR) are applied to information retrieval. Only some of these tests are used to compare systems or/and algorithms.In this paper we look at three sections of statistical analysis, namely(i) Summarizing data using a single value.(ii) Summarizing variability.(iii) Summarizing data using an interval (no specific value)In the first case, we have the mean, mode, median(pre nominal) value etc and in the second case, we look at variability in the data and in the third case we look at the confidence intervals, parametric and nonparametric tests of hypothesis testing2.1.1 Summarizing data using a single valueIn this case, the data being canvass is represented by a single value, example for this scenario are discussed on a lower floor2.1.1.1 MeanThere are three different kinds of mean(i)Arithmetic mean(ii)Geometric Mean(iii) benevolent mean(i) Arithmetic meanThis is computed by summing all the observations then dividing by the number of observations that you have collected.Letbe n observations of a random variable X. The arithmetic mean is defined asArithmetic meanWhen to use the arithmetic meanThe arithmetic mean is used whenWhen the collected data is a numeric observation.When the data has only one mode (uni-modal)When the data is not skewed i.e. not concentrated to extreme values.When the data does not have many outliers (very extreme values)The arithm etic mean is not used whenYou have categorical dataWhen the data is extremely skewed.(ii) Geometric meanThis is defined as the product of the observations, everything raised to power of, usually n.Letbe n observations of a random variable X. The geometric mean is defined asGeometric meanThe Geometric mean is used whenThe observations are numeric.The item that we are interested in is the product of the observations.(iii) Harmonic meanThis is defined as the number of observations divide be the sum of reciprocals of the observations.Letbe n observations of a random variable X. The harmonic mean is defined asHarmonic meanThe Harmonic mean is used whenThe average can be justified for the reciprocal of the observations.2.1.1.2 MedianThis is defined as the middle value of the observations. The observations are first arranged in ascending or come down order then the middle value is taken as the median.The median is used whenWhen the observations are skewed.The observations have a single mo de.The observations are numerical.The median is not used whenWe are interested in the total value.2.1.1.3 stylusThis is defined as the largest value in the given dataset or the value that has the highest frequency of occurrence.The mode is used whenThe dataset is categorical.The dataset is both numeric and multimodal.2.1.2 Summarizing variability variation in a data can be summarized using the following measures2.1.2.1 Sample varianceLetbe n observations of a random variable X, then the Sample variance, is given byThe standard deviation is used whenThe data is normally distributed.2.1.2.2 The C

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.