kstest2 (Statistics Toolbox)

Kolmogorov-Smirnov test to compare the distribution of two samples

Syntax

H = kstest2(X1,X2)
H = kstest2(X1,X2,alpha,tail)
[H,P,KSSTAT] = kstest(X,cdf,alpha,tail)

Description

H = kstest2(X1,X2) performs a two-sample Kolmogorov-Smirnov test to compare the distributions of values in the two data vectors X1 and X2 of length n1 and n2, respectively. The null hypothesis for this test is that X1 and X2 have the same continuous distribution. The alternative hypothesis is that they have different continuous distributions. The result H is 1 if we can reject the hypothesis that the distributions are the same, or 0 if we cannot reject that hypothesis. We reject the hypothesis if the test is significant at the 5% level.

For each potential value x, the Kolmogorov-Smirnov test compares the proportion of X1 values less than x with proportion of X2 values less than x. The kstest2 function uses the maximum difference over all x values is its test statistic. Mathematically, this can be written as

where is the proportion of X1 values less than or equal to x and is the proportion of X2 values less than or equal to x. Missing observations, indicated by NaNs are ignored.

H = kstest2(X1,X2,alpha,tail) specifies the significance level alpha and a code tail for the type of alternative hypothesis. If tail = 0 (the default), kstest2 performs a two-sided test with the general alternative . If tail = -1, the alternative is that . If tail = 1, the alternative is . The form of the test statistic depends on the value of tail as follows:

```
tail =  0:
tail = -1:  
tail =  1:
```

[H,P,KSSTAT] = kstest2(...) also returns the observed p-value P, and the Kolmogorov-Smirnov test statistic KSSTAT defined above for the test type indicated by tail.

The asymptotic p-value becomes very accurate for large sample sizes, and is believed to be reasonably accurate for sample sizes n1 and n2 such that (n1*n2)/(n1 + n2) >= 4.

Examples

Let's compare the distributions of a small evenly-spaced sample and a larger normal sample:

x = -1:1:5
y = randn(20,1);
[h,p,k] = kstest2(x,y)
h =
     1
p =
    0.0403
k =
    0.5714

The difference between their distributions is significant at the 5% level (p = 4%). To visualize the difference, we can overlay plots of the two empirical cumulative distribution functions. The Kolmogorov-Smirnov statistic is the maximum difference between these functions. After changing the color and line style of one of the two curves, we can see that the maximum difference appears to be near x = 1.9. We can also verify that the difference equals the k value that kstest2 reports:

cdfplot(x)
hold on
cdfplot(y)
h = findobj(gca,'type','line');
set(h(1),'linestyle',':','color','r')
1 - 3/7
ans =
      0.5714

See Also

kstest, lillietest

kstest kurtosis