• Products
  • Statistics and Data Mining Solutions
  • Statistics and Data Mining Services
  • Statistics and Data Mining Resources
  • Support
  • News and Events
  • Company
Home / Support / Downloads / EnvironmentalStats for S-PLUS Example Session

EnvironmentalStats for S-PLUS Example Session

Here is an example of using EnvironmentalStats for S-PLUS.   The courier font represents what S-PLUS displays, the bold courier font represents what the user types are on the command line.


Explanation of TcCB Data

The guidance document Statistical Methods for Evaluating the Attainment of Cleanup Standards, Volume 3: Reference-Based Standards for Soils and Solid Media (USEPA, 1994, pp.6.22-6.25) contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) concentrations (ppb) from soil samples at a reference site and a "cleanup" area.  There are 47 observations from the reference site and 77 in the cleanup area.

These data are stored in the data frame epa.94b.tccb.df (see the help file Datasets: USEPA (1994b)).  There is one observation coded as "ND" in this data set as presented in the guidance document.  Here, we’ll assume this observation is less than the smallest observed value, which is 0.09 ppb.  For the purposes of this tutorial, we’ll set this one censored observation to the assumed detection limit of 0.09.

> epa.94b.tccb.df
    TcCB.orig   TcCB Censored      Area
  1      0.22   0.22        F Reference
  2      0.23   0.23        F Reference
  3      0.26   0.26        F Reference
 .       .      .           .    .
122     18.40  18.40        F Cleanup
123     51.97  51.97        F Cleanup
124    168.64 168.64        F Cleanup

The column labeled TcCB.orig contains the original observations stored in alpha-numeric form, where the non-detect is coded as "<0.09".  The column labeled TcCB is the same as the first column, except that it contains all numeric data, and the non-detect has been re-coded as 0.09.  The column labeled Censored records whether that observation was censored at a detection limit. The column labeled Area records which area the observation came from.

Back to top


Summary Statistics

The EnvironmentalStats for S-PLUS help file Summary Statistics lists functions for computing summary statistics that are available in EnvironmentalStats for S-PLUS but not built into S-PLUS.   These include functions to compute the geometric mean, standard deviation, interquartile range, skew, kurtosis, and coefficient of variation, as well as a function called full.summary that computes all of these summary statistics and others as well.

> attach(epa.94b.tccb.df)

> full.summary(split(TcCB, Area))
                           Cleanup Reference
              Sample Size  77        47
       # Missing/Infinite   0         0
                     Mean   3.915     0.5985
                   Median   0.43      0.54
         10% Trimmed Mean   0.6846    0.5728
           Geometric Mean   0.5784    0.5382
                     Skew   7.566     0.8729
                 Kurtosis  61.6       2.993
                      Min   0.09      0.22
                      Max 168.6       1.33
                    Range 168.5       1.11
             1st Quartile   0.23      0.39
             3rd Quartile   1.1       0.75
       Standard Deviation  20.02      0.2836
      Interquartile Range   0.87      0.36
Median Absolute Deviation   0.3558    0.2669
 Coefficient of Variation   5.112     0.4739

<>These summary statistics indicate that the observations for the cleanup area are extremely skewed to the right.  The medians for the two areas are about the same, but the mean for the cleanup area is much larger, indicating a few or more "outlying" observations with large values.  This may be indicative of residual contamination that was missed during the cleanup process.

Back to top


Looking at the Data

To compare the observations in the two areas, you can use the built-in S-PLUS functions hist and boxplot, or the trellis functions histogram and bwplot, or (under S-PLUS 2000) the 2D Plots Palette.  Here we'll use hist and boxplot.

> TcCB.cleanup <- TcCB[Area == "Cleanup"]

> TcCB.ref <- TcCB[Area == "Reference"]

> par(mfrow = c(2,1))

> hist(log(TcCB.ref), xlim = range(log(TcCB)),
+ xlab = "log [ TcCB (ppb) ]",
+ ylab = "Number of Observations",
+ main = "Histogram for Reference Area")

> hist(log(TcCB.cleanup), xlim = range(log(TcCB)),
+ nclass = 25,
+ xlab = "log [ TcCB (ppb) ]",
+ ylab = "Number of Observations",
+ main = "Histogram for Cleanup Area")

Click here to enlarge image.

> boxplot(split(log(TcCB), Area),
+ xlab = "Area", ylab = "Log [ TcCB (ppb) ]",
+ main = "Boxplots for TcCB Data")

Click here to enlarge image.

Both the histograms and boxplots show that most of the observations for the cleanup area are comparable to (or even smaller than) the observations for the reference area, but, as we found out from looking at the summary statistics for these data, there are a few very large "outliers" in the cleanup area.  This may indicate a few "hot spots" in the cleanup area that were missed during the remediation process.

Back to top


Empirical and Theoretical Cumulative Distribution Functions

You can use ecdfplot to plot the empirical cumulative distribution function (ecdf) of the observations for either the reference or cleanup area (or both).  The function cdf.compare (modified from S-PLUS) lets you compare an ecdf to a theoretical cdf, or to another ecdf.  First, let’s plot the empirical cdf of the reference area data by itself, then let’s create another plot comparing this ecdf with the cdf of a lognormal distribution.

> ecdfplot(TcCB.ref, xlab = "TcCB (ppb)",
+ main = "Empirical CDF for Reference Area")

Click here to enlarge image.

> cdf.compare(TcCB.ref, dist = "lnorm")

Click on image to enlarge.

The empirical cdf plot shows that the data are right-skewed.  The comparison of the empirical cdf plot with the cdf of a fitted lognormal distribution shows that these data may probably be adequately fit by a lognormal distribution.

Now let's compare the empirical cdf's of the reference and cleanup areas.

> cdf.compare(log(TcCB.ref), log(TcCB.cleanup))

Click on image to enlarge.

As we saw with both the histograms and boxplots, the cleanup area has quite a few extreme values compared to the reference area.

Back to top


Quantile-Quantile (Probability) Plots

The qqplot function has been modified in EnvironmentalStats for S-PLUS to let you specify a theoretical distribution, what kind of line to add to the plot (if any), and whether to estimate the parameters of the theoretical distribution.  Also, both standard and Tukey mean-difference Q-Q plots can be produced.  Let’s create a Q-Q plot for the reference area TcCB data assuming they come from a lognormal distribution.

> qqplot(TcCB.ref, dist = "lnorm",
+ estimate.params = T, add.line = T,
+ qq.line.type = "0-1")

Click on image to enlarge.

As we saw in the figure comparing the ecdf with a theoretical cdf, the lognormal model appears to be a fairly good fit to these data.   Now let’s look at the Tukey mean-difference Q-Q plot.

> qqplot(TcCB.ref, dist = "lnorm",
+ estimate.params = T, plot.type = "Tukey",
+ add.line = T)

Click on image to enlarge.

Tukey mean difference Q-Q plots plot the differences between the observed and fitted quantiles on the y-axis vs. the mean of the observed and fitted quantiles on the x-axis.  These types of plots are useful because it is easier to see deviations from a horizontal line than from a line with a non-zero slope.

Back to top


Estimating Distribution Parameters

EnvironmentalStats for S-PLUS contains several functions for estimating distribution parameters.  The functions elnorm and elnorm.alt let you estimate the parameters of a lognormal distribution, given a set of observations.  The function elnorm estimates the mean and standard deviation of the log-transformed distribution.  The function elnorm.alt (alternative parameterization) estimates the mean and coefficient of variation of the original distribution.  Both of these functions allow you to compute confidence intervals as well.

Here are the results of these two functions using the reference area TcCB data.

> elnorm(TcCB.ref, ci = T)

Results of Distribution Parameter Estimation
--------------------------------------------

Assumed Distribution:        Lognormal

Estimated Parameter(s):      meanlog = -0.6195712
                             sdlog   =  0.467953

Estimation Method:           mvue

Data:                        TcCB.ref

Sample Size:                 47

Confidence Interval for:     meanlog

Confidence Interval Method:  Exact

Confidence Interval Type:    two-sided

Confidence Level:            95%

Confidence Interval:         LCL = -0.7569673
                             UCL = -0.4821751

 

> elnorm.alt(TcCB.ref, ci = T)

Results of Distribution Parameter Estimation
--------------------------------------------

Assumed Distribution:        Lognormal

Estimated Parameter(s):      mean = 0.5989072
                             cv   = 0.4899539

Estimation Method:           mvue

Data:                        TcCB.ref

Sample Size:                 47

Confidence Interval for:     mean

Confidence Interval Method:  Land

Confidence Interval Type:    two-sided

Confidence Level:            95%

Confidence Interval:         LCL = 0.5243786
                             UCL = 0.7016992

Back to top


Plotting Probability Density Functions

The function pdfplot lets you plot probability density functions for all of the built-in probability distributions that come with S-PLUS and EnvironmentalStats for S-PLUS.   For example, we can plot the observed TcCB data for the reference area using a histogram, and then superimpose the fitted distribution based on the estimated parameters.

We know from the section Estimating Distribution Parameters (above) that the estimated mean and standard deviation based on the log-transformed TcCB data in the reference area are -0.62 and 0.47.  Let's plot the histogram of the log-transformed data, then add the fitted normal distribution.

> hist(log(TcCB.ref), prob = T,
+ xlab = "log [ TcCB (ppb) ]",
+ ylab = "Relative Frequency",
+ ylim = c(0, 1),
+ main = "Histogram for Reference Area\nwith Fitted Distribution")

> pdfplot(dist = "norm", add = T,
+ param.list = list(mean = -0.62, sd = 0.47))

Click on image to enlarge.

Now let's plot the histogram of the original data, then add the fitted lognormal distribution, using the estimated mean and cv of 0.60 and 0.49.

> hist(TcCB.ref, prob = T,
+ xlab = "TcCB (ppb)", xlim = c(0, 1.5),
+ ylab = "Relative Frequency",
+ ylim = c(0, 2), nclass = 10,
+ main = "Histogram for Reference Area\nwith Fitted Distribution")

> pdfplot(dist = "lnorm.alt", add = T,
+ param.list = list(mean = 0.6, cv = 0.49))

Click on image to enlarge.

Here is an example of four other distributions available in EnvironmentalStats for S-PLUS:

> par(mfrow = c(2,2))

> pdfplot(dist = "gevd", param.list =
+ list(location = 4, scale = 1, shape = 0))

> pdfplot(dist = "lnorm.mix.alt", param.list =
+ list(mean1 = 5, cv1 = 1, mean2 = 20, cv2 = 0.5, p.mix = 0.5),
+ right.tail.cutoff = 0.02)

> pdfplot(dist = "pareto", param.list =
+ list(location = 1, shape = 1),
+ right.tail.cutoff = 0.1)

> pdfplot(dist = "zmlnorm.alt", param.list =
+ list(mean = 2, cv = 1, p.zero = 0.4),
+ right.tail.cutoff = 0.01)

Click on image to enlarge.

Back to top


Testing for Goodness-of-Fit

EnvironmentalStats for S-PLUS contains several new functions not available in S-PLUS for testing goodness-of-fit.  Here, we’ll use the Shapiro-Wilk test to test the goodness-of-fit of the reference area TcCB data to a lognormal distribution.

> tccb.ref.gof <- sw.gof(log(TcCB.ref))

> tccb.ref.gof

Results of Goodness-of-Fit Test
-------------------------------

Test Method:                Shapiro-Wilk GOF

Hypothesized Distribution:  Normal

Estimated Parameter(s):     mean = -0.6195712
                            sd   =  0.467953

Estimation Method:          mvue

Data:                       log(TcCB.ref)

Sample Size:                47

Test Statistic:             W = 0.9789915

Test Statistic Parameter:   n = 47

P-value:                    0.5512532

Alternative Hypothesis:     True cdf does not equal
                            the Normal Distribution
                            for at least one sample
                            point.

EnvironmentalStats for S-PLUS also contains a plotting method for the results of goodness-of-fit tests, as well as a function called plot.gof.summary, which produces four summary plots on one page.

> plot.gof.summary(tccb.ref.gof, digits = 2)

 

Click on image to enlarge.

Back to top


Estimating Quantiles and Computing Confidence Limits

EnvironmentalStats for S-PLUS contains several functions for estimating quantiles and optionally constructing confidence limits for the quantiles.  Let’s estimate the 90th percentile of the distribution of the reference area TcCB, assuming the true distribution is a lognormal distribution, and compute a 95% confidence interval for this 90th percentile.

> eqlnorm(TcCB.ref, p = 0.9, ci = T)

Results of Distribution Parameter Estimation
--------------------------------------------

Assumed Distribution:        Lognormal

Estimated Parameter(s):      meanlog = -0.6195712
                             sdlog   =  0.467953

Estimation Method:           mvue

Estimated Quantile(s):       90'th %ile = 0.9803307

Quantile Estimation Method:  qmle

Data:                        TcCB.ref

Sample Size:                 47

Confidence Interval for:     90'th %ile

Confidence Interval Method:  Exact

Confidence Interval Type:    two-sided

Confidence Level:            95%

Confidence Interval:         LCL = 0.8358791
                             UCL = 1.215498

Back to top


Nonparametric Two-Sample Tests

EnvironmentalStats for S-PLUS contains functions for performing general two-sample linear rank tests (to test for a shift in location) and a special quantile test that tests for a shift in the tail of one of the distributions.  Here, we’ll perform the usual Wilcoxon Rank Sum test and the quantile test to compare the reference and cleanup area TcCB data (recall the histograms shown in the section Looking at the Data above).

> two.sample.linear.rank.test(TcCB.cleanup,
+ TcCB.ref, alternative = "greater")

Results of Hypothesis Test
--------------------------

Null Hypothesis:         Fx(t) = Fy(t)

Alternative Hypothesis:  Fx(t) > Fy(t) for at
                         least one t

Test Name:               Two-Sample Linear Rank
Test:                    Wilcoxon Rank Sum
                         Test Based on Normal
                         Approximation

Data:                    x = TcCB.cleanup
                         y = TcCB.ref

Sample Sizes:            nx = 77
                         ny = 47

Test Statistic:          z = 1.171872

P-value:                 0.1206242

The Wilcoxon Rank Sum Test is not significant at the 0.10 level.  Now let’s look at the results of the quantile test.

> quantile.test(TcCB.ref, TcCB.cleanup, target.r = 9)

Results of Hypothesis Test
--------------------------

Null Hypothesis:            e = 0

Alternative Hypothesis:     0 < e <= 1, where
                            Fy(t) = (1-e)*Fx(t) + e*Fz(t),
                            Fz(t) <= Fx(t) for all t,
                            and Fx != Fz

Test Name:                  Quantile Test

Data:                       x = TcCB.ref
                            y = TcCB.cleanup

Sample Sizes:               nx = 47
                            ny = 77

Test Statistics:            k (# y obs of r largest) = 9
                            r = 9

Test Statistic Parameters:  m = 47
                            n = 77
                            quantile.ub = 0.928

P-value:                    0.01136926

The quantile test is significant at the 0.011 level.   It picked up the portion of large outlying values in the cleanup area TcCB data.

Back to top


Sample Size and Power Calculations

Environmental statistics is no different from any other field of statistics when it comes to the need to determine the sample size for a study.  EnvironmentalStats for S-PLUS includes functions to compute the relationship between power and sample size, or the length of a confidence interval and sample size, for standard hypothesis tests based on the normal (Gaussian) and binomial distributions.

The plot below shows the relationship between sample size and power for detecting a mean TcCB concentration at a "contaminated" site that is 1.5 times as large as the mean concentration at the reference site, assuming a sample size of 47 at the reference site and a standard deviation of about 0.47 (log-scale).  Note that about 15 observations are required in the "contaminated" site to obtain 90% power.

> plot.t.test.design(delta.over.sigma = log(1.5)/0.47,
+ n2 = 47, sample.type = "two",
+ alternative = "greater", digits = 2)

Click on image to enlarge.