|
||||||||||||||
EnvironmentalStats for S-PLUS Example SessionHere is an example of using EnvironmentalStats for S-PLUS. The courier font represents what S-PLUS displays, the bold courier font represents what the user types are on the command line.
Explanation of TcCB Data The guidance document Statistical Methods for Evaluating the Attainment of Cleanup Standards, Volume 3: Reference-Based Standards for Soils and Solid Media (USEPA, 1994, pp.6.22-6.25) contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) concentrations (ppb) from soil samples at a reference site and a "cleanup" area. There are 47 observations from the reference site and 77 in the cleanup area. These data are stored in the data frame epa.94b.tccb.df (see the help file Datasets: USEPA (1994b)). There is one observation coded as "ND" in this data set as presented in the guidance document. Here, well assume this observation is less than the smallest observed value, which is 0.09 ppb. For the purposes of this tutorial, well set this one censored observation to the assumed detection limit of 0.09.
The column labeled TcCB.orig contains the original observations stored in alpha-numeric form, where the non-detect is coded as "<0.09". The column labeled TcCB is the same as the first column, except that it contains all numeric data, and the non-detect has been re-coded as 0.09. The column labeled Censored records whether that observation was censored at a detection limit. The column labeled Area records which area the observation came from. Summary Statistics The EnvironmentalStats for S-PLUS help file Summary Statistics lists functions for computing summary statistics that are available in EnvironmentalStats for S-PLUS but not built into S-PLUS. These include functions to compute the geometric mean, standard deviation, interquartile range, skew, kurtosis, and coefficient of variation, as well as a function called full.summary that computes all of these summary statistics and others as well.
<>These summary statistics indicate that the observations for the cleanup area are extremely skewed to the right. The medians for the two areas are about the same, but the mean for the cleanup area is much larger, indicating a few or more "outlying" observations with large values. This may be indicative of residual contamination that was missed during the cleanup process. Looking at the Data To compare the observations in the two areas, you can use the built-in S-PLUS functions hist and boxplot, or the trellis functions histogram and bwplot, or (under S-PLUS 2000) the 2D Plots Palette. Here we'll use hist and boxplot.
Both the histograms and boxplots show that most of the observations for the cleanup area are comparable to (or even smaller than) the observations for the reference area, but, as we found out from looking at the summary statistics for these data, there are a few very large "outliers" in the cleanup area. This may indicate a few "hot spots" in the cleanup area that were missed during the remediation process. Empirical and Theoretical Cumulative Distribution Functions You can use ecdfplot to plot the empirical cumulative distribution function (ecdf) of the observations for either the reference or cleanup area (or both). The function cdf.compare (modified from S-PLUS) lets you compare an ecdf to a theoretical cdf, or to another ecdf. First, lets plot the empirical cdf of the reference area data by itself, then lets create another plot comparing this ecdf with the cdf of a lognormal distribution.
The empirical cdf plot shows that the data are right-skewed. The comparison of the empirical cdf plot with the cdf of a fitted lognormal distribution shows that these data may probably be adequately fit by a lognormal distribution. Now let's compare the empirical cdf's of the reference and cleanup areas.
As we saw with both the histograms and boxplots, the cleanup area has quite a few extreme values compared to the reference area. Quantile-Quantile (Probability) Plots The qqplot function has been modified in EnvironmentalStats for S-PLUS to let you specify a theoretical distribution, what kind of line to add to the plot (if any), and whether to estimate the parameters of the theoretical distribution. Also, both standard and Tukey mean-difference Q-Q plots can be produced. Lets create a Q-Q plot for the reference area TcCB data assuming they come from a lognormal distribution.
As we saw in the figure comparing the ecdf with a theoretical cdf, the lognormal model appears to be a fairly good fit to these data. Now lets look at the Tukey mean-difference Q-Q plot.
Tukey mean difference Q-Q plots plot the differences between the observed and fitted quantiles on the y-axis vs. the mean of the observed and fitted quantiles on the x-axis. These types of plots are useful because it is easier to see deviations from a horizontal line than from a line with a non-zero slope. Estimating Distribution Parameters EnvironmentalStats for S-PLUS contains several functions for estimating distribution parameters. The functions elnorm and elnorm.alt let you estimate the parameters of a lognormal distribution, given a set of observations. The function elnorm estimates the mean and standard deviation of the log-transformed distribution. The function elnorm.alt (alternative parameterization) estimates the mean and coefficient of variation of the original distribution. Both of these functions allow you to compute confidence intervals as well. Here are the results of these two functions using the reference area TcCB data.
Plotting Probability Density Functions The function pdfplot lets you plot probability density functions for all of the built-in probability distributions that come with S-PLUS and EnvironmentalStats for S-PLUS. For example, we can plot the observed TcCB data for the reference area using a histogram, and then superimpose the fitted distribution based on the estimated parameters. We know from the section Estimating Distribution Parameters (above) that the estimated mean and standard deviation based on the log-transformed TcCB data in the reference area are -0.62 and 0.47. Let's plot the histogram of the log-transformed data, then add the fitted normal distribution.
Now let's plot the histogram of the original data, then add the fitted lognormal distribution, using the estimated mean and cv of 0.60 and 0.49.
Here is an example of four other distributions available in EnvironmentalStats for S-PLUS:
Testing for Goodness-of-Fit EnvironmentalStats for S-PLUS contains several new functions not available in S-PLUS for testing goodness-of-fit. Here, well use the Shapiro-Wilk test to test the goodness-of-fit of the reference area TcCB data to a lognormal distribution.
EnvironmentalStats for S-PLUS also contains a plotting method for the results of goodness-of-fit tests, as well as a function called plot.gof.summary, which produces four summary plots on one page.
Estimating Quantiles and Computing Confidence Limits EnvironmentalStats for S-PLUS contains several functions for estimating quantiles and optionally constructing confidence limits for the quantiles. Lets estimate the 90th percentile of the distribution of the reference area TcCB, assuming the true distribution is a lognormal distribution, and compute a 95% confidence interval for this 90th percentile.
Nonparametric Two-Sample Tests EnvironmentalStats for S-PLUS contains functions for performing general two-sample linear rank tests (to test for a shift in location) and a special quantile test that tests for a shift in the tail of one of the distributions. Here, well perform the usual Wilcoxon Rank Sum test and the quantile test to compare the reference and cleanup area TcCB data (recall the histograms shown in the section Looking at the Data above).
The Wilcoxon Rank Sum Test is not significant at the 0.10 level. Now lets look at the results of the quantile test.
The quantile test is significant at the 0.011 level. It picked up the portion of large outlying values in the cleanup area TcCB data. Sample Size and Power Calculations Environmental statistics is no different from any other field of statistics when it comes to the need to determine the sample size for a study. EnvironmentalStats for S-PLUS includes functions to compute the relationship between power and sample size, or the length of a confidence interval and sample size, for standard hypothesis tests based on the normal (Gaussian) and binomial distributions. The plot below shows the relationship between sample size and power for detecting a mean TcCB concentration at a "contaminated" site that is 1.5 times as large as the mean concentration at the reference site, assuming a sample size of 47 at the reference site and a standard deviation of about 0.47 (log-scale). Note that about 15 observations are required in the "contaminated" site to obtain 90% power.
|
||||||||||||||
|
Copyright© 2005-2011 SolutionMetrics Pty Ltd | Terms of Use | Contact Us | T +61 2 9233 6888 |