What statistical test would you recommend to compare two animal behavior sampling methods?

I am doing a lab where I am comparing the scanning method and focal animal method. Each method will give me the following:

  • Scanning method: occurrence frequency for each behavior (%),
  • focal animal method:the occurrence frequencies (%), as well as the duration for each behavior.

I was thinking a chi-squared test or a t-test somehow but I'm not sure which to do or how to format them. Should I somehow compare the occurrence frequencies?

Not sure if this info is useful, but I normally use R for my statistical test but have never used my own data, only example data. I was going to record my data in excel and import it into R. Thank you in advance.

Selecting a sample size for studies with repeated measures

Many researchers favor repeated measures designs because they allow the detection of within-person change over time and typically have higher statistical power than cross-sectional designs. However, the plethora of inputs needed for repeated measures designs can make sample size selection, a critical step in designing a successful study, difficult. Using a dental pain study as a driving example, we provide guidance for selecting an appropriate sample size for testing a time by treatment interaction for studies with repeated measures. We describe how to (1) gather the required inputs for the sample size calculation, (2) choose appropriate software to perform the calculation, and (3) address practical considerations such as missing data, multiple aims, and continuous covariates.

Methods of sampling from a population

It would normally be impractical to study a whole population, for example when doing a questionnaire survey. Sampling is a method that allows researchers to infer information about a population based on results from a subset of the population, without having to investigate every individual. Reducing the number of individuals in a study reduces the cost and workload, and may make it easier to obtain high quality information, but this has to be balanced against having a large enough sample size with enough power to detect a true association. (Calculation of sample size is addressed in section 1B (statistics) of the Part A syllabus.)

If a sample is to be used, by whatever method it is chosen, it is important that the individuals selected are representative of the whole population. This may involve specifically targeting hard to reach groups. For example, if the electoral roll for a town was used to identify participants, some people, such as the homeless, would not be registered and therefore excluded from the study by default.

There are several different sampling techniques available, and they can be subdivided into two groups: probability sampling and non-probability sampling. In probability (random) sampling, you start with a complete sampling frame of all eligible individuals from which you select your sample. In this way, all eligible individuals have a chance of being chosen for the sample, and you will be more able to generalise the results from your study. Probability sampling methods tend to be more time-consuming and expensive than non-probability sampling. In non-probability (non-random) sampling, you do not start with a complete sampling frame, so some individuals have no chance of being selected. Consequently, you cannot estimate the effect of sampling error and there is a significant risk of ending up with a non-representative sample which produces non-generalisable results. However, non-probability sampling methods tend to be cheaper and more convenient, and they are useful for exploratory research and hypothesis generation.

Probability Sampling Methods

1. Simple random sampling

In this case each individual is chosen entirely by chance and each member of the population has an equal chance, or probability, of being selected. One way of obtaining a random sample is to give each individual in a population a number, and then use a table of random numbers to decide which individuals to include. 1 For example, if you have a sampling frame of 1000 individuals, labelled 0 to 999, use groups of three digits from the random number table to pick your sample. So, if the first three numbers from the random number table were 094, select the individual labelled “94”, and so on.

As with all probability sampling methods, simple random sampling allows the sampling error to be calculated and reduces selection bias. A specific advantage is that it is the most straightforward method of probability sampling. A disadvantage of simple random sampling is that you may not select enough individuals with your characteristic of interest, especially if that characteristic is uncommon. It may also be difficult to define a complete sampling frame and inconvenient to contact them, especially if different forms of contact are required (email, phone, post) and your sample units are scattered over a wide geographical area.

2. Systematic sampling

Individuals are selected at regular intervals from the sampling frame. The intervals are chosen to ensure an adequate sample size. If you need a sample size n from a population of size x, you should select every x/n th individual for the sample. For example, if you wanted a sample size of 100 from a population of 1000, select every 1000/100 = 10 th member of the sampling frame.

Systematic sampling is often more convenient than simple random sampling, and it is easy to administer. However, it may also lead to bias, for example if there are underlying patterns in the order of the individuals in the sampling frame, such that the sampling technique coincides with the periodicity of the underlying pattern. As a hypothetical example, if a group of students were being sampled to gain their opinions on college facilities, but the Student Record Department’s central list of all students was arranged such that the sex of students alternated between male and female, choosing an even interval (e.g. every 20 th student) would result in a sample of all males or all females. Whilst in this example the bias is obvious and should be easily corrected, this may not always be the case.

3. Stratified sampling

In this method, the population is first divided into subgroups (or strata) who all share a similar characteristic. It is used when we might reasonably expect the measurement of interest to vary between the different subgroups, and we want to ensure representation from all the subgroups. For example, in a study of stroke outcomes, we may stratify the population by sex, to ensure equal representation of men and women. The study sample is then obtained by taking equal sample sizes from each stratum. In stratified sampling, it may also be appropriate to choose non-equal sample sizes from each stratum. For example, in a study of the health outcomes of nursing staff in a county, if there are three hospitals each with different numbers of nursing staff (hospital A has 500 nurses, hospital B has 1000 and hospital C has 2000), then it would be appropriate to choose the sample numbers from each hospital proportionally (e.g. 10 from hospital A, 20 from hospital B and 40 from hospital C). This ensures a more realistic and accurate estimation of the health outcomes of nurses across the county, whereas simple random sampling would over-represent nurses from hospitals A and B. The fact that the sample was stratified should be taken into account at the analysis stage.

Stratified sampling improves the accuracy and representativeness of the results by reducing sampling bias. However, it requires knowledge of the appropriate characteristics of the sampling frame (the details of which are not always available), and it can be difficult to decide which characteristic(s) to stratify by.

4. Clustered sampling

In a clustered sample, subgroups of the population are used as the sampling unit, rather than individuals. The population is divided into subgroups, known as clusters, which are randomly selected to be included in the study. Clusters are usually already defined, for example individual GP practices or towns could be identified as clusters. In single-stage cluster sampling, all members of the chosen clusters are then included in the study. In two-stage cluster sampling, a selection of individuals from each cluster is then randomly selected for inclusion. Clustering should be taken into account in the analysis. The General Household survey, which is undertaken annually in England, is a good example of a (one-stage) cluster sample. All members of the selected households (clusters) are included in the survey. 1

Cluster sampling can be more efficient that simple random sampling, especially where a study takes place over a wide geographical region. For instance, it is easier to contact lots of individuals in a few GP practices than a few individuals in many different GP practices. Disadvantages include an increased risk of bias, if the chosen clusters are not representative of the population, resulting in an increased sampling error.

Non-Probability Sampling Methods

1. Convenience sampling

Convenience sampling is perhaps the easiest method of sampling, because participants are selected based on availability and willingness to take part. Useful results can be obtained, but the results are prone to significant bias, because those who volunteer to take part may be different from those who choose not to (volunteer bias), and the sample may not be representative of other characteristics, such as age or sex. Note: volunteer bias is a risk of all non-probability sampling methods.

2. Quota sampling

This method of sampling is often used by market researchers. Interviewers are given a quota of subjects of a specified type to attempt to recruit. For example, an interviewer might be told to go out and select 20 adult men, 20 adult women, 10 teenage girls and 10 teenage boys so that they could interview them about their television viewing. Ideally the quotas chosen would proportionally represent the characteristics of the underlying population.

Whilst this has the advantage of being relatively straightforward and potentially representative, the chosen sample may not be representative of other characteristics that weren’t considered (a consequence of the non-random nature of sampling). 2

3. Judgement (or Purposive) Sampling

Also known as selective, or subjective, sampling, this technique relies on the judgement of the researcher when choosing who to ask to participate. Researchers may implicitly thus choose a “representative” sample to suit their needs, or specifically approach individuals with certain characteristics. This approach is often used by the media when canvassing the public for opinions and in qualitative research.

Judgement sampling has the advantage of being time-and cost-effective to perform whilst resulting in a range of responses (particularly useful in qualitative research). However, in addition to volunteer bias, it is also prone to errors of judgement by the researcher and the findings, whilst being potentially broad, will not necessarily be representative.

4. Snowball sampling

This method is commonly used in social sciences when investigating hard-to-reach groups. Existing subjects are asked to nominate further subjects known to them, so the sample increases in size like a rolling snowball. For example, when carrying out a survey of risk behaviours amongst intravenous drug users, participants may be asked to nominate other users to be interviewed.

Snowball sampling can be effective when a sampling frame is difficult to identify. However, by selecting friends and acquaintances of subjects already investigated, there is a significant risk of selection bias (choosing a large number of people with similar characteristics or views to the initial individual identified).

Bias in sampling

There are five important potential sources of bias that should be considered when selecting a sample, irrespective of the method used. Sampling bias may be introduced when: 1

  1. Any pre-agreed sampling rules are deviated from
  2. People in hard-to-reach groups are omitted
  3. Selected individuals are replaced with others, for example if they are difficult to contact
  4. There are low response rates
  5. An out-of-date list is used as the sample frame (for example, if it excludes people who have recently moved to an area)

Further potential problems with sampling strategies are covered in chapter 8 of this section (“Sources of variation, its measurement and control”).

Guidance for Industry and FDA Staff Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests

This guidance represents the Food and Drug Administration's (FDA's) current thinking on this topic. It does not create or confer any rights for or on any person and does not operate to bind FDA or the public. You can use an alternative approach if the approach satisfies the requirements of the applicable statutes and regulations. If you want to discuss an alternative approach, contact the FDA staff responsible for implementing this guidance. If you cannot identify the appropriate FDA staff, call the appropriate number listed on the title page of this guidance.

1. Background

This guidance is intended to describe some statistically appropriate practices for reporting results from different studies evaluating diagnostic tests and identify some common inappropriate practices. The recommendations in this guidance pertain to diagnostic tests where the final result is qualitative (even if the underlying measurement is quantitative). We focus special attention on the practice called discrepant resolution and its associated problems.

On February 11, 1998, the Center for Devices and Radiological Health convened a joint meeting of the Microbiology, Hematology/Pathology, Clinical Chemistry/Toxicology, and Immunology Devices Panels. The purpose of the meeting was to obtain recommendations on “appropriate data collection, analysis, and resolution of discrepant results, using sound scientific and statistical analysis to support indications for use of the in vitro diagnostic devices when the new device is compared to another device, a recognized reference method or ‘gold standard,’ or other procedures not commonly used, and/or clinical criteria for diagnosis.” Using the input from that meeting, a draft guidance document was developed discussing some statistically valid approaches to reporting results from evaluation studies for new diagnostic devices. The draft guidance was released for public comment March 12, 2003.

Following publication of the draft guidance, FDA received 11 comments. Overall, the comments were favorable and requested additional information be included in the final guidance. Some respondents requested greater attention to the use of standard terminology.

Correct use of terminology for describing performance results is important to ensure safe and effective use of a diagnostic device. Whenever possible, this guidance uses internationally accepted terminology and definitions as compiled in the Clinical and Laboratory Standards Institute (CLSI) Harmonized Terminology Database. 1 This guidance also uses terms as they are defined in the STARD (STAndards for Reporting of Diagnostic Accuracy) Initiative. 2 The STARD Initiative pertains to studies of diagnostic accuracy. While the STARD Initiative does not specifically address studies designed to demonstrate diagnostic device equivalence, many of the reporting concepts are still applicable.

FDA’s guidance documents, including this guidance, do not establish legally enforceable responsibilities. Instead, guidances describe the Agency’s current thinking on a topic and should be viewed only as recommendations, unless specific regulatory or statutory requirements are cited. The use of the word should in Agency guidances means that something is suggested or recommended, but not required.

We believe we should consider the least burdensome approach in all areas of medical device regulation. This guidance reflects our careful review of the relevant scientific and legal requirements and what we believe is the least burdensome way for you to comply with those requirements. However, if you believe that an alternative approach would be less burdensome, please contact us so we can consider your point of view. You may send your written comments to the contact person listed in the preface to this guidance or to the CDRH Ombudsman. Comprehensive information on CDRH’s Ombudsman, including ways to contact him, can be found on the Internet.

2. Scope

This document provides guidance for the submission of premarket notification (510(k)) and premarket approval (PMA) applications for diagnostic devices (tests). This guidance addresses the reporting of results from different types of studies evaluating diagnostic devices with two possible outcomes (positive or negative) in PMAs and 510(k)s. The guidance is intended for both statisticians and non-statisticians.

This guidance does not address the fundamental statistical issues associated with design and monitoring of clinical studies for diagnostic devices.

3. Introduction

This section provides an explanation of the concepts relevant to this guidance. We note at the outset that evaluation of a new diagnostic test should compare a new product’s outcome (test results) to an appropriate and relevant diagnostic benchmark using subjects/patients from the intended use population that is, those subjects/patients for whom the test is intended to be used. In STARD, this is called the target population.

Other important concepts and definitions include the following:

Types of test results

The method of comparison depends on the nature of the test results. Diagnostic test results (outcomes) are usually classified as either quantitative or qualitative. A quantitative result is a numerical amount or level, while a qualitative result usually consists of one of only two possible responses for example, diseased or non-diseased, positive or negative, yes or no. This document pertains to diagnostic tests where the final result is qualitative (even if the underlying measurement is quantitative). Quantitative tests and tests with ordinal outcomes (more than two possible outcomes, but ordered) are not discussed here.

We also assume throughout that your study data do not include multiple samples from single patients.

Purpose of a qualitative diagnostic test

A qualitative diagnostic test (test) is designed to determine whether a target condition is present or absent in a subject from the intended use population. As defined in STARD, the target condition (condition of interest) “can refer to a particular disease, a disease stage, health status, or any other identifiable condition within a patient, such as staging a disease already known to be present, or a health condition that should prompt clinical action, such as the initiation, modification or termination of treatment.”

FDA recommends your labeling characterize diagnostic test performance for use by all intended users (laboratories, health care providers, and/or home users).


FDA recognizes two major categories of benchmarks for assessing diagnostic performance of new qualitative diagnostic tests. These categories are (1) comparison to a reference standard (defined below), or (2) comparison to a method or predicate other than a reference standard (non-reference standard). The choice of comparative method will determine which performance measures may be reported in the label.

Diagnostic accuracy and the reference standard

The diagnostic accuracy of a new test refers to the extent of agreement between the outcome of the new test and the reference standard. We use the term reference standard as defined in STARD. That is, a reference standard is “considered to be the best available method for establishing the presence or absence of the target condition.” It divides the intended use population into only two groups (condition present or absent) and does not consider the outcome of the new test under evaluation.

The reference standard can be a single test or method, or a combination of methods and techniques, including clinical follow-up. If a reference standard is a combination of methods, the algorithm specifying how the different results are combined to make a final positive/negative classification (which may include the choice and ordering of these methods) is part of the standard. Examples of reference standards include the diagnosis of myocardial infarction using the WHO (World Health Organization) standards, the diagnosis of lupus or rheumatoid arthritis using American Rheumatology guidelines, or the diagnosis of H. pylori infections by use of combinations of culture, histology, and urease testing.

The determination of what constitutes the “best available method” and whether that method should be considered a “reference standard” is established by opinion and practice within the medical, laboratory, and regulatory community. Sometimes there are several possible methods that could be considered. Sometimes no consensus reference standard exists. Or, a reference standard may exist, but for a non-negligible percentage of the intended use population, the reference standard is known to be in error. In all these situations, we recommend you consult with FDA on your choice of reference standard before you begin your study.

We point out that some definitions of diagnostic accuracy (see CLSI harmonized terminology database) require that the reference standard and target condition refer only to a well-defined clinical disorder. The definitions used in this document are broader. For example, the target condition could be a well-defined health condition or a condition that prompts a clinical action such as the initiation of a treatment.

Measures that describe diagnostic accuracy

There are different ways to describe diagnostic accuracy. Appropriate measures include estimates of sensitivity and specificity pairs, likelihood ratio of positive and negative result pairs, and ROC (Receiver Operating Characteristic) analysis along with confidence intervals. Refer to the most current edition of CLSI Approved Guidelines EP12-A and GP10-A the texts by Lang and Secic (1997), Pepe (2003), Zhou et al. (2002) the references within these texts and the bibliography at the end of this document. To help interpret these measures, we recommend you provide the definition of condition of interest, the reference standard, the intended use population, and a description of the study population.

Sensitivity and specificity

In studies of diagnostic accuracy, the sensitivity of the new test is estimated as the proportion of subjects with the target condition in whom the test is positive. Similarly, the specificity of the test is estimated as the proportion of subjects without the target condition in whom the test is negative (see the Appendix for an example of this calculation). These are only estimates for sensitivity and specificity because they are based on only a subset of subjects from the intended use population if another subset of subjects were tested (or even the same subjects tested at a different time), then the estimates of sensitivity and specificity would probably be numerically different. Confidence intervals and significance levels quantify the statistical uncertainty in these estimates due to the subject/sample selection process. This type of uncertainty decreases as the number of subjects in the study increases.

Positive and negative predictive value

You may also compute other quantities to help characterize diagnostic accuracy. These methods include the predictive value of a positive result (sometimes called positive predictive value or PPV) and predictive value of a negative result (sometimes called negative predictive value or NPV) pair. These quantities provide useful insight into how to interpret test results. You may refer to the extensive literature on how to calculate and interpret these measures. (See most current edition of CLSI EP12-A, Lang and Secic (1997), Pepe (2003), Zhou et al. (2002), the references within the texts, and the bibliography at the end of this document.) Further discussion of these measures is beyond the scope of this document.


Sensitivity and specificity estimates (and other estimates of diagnostic performance) can be subject to bias. Biased estimates are systematically too high or too low. Biased sensitivity and specificity estimates will not equal the true sensitivity and specificity, on average. Often the existence, size (magnitude), and direction of the bias cannot be determined. Bias creates inaccurate estimates.

FDA believes it is important to understand the potential sources of bias to avoid or minimize them. Simply increasing the overall number of subjects in the study will do nothing to reduce bias. Alternatively, selecting the “right” subjects, changing study conduct, or data analysis procedures may remove or reduce bias.

Two sources of bias that originally motivated the development of this guidance include error in the reference standard and incorporation of results from the test under evaluation to establish the target condition. This guidance discusses problems arising from these and other sources of bias and describes how to minimize these problems in your study design and data analysis. This guidance does not attempt to discuss all possible sources of bias and how to avoid them. For comprehensive discussions on bias and diagnostic device studies, see Begg (1987), Pepe (2003), Zhou et al. (2002), and the references cited in these texts.

When a non-reference standard is used for comparison

When a new test is evaluated by comparison to a non-reference standard, sensitivity and specificity are not appropriate terms to describe the comparative results. Information on the accuracy or “correctness” of the new test cannot be estimated directly. Instead, when a non-reference standard is used for comparison, FDA recommends you demonstrate the ability of the candidate test to agree sufficiently with the comparative method or predicate. A question addressed in this document is how to report results from a study evaluating a new diagnostic test when the comparative method is not a reference standard.

4. Benchmark and Study Population Recommendations

FDA recommends you carefully plan your study before collecting the first specimen or taking the first measurement. This includes determining whether you want to report diagnostic accuracy or device agreement. If you want to report diagnostic accuracy, FDA recommends your evaluation include the use of a reference standard on at least some of the subjects.

We recommend you contact CDRH early to discuss possible study designs and statistical analyses prior to any data collection for the clinical study. 3 Often there are promising advanced statistical methods that may be appropriate, and new statistical analysis techniques are constantly being developed. The list of references at the end of this document includes a variety of approaches. Discussing your planned study with CDRH before starting may save time and money.

4.1 Comparisons with the Benchmark

The choice of comparative benchmark and the methods of comparison and reporting are influenced by the existence and/or practical applicability of a reference standard. Depending on the availability of a reference standard, FDA makes the following recommendations regarding the choice of comparative benchmark:

  1. If a reference standard is available: use it to estimate sensitivity and specificity
  2. If a reference standard is available, but impractical: use it to the extent possible. Calculate estimates of sensitivity and specificity adjusted to correct for any (verification) bias that may have been introduced by not using the reference standard to its fullest extent.
  3. If a reference standard is not available or unacceptable for your particular intended use and/or intended use population: consider whether one can be constructed. If so, calculate estimated sensitivity and specificity under the constructed standard.
  4. If a reference standard is not available and cannot be constructed: calculate and report measures of agreement (see Appendices).

We now provide more details on these recommendations:

If a reference standard is available

From a purely statistical perspective, FDA believes that the best approach is to designate a reference standard and compare the new test to the designated reference standard, drawing from subjects who are representative of the intended use population. We recommend you consult with FDA prior to planning a study to ensure the designated reference standard will meet Agency needs. In this situation, sensitivity and specificity have meaning, and you can easily calculate the estimates. The Appendices contain a numerical example.

If a reference standard is available, but impractical

If you determine that using a reference standard on all subjects is impractical or not feasible, FDA recommends you obtain estimates of sensitivity and specificity using the new test and a comparative method (other than a reference standard) on all subjects, and use the reference standard on just a subset of subjects (sometimes called partial verification studies or two-stage studies).

For example, if you apply the designated reference standard to a random subset of all subjects, or to all subjects where the new test and the comparative method disagree and to a random sample of subjects where they agree, then it is possible to compute adjusted estimates (and variances) of sensitivity and specificity. In this case FDA recommends you retest a sufficient number of subjects to estimate sensitivity and specificity with reasonable precision.

Note that the simple formulas for calculating sensitivity and specificity described in the Appendix are not correct for this design and such naive calculations would give biased estimates of sensitivity and specificity. This type of bias is an example of verification or work-up bias. For details see Begg (1987), Pepe (2003), or Zhou et al. (2002).

Determining how large a subset to choose, the particular subset to choose, and how to calculate the performance measures is currently an area of active statistical research. See Albert (2006), Albert & Dodd (2004, 2006), Hawkins et al. (2001), Kondratovich (2003), Pepe (2003), Zhou et al. (2002), and references cited within these references. Since this approach can be statistically complicated, FDA recommends you consult with a CDRH statistician before using this approach.

In rare circumstances, it may be possible to estimate sensitivity and specificity without using a reference standard in the study. This may be reasonable, for example, when the sensitivity and specificity of the designated comparative method are well established from previous evaluations against a reference standard in similar subject populations. Further elaboration of this subject is beyond the scope of this document. Here too, FDA recommends you consult with a CDRH statistician before using this approach.

If a reference standard is not available, but might be constructed

An expert panel (FDA advisory panel or other panel) may be able to develop a set of clinical criteria (or a combination of reference tests and confirmatory clinical information) that would serve as a designated reference standard. While this approach may be more time-consuming up front, if successful, you can easily calculate estimates of sensitivity and specificity. In this situation, FDA recommends

  • the test label clearly describe the designated reference standard that was constructed
  • the new reference standard be created independently from the analysis of results of the new diagnostic test (ideally, in advance of collecting any specimens)
  • you consult with CDRH medical officers and statisticians prior to constructing a reference standard.

If a reference standard is not available and cannot be constructed

When a new test is evaluated by comparison to a non-reference standard, you cannot directly calculate unbiased estimates of sensitivity and specificity. Therefore, the terms sensitivity and specificity are not appropriate to describe the comparative results. Instead, the same numerical calculations are made, but the estimates are called positive percent agreement and negative percent agreement, rather than sensitivity and specificity. This reflects that the estimates are not of accuracy but of agreement of the new test with the non-reference standard.

In addition, quantities such as positive predictive value, negative predictive value, and the positive and negative likelihood ratios cannot be computed since the subjects’ condition status (as determined by a reference standard) is unknown.

In this situation, FDA recommends you report

  • the 2x2 table of results comparing the candidate test with the comparative method
  • a description of the comparative method and how it was performed
  • the pair of agreement measures along with their confidence intervals.

The Appendices provide a numerical example.

We adopt the terms “positive percent agreement” and “negative percent agreement” with the following cautionary note. Agreement of a new test with the non-reference standard is numerically different from agreement of the non-reference standard with the new test (contrary to what the term “agreement” implies). Therefore, when using these measures of agreement, FDA recommends you clearly state the calculations being performed.

One major disadvantage with agreement measures is that agreement is not a measure of “correctness.” Two tests could agree and both be wrong. In fact, two tests could agree well, but both have poor sensitivity and specificity. However, when two tests disagree, that does not mean that the new test is wrong and the comparative method is right.

One should also be aware that measures of overall agreement (including both overall percent agreement and Cohen’s Kappa) can be misleading in this setting. In some situations, overall agreement can be good when either positive or negative percent agreement is very low. For this reason, FDA discourages the stand-alone use of measures of overall agreement to characterize the diagnostic performance of a test.

There has been much statistical research on how to estimate diagnostic accuracy of a new test when a reference standard is not available or does not exist. Albert and Dodd (2004), Pepe (2003), and Zhou et al. (2002) provide reviews of some of this research, which includes use of latent class models and Bayesian models. These model-based approaches can be problematic for the purpose of estimating sensitivity and specificity because it is often difficult to verify that the model and assumptions used are correct. More troublesome is that different models can fit the data equally well, yet produce very different estimates of sensitivity and specificity. For these types of analyses, FDA recommends reporting a range of results for a variety of models and assumptions. FDA also recommends you consult with a CDRH statistician before using these approaches.

4.2 Selecting the Study Population

In addition to choosing an appropriate comparative benchmark, evaluating a new test also involves choosing an appropriate set of:

  • subjects or specimens to be tested
  • individuals and laboratories to perform the tests
  • conditions under which the tests will be conducted.

Spectrum bias

Estimates of diagnostic accuracy are subject to spectrum bias when the subjects included in the study do not include the complete spectrum of patient characteristics that is, important patient subgroups are missing. See Begg (1987), Pepe (2003), or Zhou et al. (2002). For example, there are studies that include only very healthy subjects and subjects with severe disease, omitting the intermediate and typically more difficult cases to diagnose. The accuracy measures reported from these studies are subject to spectrum bias.

Eliminating the difficult cases produces an overly optimistic picture of how the device performs in actual use. Therefore, FDA recommends the set of subjects and specimens to be tested include:

  • subjects/specimens across the entire range of disease states
  • subjects/specimens with relevant confounding medical conditions
  • subjects/specimens across different demographic groups.

If the set of subjects and specimens to be evaluated in the study is not sufficiently representative of the intended use population, the estimates of diagnostic accuracy can be biased.

External validity

A study has high external validity if the results from the study are sufficiently reflective of the “real world” performance of the device in the intended use population. Selection of the appropriate set of subjects and/or specimens is not in itself sufficient to ensure high external validity. Although detailed discussion of external validity is beyond the scope of this document, FDA generally recommends:

  • using the final version of the device according to the final instructions for use
  • using several of these devices in your study
  • including multiple users with relevant training and range of expertise
  • covering a range of expected use and operating conditions.

See Rothwell (2006) for a non-technical discussion in the context of randomized trials.

5. Reporting Recommendations

Similar reporting principles apply to any study evaluating a diagnostic test, regardless of whether the comparative benchmark is a reference standard.

Reporting the context of the study

Performance measures should be interpreted in the context of the study population and study design. Sensitivity and specificity cannot be interpreted by themselves additional information is needed. For example, estimated sensitivity and specificity of the same test can differ from study to study, depending on the types of subjects included in the study and whether an obsolete reference standard is used versus a reference standard currently accepted by the clinical community today.

Before presenting results, FDA recommends you describe or define the:

  • intended use population
  • study population
  • condition of interest (precise definition of condition explaining how those

subjects with the condition of interest are distinguished from those without)

FDA also recommends you discuss:

  • the rationale for the choice of designated comparative benchmark
  • the strengths and limitations likely to result from selection of that benchmark.

Defining the conditions of use

FDA recommends you define the conditions of use under which the candidate test and the reference standard or comparative method are performed. These may include:

  • operator experience
  • clinical laboratory facility or other test setting
  • controls applied
  • specimen acceptance criteria.

Descriptions of comparative results and methods

FDA recommends you include in your results a clear description of all methods used and how and what data were collected, such as:

  • subject recruitment procedures
  • subject demographics
  • subject and specimen inclusion and exclusion criteria
  • specimen collection procedures
  • time of specimen collection and testing
  • types of specimens collected
  • number of specimens collected and tested and number discarded
  • number of specimens included in final data analysis
  • specimen collection devices (if applicable)
  • specimen storage and handling procedures.

Reporting study results

FDA recommends you report all results by

  • clinical site or specimen collection site,
  • specimen testing or processing site, and
  • relevant clinical and demographic subgroups.

FDA recommends you report tabular comparisons of the candidate test outcome to the reference standard or comparative method. (For example, we recommend you report the 2x2 table of results such as those in the Appendix.)

FDA recommends you report measures of diagnostic accuracy (sensitivity and specificity pairs, positive and negative likelihood ratio pairs) or measures of agreement (percent positive agreement and percent negative agreement) and their two-sided 95 percent confidence intervals. We recommend reporting these measures both as fractions (e.g., 490/500) and as percentages (e.g., 98.0%). The Appendices contain a numerical example.

Underlying quantitative result

For qualitative tests derived from an underlying quantitative result, FDA recommends you provide descriptive summaries that include:

  • ranges of results
  • histograms of results by condition status (if known)
  • Receiver Operating Characteristic (ROC) Plots (if condition status is known).

The CLSI document GP10 Assessment of the Clinical Accuracy of Laboratory Tests Using Receiver Operating Characteristic (ROC) Plots provides further guidance on this topic.

Accounting of subjects and test results

FDA recommends you provide a complete accounting of all subjects and test results, including:

  • number of subjects planned to be tested
  • number tested
  • number used in final analysis
  • number omitted from final analysis.

FDA recommends you provide the number of ambiguous 4 results for candidate tests, stratified by reference standard outcome or comparative outcome.

Reporting intended use population results separately

FDA recommends you report results for those subjects in the intended use population separately from other results. It may be useful to report comparative results for subjects who are not part of the intended use population, but we recommend they not be pooled together. For example, if healthy individuals are not part of the intended use population, we recommend those results be reported separately from results for the intended use population. Results from patients outside the intended use population should not be labeled as “specificity.” The term specificity is appropriate to describe how often a test is negative only in subjects from the intended use population for whom the target condition is absent.

Rare condition of interest

When the condition of interest is rare, studies are sometimes enriched with reference standard positive subjects, potentially making the results inappropriate for pooling with other positive results. We recommend you consult with FDA on this issue.

Archived collections

If your test is evaluated using specimens retrospectively obtained from archived collections, sensitivity and specificity claims may or may not be appropriate. These claims may be appropriate if the archived specimens are representative of specimens from subjects in the intended use population, with and without the target condition, including unclear cases. FDA recommends you provide a description of the results, indicating:

  • the nature of the specimens studied
  • how the target condition status was determined
  • the limitations introduced through selective sampling.

6. Statistically Inappropriate Practices

Some common practices for reporting results are statistically inappropriate because they are misleading or can lead to inaccurate estimates of test performance. These practices most often arise when a new test is compared to a comparative method other than a reference standard.

Comparing a new test to a non-reference standard does not yield true performance. If the new test is better than the non-reference standard, the agreement will be poor. Alternatively, the agreement could be poor because the non-reference standard is fairly accurate and the new test is inaccurate. There is no statistical solution to determining which scenario is the true situation.

When comparing a new test to a non-reference standard, FDA makes the following recommendations regarding four common practices that we believe give misleading or incorrect results.

1. Avoid use of the terms “sensitivity” and “specificity” to describe the comparison of a new test to a non-reference standard

When a new test is evaluated by comparison to a non-reference standard, it is impossible to calculate unbiased estimates of sensitivity and specificity. In addition, quantities such as positive predictive value, negative predictive value, and the positive and negative likelihood ratios cannot be computed since the subjects’ condition status (as determined by a reference standard) is unknown.

For this reason, FDA recommends you report

  • the 2x2 table of results comparing the new test with the non-reference standard
  • a description of the non-reference standard
  • measures of agreement and corresponding confidence intervals.

FDA recommends the use of the terms positive percent agreement and negative percent agreement with the non-reference standard to describe these results. Agreement measures are discussed in more detail in the Appendices.

2. Avoid elimination of equivocal 5 results

If a test can (per the test instructions) produce a result which is anything other than positive or negative then it is not technically a qualitative test (since more than two outcomes are possible). In that case the measures described in this guidance do not directly apply. Discarding or ignoring these results and performing the calculations in this guidance will likely result in biased performance estimates.

To address this issue, one option is to report two different sets of performance measures

  • one set of measures based on including the equivocal results with the test positive results
  • a second set of measures based on including the equivocal results with the test negative results.

This may or may not be reasonable for your situation. FDA recommends you consult with FDA statisticians on how to handle these types of results.

3. Avoid the use of outcomes altered or updated by discrepant resolution

You should not use outcomes that are altered or updated by discrepant resolution to estimate the sensitivity and specificity of a new test or agreement between a new test and a non-reference standard.

When a new test is evaluated by comparison to a non-reference standard, discrepancies (disagreement) between the two methods may arise because of errors in the test method or errors in the non-reference standard. Since the non-reference standard may be wrong, calculations of sensitivity and specificity based on the non-reference standard are statistically biased. A practice called discrepant resolution has been suggested to get around the bias problem.

As the name implies, discrepant resolution focuses on subjects where there is a discrepancy that is, where the new test and the non-reference standard disagree. In the simplest situation, discrepant resolution can be described as a two-stage testing process:

  • Stage 1: Testing all subjects using the new test and the non-reference standard
  • Stage 2: When the new test and non-reference standard disagree, using a resolver (a reference standard or a second non-reference standard) to see which one is “right.”

A numerical example describing discrepant resolution appears in the Appendix. If the resolver is a reference standard, this process provides the condition status for the subjects re-tested with the resolver, but it does not provide the condition status for subjects when the new test agrees with the non-reference standard (usually most of the subjects). Even when the new test and non-reference standard agree, they may both be wrong.

FDA does not recommend the process used by some investigators whereby the resolver is used to revise the original 2x2 table of results (new test versus non-reference standard). We believe the original 2x2 table is inappropriately “revised” in this method because:

  • when the original two results agree, you assume (without supporting evidence) that they are both correct and do not make any changes to the table
  • when the original results disagree, and the non-reference standard disagrees with the resolver, you reclassify (change) the non-reference standard result to the resolver result.

The revised 2x2 table based on discrepant resolution is misleading because the columns are not clearly defined and do not necessarily represent condition status, as assumed. The assumption that results that agree are correct is not tested and may be far from valid. FDA recommends you do not present such a table in your final analysis because it may be very misleading. Because the calculations of sensitivity and specificity from such a revised 2x2 table are not valid estimates of performance, they should not be reported.

FDA is not aware of any scientifically valid ways to estimate sensitivity and specificity by resolving only the discrepant results, even when the resolver is a reference standard. To obtain unbiased estimates of sensitivity and specificity, FDA believes

  • the resolver must be a reference standard, and
  • you must resolve at least a subset of the concordant subjects.

Discrepant resolution with a reference standard resolver can tell you whether the new test or the non-reference standard is right more of the time, but you cannot quantify how much more. If the resolver is not a reference standard, the resolver test results can provide little or no useable information about the performance of the new test. Resolving discrepancies using repeat testing by the new test or the non-reference standard also does not provide any useful information about performance.

4. Avoid comparison of the results of a new test to the outcome of a testing algorithm that combines several comparative methods (non-reference standards), if the algorithm uses the outcome of the new test

When evaluating some types of tests, the comparative “procedure” is not a single test, but the outcome of a combination of several comparative methods and possibly clinical information. Often, two or more comparative methods are performed and interpreted according to a pre-specified testing sequence or algorithm to determine condition status.

The decision to use a second or third comparative method may depend on the outcome of the initial comparative method. This approach may be statistically reasonable. However, FDA believes this approach is not valid if the algorithm uses the outcome of the new unproven test. For example, the decision to use an additional comparative method should not be based on whether the new test is positive or negative.

FDA believes it is potentially misleading to establish the performance of a new test by comparing it to a procedure that incorporates the same new test. Any non-reference standard created in this manner will likely be biased in favor of the new test that is, it will tend to produce overestimates of agreement of the new test with the non-reference standard.

In summary, when reporting results from a study evaluating a diagnostic test, FDA believes it is inappropriate to:

  • use the terms “sensitivity” and “specificity” to describe the comparison of a new test to a non-reference standard
  • discard equivocal new test results when calculating measures of diagnostic accuracy or agreement
  • use outcomes that are altered or updated by discrepant resolution to estimate the sensitivity and specificity of a new test or agreement between a new test and a non-reference standard
  • compare the results of a new test to the outcome of a testing algorithm that combines several comparative methods (non-reference standards), if the algorithm uses the outcome of the new test.

7. Appendices

7.1 Calculating Estimates of Sensitivity and Specificity

Sensitivity and specificity are basic measures of performance for a diagnostic test. Together, they describe how well a test can determine whether a specific condition is present or absent. They each provide distinct and equally important information, and FDA recommends they be presented together:

  • Sensitivity refers to how often the test is positive when the condition of interest is present
  • Specificity refers to how often the test is negative when the condition of interest is absent.

Note that a diagnostic test where sensitivity equals [1– specificity] has no diagnostic value. That is, if the percent of subjects with positive test results when the condition is present (sensitivity) is the same as the percent of subjects with positive test results when the condition is absent (1– specificity), then the new test outcome is unaffected by the condition of interest, and it has no diagnostic value for that condition of interest. However, a test where both sensitivity and specificity are close to 1 has good diagnostic ability.

Usually, to estimate sensitivity and specificity, the outcome of the new test is compared to the reference standard using subjects who are representative of the intended use (both condition present and condition absent) population.

We assume throughout that your study data do not include multiple samples from single patients. If you do have such data, we recommend that you consult with FDA statisticians on the appropriate calculation methods.

Results are typically reported in a 2x2 table such as Table 1.

The new test has two possible outcomes, positive (+) or negative (). Subjects with the condition of interest are indicated as reference standard (+), and subjects without the condition of interest are indicated as reference standard ().


Time to event model

At a randomly placed motion-triggered camera, the time until the target species is detected is a function of abundance, movement rate, and detectability (Jennelle et al. 2002 , Parsons et al. 2017 ). The TTE model uses this fact to estimate abundance from observations of the time (starting from any arbitrary moment) until an animal is first detected. In this framework, we separate the two components of the detection process: availability and perception. Frequently, in camera trap literature, these two processes are combined so that detection is defined as the probability of detecting an animal given it is in a plot that is sampled by a camera (Burton et al. 2015 ). We do away with the idea of sampling an entire plot with a camera and instead focus only on the area within the camera's viewshed. In this way, we reduce the definition of detection probability to the probability that an animal is captured by a motion-triggered camera, given the animal is in the camera's viewshed. We begin by formulating the TTE model assuming perfect detection, and then, we present an extension to account for variable detectability.

In a TTE framework, we are interested in estimating λ by observing T, the number of sampling periods until the first animal encounters the camera. For a single observation of TTE Tij at camera i = 1, 2, …, M and sampling occasion j = 1, 2, …, J, we record the first sampling period k = 1, 2, …, K in which we observe the species of interest (Fig. 2). For example, at camera 1 on day 1, if we first observe the species of interest in the third sampling period, we record the TTE T11 = 3. If we do not observe an animal by the end of the Kth sampling period, the TTE must be longer than our observation time, so we right-censor this occasion (Muenchow 1986 , Pyke and Thompson 1986 , Castro-Santos and Haro 2003 , Bischof et al. 2014 ). An example encounter history at camera 1 with J = 5 and K = 24 for each sampling occasion may look like T1j = <NA, 23, NA, NA, 5>, where a right-censored sampling occasion is represented by NA.

We estimate the sampling variance of N ^ using the properties of maximum likelihood theory and the delta method (Mood et al. 1974 , Oehlert 1992 , Williams et al. 2002 ). R code for implementing the TTE model is given in Data S1 and Appendix S1.

where Γ(z + 1) is the gamma function Γ(z + 1) = z! and γ(z + 1, λiTij) is the lower incomplete gamma function γ z + 1 , λ i T ij = ∫ 0 λ i T ij t z e − t d t . Further development of the geometric-gamma formulation is needed because there is a near singularity in the Hessian matrix. In this paper, we demonstrate the TTE model formulated under perfect detection only.

Space to event model

Because the TTE model requires estimates of movement rate in order to set the sampling periods appropriately, we developed the STE model that does not require this auxiliary information. The STE model is conceptually similar to the TTE model, but we collapse each sampling occasion to an instantaneous sample. Because of this, the estimates are independent of animal movement rate. As with the TTE model, we begin by modeling the number of animals in view of a camera using the Poisson distribution 1. However, instead of observing the time until we observe an animal using the exponential distribution, we can instead collect data on the amount of space S between animals. As with the TTE model, when events of interest are Poisson-distributed, the interval (of space in this case) between them is exponentially distributed S

In order to estimate the amount of space between animals, we take observations of random areas of the landscape at an instant in time. If an observer were to repeatedly draw random areas of the landscape until they found an animal, the so-called STE would be the total area sampled until that point. Because the sample is instantaneous, the mean STE E[S] = 1/λ depends only on the number of animals, so we can estimate density without any further constraints. When using time-lapse photographs, detection probability is defined as the probability that an animal is captured and correctly identified given it is in the camera's viewshed. As with the TTE model, we develop this model assuming perfect detection, which we address further in the discussion.

To observe S in practice, we randomly deploy time-lapse cameras that take photographs at pre-defined times. As opposed to the TTE model, where we split sampling occasions into sampling periods, we now define sampling occasions as a single instant in time. At each sampling occasion j = 1, 2, …, J (e.g., every 1 h), we observe a snapshot of the number of animals in view of each camera. We record the STE as the total area sampled before an animal is first observed.

As an example, we examine all the photographs taken at a single time (j = 1). We first calculate the area of each camera following Eq. 5. Since we are using time-lapse cameras instead of motion-sensor cameras, the maximum distance r is defined by field landmarks rather than the trigger distance. After randomly ordering the cameras, we look through the photographs until we find the first animal detection. If camera 1 (with area a1) contains at least one animal, we record the space to first event Sj=1 = a1. If, instead, camera 1 is empty but camera n contains at least one animal, we record Sj=1 = a1 + a2 +… + an (Fig. 3). An example encounter history with J = 5 with average camera area a ¯ = 30 m 2 may look like Sj = <180 m 2 , 30 m 2 , NA, 300 m 2 , NA>, where a right-censored sampling occasion is represented as NA.

Instantaneous sampling estimator

We can convert to abundance following Eq. 6.


All three abundance estimators assume demographic and geographic closure of the study area, random camera placement, and independent observations of animals. To meet the closure assumptions, an appropriate sampling frame and time should be chosen during which the population is closed to birth, death, immigration, and emigration. While the models assume demographic and geographic closure on the level of the sampling frame, it is important to note that they do not assume geographic closure at the plot level, as N-mixture models do (Royle 2004 ). Second, cameras should be deployed randomly across the landscape rather than targeting features such as roads or trails (Rowcliffe et al. 2013 ). Animals should be neither attracted to nor repelled by the cameras, so sites should be unbaited and minimally disturbed. Next, detections of animals are assumed to be independent in space and time. As long as cameras are randomly deployed, the properties of random sampling mean that animals captured at one camera are not any more or less likely to step in front of the next camera. However, it is slightly more difficult to address independence of animal detections at a single camera. We should consider animal behavior when defining sampling occasions and leave enough time for animals to redistribute across the landscape. We can help address independence of detections by selecting sampling occasions randomly or systematically, but we still may see autocorrelation across observations. In these cases, bootstrapping may help to appropriately estimate the variance.

The TTE and STE models assume that animals follow a Poisson distribution at the spatial scale of the camera. If animals are clumped due to landscape features, we could incorporate covariates on λ to help address extra variance on the landscape. Additionally, the TTE model requires an independent estimate of the average amount of time for an animal to move through the camera area. These estimates can be obtained through auxiliary data like global positioning system (GPS) collars.

All models are currently formulated under the assumption that detection probability is 1. When using time-lapse photographs, as in the STE and IS methods, this may be fairly reasonable. The cameras take photographs at specified intervals regardless of whether they detect an animal. As long as the view in front of the camera is appropriately clear and photograph viewers are consistently trained, the photographs reflect a true capture history of animal presence and absence. On the other hand, motion-sensor cameras pose a larger issue for detection. Because detection probability decreases with distance (Rowcliffe et al. 2011 , Howe et al. 2017 ), the user may want to use only those photographs with animals a short distance from the camera so they can assume perfect detection probability. We encourage future work into extending these models when P < 1.


We performed mechanistic simulations to evaluate the estimates of abundance and the variances of those estimates for all three models. We simulated slow and fast movement rates for populations of 10 animals and 100 animals. Every individual took an independent uncorrelated random walk for 1000 steps with fixed step lengths (length 1 for the slow population and length 3 for the fast population) and random turning angles, bounded within a 30 × 30 unit area. Animals were captured at a given time in any of the 10 randomly placed 1 × 1 unit square cameras if their coordinates fell within the camera's coordinates, inclusive of two borders.

For the IS and STE methods, we created encounter histories based on the number of animals in each camera at every tenth time step. For the IS method, we used the count of animals in the cameras at each sampled time. For the STE method, we created a randomly ordered list of the cameras at each sampled time step and recorded the number of the first camera that contained at least one animal during that time. For the TTE model, we sampled animals during 10-step sampling occasions (each step represented a sampling period in the sampling occasion). We left 10 steps between the end of one sampling occasion and the beginning of the next. At each camera and sampling occasion, we recorded the number of steps until the first animal was caught in the camera.

We ran 1000 simulations for each combination of step lengths (fast and slow) and population size (10 and 100). We calculated standard error on each estimate using the analytical standard error formulas and delta method. To verify our estimates of standard error, we also calculated the standard deviation of the abundance estimates from the repeated simulations.

Case study: estimating elk abundance

To demonstrate use of these methods in the field, we used a dataset from 80 remote cameras deployed during February 2016. We deployed these cameras in the Beaverhead Mountains of Idaho on a mix of public and private land, with permission from the landowners. The study area was defined by a 2 km buffer around 3525 GPS locations from 18 December 2014 to 20 March 2015 from 33 calf and female elk. This area was characterized by high desert grass–sagebrush communities and windswept hills, where elk movement was mostly unrestrained by topography or dense vegetation. We compared estimates from our models against abundance estimates from a February 2009 aerial survey that was conducted in this area and corrected for sightability bias (Samuel et al. 1987 , 1992 ). We recognize that this is not true abundance, but it serves as a ballpark figure against which to compare our estimates.

Within the sampling frame, we randomly selected nine plots using generalized random-tessellation stratified (GRTS) sampling (Stevens and Olsen 2004 ) with the R package spsurvey (Kincaid and Olsen 2017 , R Core Team 2015 ). Generalized random-tessellation stratified sampling allows the user to replace plots in the ordered sample, so we replaced two plots due to lack of accessibility during winter and/or lack of landowner approval (Stevens and Olsen 2004 ). We divided each 1.5 × 1.5 km plot into nine equal sections and systematically placed one camera in each section. Within the bounds of the 500 × 500 m sub-plot, we attempted to place the camera to maximize capture of elk. However, in this study area, a 500 × 500 m sub-plot is relatively homogeneous, and field observations suggested that elk movement was fairly unrestrained at this scale. Thus, while subjective placement at the sub-plot level does not adhere to perfect random sampling, we do not believe we violated the assumption severely. When placing cameras, we made sure that no two cameras were on the same road, trail, or ridge, in order to reduce autocorrelation across cameras. Due to lack of trees, we placed the cameras on T-posts at an approximate height of 4–5 feet. We pointed cameras north to limit direct sunlight in the frame and cleared any vegetation obstructing the camera's view. The infrared flash, motion-triggered cameras (models HC600, PC800, and PC900 Reconyx, Holmen, Wisconsin, USA) had high trigger sensitivity and took bursts of five pictures with no delay between trigger activations. In addition to the motion trigger, cameras took pictures every five minutes from 06:00 to 18:00.

We calculated the visible camera area by camera specifications (TrailcamPro 2017 ) using Eq. 5. We based visible camera area on the Reconyx HC600 model, letting θ = 42°. The cameras had long, unimpeded views, so we set r = 50 m to reduce misclassification and miscounting. We only counted elk within that distance, which we identified by flagging placed in the field.

For the TTE model, we estimated the sampling period length as 1 h. We estimated the distance across a camera as 30 m and calculated elk speed from 1746 locations from 53 GPS collars in the Beaverhead area from January 2015 (IDFG, unpublished data). Median elk speed was approximately 30 m/h (including times of foraging and rest), so we set the sampling period length to one hour. At each camera, we sampled for four hours (four 1-h sampling periods), beginning every eight hours throughout February 2016. On each sampling occasion, we recorded the first period in which an elk was detected. If no elk were detected during a given sampling occasion, we right-censored that occasion.

For the STE model, we created a randomly ordered list of cameras and recorded the first camera that detected elk at each sampling occasion. Although the sampling should be instantaneous, we defined the sampling period as 1 min to ensure we had enough detections. Any photographs of elk during that one minute counted as a detection. We sampled each camera for one minute every hour, from 1 February to 13 February 2016. We selected this time frame to ensure that elk were not migrating on or off winter range. If no cameras observed elk at a given sampling occasion, we right-censored that occasion.

For the IS estimator, we counted all visible elk in a subset of photographs taken between 1 February and 29 February 2016. We used photographs taken on the hour, every hour, so as to reduce autocorrelation between samples. If no photograph was taken on a given occasion, we recorded the count as zero. Ideally, when using repeated fixed-area counts, the spatial replicates should be re-randomized each time. However, with the IS estimator, cameras are not redeployed at each sampling occasion j, so the variance estimator should account for potential correlation among counts at the same camera. Because most analytic variance estimators can be biased low when samples are correlated, we wanted to test the performance of our estimator against the estimate of standard error from a non-parametric bootstrap (Efron and Tibshirani 1993 ). We created 1000 new datasets by sampling the cameras with replacement and taking all counts at those cameras. We estimated abundance with each dataset. We estimated the standard error of N ^ with the standard deviation of these repeated estimates.


Comparison of collection methods for native bees

Both the number of specimens collected and their taxonomic richness differed among the collection methods (Table 2). Targeted sweep netting was by far the most effective method for sampling bees with respect to both abundance and taxonomic unit richness, and blue vane traps were the next most effective in terms of absolute numbers (Table 2 Appendix S2: Table S1). However, when standardized to approximate an equal sampling duration to the other methods (3 h), blue vane traps caught a comparable number of bees to pan (Table 2).

Method Targeted sweep netting Blue vane Yellow vane Blue pan trap Yellow pan trap Large yellow pan trap
Individuals caught 1324 347 (3.86)‡ ‡ Numbers in parentheses are divided by 90 to standardize results to 3 h in order to quantitatively compare results with the other methods.
15 (0.17)‡ ‡ Numbers in parentheses are divided by 90 to standardize results to 3 h in order to quantitatively compare results with the other methods.
8 15 6
Taxonomic units caught† † Given variation in body size between sexes (K. S. Prendergast, unpublished data), and known differences in color preferences between sexes (Heneberg and Bogusch 2014 ), for species where both sexes were collected, these were treated as distinct taxonomic units.
134 31 (0.34)‡ ‡ Numbers in parentheses are divided by 90 to standardize results to 3 h in order to quantitatively compare results with the other methods.
10 (0.11)‡ ‡ Numbers in parentheses are divided by 90 to standardize results to 3 h in order to quantitatively compare results with the other methods.
7 6 5
Genera caught 20 11 7 4 3 2
Families caught 4 4 4 3 3 2
  • † Given variation in body size between sexes (K. S. Prendergast, unpublished data), and known differences in color preferences between sexes (Heneberg and Bogusch 2014 ), for species where both sexes were collected, these were treated as distinct taxonomic units.
  • ‡ Numbers in parentheses are divided by 90 to standardize results to 3 h in order to quantitatively compare results with the other methods.

Blue vane traps caught more individuals and taxonomic units than yellow vane traps, whereas yellow pan traps were more effective than blue pan traps. Large (non-UV) yellow pan traps were the least effective (Table 2).

There were significant differences in number of individual native bees caught between the different methods (P < 0.0001 Appendix S2: Table S1). All pairwise comparisons between targeted sweep netting and all passive methods were significantly different (P < 0.0001). All pairwise comparisons between blue vane traps and other methods were significantly different (P < 0.0001 Appendix S2: Table S2), but were not once vane trap data were standardized (P > 0.05 Appendix S5: Table S2). There was a significant method × habitat interaction (P < 0.0001 Appendix S2: Table S1), but the main findings of the superiority of targeted sweep netting were consistent across habitats (Fig. 2).

Taxonomic unit richness also differed between sampling methods (P < 0.0001, Appendix S2: Table S3, Appendix S3: Table S1), following a similar pattern to that for abundances (Appendix S2: Table S4). Targeted sweep netting caught over 90% of all taxonomic units (Table 2). Blue pan traps caught slightly more taxonomic units than yellow pan traps, but the difference was nonsignificant (Table 2 Appendix S2: Table S4). As with abundance, blue vane traps caught more taxonomic units overall than the other passive methods (Table 2 Appendix S2: Table S4), but not when catch rates were standardized to three hours (Table 2 Appendix S5: Table S4). There was no method × habitat interaction (P = 0.376 Appendix S2: Table S3).

Of the 145 taxonomic units (separate for each sex), of those with n ≥ 10, all 43 were collected at higher frequencies by targeted sweep netting except for four: Amegilla chlorocyanea (female 196 blue vane, 17 targeted sweep netting, 2 yellow vane, and 1 blue pan trap) A. chlorocyanea (male 68 blue vane and 9 targeted sweep netting) the kleptoparasite of Amegilla, Thyreus waroonensis (female 11 blue vane and 2 targeted sweep netting) and Lasioglossum (Chilalictus) castor (female 14 blue vane, 12 targeted sweep netting, 9 yellow pan trap, 4 yellow vane, and 2 blue pan trap Appendix S3: Table S1).

No species were exclusive to large yellow pan traps or UV-blue or UV-yellow pan traps. Only two species, Lasioglossum (Chilalictus) sp.12 (female) and Braunsapis nitida (female), both singletons, were exclusive to yellow vane traps. Five taxonomic units were exclusive to blue vane traps (Lasioglossum (Chilalictus) lanarium [male], Lasioglossum (Chilalictus) inflatum [female], Homalictus (Homalictus) sphecodoides [female], all singletons, Euryglossula fultoni [male, n = 3], and L. (Chilalictus) lanarium [female, n = 4]). By contrast, 98 taxonomic units were captured exclusively by targeted sweep netting (Appendix S3: Table S1).

There was a significant sex × method interaction (P = 0.0002), indicating that the sexes were sampled differently depending on the method used (Appendix S2: Table S5).

Rarefaction curves and Chao estimates followed the same general pattern based on the observed numbers of taxonomic units by sampling method (Table 3 Appendix S4: Fig. S1). While the passive sampling methods followed a shallow incline with increasing sampling effort (Appendix S4: Fig. S1), the netting followed a curvilinear pattern and had still yet to plateau (Appendix S4: Fig. S1), indicating that despite high sampling effort, more taxonomic units were likely with increased sampling effort. Considering the taxonomic units captured as a percentage of the Chao 1 estimate, netting, large yellow pans, and blue pans had values above 70%, whereas the number collected in the blue vanes was only 55.6% of the estimated value, and for the yellow vane and yellow pan traps, taxonomic unit richness was only 46.7% and 44.5%, respectively, of the estimated value (Table 3). It should be noted that the confidence intervals of the Chao 1 estimates were relatively wide (Table 3).

Method Observed Chao 1 mean 95% CI lower bound 95% CI upper bound Chao 1 SD %obs of Chao 1
Large yellow pans 4 5.2 4.12 16.5 2.14 76.9
Yellow pans 6 13.5 6.92 66.7 10.9 44.5
Blue pans 9 12.4 9.58 29.4 3.9 72.3
Yellow vanes 10 21.4 12.1 73.8 12.3 46.7
Blue vanes 32 57.5 39.4 119.9 17.9 55.6
Targeted sweep netting 134 181.5 154.6 243.5 21.2 73.8


  • CI, confidence intervals SD, standard deviation %obs, percentage of the observed number of taxonomic units is of that calculated by the Chao 1 analysis.

A Bray–Curtis similarity matrix of species composition revealed that of the five collection methods, pan traps of different colors were the most similar. Both blue and yellow vanes were more similar to blue pan traps than yellow pan traps. The most successful method—targeted sweep netting—had a species composition most similar to blue vane traps, but low similarity to the other methods (Table 4). An NMDS analysis comparing taxonomic composition between the methods had low stress (0.01), indicating a good fit to the data, and depicted that the two small UV-reflective pan traps were most similar to each other (Fig. 3). Taxonomic composition of the bees caught in large yellow pan traps was most dissimilar to all other methods. Targeted sweep netting was also dissimilar to all other methods, but most similar to blue vanes.

Method Targeted sweep netting Blue vane Yellow vane Blue pan trap Yellow pan trap Large yellow pan trap
Targeted sweep netting
Blue vane 23.75
Yellow vane 5.68 15.77
Blue pan trap 4.15 21.36 24.53
Yellow pan trap 4.01 15.04 22.31 30.58
Large yellow pan trap 2.53 5.88 0.00 14.11 0.00

Native bees observed vs. targeted sweep netting

Due to being inaccessible (out of reach of the entomological sweep net) or to the difficulty in catching rapid-flying taxa, not all bees that were observed were netted. Out of a total 5299 native bees recorded by active sampling, 1324 were netted and 4366 were observed: a ratio of observed to netted bees of 1:3. Across all surveys, a mean of 6.32 ± 1.07 (standard error) bees were netted vs. 17.16 ± 4.01 observed. The proportion of netted bees to observed bees did not differ according to habitat (P = 0.147 Appendix S2: Table S2). There were, however, significant differences between taxa in the proportion of bees netted relative to that of bees observed (<0.001 Table 5 Appendix S2: Table S6), with differences in most pairwise comparisons between taxa (Tukey's post hoc test Appendix S2: Table S7). The greatest differences in netted:observed catch rates were for the genus Amegilla, which included only a single, large-bodied species (A. chlorocynea), and for Exoneura, a genus of small social bee. For Amegilla, the larger numbers observed relative to netted related to their extremely fast, erratic flight and short duration alighting at flower. For Exoneura, the high observed:netted ratio was likely due to the large numbers that often forage simultaneously on bushes, making netting some individuals easy yet impossible to catch all that were foraging. Excluding the rarely encountered taxa, most taxa were observed more frequently than netted, except for Meroglossa, represented by a single species (M. rubricata) that was often observed in trap-nests but seldom foraging, and Lipotriches, mainly represented by L. flavoviridis, a common species present at most sites and foraging on a wide range of flora.

Very rapid, zipping flight, seldom alights long on flowers

In reach of sweep nets, often foraging on vegetation that can be sweep netted

Seldom encountered singly

Flying rapidly around inflorescences often in a cloud

Never on ground-level flora prefer branches of flowering trees but if within reach are relatively easy to capture by sweeping through cloud

Intermediate flight speed

Seldom encountered singly

Prefer shrubs and trees to forage on, never at ground level

Intermediate flight speed

Prefer shrubs and trees to forage on, never at ground level

Seldom encountered singly

Flying rapidly around inflorescences often in a cloud

Never on ground-level flora prefer branches of flowering trees but if within reach are relatively easy to capture by sweeping through cloud

Males may be territorial around flowers

Intermediate flight speed

Forage at multiple heights, including low-lying flora

Intermediate flight speed

Often forage on low-lying flora

Intermediate flight speed

Buzz pollinators—stay on flowers for a longer period of time

Forage at various heights, including ground level

Alight only briefly on flowers

Forage at various heights, including ground level

Intermediate flight speed

Frequently observed just resting inside entrances of trap-nests

Intermediate flight speed

Prefer shrubs and trees to forage on, never at ground level

  • Body size categories: small, 0.48–1.78 mm ITD medium, 1.79–3.10 mm large, 3.11–4.41 mm. Categories were based on subtracting the minimum body size, as measured by intertegular distance (ITD), from the maximum and dividing by three.

Observed vs. passive collections

Both native bees and honeybees were surveyed using observational recording and passive collections. For both, observational counts vastly exceeded numbers recorded by all passive sampling methods combined. A total of 572 honeybees were collected across all passive sampling methods, whereas 19,825 were observed, amounting to numbers observed being 34.7 times greater than numbers caught by the passive traps. Numbers of native bees observed were 11-fold greater than those caught passively (391 native bee individuals caught by passive traps, compared with 4366 being observed), despite there being more passive than active methods employed.


Only a small subset of the potential cavity-nesting bee species used the trap-nests. Of the 34 cavity-nesting megachilids (including the kleptoparasitic Coelioxys) caught, only 10 species used the trap-nests, and of the 17 hylaeine bees, only four species used the trap-nests (Table 6). However, the value of the trap-nests was in being able to confirm males and females belonging to the same species namely, no males of Megachile (Eutricharaea) chrysopyga, Megachile (Mitchellapis) fabricator, and Hylaeus (Euprosopis) violaceus were collected in the field, but they emerged from bee tubes. Not only did the composition of trap-nesting species represent only a fraction of the diversity of cavity-nest species, but also the relative abundances did not mirror those caught in the field (Table 6).

Taxon Species No. of tubes No. of bees emerged Proportion of tubes Proportion of bees emerged No. of cavity-nesting bees collected during surveys Proportion of cavity-nesting bees collected during surveys
Hylaeinae Hylaeus (Euprosopis) violaceaus 15 68 0.093 0.133 3 0.004
Hylaeus (Gnathoprosopis) amiculus 1 1 0.006 0.002 7 0.009
Hylaeus (Gnathoprosopis) euxanthus 1 1 0.006 0.002 14 0.018
Meroglossa rubricata 4 8 0.025 0.016 19 0.024
Megachilidae Megachile (Eutricharaeae) obtusa 3 14 0.019 0.028 27 0.035
Megachile (Mitchellapis) fabricator 39 145 0.24 0.285 3 0.004
Megachile apicata 1 1 0.006 0.002 10 0.013
Megachile aurifrons 6 37 0.037 0.070 25 0.032
Megachile erythropyga 85 227 0.525 0.446 6 0.008
Megachile fultoni 1 1 0.006 0.002 24 0.031
Megachile “houstoni” M306/F367 † † Undescribed species, lodged in the WA Museum as M306/F367.
1 1 0.006 0.002 151 0.195
Megachile ignita 3 3 0.019 0.006 20 0.026
Megachile (Hackeriapis) tosticauda 2 2 0.012 0.004 6 0.008
Totals 162 509
  • Number of tubes occupied, the number of bees to emerge, proportion of all tubes occupied by a given species, proportion of all cavity-nesting bees are presented. To compare with survey results, number of a given species collected during the bee surveys and the proportion of all cavity-nesting bees collected during surveys (i.e., No. of sp. collected/No. of all cavity-nesting bees collected) are provided.
  • † Undescribed species, lodged in the WA Museum as M306/F367.

Mobile gardens

The mobile gardens were unsuccessful, despite the plants having a high density of blooms. Throughout the four months (56 sampling days), only S. aemula was visited, and on only five days at three sites. It should be noted that S. aemula was the only plant that flowered throughout the survey season the other three were restricted to the first month (only D. revoluta had some flowers still present in December). A total of 15 bees visited the mobile garden plants, but only one of these was native (L. (Chilalictus) castor, female)—the remainder were honeybees.

Comparison of different passive sampling methods for honeybees and native bees and the influence of habitat type

There was a significant difference in catch rates of native bee individuals by different methods (P < 0.001 Appendix S2: Table S8). Significantly more individuals were caught in blue vane traps than all other methods (P < 0.001) no other comparisons were significantly different (P > 0.05). There was no significant interaction between method and habitat (P = 0.115 Appendix S2: Table S8), although vane traps caught more bees in bushland than residential areas, where the other methods were comparable between habitats, but the sample size was too small for any valid conclusions (Fig. 2a).

Honeybee catch rates differed significantly by method (P < 0.001 Appendix S2: Table S6). Pairwise comparisons between both colored vane traps and all pan traps were highly significant (P < 0.001). Blue vanes also caught significantly higher numbers of honeybees than yellow vanes (P = 0.001). Comparisons between the pan traps were nonsignificant. There was also a significant method × habitat interaction (P < 0.001 Appendix S2: Table S8), where vane traps, which caught more bees overall, had higher catch rates in bushland remnants than residential habitats, whereas for the other methods, these caught no honeybees in most cases except for a few outliers, in both habitat types (Fig. 2b).

Assessing each method regarding whether there were differences in abundance of native bees and honeybees, it was found that the relative differences in abundance of honeybees vs. native bees differed between methods (Appendix S2: Table S9). Abundances of native bees and honeybees were similar for blue vane traps (mean native bees 8.26 ± 1.45 vs. mean honeybees 9.14 ± 1.27, P = 0.171), whereas there was a trend for honeybees to be recorded at higher abundances based on observational counts (mean native bees 94.3 ± 11.0 vs. mean honeybees 360.3 ± 97.1, P = 0.077 Appendix S2: Table S7). Both types of yellow pan traps caught significantly more native bees than honeybees (UV-fluorescent pan traps, mean native bees 0.392 ± 0.116 vs. mean honeybees 0 ± 0, P < 0.001 and large yellow, mean native bees 0.303 ± 0.119 vs. mean honeybees 0.024 ± 0.024, P = 0.001), but the trend was reversed for yellow vanes, which caught sixfold more honeybees than native bees (mean native bees 0.722 ± 0.172 vs. mean honeybees 9.14 ± 2.17, P < 0.001 Appendix S2: Table S9).


Gail F. Dawson MD, MS, FAAEP , in Easy Interpretation of Biostatistics , 2008


Inferential statistics is based on the probability of a certain outcome happening by chance. In probability theory, the word outcome refers to the result observed. It does not necessarily reflect quality-adjusted life-years (QUALY) like the outcome variable we see in clinical trials. It is simply the result of an event. The range of probabilities varies between 0 (no probability of the event happening) and 1 (the outcome will always happen.) It is rare to find circumstances in nature where the probability of occurrence is equal to 0 or 1. If that were the case, there would be no need to apply probability theory. All events that are studied in medicine have a probability of occurrence between 0 and 1. This is expressed as a decimal, such as 0.35. The simplest, most informative interpretation of probability converts these values to percentages to express the chance of something happening. An outcome with a probability of 0.35 is said to have a 35% chance of occurrence. On average, it will happen 35 times out of 100 opportunities. It follows that an outcome with 100% probability means there is no possibility that the outcome will not happen (but this never happens!).

A p value is really a probability that a given outcome could occur by chance. It is usually expressed as a decimal, such as 0.07. A p value, when multiplied by 100, is a percentage. In the above example, the p value of 0.07 means that there is a 7% probability that the observed outcome could happen by chance alone. (This is based on an underlying assumption that certain conditions have been met, which we will look into later.) Another way of stating this is: If the study were repeated hundreds of times under the same circumstances, using members of the same population, an average of only seven of these studies out of 100 would give the result we observe based on chance alone. The reason why each study does not give identical results in these situations is because different samples are used, which results in different estimates of the parameter. We will discuss this concept again but, for now, just realize that the p value represents a probability, which can be expressed as a percentage.

Experiments on Ecology | Biology

Are you researching on experiments on ecology ? You are in the right place. The below mentioned article includes a collection of nineteen experiments on ecology: 1. Community Structure Study 2. Biomass Study 3. Soil Science 4. Aquatic Ecosystem 5. Physico-Chemical Analysis of Water.

  1. Experiments on Community Structure Study
  2. Experiments on Biomass Study
  3. Experiments on Soil Science
  4. Experiments on Aquatic Ecosystem
  5. Experiments on Physico-Chemical Analysis of Water

1. Experiment on Community Structure Study: (8 Experiments)

1. Aim of the Experiment:

To determine the minimum size of the quadrat by species area-curve method.


Nails, cord or string, metre scale, hammer, pencil, notebook.

i. Prepare a L-shaped structure of 1 × 1 metre size in the given area by using 3 nails and tying them with a cord or string.

ii. Measure 10 cm on one side of the arm L and the same on the other side of L, and prepare 10 x 10 sq. cm area using another set of nails and string. Note the number of species in this area of 10 x 10 sq. cm.

iii. Increase this area to 20 × 20 sq. cm and note the additional species growing in this area.

iv. Repeat the same procedure for 30 × 30 sq. cm, 40 × 40 sq. cm and so on till 1 × 1 sq. metre area is covered (Fig. 67) and note the number of additional species every time.

Record your data in the following table:

v. Prepare a graph using the data recorded in the above table. Size of the quadrats is plotted on X- axis and the number of species on Y-axis (Fig. 67 B).

The curve starts flattening or shows only a steady increase (Fig. 67 B) at one point in the graph.

The point of the graph, at which the curve starts flattening or shows only a steady or gradual increase, indicates the minimum size or minimum area of the quadrat suitable for study.

2. Aim of the Experiment:

To study communities by quadrat method and to determine % Frequency, Density and Abundance.

Metre scale, string, four nails or quadrat, notebook.

Frequency is the number of sampling units or quadrats in which a given species occurs.

Percentage frequency (%F) can be estimated by the following formula:

Density is the number of individuals per unit area and can be calculated by the following formula:

Abundance is described as the number of individuals per quadrat of occurrence.

Abundance for each species can be calculated by the following formula:

Lay a quadrat (Fig. 68) in the field or specific area to be studied. Note carefully the plants occurring there. Write the names and number of individuals of plant species in the note-book, which are present in the limits of your quadrat. Lay at random at least 10 quadrats (Fig. 69) in the same way and record your data in the form of Table 4.1.

In Table 4.1, % frequency, density and abundance of Cyperus have been determined. Readings of the other six plants, occurred in the quadrats studied, are also filled in the table. Calculate the frequency, density and abundance of these six plants for practice. (For the practical class take your own readings. The readings in Table 4.1 are only to give an explanation of the matter).

Calculate the frequency, density and abundance of all the plant species with the help of the formulae given earlier and note the following results:

(i) In terms of % Frequency (F), the field is being dominated by…

(ii) In terms of Density (D), the field is being dominated by…

(iii) In terms of Abundance (A), the field is being dominated by…

Table 4.1: Size of quadrat: 50cm × 50cm = 2500 cm 2

3. Aim of the Experiment:

To determine minimum number of quadrats required for reliable estimate of biomass in grasslands.

Metre scale, string, four nails (or quadrat), note book, graph paper, herbarium sheet, cello tape.

i. Lay down 20-50 quadrats of definite size at random in the grassland to be studied, make a list of different plant species (e.g., A-J) present in each quadrat and note down their botanical names or hypothetic numbers (e.g., A, B, C,…, J) as shown in Table 42. u

ii. With the help of the data available in Table 4.2, find out the accumulating total of the number of species for each quadrat.

iii. Now take a graph paper sheet and plot the number of quadrats on X-axis and the accumulating total number of species on Y-axis of the graph paper.

Observations and results:

A curve would be obtained. Note carefully that this curve also starts flattening. The point at which this curve starts flattening up would give us the minimum number of quadrats required to be laid down in the grassland.

4. Aim of the Experiment:

To study frequency of herbaceous species in grassland and to compare the frequency distribution with Raunkiaer’s standard frequency diagram.

Quadrat, pencil, note-book, graph paper.

i. Lay 10 quadrats in the given area and calculate the percentage frequency of different plant species by the method and formula given above in Exercise No. 2.

ii. Arrange your data in the form of following Table 4.3:

Raunkiaer (1934) classified the species in a community into following five classes as shown in Table 4.4:

Arrange percentage frequency of different species of the above Table 4.3 in the five frequency classes (A-E) as formulated by Raunkiaer (1934) in Table 4.4.

Draw a histogram (Fig. 70) with the percentage of total number of species plotted on Y-axis and the frequency classes (A-E) on X-axis.

This is the frequency diagram (Fig. 70):

Observations and results:

The histogram takes a “J- shaped” curve as suggested by Raunkiaer (1934), and this shows the normal distribution of frequency percentage. If the vegetation in the area is uniform, class ‘E’ is always larger than class ‘D’. And in case class ‘E’ is smaller than class ‘D, the community or vegetation in the area shows considerable disturbance.

5. Aim of the Experiment:

To estimate Importance Value Index for grassland species on the basis of relative frequency, relative density and relative dominance in protected and grazed grassland.

Wooden quadrat of 1ࡧ metre, pencil, notebook.

What is Importance Value Index?

The Importance Value Index (IVI) shows the complete or overall picture of ecological importance of the species in a community. Community structure study is made by studying frequency, density, abundance and basal cover of species. But these data do not provide an overall picture of importance of a species, e.g., frequency gives us an idea about dispersion of a species in the area but does not give any idea about its number or the area covered.

Density gives the numerical strength and nothing about the spread or cover. A total picture of the ecological importance of a species in a community is obtained by IVI. For finding IVI, the percentage values of relative frequency, relative density and relative dominance are added together, and this value out of 300 is called Importance Value Index or IVI of a species.

Relative frequency (RF) of a species is calculated by the following formula:

Relative density (RD) of a species is calculated by the following formula:

Relative dominance of a species is calculated by the following formula:

Basal area of a plant species is calculated by the following formula:

Basals area of a species = p r 2

where p = 3.142, and r = radius of the stem

i. Find out the values of relative frequency, relative density and relative dominance by the above-mentioned formulae.

ii. Calculate the IVI by adding these three values:

IVI = relative frequency + relative density + relative dominance.

Arrange the species in order of decreasing importance, i.e., the species having highest IVI is of most ecological importance and the one having the lowest IVI is of least ecological importance.

6. Aim of the Experiment:

To determine the basal cover, or vegetational cover of one herbaceous community by quadrat method.

Wooden quadrat of 1×1 m, Verniercalliper, pencil, notebook.

i. Lay a wooden-framed quadrat of 1 x 1 metre randomly in a selected plot of vegetation and count the total number of individuals of the selected species inside the quadrat.

ii. Cut a few stems of some plants of this individual species and measure the diameter of the stem with the help of Verniercalliper.

iii. Calculate the basal area of the individuals by the formula:

Average basal area = π r 2 where r is the radius of the stem.

iv. Take 5 readings, arrange them in tabular form and find out the average basal area by the above formula.

v. Lay the quadrat again randomly at another place and note the same observations in the table.

vi. Lay about 10 quadrats in the same fashion and each time note the total number of the species and average basal area of the single individual.

Observations and results:

(a) For finding the average basal area, divide the sum of average basal area in all quadrats with the total number of quadrats studied.

(b) For finding the total basal cover of a particular species multiply the average basal area of all observations with the density of that particular species as under:

Basal cover of a particular species = Average basal area x Density (D) of that species.

The basal cover of a particular species is expressed in… sq. cm/sq. metre.

7. Aim of the Experiment:

To measure the vegetation cover of grassland through point-frame method.

Point-frame apparatus, graph paper sheet, herbarium sheet, cello tape, note-book.

A point-frame apparatus is a simple wooden frame of about 50 cm long and 50 cm high in which 10 movable pins are inserted at 45° angle. Each movable pin is about 50 cm long.

i. Put the point-frame apparatus (with 10 pins) at a place in the vegetation of grassland (Fig. 71) and note down various plant species hit by one or more of 10 pins of the apparatus. Treat this as one sampling unit.

ii. Now put the apparatus at random at 10-25 or more places and note down each time the various plants species in a similar fashion. In case three plants of any species touch three pins in one sampling unit put at a place, the numerical strength of that particular species in this sampling unit will be three individuals. Write this value against the species below this sampling unit.

Observations and results:

Note down the details in the form of following Table 4.5:

Now calculate the percentage frequency of each species as already done in Exercise No. 2. Allocate the various species among five frequency classes (A, B, C, D, E) mentioned in Exercise No. 4, find out the percentage value of each frequency class and prepare a frequency diagram as done in Exercise No. 4. Compare the thus-developed frequency diagram with normal frequency diagram.

Find out the three most frequently occurring species in the area studied. Also find out whether the vegetation is homogeneous or heterogeneous. Also try to determine the density values of individual species. Also find out at each place the total number of individuals of each species being hit by 10 pins of the point-frame apparatus.

8. Aim of the Experiment:

To prepare a list of plants occurring in a grassland and also to prepare chart along the line transect.

i. Prepare a 25 feet long line transect in a selected grassland by tying each end of a 25 feet cord to the upper knobs of two nails.

ii. Note down the names of the plant species whose projection touches one edge of the cord along the line transect, and assign all of them a definite number (e.g., 1,2,3,4, …etc.).

iii. Take several such samples at regular or irregular intervals in the grassland along the line transect.

iv. Also record the plant species from different grassland types in the similar fashion.

Record your data in the following Table 4.6 in the form of the following manner:

Table 4.6 gives the complete list of plants occurring in the selected grassland. Also find out the name of the species represented in maximum number in each locality.

These data will also provide a clear picture of the dominant species of the grassland in a particular area.

2. Experiment on Biomass Study: (2 Experiments)

It is usually expressed as dry weight of all the living materials (plants as w ell as animals) in an area. Under biomass we include plants (their aboveground and underground parts) as well as animals. Fallen leaves and twigs of the plants are also taken in consideration at the time of studying biomass.

In the forests, the humus is in different stages of decomposition. The floor of the forest remains covered by organic matter which is slightly or not at all decomposed. This is called litter. A partially decomposed matter is present below this layer. It is called duff. Further decomposed matter, which has lost its original form, is present below duff and called leaf mould.

9. Aim of the Experiment:

To measure the above-ground plant biomass in a grassland.

To determine the biomass of a particular area.

Nails (4), metre scale, string, khurpa (a weeding instrument), polythene bags, oven.

i. Make a quadrat of the size of 50 cm × 50 cm in the field by digging the nails and connecting them with the string. Weed out all the above-ground parts of the plants growing in that limit with the help of weeding instrument. Collect all of them in a polythene bag.

ii. Collect the fallen leaves and other parts of the plants in the second polythene bag.

iii. Collect all the animals such as ants, larvae, earthworms, insects, etc., in the third polythene bag.

iv. By digging the soil to about 20 to 25 cm., take out all the underground parts of the plants and collect them in a separate bag after washing.

In the same way lay some more quadrats in the area under study and collect all the materials in polythene bags.

Dry weight of aboveground parts = 15 gm.

Dry weight of underground parts = 4 gm.

Dry weight of animals = 1 gm.

. . . Total dry weight = 20 gm.

50 × 50 cm field area contains = 20 gm. total dry biomass

. . .100 × 100 cm field area will contain

80 gm. is the biomass of 100 × 100 cm. field area.

Results of Different Parts:

(i) 50 x 50 cm. field area contains 15 gm. of aboveground parts.

100 × 100 cm. field area will contain

= 15×100×100/50×50 = 60 gm. biomass.

(ii) 50 x 50 cm. field area contains 4 gm. of underground parts.

100 × 100 cm. field area will contain

= 4×100×100/50×50 = 16 gm. biomass

(iii) 50 × 50 cm. field area contains 1 gm. of animals

100 × 100 cm. field area will contain

1×100×100/50×50 = 4 gm. biomass.

One square metre (100×100 cm.) field area contains 80 gm. biomass in terms of dry weight of the total plant and animal parts.

10. Aim of the Experiment:

To determine diversity indices (richness, Simpson, Shannon-Wiener) in grazed and protected grassland.

To study species diversity (richness and evenness), Index of dominance, Similarity index, Dissimilarity index and Species diversity index in grazed and protected grassland.

Species diversity is a statistical abstraction with two components.

These two components are:

(i) Richness (or number of species), and

(ii) Evenness or equitability.

In any grassland, to be studied, if there are seventy species in a stand, then its richness is seventy. Pick out individual plants of different species with the help of khurpa, count the number of species in a stand of the area provided, and calculate the richness. On the other hand, if all the species in the grassland have equal number of individuals, then its evenness or equitability is high and if some species have only a few individuals then the evenness is low.

The species which have strongest control over energy flow and environment in given habitat are called ecological dominant. According to Simpson (1949), the Index of dominance (C) is calculated by the formula

where∑ (sigma) refers to summation, ni refers to the importance value of the species in terms of number of individuals or biomass or productivity of each species over a unit area, and N refers to the total of corresponding importance values of all the component species in the same unit area and period. Count the Index of Dominance by the above-mentioned formula.

Similarity Index between two stands of vegetation can be worked out by the formula S = 2 C/(A+B), where S is the Similarity Index, C is the number of species common to both the stands, and A and B are number of species on stand A and stand B. For example, if there are 20 species on site A and 20 on site B and 14 species are common in both sites, the Similarity Index (S) will be

(d) Dissimilarity index:

The Dissimilarity Index is counted by the formula D = 1 – S, where D is the dissimilarity index and S is the similarity index. For example, if there are 20 species on site A and 20 species on site B and 14 species are common in both sites the similarity index (S) comes to 0.7 as calculated above in case of similarity index. Therefore, dissimilarity index (D) can be counted, as

Species diversity index (d) is calculated by the following formula given by Menhinick (1964):

where d = diversity index, S = number of species, and N= number of individuals of that particular species.

3. Experiment on Soil Science: (1 Experiment)

11. Aim of the Experiment:

To study the characteristics of different types of soils.

Samples of different types of soils (e.g., clay soil, sand or alluvial soil, humus, black soil, yellow soil, red soil, laterite or lateritic soil).

Method and observations:

Different soil samples are taken and studied individually.

Some of their major characteristics are under mentioned:

i. It is a compact and heavy-textured soil.

ii. The size of its particles is less than 0.002 mm.

iii. It has very minute spaces in between its particles.

iv. It is quite sticky when wet but becomes hard and develops cracks on drying.

v. It has higher water-holding capacity and poor aeration.

vi. It gets waterlogged easily.

vii. Its particles are negatively charged and have the ability to absorb cations of Mg, Ca, K, P, Fe and Na.

viii. This soil is made up of hydrated alumino silicate.

ix. It is quite rich in calcium carbonate and magnesium carbonate.

x. The pore space between its particles is greater than sand.

xi. This soil has high degree of fertility.

On the basis of above characteristics, the given sample belongs to clay soil.

II. Fine sand or alluvial soil:

i. This soil is loose, light-textured and silver-grey in colour.

ii. The size of its particles is between 0.02 mm to 0.2 mm.

iii. It has poor water-holding capacity.

iv. This soil shows quite rapid rate of water infiltration.

vi. The carbonate content of this soil is very low.

vii. It does not get waterlogged easily.

viii. It shows good aeration.

ix. It is non-sticky and non-plastic when wet.

x. It has very low contents of phosphate, nitrogen and organic matter.

xi. It has shiny particles of aluminium silicate or mica.

xii. Some amounts of iron, magnesium, sodium, aluminium, silicon and calcium are present in this soil.

xiii. Its particles become warm on long exposure to the sun.

The above characteristics show that the given soil sample belongs to fine sand or alluvial soil.

i. It is decomposed matter of plant and animal remains.

ii. This organic matter is amorphous and dark brown to black in colour.

iii. It is soluble in dilute alkali solution like KOH and NaOH but insoluble in water.

iv. It is actually a layer of organic matter at the top of a soil profile. It is the habitat of most decomposers. The main decomposers are bacteria and fungi.

v. It is made up of nitrogen-rich proteins, lignin and polysaccharides.

vi. A large amount of carbon and small amounts of sulphur, phosphorus and some other elements are also present.

vii. It is colloidal in nature.

On the basis of the above characteristics, it can be concluded that the given material is humus.

i. This is black-coloured soil. The black colour is due to the presence of iron in this soil.

ii. High percentage of iron oxides, calcium carbonates, magnesium carbonates and alumina are present in this soil.

iii. It also contains large amount of nitrogen and organic matter. It, however, contains very low amount of phosphorus.

iv. If made wet by adding some water, this soil is sticky. On drying, it contracts and shows cracks.

v. It has high water-retaining capacity.

vi. It is highly productive and suits most for the crops like cotton.

Based on the above-mentioned characteristics, it can be concluded that the given sample belongs to black soil.

i. The yellow colour of this soil is due to the enhanced hydration of ferric oxide.

ii. It is a porous soil with nearly neutral pH.

iii. Size of the particles of this soil is between 0.002 mm to 0.02 mm.

iv. It is a granite-derived soil with moderately rich humus.

v. It contains very low amount of oxides of phosphorus, nitrogen and potassium.

vi. It contains large amount of silicon oxide and alminosilicate.

The above characteristics show that this is a sample of yellow soil.

i. This is the sample of red-coloured soil.

ii. The red colour is due to the diffusion of large amount of iron compounds such as ferrous oxide and ferric oxide.

iii. It is a slightly acidic type of soil. Its pH varies between 5 and 8.

iv. Some amount of silicon oxide and aluminium oxide are also present in this soil.

v. This soil is not good for agriculture because it is poor in nitrogen, phosphorus and humus.

Because of the above characteristics, the given sample is of red soil.

VII. Laterite soil:

i. This is yellowish or red-coloured soil. On exposure to sun it turns black.

ii. It is produced from aluminium-rich rocks.

iii. It is quite compact type of soil made up of hydrated oxides of iron and aluminium.

iv. It also contains small amounts of compounds like magnesium oxide and titanium oxide.

v. Small amounts of nitrogen, phosphorus, magnesium, potash and lime are also present in this soil.

vi. It is also quite rich in humus.

vii. Because of the above characteristics, this soil is good for the purpose of agriculture.

Above-mentioned characteristics show that the given sample is of laterite or lateritic soil.

4. Experiment on Aquatic Ecosystem: (1 Experiment)

Living organisms are structurally and functionally inter-related with the external world or the environment, and this functional and structural relationship of communities and the environment is called ecological system or ecosystem.

Ecosystem normally contains:

Taking in view the organisms and their habitat conditions, the ecosystem can be classified as follows:

Pond ecosystem can be studied as follows:

12. Aim of the Experiment:

To study the biotic components of a pond. Make diagram of a pond ecosystem.

Hand lens, collection net, meshes of different sizes, collection tubes, iron hook, scissor, forceps, and centrifuge.

Biotic components of a pond can be studied exactly according to the classification of a pond ecosystem given above. Hydrophytes can be picked by hand and collected in polythene bags. Other submerged plants may also be taken out by iron hooks (Fig. 74).

Phytoplankton and zooplankton can be collected in plankton bottles.

With the help of plankton nets, microorganisms can be collected in tubes.

Macro-producers and macro-consumers can be estimated in gm./cubic metre by the quadrat method used in the exercise of biomass.

Micro-producers and micro-consumers can be estimated in gm./litre of water collected as sample from an undisturbed part of the pond. They can be separated by centrifuging a little amount of pond water (containing micro-producers and micro-consumers) in test tube.

On the basis of their trophic position in the ecosystem different organisms may be grouped as follows:

Vallisneria, Ceratophyllum, Hydrilla, Potamogeton, Chara, etc.

(ii) Free-floating:

Azolla, Eichhornia, Lemna, Pistia, Spirodella, Salvinia, etc.

(iii) Rooted floating:

Trapa, Jussiaea, Nymphaea, Potamogeton, Nelumbium, etc.

Marsilea, Typha, Ranunculus, Polygonum, Cyperus, etc.

Agal members of Chlorophyceae, Xanthophyceae, Bacillariophyceae, Myxophyceae, etc.

(i) Consumers of the 1st order (Primary consumers):

e.g., Zooplankton, some insects.

(ii) Consumers of the 2nd order (Secondary consumers):

e.g., Fishes, frogs and some insects.

(iii) Consumers of the 3rd order (Tertiary consumers):

5. Experiment on Physico-Chemical Analysis of Water: (7 Experiments)

13. Aim of the Experiment:

To measure temperature and pH of different water bodies.

Maximum-minimum thermometer or thermometer or thermo flask.

The temperature of the pond can be determined by any of the following apparatuses:

(a) Maximum-minimum thermometer:

It contains two indicators (Fig. 73). With the help of a magnet these indicators are set to the present atmospheric temperature. Quickly lower down the thermometer to the desired depth in the pond. Keep it there for 10 minutes.

Bring out the thermometer quickly and note the readings of both the indicators. Out of the two indicators, one remains at the point and other moves to some extent giving the reading of temperature at that particular depth of the pond.

It is an instrument which gives correct reading of temperature in centigrade. It contains a long cable. At the end of the cable is attached a thermocouple (Fig. 75).

A milliammeter is present which is calibrated in C° and gives direct reading. Quickly lower the thermocouple upto a desired depth and note the temperature.

It is also one of the good apparatuses for measuring the temperature of a pond. After lowering to the desired depth, bring it out when it is filled completely with water. With the help of a good sensitive thermometer, note the temperature of the water.

pH of the pond water can be tested by pH meter, pH paper or B.D.H. Universal Indicator.

14. Aim of the Experiment:

To determine transparency or turbidity of different water bodies.

Transparency (clarity of pond):

It is directly related to and mainly depends upon the presence of microorganisms and microscopic soil particles in the pond water. If the quantity of soil particles and microscopic organisms will be more, the pond water will be less transparent. It also depends upon the depth of the pond water. The turbidity value will be very low in the deep water.

The instrument for knowing the turbidity value is called Secchi disc (Fig. 76). It is a circular disc with black and white or other contrasting colours. The disc is lowered down in the water. Note the depth of the water where there is no colour contrast on the disc.

15. Aim of the Experiment:

To find out the light intensity available to pond.

Light intensity available to the pond is measured with the help of ‘photometer’ (Fig. 77).

A ‘photometer’ consists of a photoelectric cell and a micro-ammeter. Photometer for pond is specially sealed in water-tight containers fitted with a glass window. Photoelectric cell is sensitive to light and generates current when light falls on it. Light intensity is proportional to the current generated in the photoelectric cell by the light falling on it. Readings can be noted in micro-ammeter.

Calculate the light intensity by the following formula:

Light Intensity = r × 100 / a

where r – Reading of lux-meter or photometer

a = Reflected light from the cardboard.

16. Aim of the Experiment:

To measure amount of dissolved oxygen content in polluted and unpolluted water bodies.

To measure amount of dissolved oxygen in pond water.

Water sample, glass stoppered conical flask, manganous-sulphate, potassium iodide solution (alkaline), pipettes, sulphuric acid (conc.), sodium thiosulphate solution, starch solution, reagent bottles.

Preparation of reagents:

(a) Starch solution:

Add 1 gm. starch in 100 ml distilled water, warm and dissolve it.

(b) Potassium iodide solution (alkaline):

Heat 200 ml of distilled water and dissolve in it 100 gm. KOH and 50 gm. KI.

(c) Manganous sulphate solution:

Add 200 gmMnSO4 . 4H2O in 200 ml distilled water. Heat it to dissolve maximum salt. Cool it and then filter it.

(d) Sodium shiosulphate solution:

Dissolve 24.82 gmNa2SO4 . 5H2O in some amount of distilled water and make up the volume to 1 litre by adding more distilled water. To stabilize the solution add a small pellet of sodium hydroxide. Thus prepared solution is 0.1N stock solution. In 250 ml of this stock solution add 750 ml of distilled water to dilute the solution of sodium thiosulphate.

i. Take 100 ml of water sample in a 250 ml glass-stoppered conical flask and add 1 ml of manganous sulphate solution and 1 ml alkaline potassium iodide solution by separate pipettes. Appearance of brown precipitate indicates the presence of oxygen in the water sample.

ii. Shake it well and then allow the precipitate to settle down.

iii. Add 2 ml sulphuric acid (conc.) and again shake it well. The precipitate will be dissolved.

iv. Decant the liquid and titrate it with sodium thiosulphate solution. Starch solution is used as an indicator. The blue black colour disappears when the end point is reached.

Calculations and results:

Dissolved oxygen in mg/ liter is calculated by the following formula:

= (ml × N of sodium thiosulphate) × 8 ×1000/V1 – V2

where V1 = Volume of water sample titrated,

V2 = Volume of MnSO4 and KI solution added.

17. Aim of the Experiment:

To determine the total dissolved solids (TDS) in water.

Water, sample, evaporating dish, Whatman filter paper, oven, desiccator, balance, beakers.

i. Weigh a dry and clean evaporating dish of200 ml capacity.

ii. Shake the water sample well and filter it through Whatman filter paper.

iii. Take 100 ml of filtrate in a pre-weighed evaporating dish and keep it in an oven at 180°C for some time. The water will be evaporated and the sample will become dry.

iv. Cool it in a desiccator and weigh. Calculate the total dissolved solids using following formula:

Total Dissolved Solids (mg/1) = (a-b) × 10/V

a = Weight of dish and dried filtered sample (in gm.)

b = Weight of empty evaporating dish (in gm).

V= Volume of sample evaporated (in ml).

18. Aim of the Experiment:

To count phytoplankton by haemocytometer method.

Haemocytometer, water sample, cover slip, microscope, dropper.

Haemocytometer is a special type of glass slide having more than 500 small grooved chambers or counting chambers (1 × 1 × 0.5 mm) in the middle portion. This specially designed slide is used for counting the microorganisms or plankton present in a water drop.

i. Take a haemocytometer and put a drop of concentrated water sample on its counting chambers.

ii. Put a cover slip, wait for about 2-5 minutes and examine under the high power of microscope. Count the plankton present in each chamber.

19. Aim of the Experiment:

To determine plankton biomass of a pond.

Pond water, shallow water bottle, chemical balance, oven, beaker, funnel, Whatman filter paper.

i. Collect 1000 ml of surface water of pond with the help of a shallow water bottle. This water contains phytoplankton and zooplankton.

ii. Weigh a dry filter paper. Suppose it is A1 gm.

iii. Take another filter paper. Make it wet and weigh it. Suppose it is A2 gm.

iv. Filter the water sample through a Whatman filter paper and weigh this filter paper containing plankton. Suppose it is A3 gm.

v. Now put this plankton-containing filter paper in oven for 24 hours at 85°C. Weigh this dry filter paper with plankton. Suppose it is A4 gm.

Calculations and result:

Calculate the biomass (fresh weight or dry weight of organisms) in mg/litre as follows:

(i) Fresh weight of plankton/1000 ml = A3– A2gm.

(ii) Fresh weight of plankton/ml = A3 – A2gm/1000


6.1 The “methods” section

The rationale for the ordering of elements in ADEMP is that this is usually the appropriate order to report them in a methods section. If the simulation study has been planned and written out before it is executed, the methods section is largely written. This is a particularly helpful ordering for other researchers who might wish to replicate the study. Details should be included to allow reproduction as far as possible, such as the value of nsim and how this was decided on, dependence among simulated datasets. Another important element to report is a justification of the chosen targets for particular applied contexts.

6.2 Presentation of results

Some simulation studies can be very small, for example, exploring one or two performance measures under a single data-generating mechanism. These can be reported in text (as in He et al 60 ). In other cases, there are enough results that it becomes necessary to report them in tabular or graphical form. For any tabulation or plot of results, there are four potential dimensions: data generating mechanisms, methods, estimands, and performance measures. This section provides some considerations for presenting these results.

In tabular displays, it is common to divide rows according to data-generating mechanisms and methods as columns (as in Chen et al 61 ), though if there are more methods than data-generating mechanisms it is better to swap these (as in Hsu et al 62 ). Performance measures and estimands may vary across columns or across rows depending on what makes the table easier to digest (see, for example, Alonso et al 63 ).

There are two key considerations in the design of tables. The first is how to place the important comparisons side-by-side. The most important comparisons will typically be of methods, so bias (for example) for different methods should be arranged in adjacent rows or columns.

The second consideration regards presentation of Monte Carlo SEs, and this tends to confound the first. By presenting them next to performance results, for example in parentheses, the table becomes cluttered and hard to digest, obscuring interesting comparisons. For this reason, some authors will report the maximum Monte Carle SE in the caption of tables. 43, 64 Results should not be presented to a greater accuracy than is justified by the Monte Carlo SE (eg, 3dp for coverage). In our review of Volume 34, seven articles presented Monte Carlo SEs for estimated performance: three in the text, two in a table, one in a graph, and one in a float caption.

The primary advantage of graphical displays of performance is that it is easier to quickly spot patterns, particularly over dimensions that are not compared side-by-side. A second advantage is that it becomes possible to present raw data estimates (for example the ) as well as performance results summarizing them (see, for example, figure 3 of Lambert et al 65 ). In our experience, these plots are popular and intuitive ways to summarise the and model SEs. Another example of a plot of estimates data is a histogram given in Kahan 31 (this was particularly important as Bias≃0, but almost no was close to θ). Even if plots of estimates are not planned to be included in publications, we urge their use in exploration of simulation results.

The main disadvantage of graphical displays of results is that plots can be less space-efficient than tables, it is not possible to read the exact numbers, and separate plots will frequently be required for different performance measures.

Compared with tables, it is easier for plots of performance results to accommodate display of Monte Carlo SEs directly, and this should be done, for example as 95% confidence intervals. The considerations about design of plots to facilitate the most relevant comparisons apply as with tables. Methods often have names that are hard to arrange side by side in a legible manner it is usually preferable to arrange methods in horizontal rows and performance measures across columns.

As noted previously, full factorial designs can pose problems for presentation of results. One option for presentation is to present data assuming no interaction unless one is obviously present. An alternative approach taken by Rücker and Schwarzer is to present all results of a full factorial simulation study with 4 × 4 × 4 × 4 × 3 = 768 data-generating mechanisms, and comparison of six methods. 66 Their proposal is a “nested-loop plot,” which loops through nested factors– usually data-generating mechanisms – for an estimand, and plots results for different methods on top of each other. 66 This is a useful graphic but will not suit all designs (and makes depiction of Monte Carlo SE difficult).

There is no one correct way to present results, but we encourage careful thought to facilitate readability and clarity, considering the comparisons that need to be made by readers.

What statistical test would you recommend to compare two animal behavior sampling methods? - Biology

Module 2: Methods of Data Collection - Chapters 2

On-line Lesson

Leisure Research Methods

Once a research question has been determined the next step is to identify which method will be appropriate and effective.

The table below describes the basic characteristics of different methodologies.

Qualitative and Quantitative Research Methodologies

Quantitative research methods include:

Qualitative research methods include:

Each research method has it's strengths and weaknesses. When designing a research study it is important to decide what the outcome (data) the study will produce then select the best methodology to produce that desired information.

Data Collection Techniques

There are two sources of data. Primary data collection uses surveys, experiments or direct observations. Secondary data collection may be conducted by collecting information from a diverse source of documents or electronically stored information. U.S. census and market studies are examples of a common sources of secondary data. This is also referred to as "data mining."

Key Data Collection Techniques

Writing an Introduction

In any research proposal the researcher should avoid the word "investigation." This word is perceived in a negative sense.

The key components of a good introduction include

Experimental Treatments

Experimental designs are the basis of statistical significance. An example of the fundamentals of an experimental design is shown below.

A researcher is interested in the effect of an outdoor recreation program (the independent variable, experimental treatment, or intervention variable) on behaviors (dependent or outcome variables) of youth-at-risk.

In this example, the independent variable (outdoor recreation program) is expected to effect a change in the dependent variable. Even with a well designed study, an question remains, how can the researcher be confident that the changes in behavior, if any, were caused by the outdoor recreation program, and not some other, intervening or extraneous variable ? An experimental design does not eliminate intervening or extraneous variables but, it attempts to account for their effects.

Experimental control is associated with four primary factors (Huck, Cormier, & Bounds, 1974).

Treatment Group: The portion of a sample or population that is exposed to a manipulation of the independent variable is known as the treatment group. For example, youth who enroll and participate in recreation programs are the treatment group, and the group to which no recreation services are provided constitutes the control group.

There are two primary criteria for evaluating the validity of an experimental design.

Errors: are conditions that may confuse the effect of the independent variable with that of some other variable(s).

True Designs - Five Basic Steps to Experimental Research Design

1. Survey the literature for current research related to your study.
2. Define the problem, formulate a hypothesis, define basic terms and variables, and operationalize variables.
3. Develop a research plan:
a. Identify confounding/mediating variables that may contaminate the experiment, and develop methods to control or minimize them.
b. Select a research design (see Chapter 3).
c. Randomly select subjects and randomly assign them to groups.
d. Validate all instruments used.
e. Develop data collection procedures, conduct a pilot study, and refine the instrument.
f. State the null and alternative hypotheses and set the statistical significance level of the study.
4. Conduct the research experiment(s).
5. Analyze all data, conduct appropriate statistical tests and report results.

The primary difference between true designs and quasi designs is that quasi designs do not use random assignment into treatment or control groups since this design is used in existing naturally occurring settings.

Groups are given pretests, then one group is given a treatment and then both groups are given a post-test. This creates a continuous question of internal and external validity, since the subjects are self-selected. The steps used in a quasi design are the same as true designs.

Ex Post Facto Designs

An ex post facto design will determine which variables discriminate between subject groups.

Steps in an Ex Post Facto Design

Ex post facto studies cannot prove causation, but may provide insight into understanding of phenomenon.


Nominal Group Technique (NGT)

The NGT is a group discussion structuring technique. It is useful for providing a focused effort on topics. The NGT provides a method to identify issues of concern to special interest groups or the public at large. Ewert (1990) noted that the NGT is a collective decision-making technique for use in park and recreation planning and management. The NGT is used to obtain insight into group issues, behaviors and future research needs.

Source: (Mitra & Lankford, 1999)

The delphi method was developed to structure discussions and summarize options from a selected group to:

Although the data may prove to be valuable, the collection process is very time consuming. When time is available and respondents are willing to be queried over a period of time, the technique can be very powerful in identifying trends and predicting future events.

The technique requires a series of questionnaires and feedback reports to a group of individuals. Each series is analyzed and the instrument/statements are revised to reflect the responses of the group. A new questionnaire is prepared that includes the new material, and the process is repeated until a consensus is reached.

The reading below is a research study that used the delphi technique and content analysis to develop a national professional certification program.

Richard Krueger (1988), describe the focus group as a special type of group in terms of purpose, size, composition, and procedures. A focus group is typically composed of seven to twelve participants who are unfamiliar with each other and conducted by a trained interviewer. These participants are selected because they have certain characteristics in common that relate to the topic of the focus group.

The researcher creates a permissive environment in the focus group that nurtures different perceptions and points of view, without pressuring participants to vote, plan, or reach consensus. The group discussion is conducted several times with similar types of participants to identify trends and patterns in perceptions. Careful and systematic analysis of the discussions provide clues and insights as to how a product, service, or opportunity is perceived.

A focus group can be defined as a carefully planned discussion designed to obtain perceptions on a defined area of interest in a permissive, nonthreatening environment. It is conducted with approximately seven to twelve people by a skilled interviewer. The discussion is relaxed, comfortable, and often enjoyable for participants as they share their ideas and perceptions. Group members influence each other by responding to ideas and comments in the discussion.


Focus group interviews typically have four characteristics:

Other types of group processes used in human services (delphic, nominal, planning, therapeutic, sensitivity, or advisory) may have one or more of these features, but not in the same combination as those of focus group interviews.

Behavior/Cognitive Mapping

Cognitive and spatial mapping information provides a spatial map of:

All types of recreation activities and travel involve some level of environmental cognition because people must identify and locate recreation destinations and attractions.

Cognitive mapping allows recreation resource managers the opportunity to identify where users and visitors perceive the best recreation areas are located. It is important to understand user perceptions in order to manage intensive use areas in terms of maintenance, supervision, budgeting, policy development and planning.

Cognitive maps grid the research site into zones. The zones identify existing geographic, climatic, landscape, marine resources, and recreation sites. The grids allow respondents to indicate primary recreation sites, and then a composite is developed to identify high impact areas. Researchers collect data at recreation areas (beach, campground, marina, trailhead, etc.) by interviewing visitors and recreationists. During the data collection process, random sites, days, times, and respondents (every nth) should be chosen to increase the reliability and generalizability of the data.

Observational research is used for studying nonverbal behaviors (gestures, activities, social groupings, etc.).

Sommer & Sommer (1986) developed the list shown below to assist in observation research.

Casual observation is normally done like unstructured interviews. During the early stages of a research project, casual observation allows the researcher(s) to observe subjects prior to designing questionnaires and/or interview formats.

Types of Observation Studies

Documents (also called Secondary Data or Data Mining)

Data mining is commonly used in both qualitative and quantitative research. Secondary data provides data which provides a framework for the research project, development of research question(s), and validation of study findings.

Frequently used sources of secondary data are:

Content analysis systematically describes the form or content of written and/or spoken material. It is used to quantitatively studying mass media. The technique uses secondary data and is considered unobtrusive research.

The first step is to select the media to be studied and the research topic. Then develop a classification system to record the information. The techniques can use trained judges or a computer program can be used to sort the data to increase the reliability of the process.

Content analysis is a tedious process due to the requirement that each data source be analyzed along a number of dimensions. It may also be inductive (identifies themes and patterns) or deductive (quantifies frequencies of data). The results are descriptive, but will also indicate trends or issues of interest.

The reading below is a research study that used the delphi technique and content analysis to develop a national professional certification program.

Meta-analysis combines the results of studies being reviewed. It utilizes statistical techniques to estimate the strength of a given set of findings across many different studies. This allows the creation of a context from which future research can emerge and determine the reliability of a finding by examining results from many different studies. Researchers analyze the methods used in previous studies, and collectively quantify the findings of the studies. Meta-analysis findings form a basis for establishing new theories, models and concepts.

Thomas and Nelson (1990) detail the steps to meta-analysis:

Historical research is also referred to as analytical research. Common methodological characteristics include a research topic that addresses past events, review of primary and secondary data, techniques of criticism for historical searches and evaluation of the information, and synthesis and explanation of findings. Historical studies attempt to provide information and understanding of past historical, legal, and policy events.

Five basic procedures common to the conduct of historical research were identified by McMillan & Schumacher (1984). They provide a systematic approach to the process of historical research.

Step 1: Define the problem, asking pertinent questions such as: Is the historical method appropriate? Are pertinent data available ? Will the findings be significant in the leisure services field?

Step 2: Develop the research hypothesis (if necessary) and research objectives to provide a framework for the conduct of the research. Research questions focus on events (who, what, when, where), how an event occurred (descriptive), and why the event happened (interpretive ). This contrasts with quantitative studies, in which the researcher is testing hypotheses and trying to determine the significance between scores for experimental and control groups or the relationships between variable x and variable y.

Step 3: Collect the data, which consists of taking copious notes and organizing the data. The researcher should code topics and subtopics in order to arrange and file the data. The kinds of data analysis employed in historical research include (based on McMillan & Schumacher, 1984):

Step 4: Utilizing external and internal criticism, the re- search should evaluate the data. Sources of data include documents (letters, diaries, bills, receipts, newspapers, journals/magazines, films, pictures, recordings, personal and institutional records, and budgets), oral testimonies of participants in the events, and relics ( textbooks, buildings, maps, equipment, furniture, and other objects).

Step 5: Reporting of the findings, which includes a statement of the problem, review of source material, assumptions, research questions and methods used to obtain findings, the interpretations and conclusions, and a thorough bibliographic referencing system.

The multimethod approach encourages collecting, analyzing and integrating data from several sources and the use of a variety of different types of research methods.

Copyright 2001. Northern Arizona University, ALL RIGHTS RESERVED


We have developed software to create semi-synthetic simulations based on real data to compare the performance of some of the most popular pathway analysis methods. The most powerful methods are sigPathway and GSEA, who have similar performance and detect secondary signals based on pathway overlap. By taking an applied approach to methods comparison, valuable information is gained about the power of each GSA method. With such information more confidence is carried into any functional follow-up studies.