Data analysis

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Data analysis is a process of gathering, modeling, and transforming data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis, and confirmatory data analysis. EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on application of statistical or structural models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All are varieties of data analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The term data analysis is sometimes used as a synonym for data modeling, which is unrelated to the subject of this article.

Contents

[edit] Nuclear and particle physics

In nuclear and particle physics the data usually originate from the experimental apparatus via a data acquisition system. It is then processed, in a step usually called data reduction, to apply calibrations and to extract physically significant information. Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc. The resulting data n-tuples are then scrutinized by the high physicists, using specialized software tools like ROOT or PAW, comparing the results of the experiment with theory.

The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation software like Geant4, predict the response of the detector to a given theoretical event, producing simulated events which are then compared to experimental data.

See also: Computational physics.

[edit] Social sciences

Qualitative data analysis (QDA) or qualitative research is the non-quantitative analysis of data from non-numerical sources, for example words, photographs, observations, etc..

[edit] Phases in data analysis

The statistical analysis of data is a process with several phases, each with its own goal.

[edit] Data cleaning

During data cleaning erroneous entries are inspected and corrected where possible. In some cases, it is easy to substitute suspect data with the correct values. However, when it is unclear what caused the erroneous data or what should be used to replace it, it is important that no subjective decisions are made to ensure the quality of the data. Furthermore, it is important not to throw information away at any stage in the data cleaning phase. When altering variables the original values should be kept in a duplicate dataset or under a different variable name so that information is always cumulatively retrievable.[1]

[edit] Initial data analysis

The initial data analysis uses descriptive statistics to answer the following four questions[1]:

  1. What is the quality of the data?
  2. What is the quality of the measurements?
  3. Did the implementation of the study fulfill the intentions of the research design?
  4. What are the characteristics of the data sample?

Each step of the initial data analysis is described below.

[edit] The quality of the data

The quality of the data can be assessed in several ways. First of all the distribution of the variables before data cleaning is compared to the distribution of the variables after data cleaning to see whether data cleaning has had unwanted effects on the data. Second, the missing observations in the data are analyzed to see whether they are missing at random and whether some form of imputation (statistics) is needed. Third, extreme observations in the data are analyzed to see if they seem to disturb the distribution. If that is the case, robust techniques can be applied.

[edit] The quality of the measurements

When the quality of the measurement instruments used is not the main focus of the research, the quality of the measurement instruments can be checked during initial data analysis. One way to assess the quality of a measurement instrument is to perform an analysis of homogeneity (internal consistency). A homogeneity index like Cronbach's α gives an indication of the reliability of a measurement instrument.

[edit] The implementation of the design

In many cases, a check to see whether the randomization procedure has worked will be the starting point for analyzing the implementation of the design. This can be done by checking whether variables are equally distributed across groups. Other ways of checking the implementation of the design are manipulation checking and the analysis of nonresponse and dropout.

[edit] Characteristics of the data sample

In this step, the findings of the initial data analysis are documented and possible corrective actions are taken. For instance, when the distribution of a variable is not normal, the data may need to be transformed or categorized. Furthermore, a decision should be made on how to handle missing data and outliers. If the randomization procedure seems to be defective, propensity scores can be calculated and included in the main analyses as a

[edit] See also

[edit] References

  1. ^ a b Adèr, H. J.; Mellenbergh, G. J.; Hand, D. J. (2008), "Chapter 14: Phases and initial steps in data analysis", Advising on Reseach Methods: A Consultant's Companion, Huizen, the Netherlands: Johannes van Kessel publishing, p. 336, ISBN 978-90-79418-02-2 

[edit] Further reading

Personal tools