Descriptive statistics are an important part of biomedical research which is used to describe the basic features of the data in the study. They provide simple summaries about the sample and the measures. Measures of the central tendency and dispersion are used to describe the quantitative data. For the continuous data, test of the normality is an important step for deciding the measures of central tendency and statistical methods for data analysis. When our data follow normal distribution, parametric tests otherwise nonparametric methods are used to compare the groups. There are different methods used to test the normality of data, including numerical and visual methods, and each method has its own advantages and disadvantages.
![]()
In the present study, we have discussed the summary measures and methods used to test the normality of the data. IntroductionA data set is a collection of the data of individual cases or subjects. Usually, it is meaningless to present such data individually because that will not produce any important conclusions. In place of individual case presentation, we present summary statistics of our data set with or without analytical form which can be easily absorbable for the audience. Statistics which is a science of collection, analysis, presentation, and interpretation of the data, have two main branches, are descriptive statistics and inferential statistics.Summary measures or summary statistics or descriptive statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Descriptive statistics are the kind of information presented in just a few words to describe the basic features of the data in a study such as the mean and standard deviation (SD)., The another is inferential statistics, which draw conclusions from data that are subject to random variation (e.g., observational errors and sampling variation). In inferential statistics, most predictions are for the future and generalizations about a population by studying a smaller sample., To draw the inference from the study participants in terms of different groups, etc., statistical methods are used.
These statistical methods have some assumptions including normality of the continuous data. There are different methods used to test the normality of data, including numerical and visual methods, and each method has its own advantages and disadvantages. Descriptive statistics and inferential statistics both are employed in scientific analysis of data and are equally important in the statistics.
In the present study, we have discussed the summary measures to describe the data and methods used to test the normality of the data. To understand the descriptive statistics and test of the normality of the data, an example with a data set of 15 patients whose mean arterial pressure (MAP) was measured are given below. Further examples related to the measures of central tendency, dispersion, and tests of normality are discussed based on the above data. Measures of Central TendencyData are commonly describe the observations in a measure of central tendency, which is also called measures of central location, is used to find out the representative value of a data set. The mean, median, and mode are three types of measures of central tendency.
Measures of central tendency give us one value (mean or median) for the distribution and this value represents the entire distribution. To make comparisons between two or more groups, representative values of these distributions are compared. It helps in further statistical analysis because many techniques of statistical analysis such as measures of dispersion, skewness, correlation, t-test, and ANOVA test are calculated using value of measures of central tendency.
That is why measures of central tendency are also called as measures of the first order. A representative value (measures of central tendency) is considered good when it was calculated using all observations and not affected by extreme values because these values are used to calculate for further measures. MeanMean is the mathematical average value of a set of data. Mean can be calculated using summation of the observations divided by number of observations.
It is the most popular measure and very easy to calculate. It is a unique value for one group, that is, there is only one answer, which is useful when comparing between the groups.
In the computation of mean, all the observations are used., One disadvantage with mean is that it is affected by extreme values (outliers). For example, according to, mean MAP of the patients was 97.47 indicated that average MAP of the patients was 97.47 mmHg.
MedianThe median is defined as the middle most observation if data are arranged either in increasing or decreasing order of magnitude. Thus, it is one of the observations, which occupies the central place in the distribution (data). This is also called positional average. Extreme values (outliers) do not affect the median. It is unique, that is, there is only one median of one data set which is useful when comparing between the groups.
![]()
There is one disadvantage of median over mean that it is not as popular as mean. For example, according to, median MAP of the patients was 95 mmHg indicated that 50% observations of the data are either less than or equal to the 95 mmHg and rest of the 50% observations are either equal or greater than 95 mmHg. Standard deviation and varianceThe SD is a measure of how spread out values is from its mean value. Its symbol is σ (the Greek letter sigma) or s.
It is called SD because we have taken a standard value (mean) to measures the dispersion. Where x i is individual value, x ̄ is mean value.
If sample size is. Standard errorStandard error is the approximate difference between sample mean and population mean. When we draw the many samples from same population with same sample size through random sampling technique, then SD among the sample means is called standard error. If sample SD and sample size are given, we can calculate standard error for this sample, by using the formula.Standard error = sample SD/√sample size.For example, according to, standard error is 2.84 mmHg, which showed that average mean difference between sample means and population mean is 2.84 mmHg.
A collection of functions for statistical calculation written in Swift. It can be used in Swift apps for Apple devices and in open source Swift programs on other platforms. Setup a previous version of the library if you use an older version of Swift. Kurtosis formula. Sigma.kurtosisA(2, 1, 3, 4.1, 19, 1.5) // Result:.
Quartiles and interquartile rangeThe quartiles are the three points that divide the data set into four equal groups, each group comprising a quarter of the data, for a set of data values which are arranged in either ascending or descending order. PercentileThe percentiles are the 99 points that divide the data set into 100 equal groups, each group comprising a 1% of the data, for a set of data values which are arranged in either ascending or descending order. About 25% percentile is the first quartile, 50% percentile is the second quartile also called median value, while 75% percentile is the third quartile of the data.For ith percentile = i. (n + 1)/100 th observation, where i = 1, 2, 3.,99.Example: In the above, 10 th percentile = 10. (n + 1)/100 =1.6 th observation from initial which is fall between the first and second observation from the initial = 1 st observation + 0.6. (difference between the second and first observation) = 83.20 mmHg, which indicated that 10% of the data are either ≤83.20 and rest 90% observations are either ≥83.20. RangeDifference between largest and smallest observation is called range.
If A and B are smallest and largest observations in a data set, then the range (R) is equal to the difference of largest and smallest observation, that is, R = A−B.For example, in the above, minimum and maximum observation in the data is 82 mmHg and 116 mmHg. Hence, the range of the data is 34 mmHg (also can write like: 82–116).Descriptive statistics can be calculated in the statistical software “SPSS” (analyze → descriptive statistics → frequencies or descriptives. Why to test the normality of dataVarious statistical methods used for data analysis make assumptions about normality, including correlation, regression, t-tests, and analysis of variance. Central limit theorem states that when sample size has 100 or more observations, violation of the normality is not a major issue., Although for meaningful conclusions, assumption of the normality should be followed irrespective of the sample size.
If a continuous data follow normal distribution, then we present this data in mean value. Further, this mean value is used to compare between/among the groups to calculate the significance level ( P value).
If our data are not normally distributed, resultant mean is not a representative value of our data. A wrong selection of the representative value of a data set and further calculated significance level using this representative value might give wrong interpretation. That is why, first we test the normality of the data, then we decide whether mean is applicable as representative value of the data or not. If applicable, then means are compared using parametric test otherwise medians are used to compare the groups, using nonparametric methods. Methods used for test of normality of dataAn assessment of the normality of data is a prerequisite for many statistical tests because normal data is an underlying assumption in parametric testing.
There are two main methods of assessing normality: Graphical and numerical (including statistical tests)., Statistical tests have the advantage of making an objective judgment of normality but have the disadvantage of sometimes not being sensitive enough at low sample sizes or overly sensitive to large sample sizes. Graphical interpretation has the advantage of allowing good judgment to assess normality in situations when numerical tests might be over or undersensitive. Although normality assessment using graphical methods need a great deal of the experience to avoid the wrong interpretations. If we do not have a good experience, it is the best to rely on the numerical methods.
There are various methods available to test the normality of the continuous data, out of them, most popular methods are Shapiro–Wilk test, Kolmogorov–Smirnov test, skewness, kurtosis, histogram, box plot, P–P Plot, Q–Q Plot, and mean with SD. The two well-known tests of normality, namely, the Kolmogorov–Smirnov test and the Shapiro–Wilk test are most widely used methods to test the normality of the data. Normality tests can be conducted in the statistical software “SPSS” (analyze → descriptive statistics → explore → plots → normality plots with tests).The Shapiro–Wilk test is more appropriate method for small sample sizes ( 0.05, null hypothesis accepted and data are called as normally distributed. Skewness is a measure of symmetry, or more precisely, the lack of symmetry of the normal distribution. Kurtosis is a measure of the peakedness of a distribution.
![]()
The original kurtosis value is sometimes called kurtosis (proper). Most of the statistical packages such as SPSS provide “excess” kurtosis (also called kurtosis excess) obtained by subtracting 3 from the kurtosis (proper).
A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. If mean, median, and mode of a distribution coincide, then it is called a symmetric distribution, that is, skewness = 0, kurtosis (excess) = 0. A distribution is called approximate normal if skewness or kurtosis (excess) of the data are between − 1 and + 1. Although this is a less reliable method in the small-to-moderate sample size (i.e., n 300, normality of the data is depend on the histograms and the absolute values of skewness and kurtosis. Either an absolute skewness value ≤2 or an absolute kurtosis (excess) ≤4 may be used as reference values for determining considerable normality. A histogram is an estimate of the probability distribution of a continuous variable.
If the graph is approximately bell-shaped and symmetric about the mean, we can assume normally distributed data,. In statistics, a Q–Q plot is a scatterplot created by plotting two sets of quantiles (observed and expected) against one another. For normally distributed data, observed data are approximate to the expected data, that is, they are statistically equal. A P–P plot (probability–probability plot or percent–percent plot) is a graphical technique for assessing how closely two data sets (observed and expected) agree. It forms an approximate straight line when data are normally distributed.
Departures from this straight line indicate departures from normality. Box plot is another way to assess the normality of the data. It shows the median as a horizontal line inside the box and the IQR (range between the first and third quartile) as the length of the box. The whiskers (line extending from the top and bottom of the box) represent the minimum and maximum values when they are within 1.5 times the IQR from either end of the box (i.e., Q1 − 1.5. IQR and Q3 + 1.5.
IQR). Scores 1.5 times and 3 times the IQR are out of the box plot and are considered as outliers and extreme outliers, respectively. A box plot that is symmetric with the median line at approximately the center of the box and with symmetric whiskers indicate that the data may have come from a normal distribution. In case many outliers are present in our data set, either outliers are need to remove or data should treat as nonnormally distributed,. Another method of normality of the data is relative value of the SD with respect to mean. If SD is less than half mean (i.e., CV.
Boxplot showing distribution of the mean arterial pressureFor example in, data of MAP of the 15 patients are given. Normality of the above data was assessed.
Result showed that data were normally distributed as skewness (0.398) and kurtosis (−0.825) individually were within ±1. Critical ratio ( Z value) of the skewness (0.686) and kurtosis (−0.737) were within ±1.96, also evident to normally distributed. Similarly, Shapiro–Wilk test ( P = 0.454) and Kolmogorov–Smirnov test ( P = 0.200) were statistically insignificant, that is, data were considered normally distributed.
As sample size is. ConclusionsDescriptive statistics are a statistical method to summarizing data in a valid and meaningful way. A good and appropriate measure is important not only for data but also for statistical methods used for hypothesis testing. For continuous data, testing of normality is very important because based on the normality status, measures of central tendency, dispersion, and selection of parametric/nonparametric test are decided. Although there are various methods for normality testing but for small sample size ( n.
Method and materialsFifty-two patients of IPNB underwent MRI at 1.5T with diffusion-weighted imaging (DWI, b = 500 s/mm 2) before surgical resections. ADC histogram metrics were generated by using the software MR OncoTreat. The mean, standard deviation, median, skewness, kurtosis as well as the 10th, 25th, 75th, and 90th percentiles were compared between pathologically defined invasive ( n = 35) and noninvasive ( n = 17) IPNBs. Such conventional imaging characters as lesion location, bile duct wall dilation, and mural nodularity were also assessed.
Multivariate regression analysis as well as receiver operating characteristics (ROC) analysis were then conducted to determine the predictive factors and to evaluate potential diagnostic performances. ResultsThe inter-operator reliability was good to excellent (ICC: 0.693–979). Mean median, kurtosis, and the 10th, 25th, 75th, 90th percentiles were all greater in noninvasive group than invasive ones ( P: 0.00–002). Skewness was lower in noninvasive group than invasive ones (− 1.0 ± 0.6 vs. − 0.3 ± 0.6, P = 0.00).
After multivariate regression, skewness (AUC = 0.822, 95%CI 0.70–0.91) and mural nodularity (accuracy = 0.808) were the only two independent factors in predicting invasive IPNBs. The diagnostic performance improved (AUC = 0.867, 95%CI 0.742–0.946) when combining skewness and mural nodularity, however, the difference did not reach statistical significance ( P = 0.16).
![]() Comments are closed.
|
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
March 2023
Categories |