Statistical Data Description

## Statistical Data

In this post, we discuss about various types of statistical data, their collection, analysis and presentation

• Statistics
• Application of Statistics
• Data Collection
• Statistical Error
• Classification of Data
• Data Diagrams
• Frequency Distribution
• Frequency Polygon
• Histogram
• Frequency Curve

## Statistics

The word, ‘statistics’ has been derived from the Latin word, ‘Status’ means a political state. This word also resembles with the Italian word Statista’, German word ‘Statistik’ and French word ‘Statistique’ carrying the same meaning of a state.

Some people think that the word Statistics came from Italian word Statista. In those days, Statistics was analogous to State or data collected from State. In India, Arthasastra by Kautilya in Chandragupta regime appended value into the concept. Apart from that, Abu Bakr  made contribution in Agricultural Statistics in his book Ain-i-Akabari

In general, Statistics is the science (a branch of mathematics), which deals with the methods of collecting, classifying, presenting, comparing and interpreting numerical data collected to help draw conclusions about the data on the sphere of enquiry.

In that view, Statistics may be defined as Quantitative & Qualitative Analysis  of data collected for Statistical Analysis & interpretation.

Application of Statistics in Economics

Modern developments in Economics have root in statistics. Economics and Statistics are closely related. Time Series Analysis, Index Numbers, Demand Analysis etc. are some areas of Economics and statistics.

Econometrics interact with statistics in a very positive way. Socio-Economic surveys and analysis with the help of different statistical techniques plays a vital role in Economics for making future projection of demand of goods, sales, prices, quantities etc. which are all part Economic planning. Regression Analysis is a powerful tool for Economic planning.

Application of Statistics in Business Management

Due to increasing complexity in the business and industry environment, most of the decision making processes rely upon different quantitative techniques which could be explained as a combination of statistical techniques and operations research techniques. Statistical decision theory is another component of statistics having focuses on the analysis of complex business strategies.

Application of Statistics in Commerce and Industry

Modern managers, industrialists and businessmen are relying more on statistical procedures. Data on previous sales, raw materials, wages and salaries, products of identical nature of other factories etc are collected, analysed.

Measures of central tendency and dispersion, correlation and regression analysis, time series analysis, index numbers, sampling, statistical quality control are some of the statistical methods applied in commerce and industry.

Shortcomings of Statistics

Statistical methods have following shortcomings:-

• Statistics deals with the aggregate. Statistics do not consider Individual data. Individual data are considered just as part of Aggregate but do not give importance to individual data
• Statistics is concerned with quantitative data. Qualitative data are to be changed to quantitative data by providing a numerical description to the corresponding qualitative data.
• Projections (e.g sales, production, price and quantity etc). are possible under a specific set of conditions. If any of these conditions is violated, projections are likely to be inaccurate. Statistical projections are based on some assumptions, which may not be correct under the current situation
• The theory of statistical inferences is built upon random sampling. If the rules for random sampling are not strictly adhered to, the conclusion drawn on the basis of unrepresentative samples would be erroneous. Inferences based on Sampling are subject to sampling variations and limitations.

## Statistical Data Variables

‘Data’ is quantitative information about some particular characteristic(s) under consideration. A quantitative characteristic is known as a variable. A variable may be either Discrete or Continuous.

• Discreet : When a variable assumes a finite or a countably infinite number of isolated values, it is known as a discrete variable (like number of misprints in books of publisher, number of road accidents in a locality.

Such numbers are discreet as there cannot be fractional number of deaths or misprints (like 2.5)

• Continuous : A variable is known to be continuous if it can assume any value from a given interval (like gender of a baby, the nationality of a person, the colour of a flower etc).

A continuous variable can take any value between two extremes. For example, the age of eligible applicants for a post must be between 25 to 40. So, in this case, Age of an eligible Applicant  is continuous variable as it can take any value between 25 & 40 .

Classification of Statistical Data : Statistical datamay be classified as :

• Primary Data : Data collected for the first time by an investigator or agency are known as Primary Data (e.g Prof. X collects the data on age, height & weight of every student in his school)
• Secondary Data : Primary data, already collected, being used by a different person or agency (e.g Prof Y, using the Primary data of age, height & weight of students collected by Prof X, prepares statistical data)

## Statistical Primary Data Collection

Collection of data plays  vital role for any statistical analysis.

Criteria for method for primary data collection

While selecting a method for collecting primary data, following considerations should be taken into account

Nature, object and scope of the investigation; available Financial resources and time factor; degree of accuracy and reliability needed; status of the investigator for conducting the survey, etc.

Methods of primary data collection

Interview means questioning and getting answers / feedback to the question from individual person

• Personal interview : In personal interview method, the investigator meets the respondents directly and collect the required information from them.

In case of a natural calamity like a super cyclone or an earthquake or an epidemic, we may collect the necessary data much more quickly and accurately by applying this method.

• Indirect Interview : If there are some practical problems in meeting the respondents directly, (e.g in case of a rail accident), Indirect Interview may be conducted. Investigator collects the essential information from the persons associated with the event.
• Telephone interview : Telephone interview method is a quick and rather inexpensive way to collect primary data. Researcher himself contact the interviewee over the phone.

Telephone interview, though less consistent than Personal Interview and Indirect Interview, has a wide coverage.

The amount of non-responses is more for telephone interview method

• Mailed questionnaire : Mailed questionnaire method involves of framing a well-drafted and soundly-sequenced questionnaire covering all the Important aspects of the problem under consideration.

The questionnaires sheets are sent to the respondents giving all the requisite guidelines for filling up the questionnaire.

Although a wide area can be covered using the mailed questionnaire method, the amount of non-responses is likely to be high in this method.

• Observation : In observation method, data are collected by direct observation or using instrument. Though it is a good  method for data collection, it is time consuming, laborious and covers only a small area.

Sources of Secondary Statistical Data

• Publications of Universities and Research Institutes : Universities, Research Institutes and other bodies publish the relevant data.
• Newspapers and Journals:  Various newspapers, journals, magazines and dailies etc. provide data.
• Other Sources of secondary data: International sources like WHO, IMF, World Bank etc. Government sources like Statistical reports of CSO, Indian Agricultural Statistics by the Ministry of Food and Agriculture. Private and quasi-government sources like ISI, ICAR, NCERT etc. Unpublished sources of various research institutes, researchers etc. Internet based different research reports.
• Scrutiny & Validation of Data : Statistical Data under study, whether collected from Primary Source or Secondary, must be scrutinized & validated  as to consistency, applicability in regards to the purpose & objective of statistical analysis, for which the data are being used

## Statistical Error

Error in the science of statistics does not mean mistake.  Statistical error arise as data collection with complete coverage is impracticable.

Statistical Error Types

• Error of origin: Error which arise in the collection of data are known as errors of origin. The errors due to defective definition of the subject-matter of the investigation, defective questionnaire, carelessness of the enumerators etc., come under this category.
• Error in inadequacy: Error which arise due to incomplete data or due to smallness of the sample selected are called errors of inadequacy.
• Errors of manipulation: The error which arise while analysing and explaining the statistical data are called errors of manipulation.

For example, clerical errors, arithmetical slips, etc. Mostly these errors occur while counting, measuring and classifying data. Misuse of averages, percentages, etc. may also result in such errors.

In the process of analysis, if proper care is taken, errors of manipulation can be decreased.

• Biased errors: The statistical errors which tend to be in the same direction are said to be biased.

Biased errors arise due to the bias of the informants, bias of the investigator, defect of  measuring instruments, defective selection of the sample, bias in interpretation of the results, etc.

• Unbiased errors: Statistical errors which arise in the normal course of investigation are called unbiased errors.

Such errors may arise accidently without any bias or prejudice. These errors tend to off-set or cancel each other and leave nominal effects on the general results. These errors are also termed Compensating Error, Accidental Error or Random Error.

• Sampling errors: The errors in the sample investigation which are due to sample are called sampling errors.

In a sample investigation, only a small part of the population is studied and hence the results are bound to differ from the census results, having a certain amount of error. Such error is known as sampling error. The term ‘sampling error’ is used only in case of a simple investigation and it has no significance in case of census method.

Sampling errors are also called sampling fluctuations.

• Non-sampling Errors: The errors that arise in a sample investigation not because of sampling, but also appear in a census investigation, are known as non-sampling errors. These errors may be clamped as response errors, prestige errors, non-response errors, publication errors, etc.

Non-sampling errors include biases and mistakes. Non-sampling errors may occur at any stage of planning and execution of the Investigation (sample or census). These errors increase with the increase in the number of items. In sample investigation, the non-sampling errors may arise for defective frame, faulty choice of sampling units, wrong use of sampling techniques, etc.

## Statistical Data Classification

Process of arranging data on the basis of the characteristic under consideration, into a number of groups or classes according to the similarities of the observations, is called classification of data.

Objectives of classification of data

• It puts the data in a precise and summarised form.
• It makes comparison possible between various characteristics.
• Time Series Data : When the data are classified in respect of successive time points or intervals, they are termed as time series data.

The number of students appeared for ICA final for the ten years, the yearly production of a factory from 1995 to 2015 etc. are examples of time series data.

• Geographical Data : Data arranged region wise are known as geographical data.

For example, Students appeared for CA final in the year 2015 in different states, then we come across Geographical Data.

• Qualitative data & Quantitative Data : Data classified in respect of an attribute may be classified as Qualitative & Quantitative data.
• Qualitative Data : Data on nationality, gender, drinking habit of a group of individuals are examples of qualitative data.
• Quantitative Data : Data are classified in respect of a variable, say age, weight, profits, wages etc, they are called as quantitative data.
• Frequency & non-frequency Data : Data may be classified as frequency data and non-frequency data.
• Frequency data : The qualitative as well as quantitative data belong to the frequency group
• Non-frequency Data : Time series data and geographical data belong to the non-frequency group.

## Statistical Data Presentation

Statistical data is represented in summarized and compact form that yields meaningful information about the data.

Statistical data may be presented in various methods

Textual Presentation : This method comprises presenting data with the help of a paragraph or a number of paragraphs. The official report of an Investigation commission is usually prepared by textual presentation.

The benefits of this mode of presentation lies in its simplicity and even a layman can understand the data. The observations with exact magnitude can be presented with the help of textual presentation. Moreover, this type of presentation can be taken as the first step towards development of other methods of presentation.

Textual presentation, is not preferred by a statistician simply because, it is dull, monotonous.

Tabular presentation

Tabular presentation or Tabulation refers to presentation of data with the help of a statistical table having a number of rows and columns and complete with reference number, title, description of rows as well as columns and footnotes, if any.

Tabular Presentation of Data

Tabular Presentation Rules

• A statistical table should be allotted a serial number along with a self-explanatory title.
• The table under consideration should be divided into caption, box-head, Stub and Body. Caption is the upper part of the table, describing the columns and sub-columns, if any.

The Box-head is the entire upper part of the table which includes columns and sub-column numbers, unit(s) of measurement along with caption.

Stub is the left part of the table providing the description of the rows. the body is the main part of the table that contains the numerical figures.

• The table should be well-balanced in length and breadth.
• The data must be arranged in a table in such a way that comparison(s) between different figures are made possible without much labour and time. Also the row totals, column totals, the units of measurement must be shown.
• The data should be arranged intelligently in a well-balanced sequence and the presentation of data in the table should be appealing to the eyes as far as practicable.
• Notes describing the source of the data, bringing clarity and, if necessary, about any rows or columns, known as footnotes, should be shown at the bottom of the table.

## Statistical Data Diagrammatic representation

Attractive representation of statistical data (by charts, diagrams and pictures) can be provided for both the educated as well as uneducated section of the society. Trend present in the given data can be identified in such representation. Compared to tabulation, this is less accurate.

Commonly used Diagrams

• Line Diagram : When the data vary over time line diagram is often used. In a simple line diagram, the plotted points are then joined successively to form a line and the resulting chart is known as line-diagram
• Line diagram construction. Create X axis & Y axis scale according to the range of data (the two extreme values). Now plot the points. Join the points.

Line graphs can be used to show how something changes over time. They have an x-axis (horizontal) and a y-axis (vertical), representing the two variables. For example, the x-axis showing numbers for the time period, and the y-axis showing the numbers for what is being measured. Line graphs can be used to show peaks (ups) and valleys (downs), for data that was collected in a short time period.

• Logarithmic Diagram: When the time series exhibit a wide range of fluctuations, logarithmic or ratio chart (Log yt instead of yt  is plotted). Multiple line chart for representing two or more related time series data expressed in the same unit and multiple-axis chart are used if the variables are expressed in different units.

A log graph (also known as a semi-logarithmic graph, uses a linear scale on one axis and a logarithmic scale on the other axis, for plotting data points of two variables where one of the variables has a much larger range of values than the other variable, revealing relationship that would not be so  obvious when plotted linearly.

Log Graph construction: 1. Define a logarithm. [e.g for the equation x = by, y = logb(x)]. 2. Establish linear and logarithmic scales. For example, a logarithmic scale with a base of 10 would be labeled 10, 100, 1,000 and so on. 3. Map functions on a linear graph. Both the x and y scales measure the same units.4. Use a line-log graph. This type of log graph has a y axis with a linear scale and an x axis with logarithmic scale. The scale of the x axis is therefore compressed by a factor of 10x in relation to the y axis. 5. Use a log-lin graph. This type of log graph has a y axis with a logarithmic scale and an x axis with linear scale. The scale of the x axis is therefore expanded by a factor of 10^x in relation to the y axis.

• Bar Diagram: Horizontal bar diagram is used for qualitative data or data varying over space. The Vertical bar diagram is related with quantitative data or time series data. Component or sub-divided Bar diagrams are used for representing data divided into a number of components. Divided Bar charts or Percentage Bar diagrams are used for comparing different components of a variable and also the relating of the components to the whole.

Bar graphs have an x-axis (horizontal) and a y-axis (vertical). Create point corresponding to each value as per Y axis for the points at X axis. Now draw a line, from the base point at X axis or a column to represent the value pictorially.

• Pie Chart : Pie diagrams are circular diagrams in which whole area represents the aggregate and the circle is divided into various parts to represent the components of the whole.

## Pie Chart construction Method

The values of various components are expressed as percentage of the whole.

• As a circle represents 3600, the whole is assumed to be equal to 3600.
• 1% of the total value is 3.60 (360/100), the percentage value of components is converted into degrees by multiplying it with 3.60
• A circle is drawn of an appropriate radius in case where only one category of whole with its components are to be shown.
• If two or more sets of data are to be shown, the radius of the circles would be proportional to the square roots of their whole or magnitude.
• When the circle is drawn, a radius is drawn which acts as the base line. An angle equal to the degree represented by the first component is drawn on the radius. The line so drawn touches the circumference and it will represent the proportion of first component.

Taking this new line as base, another angle, equal to the degree represented by second component is drawn, which shall represent the proportion of second component. This process is repeated for all the sectors representing different components.

• The different sectors representing component parts may be coloured in different shades or spotted or dotted to distinguish them from one another.

Steps of drawing pie chart

1.Find the central angle for each component as explained. 2. Draw a circle of any radius. 3. Draw a horizontal radius. 4. Starting with the horizontal radius, draw radii, making central angles corresponding to the values of respective components. 5. Repeat the process for all the components of the given data. 6. These radii divide the whole circle into various sectors. 7. Now, shade the sectors with different colours to denote various components.

The required pie chart is ready.

## Bar Chart Preparation – Problem

Ex. Draw a simple bar diagram for area used for different crops.

## Pie Chart Preparation – Problem

Ex. Draw a suitable pie chart for expenditure incurred in various sectors of economy in the year 2023.

## Pie Chart Preparation – Problem

Ex. Draw a pie chart for investment in various sectors

## Frequency Distribution

Frequency is the number of repetitive occurrences of an event per unit ( e.g unit of time).

Frequency Distribution Table is a table containing data grouped into classes and number of cases falling under each class (referred to as frequencies).

Frequency distribution Table shows us a summarized grouping of data divided into mutually exclusive classes and the number of occurrences in a class. It is a way of showing unorganized data

Discrete Frequency Distribution

A frequency distribution formed by distinct values of a discrete variable or a continuous variable is termed as discrete frequency distribution.

Discrete frequency distribution Table creation Steps:

• Make a blank table containing of three columns with the headings variable, tally marks and frequency.
• Find the smallest observation and the largest observation; write the observations in ascending order or descending order.
• Read off the observations one by one from the data given and for each one record a tally mark (|)

against each observation.To help the counting of tally marks arrange them in groups of four like

• Count all the tally marks in a row and record their number in the frequency column.
• Record the total frequency in the last row at the bottom.

Class Size

Class size is the difference between the upper limit and the lower limit of a class interval.

• Class Limit (CL) : Class limits may be defined as the minimum value and the maximum value the class interval may contain. The minimum value is called as the lower class limit (LCL) and the maximum value is called as the upper class limit (UCL).
• Class Interval: It is the size (or width) of each class into which a range of a variable is divided. It is the difference between Upper class Limit & Lower class Limit.
• Class Boundaries : These are the actual Class Limits of Class
• Interval.  It is the midpoint of the upper class limit of one class and the lower class limit of the subsequent class

Cumulative Frequency

Cumulative frequency is the running total of Frequencies. It is calculated by adding each frequency from a frequency distribution table to the sum of its predecessors.

The number of observation ‘equal or less than (Less than type cumulative frequencies) ’or‘ equal or more than’ (More than type cumulative frequencies), or ‘less than ‘or ‘more than’ a specific value of a variable under study ( in a frequency distribution ) is called a cumulative frequency.

The sign of equality may be included with any one of the above cumulative frequencies.

For a grouped frequency distribution, the number of observations less than the upper class

boundary of a given class (i.e., the sum of the frequencies upto and including that class where the classes are in ascending order) is termed less than cumulative frequency.

If we add the frequencies from the bottom of the frequency distribution giving the number of observations ‘more than or equal’ to lower boundaries (class being in ascending order of magnitude) of the class-intervals, they are termed, ‘more than’ cumulative frequencies.

## Discreet Frequency Distribution – Problem

Ex. Prepare a frequency distribution table for typing mistakes, found on review of first 34 pages of a book :
0,1,3,3,2,5,6,0,1,0,6,5,4,1,1,0,2,3,2,5,0,4,5,2,3,2,2,3,3,4,6,1,4,5
Frequency distribution Table for typing mistakes.
Since x, the typing mistakes, is a discrete variable, x can assume seven values 0, 1, 2, 3, 4, 5, 6. (So, we have 7 classes).

## Frequency Polygon

Frequency Polygons are graphical presentation of shapes of distributions. Like histograms, Frequency polygons are used for comparing sets of cumulative frequency distributions.

Preparation of Frequency polygon

• The class frequency of each class are plotted against the mid-values, and the points plotted are joined by means of straight lines.
• Histogram is Bar chart of Frequency distribution First a histogram is made. Then the mid-points of the upper horizontal sides of the adjacent rectangles are joined by straight lines. The last point at end may be joined to the base at the mid-point of two class-intervals of zero frequency outside the histogram (although it is not necessary). Such a (closed) figure is called a frequency polygon.

Properties of Frequency polygon

The area of a frequency polygon is exactly equal to the area of the histogram. Because the

triangular strips of area which are excluded from the histogram are equal to those formerly outside the histogram but not included in the polygon.

The frequency polygons are preferred for graphic comparison of frequency distribution as two or

more polygon can be shown on the same graph.

## Frequency Polygon : Problem

Ex. Draw a frequency polygon from the Data of  weights of different students.

## Ogive or Cumulative Frequency Polygon

If the cumulative frequencies are plotted against the class-boundaries and the successive points are joined by straight lines, known as Ogive (or cumulative frequency polygon).

Ogive line Types

• Less than type : Cumulative frequencies from below are plotted against the upper class-boundaries. This  is known as less than type, because the ordinate of any point on the curve  (obtained) indicates the frequency of all values less than or equal to the corresponding value of the variable represented by the abscissa of the point.
• Greater than type : Cumulative frequencies from above are plotted against the corresponding lower boundaries.  This is known as greater than type

## Histogram

A histogram is graphical presentation of a grouped frequency distribution (Discrete or Continuous variable).

Histogram contains of a set of continuous rectangles one over each class-interval having their areas proportional to the corresponding class frequencies.

The plotted data against the y-axis must start with zero without any scale break. If the class-intervals are in the inclusive form, they are converted into exclusive form. If mid-points and frequencies are known, then exclusive class-intervals are obtained.

Ex. Grouped frequency distribution Data of daily income of workers group Histogram

## Frequency Curve

Frequency Curve is the smooth curve which corresponds to the limiting case of a histogram computed for a frequency distribution of a continuous distribution as the number of data points becomes very large.

If in a grouped frequency distribution, the class-intervals are made smaller, so that their original class frequencies remain constant, then the histogram and the frequency polygon approach more and more closely to become a smooth curve in the limit.

Such a limit to the histogram or the frequency polygon is said to be a frequency curve. The frequency curve represents more truly the distribution of continuous measurements.

Frequency Curve Preparation Method

To make a frequency curve for a given frequency distribution, the midpoints of the class

intervals are joined smoothly in such a way that the area included is just the same as that of the histogram or polygons.