It is unrealistic to expect that data will be perfect. There may be problems due to human error, limitations of measuring devices, or flaws in the data collection process. Values or even entire data objects maybe missing. In other cases, there may be spurious or duplicate objects. we shall now discuss some of these errors breifly.
Measurement and data collection errors
Measurement errors refer to any problems resulting from the measurement process. A common problem is that values recorded defer from the original values to some extent. For continues attributes, the numerical difference of the measured value and true value is called error. The term data collection error refers to errors such as omitting data objects or attribute values.
Noise and artifacts
Noise is the random component of a measurement error. It may involve the distortion of a value or the addition of spurious objects.
Precision, Bias and Accuracy
Precision:- The closeness of repeated measurements (of the same quantity) to one another.
Bias:- A systematic variation of measurements from the quantity being measured.
Accuracy:- The closeness of measurements to the true value of the quantity being measured.
Outliers are either data objects that, in some sense, have characterstics that are different from most of the other data objects in the data set, or values of an attribute that are unusual with respect to the typical values for that attribute.
Difference between outliers and noise is that outliers can sometimes be legitmate data objects or values. Thus unlike noise, outliers may sometimes be of interests. In fraud and network intrusion detection, for example, the the goal is to find unusual objects are events from among a large number of normal ones.
It is not unusua for an object to be missing one or more attribute values. In some cases, information may not be collected e.g., some people may decline to answer their age or weight. In other cases attributes are not applicable to all objects. Whatever may be the case missing values are needed to be considered during data analysis.
There are various ways to handle missing values some of them will be discussed here:-
Eliminate data objects or attributes:- A simple and effective strategy is to eliminate objects with missing values. However, even a partially specified data object contains some information, and if many objects have missing values, then a reliable analysis can be difficult. Nonetheless, if a dataset has only a few objects that have missing values, then it may be expedient to omit them.
Estimating Missing Values:- Sometimes missing data can be reliably estimated. If the attribute is continues, then the average attributes of nearest neighbours are is used; if the attribute is categorical, then the most commonly occuring attribute can be taken.
Ignore the missing values:- Many data mining approaches can be modified to ignore missing values. For example, suppose that data objects are being clustered and the similiarity between pair of objects needs to be calculated. If one or both objects of a pair have missing value for some attributes, then the similiarity can be calculated using only the attributes that do not have missing values.
Data can contain inconsisten values. Consider an address field, where both zip code and city are listed, but the specified zip code area is not contained in that city.
A data set may include data objects that are duplicates, or almost duplicates, of one another. The term deduplication is often used to refer to the process of dealing with the issues that arrise during the removal of duplicates from a dataset.