How to deal with missing values from a survey?
Posted: Mon Jan 06, 2025 8:42 am
What do you know about missing values in a survey? During the data collection process using the survey technique, it is very common that some of the questionnaires that we send to our study objective or target are not completed partially or totally, which can make the final analysis difficult and bias the results of our research. Among the most common causes that lead to this event we can find:
– Ambiguous questions in the questionnaire
– Lack of interest in answering the survey
– Compromising questions
Although these causes mainly affect the final result and the representativeness argentina phone number of the selected sample, there are a series of purification and imputation strategies that can reduce bias and optimize the final result of our market research.
Techniques for dealing with missing values
Debugging Techniques
The Data Cleaning process consists of evaluating the quality of the information collected, improving its quality in order to avoid unreliable analysis. The most commonly used cleaning strategies are:
– List of values : This involves searching the data matrix for values that are outside the response range. These values can be considered as lost, or the correct value can be estimated from other variables (Imputation).
Example: In the variable Sex, whose values are 1 = Male and 2 = Female, we find a 3 in the data matrix.
– Filter questions : This involves comparing the number of responses from a filter category and another filtered category. If any anomaly is observed that cannot be resolved, it will be considered as a missing value.
Example: Filter question A has 11 answers leading to filtered question B, while the one leading to filtered question C has 9 answers. However, we notice that 14 answers have been given for question B (2 more than expected), therefore there is no match between the filter category and the filtered category.
– Logical Consistencies : Answers that may be considered contradictory to each other are checked.
Example: Respondents who answered “Single” about their marital status should not have answered the question “Spouse’s activity.”
– Level of representativeness : A count is made of the number of responses obtained for each variable. If the number of unanswered questions is very high, equality between responses and non-responses can be assumed or, alternatively, an imputation of the non-response can be made.
I invite you to read: Keys to optimizing the quality of information in your online surveys .
Imputation Techniques
This technique consists of replacing missing values with valid values or responses by estimating them. There are three types of imputation:
– Random imputation: This type of imputation assumes the lack of information due to the randomness of the sample. To perform the imputation, the probability of each value that appears in the variable (valid and lost) is analyzed, and each lost value will be assigned those that have a probability equal to or less than that probability.
Example: The probability of value A appearing is 0.012 (1.2%), while the probability of value B is 0.357 (35.7%). Therefore, missing values with a probability equal to or less than 0.012 will be assigned the value A, while those with a probability greater than 0.012 and less than 0.369 (the sum of probability A: 0.012 and probability B: 0.357) will be assigned the value B.
– Ambiguous questions in the questionnaire
– Lack of interest in answering the survey
– Compromising questions
Although these causes mainly affect the final result and the representativeness argentina phone number of the selected sample, there are a series of purification and imputation strategies that can reduce bias and optimize the final result of our market research.
Techniques for dealing with missing values
Debugging Techniques
The Data Cleaning process consists of evaluating the quality of the information collected, improving its quality in order to avoid unreliable analysis. The most commonly used cleaning strategies are:
– List of values : This involves searching the data matrix for values that are outside the response range. These values can be considered as lost, or the correct value can be estimated from other variables (Imputation).
Example: In the variable Sex, whose values are 1 = Male and 2 = Female, we find a 3 in the data matrix.
– Filter questions : This involves comparing the number of responses from a filter category and another filtered category. If any anomaly is observed that cannot be resolved, it will be considered as a missing value.
Example: Filter question A has 11 answers leading to filtered question B, while the one leading to filtered question C has 9 answers. However, we notice that 14 answers have been given for question B (2 more than expected), therefore there is no match between the filter category and the filtered category.
– Logical Consistencies : Answers that may be considered contradictory to each other are checked.
Example: Respondents who answered “Single” about their marital status should not have answered the question “Spouse’s activity.”
– Level of representativeness : A count is made of the number of responses obtained for each variable. If the number of unanswered questions is very high, equality between responses and non-responses can be assumed or, alternatively, an imputation of the non-response can be made.
I invite you to read: Keys to optimizing the quality of information in your online surveys .
Imputation Techniques
This technique consists of replacing missing values with valid values or responses by estimating them. There are three types of imputation:
– Random imputation: This type of imputation assumes the lack of information due to the randomness of the sample. To perform the imputation, the probability of each value that appears in the variable (valid and lost) is analyzed, and each lost value will be assigned those that have a probability equal to or less than that probability.
Example: The probability of value A appearing is 0.012 (1.2%), while the probability of value B is 0.357 (35.7%). Therefore, missing values with a probability equal to or less than 0.012 will be assigned the value A, while those with a probability greater than 0.012 and less than 0.369 (the sum of probability A: 0.012 and probability B: 0.357) will be assigned the value B.