How does R handle missing values? (2024)

Version info: Code for this page was tested in R Under development (unstable) (2012-02-22 r58461)On: 2012-03-28With: knitr 0.4

Like other statistical software packages, R is capable of handling missing values. However, to those accustomed to working with missing values in other packages, the way in which R handles missing values may require a shift in thinking. On this page, we will present first the basics of how missing values are represented in R. Next, for those coming from SAS, SPSS, and/or Stata, we will outline some of the differences between missing values in R and missing values elsewhere. Finally, we will introduce some of the tools for working with missing values in R, both in data management and analysis.

Very basics

Missing data in R appears as NA. NA is not a string or a numeric value, butan indicator of missingness. We can create vectors with missing values.

x1 <- c(1, 4, 3, NA, 7)x2 <- c("a", "B", NA, "NA")

NA is the one of the few non-numbers that we could include in x1 without generatingan error (and the other exceptions are letters representing numbers or numericideas like infinity). In x2, the third value is missing while the fourth value is thecharacter string “NA”. To see which values in each of these vectors R recognizesas missing, we can use the is.na function. It will return a TRUE/FALSEvector with as any elements as the vector we provide.

is.na(x1)## [1] FALSE FALSE FALSE TRUE FALSEis.na(x2)## [1] FALSE FALSE TRUE FALSE

We can see that R distinguishes between the NA and “NA” in x2–NA isseen as a missing value, “NA” is not.

Differences from other packages

NA cannot be used in comparisons: In other packages, a “missing”value is assigned an extreme numeric value–either very high or very low. As aresult, values coded as missing can 1) be compared to other values and 2) othervalues can be compared to missing. In the example SAS code below, we compare the values in y to 0 and to the missingsymbol and see that both comparisons are valid (and that the missing symbol isvalued at less than zero).

data test; input x y; datalines;2 .3 45 16 0;data test; set test; lowy = (y < 0); missy = (y = .);run;proc print data = test; run;Obs x y lowy missy 1 2 . 1 1 2 3 4 0 0 3 5 1 0 0 4 6 0 0 0

We can try the equivalent in R.

x1 < 0## [1] FALSE FALSE FALSE NA FALSEx1 == NA## [1] NA NA NA NA NA

Our missing value cannot be compared to 0 and none of our values can be compared to NA because NA is not assigned a value–itsimply is or it isn’t.

NA is used for all kinds of missing data: In other packages, missingstrings and missing numbers might be represented differently–empty quotationsfor strings, periods for numbers. In R, NA represents all types of missing data.We saw a small example of this in x1 and x2. x1 is a“numeric” object and x2 is a “character” object.
Non-NA values cannot be interpreted as missing: Other packages allow you todesignate values as “system missing” so that these values will be interpreted inthe analysis as missing. In R, you would need to explicitly change these valuesto NA. The is.na functioncan also be used to make such a change:

is.na(x1) <- which(x1 == 7)x1## [1] 1 4 3 NA NA

NA options in R

We have introduced is.na as a tool for both finding and creatingmissing values. It is one of several functions built around NA. Most ofthe other functions for NA are options for na.action.

Just as there aredefault settings for functions, there are similar underlying defaults for R as a software. Youcan view these current settings with options(). One of these is the “na.action”that describes how missing values should be treated. The possiblena.action settings within R include:

na.omit and na.exclude: returns the object with observationsremoved if they contain any missing values; differences between omitting andexcluding NAs can be seen in some prediction and residual functions
na.pass: returns the object unchanged
na.fail: returns the object only if it contains no missing values

To see the na.action currently in in options, use getOption(“na.action”).We can create a data frame with missing values and see how it is treated witheach of the above.

Missing values in analysis

In some R functions, one of the arguments the user can provide is the na.action. For example, if you look at the help for the lm command,you can see that na.action is one of the listed arguments. By default, itwill use the na.action specified in the R options. If you wish to usea different na.action for the regression, you can indicate the action inthe lm command.

Two common options with lm are the default, na.omit and na.exclude which does not use the missing values, but maintains their position for the residuals and fitted values.

## use the famous anscombe data and set a few to NAanscombe <- within(anscombe, { y1[1:3] <- NA})anscombe # view## x1 x2 x3 x4 y1 y2 y3 y4## 1 10 10 10 8 NA 9.14 7.46 6.58## 2 8 8 8 8 NA 8.14 6.77 5.76## 3 13 13 13 8 NA 8.74 12.74 7.71## 4 9 9 9 8 8.81 8.77 7.11 8.84## 5 11 11 11 8 8.33 9.26 7.81 8.47## 6 14 14 14 8 9.96 8.10 8.84 7.04## 7 6 6 6 8 7.24 6.13 6.08 5.25## 8 4 4 4 19 4.26 3.10 5.39 12.50## 9 12 12 12 8 10.84 9.13 8.15 5.56## 10 7 7 7 8 4.82 7.26 6.42 7.91## 11 5 5 5 8 5.68 4.74 5.73 6.89model.omit <- lm(y2 ~ y1, data = anscombe, na.action = na.omit)model.exclude <- lm(y2 ~ y1, data = anscombe, na.action = na.exclude)## compare effects on residualsresid(model.omit)## 4 5 6 7 8 9 10 11 ## 0.727 1.575 -0.799 -0.743 -1.553 -0.425 2.190 -0.971 resid(model.exclude)## 1 2 3 4 5 6 7 8 9 10 ## NA NA NA 0.727 1.575 -0.799 -0.743 -1.553 -0.425 2.190 ## 11 ## -0.971 ## compare effects on fitted (predicted) valuesfitted(model.omit)## 4 5 6 7 8 9 10 11 ## 8.04 7.69 8.90 6.87 4.65 9.55 5.07 5.71 fitted(model.exclude)## 1 2 3 4 5 6 7 8 9 10 11 ## NA NA NA 8.04 7.69 8.90 6.87 4.65 9.55 5.07 5.71

Using na.exclude pads the residuals and fitted values with NAs where there were missing values. Other functions do not use the na.action, but instead have a different argument (with some default) for how they will handle missing values. For example, the mean command will, by default, return NA if there are any NAs in the passed object.

mean(x1)## [1] NA

If you wish to calculate the mean of the non-missing values in the passedobject, you can indicate this in the na.rm argument (which is, bydefault, set to FALSE).

mean(x1, na.rm = TRUE)## [1] 2.67

Two common commands used in data management and exploration are summaryand table. The summary command (when used with numeric vectors) returns the number of NAs in a vector, but the table command ignores NAs by default.

summary(x1)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.00 2.00 3.00 2.67 3.50 4.00 2 table(x1)## x1## 1 3 4 ## 1 1 1

To see NA among the table output, you can indicate “ifany” or “always” in theuseNA argument. The firstwill show NA in the output only if there is some missing data in the object. Thesecond will include NA in the output regardless.

table(x1, useNA = "ifany")## x1## 1 3 4 ## 1 1 1 2 table(1:3, useNA = "always")## ## 1 2 3 ## 1 1 1 0

Sorting data containing missing values in R is again different from otherpackages because NA cannot be compared to other values. By default, sortremoves any NA values and can therefore change the length of a vector.

(x1s <- sort(x1))## [1] 1 3 4length(x1s)## [1] 3

The user can specify if NA should be last or first in a sorted order by indicating TRUE or FALSE for thena.last argument.

sort(x1, na.last = TRUE)## [1] 1 3 4 NA NA

No matter the goal of your R code, it is wise to both investigatemissing values in your data and use the help files for all functions you use.You should be either aware of and comfortable with the default treatments ofmissing values or specifying the treatment of missing values you want foryour analysis.

FAQs

How does R handle missing values? ›

Missing data in R appears as NA. NA is not a string or a numeric value, but an indicator of missingness.

How does R handle blank values? ›

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for character and numeric data.

View Details ›

How does R treat missing values in regression? ›

Missing data, codified as NA in R, can be problematic in predictive modeling. By default, most of the regression models in R work with the complete cases of the data, that is, they exclude the cases in which there is at least one NA .

Discover More Details ›

How to handle invalid values in R? ›

You typically find invalid values by listing out the values, deciding which need fixing, subsetting the variable to isolate the invalid values, and finally replacing the invalid values with valid ones. In base R, you can subset with brackets [] and logical expressions.

Why am I getting NA for mean in R? ›

If there are missing values, then the mean function returns NA. To drop the missing values from the calculation use na.

View Details ›

What happens if you run a regression with missing values? ›

As the amount of data that is missing increases, there can be a substantial reduction of sample size and a resulting loss of power. As important, there is a potential for biases in the regression estimates and their standard errors (and therefore the significance tests), depending on which values are missing.

How to replace na with 0 in R? ›

Method 1: Using the is.na() Function

The ` is.na(data) ` function used in this example returns a logical vector containing TRUE for NAs and FALSE for non-missing values inside the vector. This logical vector is then used to index the data vector, and all the NAs are replaced with 0s.

Read On ›

How to do imputation in R? ›

Imputing missing values in R

In R, replace the column's missing value with zero.
Replace the column's missing value with the mean.
Replace the column's missing value with the median.

Mar 9, 2022

Find Out More ›

How do I remove rows with missing data in R? ›

Remove rows with missing values using complete.

The complete. cases() is used for removing missing data in a dataframe or in matrix or in a vector. This function can easily filter the rows with missing data and works more efficient in manner .

How do I remove rows with NA in one column in R? ›

Remove rows with NA of one column in R DataFrame Using drop_na() drop_na() Drops rows having values equal to NA. To use this approach we need to use “tidyr” library, which can be installed.

How to handle missing data? ›

When dealing with missing data, data scientists can use two primary methods to solve the error: imputation or data removal. The imputation method substitutes reasonable guesses for missing data. It's most useful when the percentage of missing data is low.

Get More Info ›

What is the difference between NULL and NA in R? ›

Both are used to represent missing or undefined values. NULL represents the null object, it's a reserved word. NULL is perhaps returned by expressions and functions, so that values are undefined. NA is a logical constant of length 1, which contains a missing value indicator.

View Details ›

How to handle invalid values and outliers in R? ›

— Use summary statistics and visualizations to identify outliers and invalid values in the dataset. For example, you can use box plots and histograms to identify extreme values that may be invalid. — Look for values that do not make sense in the context of the data.

View Details ›

How to identify blanks in R? ›

Luckily, R gives us a special function to detect NA s. This is the is.na() function. And actually, if you try to type my_vector == NA , R will tell you to use is.na() instead.

See Details ›

How to handle missing values in a dataset? ›

Missing values can be handled by deleting the rows or columns having null values. If columns have more than half of the rows as null then the entire column can be dropped. The rows which are having one or more columns values as null can also be dropped.

See Details ›

How to replace null values with na in R? ›

Replacing values with NA

tidyr::replace_na() : Missing values turns into a value (NA –> -99)
naniar::replace_with_na() : Value becomes a missing value (-99 –> NA)

Mar 5, 2024

Get More Info Here ›

What is the null value in R? ›

R actually has two such values: NA and NULL. In statistical data sets, we often encounter missing data, which we represent in R with the value NA. NULL, on the other hand, represents that the value in question simply doesn't exist, rather than being existent but unknown.

View Details ›