R: Filter a Data Frame Based on Values in Two Columns
I started with a data frame that looked like this:
> data <- read.csv("specdata/002.csv")
> # we'll just use a few rows to make it easier to see what's going on
> data[2494:2500,]
Date sulfate nitrate ID
2494 2007-10-30 3.25 0.902 2
2495 2007-10-31 NA NA 2
2496 2007-11-01 NA NA 2
2497 2007-11-02 6.56 1.270 2
2498 2007-11-03 NA NA 2
2499 2007-11-04 NA NA 2
2500 2007-11-05 6.10 0.772 2
We want to only return the rows which have a value in both the ‘sulfate’ and the ‘nitrate’ columns.
I initially tried to use the Filter function but wasn’t very successful:
> smallData <- data[2494:2500,] > Filter(function(x) !is.na(x$sulfate), smallData) Error in x$sulfate : $ operator is invalid for atomic vectors
I’m not sure that Filter is designed to filter data frames – it seems more appropriate for lists or vectors – so I ended up filtering the data frame using what I think is called an extract operation:
> smallData[!is.na(smallData$sulfate) & !is.na(smallData$nitrate),] Date sulfate nitrate ID 2494 2007-10-30 3.25 0.902 2 2497 2007-11-02 6.56 1.270 2 2500 2007-11-05 6.10 0.772 2
The code inside the square brackets returns a collection indicating whether or not we should return each row:
> !is.na(smallData$sulfate) & !is.na(smallData$nitrate) [1] TRUE FALSE FALSE TRUE FALSE FALSE TRUE
which is equivalent to doing this:
> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),] Date sulfate nitrate ID 2494 2007-10-30 3.25 0.902 2 2497 2007-11-02 6.56 1.270 2 2500 2007-11-05 6.10 0.772 2
We put a comma after the list of true/false values to indicate that we want to return all the columns otherwise we’d get this error:
> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE)] Error in `[.data.frame`(smallData, c(TRUE, FALSE, FALSE, TRUE, FALSE, : undefined columns selected
We could filter the columns as well if we wanted to:
> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE), c(1,2)] Date sulfate 2494 2007-10-30 3.25 2497 2007-11-02 6.56 2500 2007-11-05 6.10
As is no doubt obvious, I don’t know much R so if there’s a better way to do anything please let me know.
The full code is on github if you’re interested in seeing it in context.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)





