转载: Subset Data Frame Rows in R - Datanovia
# Subset Data Frame Rows in R
This tutorial describes how to subset or extract data frame rows based on certain criteria.
In this tutorial, you will learn the following R functions from the dplyr package:
- slice(): Extract rows by position
- filter(): Extract rows that meet a certain logical criteria. For example
iris %>% filter(Sepal.Length > 6)
. - filter_all(), filter_if() and filter_at(): filter rows within a selection of variables. These functions replicate the logical criteria over all variables or a selection of variables.
- sample_n(): Randomly select n rows
- sample_frac(): Randomly select a fraction of rows
- top_n(): Select top n rows ordered by a variable
We will also show you how to remove rows with missing values in a given column.
# Required packages
Load the tidyverse
packages, which include dplyr
:
1 | library(tidyverse) |
# Demo dataset
We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.
1 | my_data <- as_tibble(iris) |
# Extract rows by position
- Key R function:
slice()
[dplyr package]
1 | my_data %>% slice(1:6) |
# Filter rows by logical criteria
- Key R function:
filter()
[dplyr package]. Used to filter rows that meet some logical criteria.
Before continuing, we introduce logical comparisons and operators, which are important to know for filtering data.
# Logical comparisons
The “logical” comparison operators available in R are:
- Logical comparisons
- <: for less than
- >: for greater than
- <=: for less than or equal to
- >=: for greater than or equal to
- ==: for equal to each other
- !=: not equal to each other
- %in%: group membership. For example, “value %in% c(2, 3)” means that value can takes 2 or 3.
- is.na(): is NA
- !is.na(): is not NA.
- Logical operators
- value == 2**|3: means that the value equal 2 or (|) 3. value %in% c(2, 3) is a shortcut equivalent to value == 2|**3.
- &: means and. For example sex == “female” & age > 25
The most frequent mistake made by beginners in R is to use = instead of == when testing for equality. Remember that, when you are testing for equality, you should always use == (not =).
# Extract rows based on logical criteria
- One-column based criteria: Extract rows where Sepal.Length > 7:
1 | my_data %>% filter(Sepal.Length > 7) |
- Multiple-column based criteria: Extract rows where Sepal.Length > 6.7 and Sepal.Width ≤ 3:
1 | my_data %>% filter(Sepal.Length > 6.7, Sepal.Width <= 3) |
- Test for equality (==): Extract rows where Sepal.Length > 6.5 and Species = “versicolor”:
1 | my_data %>% filter(Sepal.Length > 6.7, Species == "versicolor") |
- Using OR operator (|): Extract rows where Sepal.Length > 6.5 and (Species = “versicolor” or Species = “virginica”):
Use this:
1 | my_data %>% filter( |
Or, equivalently, use this shortcut (%in% operator):
1 | my_data %>% filter( |
# Filter rows within a selection of variables
This section presents 3 functions - filter_all(), filter_if() and filter_at() - to filter rows within a selection of variables.
These functions replicate the logical criteria over all variables or a selection of variables.
Create a new demo data set from my_data
by removing the grouping column “Species”:
1 | my_data2 <- my_data %>% select(-Species) |
- Select rows where all variables are greater than 2.4:
1 | my_data2 %>% filter_all(all_vars(.> 2.4)) |
- Select rows when any of the variables are greater than 2.4:
1 | my_data2 %>% filter_all(any_vars(.> 2.4)) |
- Vary the selection of columns on which to apply the filtering criteria.
filter_at()
takes avars()
specification. The following R code apply the filtering criteria on the columns Sepal.Length and Sepal.Width:
1 | my_data2 %>% filter_at(vars(starts_with("Sepal")), any_vars(. > 2.4)) |
# Remove missing values
We start by creating a data frame with missing values. In R NA (Not Available) is used to represent missing values:
1 | # Create a data frame with missing data |
Extract rows where height is NA:
1 | friends_data %>% filter(is.na(height)) |
Exclude (drop) rows where height is NA:
1 | friends_data %>% filter(!is.na(height)) |
In the R code above, !is.na() means that “we don’t want” NAs.
# Select random rows from a data frame
It’s possible to select either n random rows with the function sample_n()
or a random fraction of rows with sample_frac()
. We first use the function set.seed()
to initiate random number generator engine. This important for users to 转载 the analysis.
1 | set.seed(1234) |
# Select top n rows ordered by a variable
Select the top 5 rows ordered by Sepal.Length
1 | my_data %>% top_n(5, Sepal.Length) |
Group by the column Species and select the top 5 of each group ordered by Sepal.Length:
1 | my_data %>% |
# Summary
In this tutorial, we introduce how to filter a data frame rows using the dplyr package:
- Filter rows by logical criteria:
my_data %>% filter(Sepal.Length >7)
- Select n random rows:
my_data %>% sample_n(10)
- Select a random fraction of rows:
my_data %>% sample_frac(10)
- Select top n rows by values:
my_data %>% top_n(10, Sepal.Length)