Bracket subsetting is handy, but it can be cumbersome and difficultto read, especially for complicated operations. Enterdplyr
. dplyr
is a package for helping with tabular data manipulation. It pairs nicelywith tidyr
which enables you to swiftlyconvert between different data formats for plotting and analysis.
The tidyverse
package is an“umbrella-package” that installs tidyr
,dplyr
, and several other useful packagesfor data analysis, such as ggplot2
,tibble
, etc.
The tidyverse
package tries to address3 common issues that arise when doing data analysis in R:
You should already have installed and loaded thetidyverse
package. If you haven’t alreadydone so, you can type install.packages("tidyverse")
straight into the console. Then, type library(tidyverse)
toload the package.
What are dplyr
andtidyr
?
The package dplyr
provides helper toolsfor the most common data manipulation tasks. It is built to workdirectly with data frames, with many common tasks optimized by beingwritten in a compiled language (C++). An additional feature is theability to work directly with data stored in an external database. Thebenefits of doing this are that the data can be managed natively in arelational database, queries can be conducted on that database, and onlythe results of the query are returned.
This addresses a common problem with R in that all operations areconducted in-memory and thus the amount of data you can work with islimited by available memory. The database connections essentially removethat limitation in that you can connect to a database of many hundredsof GB, conduct queries on it directly, and pull back into R only whatyou need for analysis.
The package tidyr
addresses the commonproblem of wanting to reshape your data for plotting and usage bydifferent R functions. For example, sometimes we want data sets where wehave one row per measurement. Other times we want a data frame whereeach measurement type has its own column, and rows are instead moreaggregated groups (e.g., a time period, an experimental unit like a plotor a batch number). Moving back and forth between these formats isnon-trivial, and tidyr
gives you tools forthis and more sophisticated data manipulation.
To learn more about dplyr
andtidyr
after the workshop, you may want tocheck out this handydata transformation with dplyr
cheatsheet and this oneabout tidyr
.
As before, we’ll read in our data using the read_csv()
function from the tidyverse packagereadr
.
R
surveys <- read_csv("data_raw/portal_data_joined.csv")
OUTPUT
#> Rows: 34786 Columns: 13#> ── Column specification ────────────────────────────────────────────────────────#> Delimiter: ","#> chr (6): species_id, sex, genus, species, taxa, plot_type#> dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight#> #> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
R
## inspect the datastr(surveys)
R
## preview the dataview(surveys)
Next, we’re going to learn some of the most commondplyr
functions:
select()
: subset columnsfilter()
: subset rows on conditionsmutate()
: create new columns by using information fromother columnsgroup_by()
andsummarize()
: create summarystatistics on grouped dataarrange()
: sort resultscount()
: count discrete values
Selecting columns and filtering rows
To select columns of a data frame, use select()
. Thefirst argument to this function is the data frame(surveys
), and the subsequent arguments are the columns tokeep.
R
select(surveys, plot_id, species_id, weight)
To select all columns except certain ones, put a “-” infront of the variable to exclude it.
R
select(surveys, -record_id, -species_id)
This will select all the variables in surveys
exceptrecord_id
and species_id
.
To choose rows based on a specific criterion, usefilter()
:
R
filter(surveys, year == 1995)
Pipes
What if you want to select and filter at the same time? There arethree ways to do this: use intermediate steps, nested functions, orpipes.
With intermediate steps, you create a temporary data frame and usethat as input to the next function, like this:
R
surveys2 <- filter(surveys, weight < 5)surveys_sml <- select(surveys2, species_id, sex, weight)
This is readable, but can clutter up your workspace with lots ofobjects that you have to name individually. With multiple steps, thatcan be hard to keep track of.
You can also nest functions (i.e.one function inside of another),like this:
R
surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)
This is handy, but can be difficult to read if too many functions arenested, as R evaluates the expression from the inside out (in this case,filtering, then selecting).
The last option, pipes, are a recent addition to R. Pipeslet you take the output of one function and send it directly to thenext, which is useful when you need to do many things to the samedataset. Pipes in R look like %>%
and are made availablevia the magrittr
package, installedautomatically with dplyr
. If you useRStudio, you can type the pipe with Ctrl
- Shift + M if you have a PC or Cmd +Shift + M if you have a Mac.
R
surveys %>% filter(weight < 5) %>% select(species_id, sex, weight)
In the above code, we use the pipe to send the surveys
dataset first through filter()
to keep rows whereweight
is less than 5, then through select()
to keep only the species_id
, sex
, andweight
columns. Since %>%
takes the objecton its left and passes it as the first argument to the function on itsright, we don’t need to explicitly include the data frame as an argumentto the filter()
and select()
functions anymore.
Some may find it helpful to read the pipe like the word “then.” Forinstance, in the example above, we took the data framesurveys
, then we filter
ed for rowswith weight < 5
, then we select
edcolumns species_id
, sex
, andweight
. The dplyr
functionsby themselves are somewhat simple, but by combining them into linearworkflows with the pipe we can accomplish more complex manipulations ofdata frames.
If we want to create a new object with this smaller version of thedata, we can assign it a new name:
R
surveys_sml <- surveys %>% filter(weight < 5) %>% select(species_id, sex, weight)surveys_sml
Note that the final data frame is the leftmost part of thisexpression.
Challenge
Using pipes, subset the surveys
data to include animalscollected before 1995 and retain only the columns year
,sex
, and weight
.
R
surveys %>% filter(year < 1995) %>% select(year, sex, weight)
Mutate
Frequently you’ll want to create new columns based on the values inexisting columns, for example to do unit conversions, or to find theratio of values in two columns. For this we’ll usemutate()
.
To create a new column of weight in kg:
R
surveys %>% mutate(weight_kg = weight / 1000)
You can also create a second new column based on the first new columnwithin the same call of mutate()
:
R
surveys %>% mutate(weight_kg = weight / 1000, weight_lb = weight_kg * 2.2)
If this runs off your screen and you just want to see the first fewrows, you can use a pipe to view the head()
of the data.(Pipes work with non-dplyr
functions, too,as long as the dplyr
ormagrittr
package is loaded).
R
surveys %>% mutate(weight_kg = weight / 1000) %>% head()
The first few rows of the output are full of NA
s, so ifwe wanted to remove those we could insert a filter()
in thechain:
R
surveys %>% filter(!is.na(weight)) %>% mutate(weight_kg = weight / 1000) %>% head()
is.na()
is a function that determines whether somethingis an NA
. The !
symbol negates the result, sowe’re asking for every row where weight is not anNA
.
Challenge
Create a new data frame from the surveys
data that meetsthe following criteria: contains only the species_id
columnand a new column called hindfoot_cm
containing thehindfoot_length
values (currently in mm) converted tocentimeters. In this hindfoot_cm
column, there are noNA
s and all values are less than 3.
Hint: think about how the commands should be orderedto produce this data frame!
R
surveys_hindfoot_cm <- surveys %>% filter(!is.na(hindfoot_length)) %>% mutate(hindfoot_cm = hindfoot_length / 10) %>% filter(hindfoot_cm < 3) %>% select(species_id, hindfoot_cm)
Split-apply-combine data analysis and the summarize()
function
Many data analysis tasks can be approached using thesplit-apply-combine paradigm: split the data into groups, applysome analysis to each group, and then combine the results. Key functionsof dplyr
for this workflow aregroup_by()
and summarize()
.
The group_by()
and summarize()
functions
group_by()
is often used together withsummarize()
, which collapses each group into a single-rowsummary of that group. group_by()
takes as arguments thecolumn names that contain the categorical variables forwhich you want to calculate the summary statistics. So to compute themean weight
by sex:
R
surveys %>% group_by(sex) %>% summarize(mean_weight = mean(weight, na.rm = TRUE))
You may also have noticed that the output from these calls doesn’trun off the screen anymore. It’s one of the advantages oftbl_df
over data frame.
You can also group by multiple columns:
R
surveys %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight, na.rm = TRUE)) %>% tail()
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.
Here, we used tail()
to look at the last six rows of oursummary. Before, we had used head()
to look at the firstsix rows. We can see that the sex
column containsNA
values because some animals had escaped before their sexand body weights could be determined. The resultingmean_weight
column does not contain NA
butNaN
(which refers to “Not a Number”) becausemean()
was called on a vector of NA
valueswhile at the same time setting na.rm = TRUE
. To avoid this,we can remove the missing values for weight before we attempt tocalculate the summary statistics on weight. Because the missing valuesare removed first, we can omit na.rm = TRUE
when computingthe mean:
R
surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight))
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.
Here, again, the output from these calls doesn’t run off the screenanymore. If you want to display more data, you can use theprint()
function at the end of your chain with the argumentn
specifying the number of rows to display:
R
surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight)) %>% print(n = 15)
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.
Once the data are grouped, you can also summarize multiple variablesat the same time (and not necessarily on the same variable). Forinstance, we could add a column indicating the minimum weight for eachspecies for each sex:
R
surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight), min_weight = min(weight))
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.
It is sometimes useful to rearrange the result of a query to inspectthe values. For instance, we can sort on min_weight
to putthe lighter species first:
R
surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight), min_weight = min(weight)) %>% arrange(min_weight)
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.
To sort in descending order, we need to add the desc()
function. If we want to sort the results by decreasing order of meanweight:
R
surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight), min_weight = min(weight)) %>% arrange(desc(mean_weight))
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.
Counting
When working with data, we often want to know the number ofobservations found for each factor or combination of factors. For thistask, dplyr
provides count()
.For example, if we wanted to count the number of rows of data for eachsex, we would do:
R
surveys %>% count(sex)
The count()
function is shorthand for something we’vealready seen: grouping by a variable, and summarizing it by counting thenumber of observations in that group. In other words,surveys %>% count()
is equivalent to:
R
surveys %>% group_by(sex) %>% summarize(count = n())
For convenience, count()
provides the sort
argument:
R
surveys %>% count(sex, sort = TRUE)
Previous example shows the use of count()
to count thenumber of rows/observations for one factor (i.e.,sex
). If we wanted to count combination offactors, such as sex
and species
, wewould specify the first and the second factor as the arguments ofcount()
:
R
surveys %>% count(sex, species)
With the above code, we can proceed with arrange()
tosort the table according to a number of criteria so that we have abetter comparison. For instance, we might want to arrange the tableabove in (i) an alphabetical order of the levels of the species and (ii)in descending order of the count:
R
surveys %>% count(sex, species) %>% arrange(species, desc(n))
From the table above, we may learn that, for instance, there are 75observations of the albigula species that are not specified forits sex (i.e.NA
).
Challenge
- How many animals were caught in each
plot_type
surveyed?
R
surveys %>% count(plot_type)
Challenge(continued)
- Use
group_by()
andsummarize()
to find themean, min, and max hindfoot length for each species (usingspecies_id
). Also add the number of observations (hint: see?n
).
R
surveys %>% filter(!is.na(hindfoot_length)) %>% group_by(species_id) %>% summarize( mean_hindfoot_length = mean(hindfoot_length), min_hindfoot_length = min(hindfoot_length), max_hindfoot_length = max(hindfoot_length), n = n() )
Challenge(continued)
- What was the heaviest animal measured in each year? Return thecolumns
year
,genus
,species_id
,andweight
.
R
surveys %>% filter(!is.na(weight)) %>% group_by(year) %>% filter(weight == max(weight)) %>% select(year, genus, species, weight) %>% arrange(year)
Reshaping with pivot_longer and pivot_wider
In the spreadsheetlesson, we discussed how to structure our data leading to the fourrules defining a tidy dataset:
- Each variable has its own column
- Each observation has its own row
- Each value must have its own cell
- Each type of observational unit forms a table
Here we examine the fourth rule: Each type of observational unitforms a table.
In surveys
, the rows of surveys
contain thevalues of variables associated with each record (the unit), values suchas the weight or sex of each animal associated with each record. What ifinstead of comparing records, we wanted to compare the different meanweight of each genus between plots? (Ignoring plot_type
forsimplicity).
We’d need to create a new table where each row (the unit) iscomprised of values of variables associated with each plot. In practicalterms this means the values in genus
would become the namesof column variables and the cells would contain the values of the meanweight observed on each plot.
Having created a new table, it is therefore straightforward toexplore the relationship between the weight of different genera within,and between, the plots. The key point here is that we are stillfollowing a tidy data structure, but we have reshapedthe data according to the observations of interest: average genus weightper plot instead of recordings per date.
The opposite transformation would be to transform column names intovalues of a variable.
We can do both these of transformations with two tidyr
functions, pivot_wider()
andpivot_longer()
.
These may sound like dramatically different data layouts, but thereare some tools that make transitions between these layouts morestraightforward than you might think! The gif below shows how these twoformats relate to each other, and gives you an idea of how we can use Rto shift from one format to the other.
![Manipulating, analyzing and exporting data with tidyverse (1) Manipulating, analyzing and exporting data with tidyverse (1)](https://i0.wp.com/datacarpentry.org/fig/tidyr-pivot_wider_longer.gif)
Pivoting from long to wide format
pivot_wider()
takes three principal arguments:
- the data
- the names_from column variable whose values will become newcolumn names.
- the values_from column variable whose values will fill thenew column variables.
Further arguments include values_fill
which, if set,fills in missing values with the value provided.
Let’s use pivot_wider()
to transform surveys to find themean weight of each genus in each plot over the entire survey period. Weuse filter()
, group_by()
andsummarize()
to filter our observations and variables ofinterest, and create a new variable for themean_weight
.
R
surveys_gw <- surveys %>% filter(!is.na(weight)) %>% group_by(plot_id, genus) %>% summarize(mean_weight = mean(weight))
OUTPUT
#> `summarise()` has grouped output by 'plot_id'. You can override using the#> `.groups` argument.
R
str(surveys_gw)
This yields surveys_gw
where the observations for eachplot are distributed across multiple rows, 196 observations of 3variables. Using pivot_wider()
with the names fromgenus
and with values from mean_weight
thisbecomes 24 observations of 11 variables, one row for each plot.
R
surveys_wide <- surveys_gw %>% pivot_wider(names_from = genus, values_from = mean_weight)str(surveys_wide)
![Manipulating, analyzing and exporting data with tidyverse (2) Manipulating, analyzing and exporting data with tidyverse (2)](https://i0.wp.com/datacarpentry.org/fig/pivot_wider_graphic.png)
We could now plot comparisons between the weight of genera (one iscalled a genus, multiple are called genera) in different plots, althoughwe may wish to fill in the missing values first.
R
surveys_gw %>% pivot_wider(names_from = genus, values_from = mean_weight, values_fill = 0) %>% head()
Pivoting from wide to long format
The opposing situation could occur if we had been provided with datain the form of surveys_wide
, where the genus names arecolumn names, but we wish to treat them as values of a genus variableinstead.
In this situation we are reshaping the column names and turning theminto a pair of new variables. One variable represents the column namesas values, and the other variable contains the values previouslyassociated with the column names.
pivot_longer()
takes four principal arguments:
- the data
- the names_to column variable we wish to create from columnnames.
- the values_to column variable we wish to create and fillwith values.
- cols are the name of the columns we use to make this pivot(or to drop).
To recreate surveys_gw
from surveys_wide
wewould create a names variable called genus
and valuevariable called mean_weight
.
In pivoting longer, we also need to specify what columns to reshape.If the columns are directly adjacent as they are here, we don’t evenneed to list the all out: we can just use the :
operator!
R
surveys_long <- surveys_wide %>% pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id)str(surveys_long)
![Manipulating, analyzing and exporting data with tidyverse (3) Manipulating, analyzing and exporting data with tidyverse (3)](https://i0.wp.com/datacarpentry.org/fig/pivot_longer_graphic.png)
Note that now the NA
genera are included in the longformat data frame. Pivoting wider and then longer can be a useful way tobalance out a dataset so that every replicate has the samecomposition
We could also have used a specification for what columns to exclude.In this example, we will use all columns exceptplot_id
for the names variable. By using the minus sign inthe cols
argument, we omit plot_id
from beingreshaped
R
surveys_wide %>% pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id) %>% head()
Challenge
- Reshape the
surveys
data frame withyear
as columns,plot_id
as rows, and the number of genera perplot as the values. You will need to summarize before reshaping, and usethe functionn_distinct()
to get the number of uniquegenera within a particular chunk of data. It’s a powerful function! See?n_distinct
for more.
R
surveys_wide_genera <- surveys %>% group_by(plot_id, year) %>% summarize(n_genera = n_distinct(genus)) %>% pivot_wider(names_from = year, values_from = n_genera)
OUTPUT
#> `summarise()` has grouped output by 'plot_id'. You can override using the#> `.groups` argument.
R
head(surveys_wide_genera)
Challenge(continued)
- Now take that data frame and
pivot_longer()
it, so eachrow is a uniqueplot_id
byyear
combination.
R
surveys_wide_genera %>% pivot_longer(names_to = "year", values_to = "n_genera", cols = -plot_id)
Challenge(continued)
- The
surveys
data set has two measurement columns:hindfoot_length
andweight
. This makes itdifficult to do things like look at the relationship between mean valuesof each measurement per year in different plot types. Let’s walk througha common solution for this type of problem. First, usepivot_longer()
to create a dataset where we have a namescolumn calledmeasurement
and avalue
columnthat takes on the value of eitherhindfoot_length
orweight
. Hint: You’ll need to specify which columnswill be part of the reshape.
R
surveys_long <- surveys %>% pivot_longer(names_to = "measurement", values_to = "value", cols = c(hindfoot_length, weight))
- With this new data set, calculate the average of each
measurement
in eachyear
for each differentplot_type
. Thenpivot_wider()
them into a dataset with a column forhindfoot_length
andweight
. Hint: You only need to specify the namesand values columns forpivot_wider()
.
R
surveys_long %>% group_by(year, measurement, plot_type) %>% summarize(mean_value = mean(value, na.rm=TRUE)) %>% pivot_wider(names_from = measurement, values_from = mean_value)
OUTPUT
#> `summarise()` has grouped output by 'year', 'measurement'. You can override#> using the `.groups` argument.
Certainly! From the passage you provided, let's break down the concepts discussed:
1. dplyr:
- Definition: It's a package in R used for data manipulation.
- Functionality: Offers various functions (like
select
,filter
,mutate
,group_by
,summarize
, etc.) to perform common data manipulation tasks. - Performance: Optimized tasks written in C++ to work directly with data frames.
- Database Compatibility: Allows working with data stored in external databases.
2. tidyr:
- Definition: Another package in R used for data manipulation, particularly for data reshaping.
- Functionality: Provides tools (
pivot_longer
,pivot_wider
, etc.) to reshape data into tidy formats. - Purpose: Aids in switching between different data layouts, facilitating analysis and plotting.
3. tidyverse:
- Definition: An "umbrella-package" that installs
tidyr
,dplyr
, along with other helpful packages for data analysis likeggplot2
,tibble
, etc. - Objective: Addresses common issues faced during data analysis in R.
- Advantages: Encourages tidy data formats, which are more accessible and easier to work with for analysis and visualization.
4. Key dplyr Functions:
select()
: Subsetting columnsfilter()
: Subsetting rows based on conditionsmutate()
: Creating new columns based on existing onesgroup_by()
andsummarize()
: Summarizing data after groupingarrange()
: Sorting resultscount()
: Counting discrete values
5. Pipes in dplyr:
- Definition: Symbolized by
%>%
, allows chaining multiple functions together for efficient data manipulation. - Functionality: Takes the output of one function and passes it directly as input to the next function, enhancing readability and simplicity in code.
6. Data Reshaping with tidyr:
pivot_longer()
: Converts wide data to long data format.pivot_wider()
: Converts long data to wide data format.- Purpose: Facilitates comparison between variables, groups, or observations in different formats.
7. Examples in Data Handling:
- Subset Selection: Using
select()
andfilter()
. - Pivoting: Employing
pivot_longer()
andpivot_wider()
for data format changes. - Summarizing Data: Utilizing
summarize()
after grouping withgroup_by()
.
8. Working with External Data:
- Database Integration:
dplyr
allows working directly with data stored in external databases. - Advantage: Removes limitations on data size due to in-memory operations, facilitating handling of larger datasets.
9. Performance Enhancement:
- Compiled Language Usage: dplyr uses compiled code (C++) for common tasks, enhancing performance and speed in data manipulation.
10. Workflow Optimization:
- Pipes for Workflow: Pipes in R (
%>%
) streamline the workflow, allowing complex data manipulations in a more readable manner.
11. Challenge Exercises:
- These exercises demonstrate practical application of the discussed functions for data manipulation and summarization.
By understanding these concepts, you can efficiently manipulate, reshape, and summarize data using dplyr and tidyr packages in R.