Question Using dplyr , how do I select the top and bottom observations/rows of grouped data in one statement? Data & Example Given a data frame:
I can get the top and bottom observations from each group using slice , but using two separate statements:
firstStop % group_by(id) %>% arrange(stopSequence) %>% slice(1) %>% ungroup lastStop % group_by(id) %>% arrange(stopSequence) %>% slice(n()) %>% ungroup
Can I combine these two statements into one that selects both top and bottom observations?
16.7k 6 6 gold badges 20 20 silver badges 53 53 bronze badges
asked Jul 21, 2015 at 1:41
8,263 15 15 gold badges 45 45 silver badges 80 80 bronze badges
Commented Nov 28, 2018 at 8:04
There is probably a faster way:
df %>% group_by(id) %>% arrange(stopSequence) %>% filter(row_number()==1 | row_number()==n())
answered Jul 21, 2015 at 1:48
24.9k 5 5 gold badges 65 65 silver badges 76 76 bronze badges
rownumber() %in% c(1, n()) would obviate the need to run vector scan twice
Commented Sep 4, 2016 at 12:20
@MichaelChirico I suspect you omitted an _ ? i.e. filter(row_number() %in% c(1, n()))
Commented Oct 11, 2017 at 11:03
Just for completeness: You can pass slice a vector of indices:
df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))
id stopId stopSequence 1 1 a 1 2 1 c 3 3 2 b 1 4 2 c 4 5 3 b 1 6 3 a 3
answered Jul 21, 2015 at 17:11
66.7k 8 8 gold badges 101 101 silver badges 184 184 bronze badges
might even be faster than filter - have not tested this, but see here
Commented Jul 3, 2018 at 16:42
@Tjebo Unlike filter, slice can return the same row multiple times, eg mtcars[1, ] %>% slice(c(1, n())) so in that sense the choice between them depends on what you want returned. I'd expect the timings to be close unless n is very large (where slice might be favored), but haven't tested either.
Commented Jul 3, 2018 at 16:47Not dplyr , but it's much more direct using data.table :
library(data.table) setDT(df) df[ df[order(id, stopSequence), .(rows = .I[c(1L,.N)]), by=id]$rows ] # rows stopId stopSequence # 1: 1 a 1 # 2: 1 c 3 # 3: 2 b 1 # 4: 2 c 4 # 5: 3 b 1 # 6: 3 a 3
More detailed explanation:
# 1) get row numbers of first/last observations from each group # * basically, we sort the table by id/stopSequence, then, # grouping by id, name the row numbers of the first/last # observations for each id; since this operation produces # a data.table # * .I is data.table shorthand for the row number # * here, to be maximally explicit, I've named the variable rows # as row_num to give other readers of my code a clearer # understanding of what operation is producing what variable first_last = df[order(id, stopSequence), .(rows = .I[c(1L,.N)]), by=id] idx = first_last$rows # 2) extract rows by number df[idx]
Be sure to check out the Getting Started wiki for getting the data.table basics covered