Select first and last row from grouped data

Question Using dplyr , how do I select the top and bottom observations/rows of grouped data in one statement? Data & Example Given a data frame:

I can get the top and bottom observations from each group using slice , but using two separate statements:

firstStop % group_by(id) %>% arrange(stopSequence) %>% slice(1) %>% ungroup lastStop % group_by(id) %>% arrange(stopSequence) %>% slice(n()) %>% ungroup

Can I combine these two statements into one that selects both top and bottom observations? 16.7k 6 6 gold badges 20 20 silver badges 53 53 bronze badges asked Jul 21, 2015 at 1:41 8,263 15 15 gold badges 45 45 silver badges 80 80 bronze badges Commented Nov 28, 2018 at 8:04

10 Answers 10

There is probably a faster way:

df %>% group_by(id) %>% arrange(stopSequence) %>% filter(row_number()==1 | row_number()==n())

answered Jul 21, 2015 at 1:48 24.9k 5 5 gold badges 65 65 silver badges 76 76 bronze badges rownumber() %in% c(1, n()) would obviate the need to run vector scan twice Commented Sep 4, 2016 at 12:20 @MichaelChirico I suspect you omitted an _ ? i.e. filter(row_number() %in% c(1, n())) Commented Oct 11, 2017 at 11:03

Just for completeness: You can pass slice a vector of indices:

df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))

 id stopId stopSequence 1 1 a 1 2 1 c 3 3 2 b 1 4 2 c 4 5 3 b 1 6 3 a 3

answered Jul 21, 2015 at 17:11 66.7k 8 8 gold badges 101 101 silver badges 184 184 bronze badges might even be faster than filter - have not tested this, but see here Commented Jul 3, 2018 at 16:42

@Tjebo Unlike filter, slice can return the same row multiple times, eg mtcars[1, ] %>% slice(c(1, n())) so in that sense the choice between them depends on what you want returned. I'd expect the timings to be close unless n is very large (where slice might be favored), but haven't tested either.

Commented Jul 3, 2018 at 16:47

Not dplyr , but it's much more direct using data.table :

library(data.table) setDT(df) df[ df[order(id, stopSequence), .(rows = .I[c(1L,.N)]), by=id]$rows ] # rows stopId stopSequence # 1: 1 a 1 # 2: 1 c 3 # 3: 2 b 1 # 4: 2 c 4 # 5: 3 b 1 # 6: 3 a 3

More detailed explanation:

# 1) get row numbers of first/last observations from each group # * basically, we sort the table by id/stopSequence, then, # grouping by id, name the row numbers of the first/last # observations for each id; since this operation produces # a data.table # * .I is data.table shorthand for the row number # * here, to be maximally explicit, I've named the variable rows # as row_num to give other readers of my code a clearer # understanding of what operation is producing what variable first_last = df[order(id, stopSequence), .(rows = .I[c(1L,.N)]), by=id] idx = first_last$rows # 2) extract rows by number df[idx]

Be sure to check out the Getting Started wiki for getting the data.table basics covered