I am learning how to use the R dplyr ‘join’ functions by doing the exercises from this course: https://github.com/uclouvain-cbio/WSBIM1207 and got stuck on the problem described below.
First, download the example dataframes used for this question:
Load the package:
Then in R/RStudio load the dataframe files, ‘clinical2’ and ‘expression’ by typing:
The task is, firstly:
‘Join the expression and clinical2 tables by the patient reference, using the left_join and the right_join functions.‘
I did that in this way:
left_join(expression, clinical2, by = c("patient" = "patientID")) right_join(expression, clinical2, by = c("patient" = "patientID"))
The second task is to explain why the results are different. I found that there are 3 more rows in the right_join output versus the left_join output. This seems odd to me given that ‘clinical2’ has 516 rows, whereas ‘expression’ has 570 rows. The 3 extra rows present in the r_join output have in common that they contain multiple NA values, which presumably represent patients found in ‘clinical2’ and not in ‘expression’. I don’t really understand what is going on here, and would be grateful for any help.