Friday, February 17, 2017

Removing duplicate genes based on the conditions of another column for differential expression and gene ontology pipelines in R

Some differential gene expression tools may reject your input table if for some reason gene name has been duplicated in another row. If your data frame contains duplicated rows on a single column (e.g. gene name) you can remove them in R using the following code. Note that this will remove any subsequent occurrence of the duplicated gene, therefore preserving only the first occurrence.

table<-read.delim("table.csv", header=T, sep=",")
table
    X         id    baseMean   baseMeanA  baseMeanB foldChange log2FoldChange
1   1 SEC24B-AS1    3.837647   3.0343409   4.640954  1.5294767     0.61303813
2   2       A1BG    1.769987   0.7314147   2.808559  3.8398998     1.94106866
3   3       A1CF    0.000000   0.0000000   0.000000         NA             NA
4   4      GGACT    5.722972   3.5491983   7.896745  2.2249378     1.15376499
5   5        A2M    0.000000   0.0000000   0.000000         NA             NA
6   6      A2ML1    0.000000   0.0000000   0.000000         NA             NA
7   7      A2MP1    0.000000   0.0000000   0.000000         NA             NA
8   8     A4GALT  261.976303 281.3563018 242.596304  0.8622387    -0.21384071
9   9      A4GNT    0.000000   0.0000000   0.000000         NA             NA
10 10       AAAS  237.463538 240.8107614 234.116315  0.9722004    -0.04067439
11 11       AACS  262.054727 268.0227018 256.086753  0.9554666    -0.06572258
12 12      GGACT 1000.000000   0.0000000   0.000000         NA             NA
13 13     A4GALT 1000.000000   0.0000000   0.000000         NA             NA
14 14 SEC24B-AS1 1000.000000   0.0000000   0.000000         NA             NA
        pval padj
1  0.4469177    1
2  0.3359994    1
3         NA   NA
4  0.1943902    1
5         NA   NA
6         NA   NA
7         NA   NA
8  0.3760542    1
9         NA   NA
10 0.8189333    1
11 0.6515563    1
12        NA   NA
13        NA   NA
14        NA   NA
> table.2 <- subset(table, !duplicated(table[,2])) > table.2 X id baseMean baseMeanA baseMeanB foldChange log2FoldChange 1 1 SEC24B-AS1 3.837647 3.0343409 4.640954 1.5294767 0.61303813 2 2 A1BG 1.769987 0.7314147 2.808559 3.8398998 1.94106866 3 3 A1CF 0.000000 0.0000000 0.000000 NA NA 4 4 GGACT 5.722972 3.5491983 7.896745 2.2249378 1.15376499 5 5 A2M 0.000000 0.0000000 0.000000 NA NA 6 6 A2ML1 0.000000 0.0000000 0.000000 NA NA 7 7 A2MP1 0.000000 0.0000000 0.000000 NA NA 8 8 A4GALT 261.976303 281.3563018 242.596304 0.8622387 -0.21384071 9 9 A4GNT 0.000000 0.0000000 0.000000 NA NA 10 10 AAAS 237.463538 240.8107614 234.116315 0.9722004 -0.04067439 11 11 AACS 262.054727 268.0227018 256.086753 0.9554666 -0.06572258 pval padj 1 0.4469177 1 2 0.3359994 1 3 NA NA 4 0.1943902 1 5 NA NA 6 NA NA 7 NA NA 8 0.3760542 1 9 NA NA 10 0.8189333 1 11 0.6515563 1
You can see that the last three rows were removed as those genes were previously repeated. In some cases you would need to remove duplicated genes based on some condition in another column.

If you need to filter rows duplicated on a certain column by, e.g. sorting on another column, use this rather elegant code that involved presorting followed by subsetting.

table<-read.delim("table.csv", header=T, sep=",")
> table
    X         id    baseMean   baseMeanA  baseMeanB foldChange log2FoldChange
1   1 SEC24B-AS1    3.837647   3.0343409   4.640954  1.5294767     0.61303813
2   2       A1BG    1.769987   0.7314147   2.808559  3.8398998     1.94106866
3   3       A1CF    0.000000   0.0000000   0.000000         NA             NA
4   4      GGACT    5.722972   3.5491983   7.896745  2.2249378     1.15376499
5   5        A2M    0.000000   0.0000000   0.000000         NA             NA
6   6      A2ML1    0.000000   0.0000000   0.000000         NA             NA
7   7      A2MP1    0.000000   0.0000000   0.000000         NA             NA
8   8     A4GALT  261.976303 281.3563018 242.596304  0.8622387    -0.21384071
9   9      A4GNT    0.000000   0.0000000   0.000000         NA             NA
10 10       AAAS  237.463538 240.8107614 234.116315  0.9722004    -0.04067439
11 11       AACS  262.054727 268.0227018 256.086753  0.9554666    -0.06572258
12 12      GGACT 1000.000000   0.0000000   0.000000         NA             NA
13 13     A4GALT 1000.000000   0.0000000   0.000000         NA             NA
14 14 SEC24B-AS1 1000.000000   0.0000000   0.000000         NA             NA
        pval padj
1  0.4469177    1
2  0.3359994    1
3         NA   NA
4  0.1943902    1
5         NA   NA
6         NA   NA
7         NA   NA
8  0.3760542    1
9         NA   NA
10 0.8189333    1
11 0.6515563    1
12        NA   NA
13        NA   NA
14        NA   NA
> table = table[order(table[,'id'],-table[,'baseMean']),]
> table
    X         id    baseMean   baseMeanA  baseMeanB foldChange log2FoldChange
2   2       A1BG    1.769987   0.7314147   2.808559  3.8398998     1.94106866
3   3       A1CF    0.000000   0.0000000   0.000000         NA             NA
5   5        A2M    0.000000   0.0000000   0.000000         NA             NA
6   6      A2ML1    0.000000   0.0000000   0.000000         NA             NA
7   7      A2MP1    0.000000   0.0000000   0.000000         NA             NA
13 13     A4GALT 1000.000000   0.0000000   0.000000         NA             NA
8   8     A4GALT  261.976303 281.3563018 242.596304  0.8622387    -0.21384071
9   9      A4GNT    0.000000   0.0000000   0.000000         NA             NA
10 10       AAAS  237.463538 240.8107614 234.116315  0.9722004    -0.04067439
11 11       AACS  262.054727 268.0227018 256.086753  0.9554666    -0.06572258
12 12      GGACT 1000.000000   0.0000000   0.000000         NA             NA
4   4      GGACT    5.722972   3.5491983   7.896745  2.2249378     1.15376499
14 14 SEC24B-AS1 1000.000000   0.0000000   0.000000         NA             NA
1   1 SEC24B-AS1    3.837647   3.0343409   4.640954  1.5294767     0.61303813
        pval padj
2  0.3359994    1
3         NA   NA
5         NA   NA
6         NA   NA
7         NA   NA
13        NA   NA
8  0.3760542    1
9         NA   NA
10 0.8189333    1
11 0.6515563    1
12        NA   NA
4  0.1943902    1
14        NA   NA
1  0.4469177    1
> table.2 <- subset(table, !duplicated(table[,2]))
> table.2
    X         id    baseMean   baseMeanA  baseMeanB foldChange log2FoldChange
2   2       A1BG    1.769987   0.7314147   2.808559  3.8398998     1.94106866
3   3       A1CF    0.000000   0.0000000   0.000000         NA             NA
5   5        A2M    0.000000   0.0000000   0.000000         NA             NA
6   6      A2ML1    0.000000   0.0000000   0.000000         NA             NA
7   7      A2MP1    0.000000   0.0000000   0.000000         NA             NA
13 13     A4GALT 1000.000000   0.0000000   0.000000         NA             NA
9   9      A4GNT    0.000000   0.0000000   0.000000         NA             NA
10 10       AAAS  237.463538 240.8107614 234.116315  0.9722004    -0.04067439
11 11       AACS  262.054727 268.0227018 256.086753  0.9554666    -0.06572258
12 12      GGACT 1000.000000   0.0000000   0.000000         NA             NA
14 14 SEC24B-AS1 1000.000000   0.0000000   0.000000         NA             NA
        pval padj
2  0.3359994    1
3         NA   NA
5         NA   NA
6         NA   NA
7         NA   NA
13        NA   NA
9         NA   NA
10 0.8189333    1
11 0.6515563    1
12        NA   NA
14        NA   NA

No comments:

Post a Comment