Monday, January 23, 2017

Replacing Excel's erroneous conversion of gene names

If you work in Excel with gene names, it is likely that your gene names have been substituted with dates (month-day).

For example, I am looking at the table created by a collaborator:

mpjanic@valkyr:~$ head mastertable.ip 
GENE ip_270_1 ip_270_2 ip_270_3 ip_270_4 ip_270_5 ip_270_6 ip_270_7 ip_270_8
SEC24B-AS1 2 1 4 3 8 5 4 3
A1BG 0 0 0 4 3 2 3 5
A1CF 0 0 0 0 0 0 0 0
GGACT 2 1 3 8 5 11 11 6
A2M 0 0 0 0 0 0 0 0
A2ML1 0 0 0 0 0 0 0 0
A2MP1 0 0 0 0 0 0 0 0
A4GALT 318 306 115 328 259 183 272 397
A4GNT 0 0 0 0 0 0 0 0
It is often good to check if the table has been generated in Excel:

mpjanic@valkyr:~$ grep Sep mastertable.ip 
Sep-15 1070 1220 778 1675 1675 1258 1133 2200
Sep-01 3 1 0 0 0 0 2 0
Sep-10 870 1065 798 1507 1322 839 1075 1469
Sep-11 3238 5737 3216 7827 5064 4548 3512 8151
Sep-12 0 0 0 0 0 0 0 0
Sep-14 0 0 0 0 0 0 0 0
Sep-02 2374 3801 2429 5225 4467 3692 3696 5812
Sep-03 1 1 0 2 1 1 1 1
Sep-04 0 0 0 0 0 0 0 0
Sep-05 198 282 147 338 291 243 290 410
Sep-06 332 434 334 743 462 432 417 629
Sep-08 686 784 502 1094 834 777 625 973
Sep-09 4238 4609 1840 4750 4047 3414 4967 5742
Sep-07 2374 3675 2426 4685 4358 3038 4676 6592
mpjanic@valkyr:~/mastertable$ grep Mar mastertable.ip 
Mar-01 0 1 3 3 3 0 0 4
Mar-02 76 76 52 107 64 57 72 129
Mar-01 0 0 0 0 0 0 0 1
Mar-10 0 1 0 0 0 2 1 0
Mar-11 0 0 0 1 0 0 0 0
Mar-02 224 191 153 220 201 120 313 323
Mar-03 236 223 239 351 313 230 439 421
Mar-04 162 193 126 255 233 157 198 254
Mar-05 239 343 214 469 411 246 360 591
Mar-06 1122 1828 809 2053 1965 1385 1768 2897
Mar-07 106 242 136 335 260 200 137 306
Mar-08 342 383 217 438 351 299 366 384
Mar-09 16 37 27 63 48 36 32 34
mpjanic@valkyr:~/mastertable$ grep Dec mastertable.ip 
Dec-01 0 0 0 0 0 0 0 0
There are 3 genes possibly affected by Excel, SEPT, MARCH and DEC, so for example SEPT1 will be converted to Sep-01, MARCH1 to Mar-01, and DEC1 to Dec-01.

To convert back use sed with regular expressions:

mpjanic@valkyr:~$ sed -E "s/(Sep-|Sep-0)/SEPT/g" mastertable.ip | grep SEPT
SEPT15 1070 1220 778 1675 1675 1258 1133 2200
SEPT1 3 1 0 0 0 0 2 0
SEPT10 870 1065 798 1507 1322 839 1075 1469
SEPT11 3238 5737 3216 7827 5064 4548 3512 8151
SEPT12 0 0 0 0 0 0 0 0
SEPT14 0 0 0 0 0 0 0 0
SEPT2 2374 3801 2429 5225 4467 3692 3696 5812
SEPT3 1 1 0 2 1 1 1 1
SEPT4 0 0 0 0 0 0 0 0
SEPT5 198 282 147 338 291 243 290 410
SEPT6 332 434 334 743 462 432 417 629
SEPT7P2 89 121 55 124 100 74 102 119
SEPT8 686 784 502 1094 834 777 625 973
SEPT9 4238 4609 1840 4750 4047 3414 4967 5742
SEPT7 2374 3675 2426 4685 4358 3038 4676 6592
SEPT7L 0 1 0 0 0 0 0 0
mpjanic@valkyr:~$ sed -E "s/(Mar-|Mar-0)/MARCH/g" mastertable.ip | grep MARCH
MARCH1 0 1 3 3 3 0 0 4
MARCH2 76 76 52 107 64 57 72 129
MARCH1 0 0 0 0 0 0 0 1
MARCH10 0 1 0 0 0 2 1 0
MARCH11 0 0 0 1 0 0 0 0
MARCH2 224 191 153 220 201 120 313 323
MARCH3 236 223 239 351 313 230 439 421
MARCH4 162 193 126 255 233 157 198 254
MARCH5 239 343 214 469 411 246 360 591
MARCH6 1122 1828 809 2053 1965 1385 1768 2897
MARCH7 106 242 136 335 260 200 137 306
MARCH8 342 383 217 438 351 299 366 384
MARCH9 16 37 27 63 48 36 32 34
mpjanic@valkyr:~$ sed -E "s/(Dec-|Dec-0)/DEC/g" mastertable.ip | grep DEC
DEC1 0 0 0 0 0 0 0 0

Sunday, January 15, 2017

rsIDmultipleVCFParser, parse multiple individual vcf files

This is a bash/awk script that will parse multiple individual vcf file and for your input SNP of interest output a table with VCF tags including genotype GT, genotype quality GQ, allele depth AD etc, values for VCF tags, sample names/IDs from the VCF file, input SNP ID. Useful to assess genotype and quality for a particular SNP of interest.

https://github.com/milospjanic/rsIDmultipleVCFParser

If you have a collection of vcf files in one folder (note vcfs need to snp and indel vcfs named *SNP_INDEL.vcf

ls -ltrh *SNP_INDEL.vcf

-rwxrwxrwx 1 mpjanic users  1.2G Jul  7  2015 1410UNHS-0007_1848_1_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users 1018M Jul  7  2015 1410UNHS-0007_2102_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  989M Jul  7  2015 1410UNHS-0007_2228_1_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  1.2G Jul  7  2015 1410UNHS-0007_2305_3_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users 1007M Jul  7  2015 1410UNHS-0007_2435_5_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  974M Jul  7  2015 1410UNHS-0007_2999_1_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  964M Jul  7  2015 1410UNHS-0007_3003_5_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  930M Jul  7  2015 1411KHX-0032_1522-2_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  846M Jul  7  2015 1411KHX-0032_2105-6P_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  818M Jul  7  2015 1411KHX-0032_2139-1_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  835M Jul  7  2015 1411KHX-0032_2463-4_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  1.1G Jul  7  2015 1411KHX-0032_24635-1P_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  836M Jul  7  2015 1411KHX-0032_317155-2_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  898M Jul  7  2015 1503UNHX-0001_2989_3_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  675M Jul  9  2015 1407KHX-0008_3101801-2-3_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  950M Jul  9  2015 1503UNHX-0001_1060602_4_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  956M Jul  9  2015 1503UNHX-0001_1596_5p_SNP_INDEL.vcf
-rwxrwxrwx 1 mpjanic users  830M Jul 10  2015 1411KHX-0032_2135-5_SNP_INDEL.vcf
Use this combined bash/awk script to parse all vcfs in the folder simultaneously, and to output a table with, SNP ID, sample ID, vcf tags and values.

chmod 755 rsIDmultipleVCFParser.sh
./rsIDmultipleVCFParser.sh rs200751313

SNP: rs200751313    Sample: 3101801-2-3 Tags: GT:GQ:GQX:DP:DPF:AD   Values: 1/1:33:33:12:3:0,12
SNP: rs200751313    Sample: 1848_1
SNP: rs200751313    Sample: 2102    Tags: GT:GQ:GQX:DP:DPF:AD   Values: 1/1:132:132:45:3:0,45
SNP: rs200751313    Sample: 2228_1  Tags: GT:GQ:GQX:DP:DPF:AD   Values: 1/1:48:48:17:2:0,17
SNP: rs200751313    Sample: 2305_3  Tags: GT:GQ:GQX:DP:DPF:AD   Values: 0/1:55:55:30:0:5,25
SNP: rs200751313    Sample: 2435_5
SNP: rs200751313    Sample: 2999_1  Tags: GT:GQ:GQX:DP:DPF:AD   Values: 1/1:87:87:30:0:0,30
SNP: rs200751313    Sample: 3003_5  Tags: GT:GQ:GQX:DP:DPF:AD   Values: 1/1:54:54:19:1:0,19
SNP: rs200751313    Sample: 1522-2
SNP: rs200751313    Sample: 2105-6P Tags: GT:GQ:GQX:DP:DPF:AD   Values: 0/1:33:33:43:1:7,36
SNP: rs200751313    Sample: 2135-5  Tags: GT:GQ:GQX:DP:DPF:AD   Values: 0/1:37:37:14:2:2,12
SNP: rs200751313    Sample: 2139-1  Tags: GT:GQ:GQX:DP:DPF:AD   Values: 1/1:90:90:31:3:0,31
SNP: rs200751313    Sample: 2463-4
SNP: rs200751313    Sample: 24635-1P    Tags: GT:GQ:GQX:DP:DPF:AD   Values: 0/1:77:77:24:2:6,18
SNP: rs200751313    Sample: 317155-2    Tags: GT:GQ:GQX:DP:DPF:AD   Values: 1/1:75:75:26:0:0,26
SNP: rs200751313    Sample: 1060602_4   Tags: GT:GQ:GQX:DP:DPF:AD   Values: 1/1:87:87:30:3:0,30
SNP: rs200751313    Sample: 1596_5p Tags: GT:GQ:GQX:DP:DPF:AD   Values: 1/1:66:66:23:2:0,23
SNP: rs200751313    Sample: 2989_3  Tags: GT:GQ:GQX:DP:DPF:AD   Values: 1/1:93:93:32:2:0,32

Thursday, January 12, 2017

combinedVCFParser

This is a bash/awk script that will parse a combined vcf file and for your input SNP of interest output a table of with Het/Ref homo/Alt homo denotation, sample names/IDs, position in vcf, genotype.

Script is useful when you need to quicky assess which sample is Het/Ref homo/Alt homo for a particular SNP of interest.

Download from GitHub
https://github.com/milospjanic/combinedVCFParser

For example, this is the vcf file content that you want to inspect, and SNP of interest is rs115209712

cat file.vcf 

##fileformat=VCFv4.2
##filedate=20160604
##source="beagle.03May16.862.jar (version 4.1)"
##INFO=
##INFO=
##INFO=
##INFO=
##FORMAT=
##FORMAT=
##FORMAT=
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1060602 1522    1596    2105    2139    2161    2228    2305    2435    2463    2477    2999    2989    3003    3101801 317155
1   84183   rs570923646 C   G   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0|0   0|0
1   86028   rs114608975 T   C   .   PASS    .   GT  0|0 0|0 1|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|1|1   0|0
1   86065   rs116504101 G   C   .   PASS    .   GT  0|0 0|0 1|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|1|1   0|0
1   86331   rs115209712 A   G   .   PASS    .   GT  0|0 1|1 0|0 0|0 1|1 0|0 0|0 0|0 0|0 1|1 0|0 0|1 1|1 0|0|0   0|0
1   87021   rs188486692 T   C   .   PASS    .   GT  1|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0|0   0|0
1   88236   rs186918018 C   T   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0|0   0|0
1   88338   rs55700207  G   A   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0|0   0|0
1   566875  rs2185539   C   T   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|1 0|0 0|0 0|0 0|0 0|0 0|0|0   0|0
1   568451  rs78032279  C   T   .   PASS    .   GT  0|0 0|1 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|1 0|0 0|0 0|0 1|0|0   0|0
1   705882  rs72631875  G   A   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|1 1|0|0   0|0
1   713977  rs74512038  C   T   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0|0   0|0
1   714002  rs143219077 C   G   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|0 0|0|0   0|0
1   714019  rs114983708 A   G   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0|0   0|0

Run combinedVCFParser.sh with SNP ID and vcf file name as two arguments

chmod 755 combinedVCFParser.sh
./combinedVCFParser.sh rs115209712 file.vcf

Heterozygote    2999    21  0|1
Reference homozygote    1060602 10  0|0
Reference homozygote    1596    12  0|0
Reference homozygote    2105    13  0|0
Reference homozygote    2161    15  0|0
Reference homozygote    2228    16  0|0
Reference homozygote    2305    17  0|0
Reference homozygote    2435    18  0|0
Reference homozygote    2477    20  0|0
Reference homozygote    3003    23  0|0|0
Reference homozygote    3101801 24  0|0
Alternative homozygote  1522    11  1|1
Alternative homozygote  2139    14  1|1
Alternative homozygote  2463    19  1|1
Alternative homozygote  2989    22  1|1

Monday, January 9, 2017

Awk script to transform scientific numbers in E-notation

If you have a table with mixed E-notations scientific numbers and numbers in a standard form you can use this code in awk to transform it completely into standard form.

For example: file.txt

cat file.txt
NM_000033 104 100 62 87 111 90
NM_000036 0 0 1 3 0 4e+05
NM_000211 8e+05 0 0 3 13 0
NM_000299 0 2e-07 1 0 0 1
NM_001004711 0 0 6e-07 0 0 0
NM_000035 4 0 0 2 1 0
NM_000036 0 0 1 3 0 0 
NM_000718 5e-07 2e-07 4e-08 1e-08 1 7e-08
NM_000928 1 0 0 5e-06 7 0
NM_001001734 3 0 3e-06 4 1 7
NM_001005356 50 5e-08 120 7e-06 3e-06 38
NM_000039 59 27 31 50 48 40
NM_000040 0 0 0 0 0 0
NM_001011880 0 3 16 2e-07 4 3
NM_001024452 0 1 0 0 2e-06 0
NM_000041 110 74 59 68 59 81
Use awk code, save as an awk script and run with awk

#condition if line matches regex e+, e-, E+ or E- 
{
/[eE][+-]/

#go through fields till NF
  {for(i = 1; i <= NF; i++) 

#if field matches regex ^[0-9]*[eE][+-][0-9]*$ 
#i.e. any repetition of numbers, followed by e+, e-, E+ or E- then any repetition of numbers
#change to integer with printf "%.0f

    {if ($i~/^[0-9]*[eE][+-][0-9]*$/) printf "%.0f""\t", $i;

#if filed does not contain scientific number only print
    else printf $i"\t"} 

#print new line
  printf "\n";next}

#print whole line and new line for line do not matching a condition 
{printf $0"\n"}
}
Script will match those lines that contain regex e+, e-, E+ or E-, and then go through each field of that line. If a field matches ^[0-9]*[eE][+-][0-9]*$, then substitute with an integer. Else only print field. Outside of the condition, print whole line and newline character.

Example output:

awk -f script.awk file.txt 
NM_000033 104 100 62 87 111 90 
NM_000036 0 0 1 3 0 400000 
NM_000211 800000 0 0 3 13 0 
NM_000299 0 0 1 0 0 1 
NM_001004711 0 0 0 0 0 0 
NM_000035 4 0 0 2 1 0 
NM_000036 0 0 1 3 0 0 
NM_000718 0 0 0 0 1 0 
NM_000928 1 0 0 0 7 0 
NM_001001734 3 0 0 4 1 7 
NM_001005356 50 0 120 0 0 38 
NM_000039 59 27 31 50 48 40 
NM_000040 0 0 0 0 0 0 
NM_001011880 0 3 16 0 4 3 
NM_001024452 0 1 0 0 0 0 
NM_000041 110 74 59 68 59 81 
Or use oneliner:

awk  '/e[+-]/{for(i = 1; i <= NF; i++) {if ($i~/^[0-9]*e[+-][0-9]*$/) printf "%.0f""\t", $i; else printf $i"\t"} printf "\n";next} {printf $0"\n"}' file.txt

Thursday, January 5, 2017

Parsing tsv output files from Kallisto

If you have a tsv file from Kallisto (abundance.tsv) that you need to parse,

head abundance.tsv
target_id       length  eff_length      est_counts      tpm
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript|    1657    1478    12.1601 0.153207
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|       632     453     0       0
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene|    1351    1172    243.528 3.86933
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|       68      3.2997  0.5     2.8217
ENST00000473358.1|ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|        712     533     0       0
ENST00000469289.1|ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|        535     356     0       0
ENST00000607096.1|ENSG00000274890.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|      138     6.64697 0       0
ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-001|FAM138A|1187|lincRNA| 1187    1008    0       0
ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-002|FAM138A|590|lincRNA|  590     411     0       0
for example, you need to take ENSG gene name from the second column delimited by | and then estimated number of counts from the third column delimited by tab, you can do this in one command. Use find/xargs pipe to process every abundance.tsv, cut -f1,4 columns using default tab as a separator (thus grabing first column and estimated number of counts from the 4th), then sed "s/.*|E/E/g" to delete text till first | (i.e the first column delimited with | ). 

find -name '*abundance.tsv' | xargs -I % sh -c 'cut -f1,4 % | sed "s/.*|E/E/g";' | head
target_id       est_counts
ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript|      12.1601
ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-001|DDX11L1|632|transcribed_unprocessed_pseudogene| 0
ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene|      243.528
ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA| 0.5
ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|  0
ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|  0
ENSG00000274890.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|        0
ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-001|FAM138A|1187|lincRNA|   0
ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-002|FAM138A|590|lincRNA|    0
Next, use sed "s/|.*|.*|.*|.*|//g" to delete remaining of the text in the first column, keeping only ENSG gene name.

find -name '*abundance.tsv' | xargs -I % sh -c 'cut -f1,4 % | sed "s/.*|E/E/g";' | head
target_id       est_counts
ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript|      12.1601
ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-001|DDX11L1|632|transcribed_unprocessed_pseudogene| 0
ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene|      243.528
ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA| 0.5
ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|  0
ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|  0
ENSG00000274890.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|        0
ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-001|FAM138A|1187|lincRNA|   0
ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-002|FAM138A|590|lincRNA|    0
Finally, substitute .[0-9]* with //, i.e. delete. 

find -name '*abundance.tsv' | xargs -I % sh -c 'cut -f1,4 % | sed "s/.*|E/E/g" | sed "s/|.*|.*|.*|.*|//g" | sed -e "s/\.[0-9]*//g";' | head
target_id       est_counts
ENSG00000223972 12
ENSG00000223972 0
ENSG00000227232 243
ENSG00000278267 0
ENSG00000243485 0
ENSG00000243485 0
ENSG00000274890 0
ENSG00000237613 0
ENSG00000237613 0