Sunday, February 28, 2016

Moving all the files that DO NOT match extensions

If you want to move from the folder all the files that do not match given extensions use find with ! option to reverse search:

mpjanic@valkyr:~/tmp$ ls -ltrh
total 48K
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 1.tiff
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 2.tiff
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 3.tiff
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 2.tif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 3.tif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 1.tif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 1.jpg
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 2.jpg
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 3.jpg
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 3.gif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 2.gif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 1.gif
mpjanic@valkyr:~/tmp$ find -type f ! \( -iname '*.gif' -o -iname '*.jpg' \) -exec  mv {} ~/tmp2 \;
mpjanic@valkyr:~/tmp$ ls -ltrh
total 24K
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 1.jpg
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 2.jpg
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 3.jpg
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 3.gif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 2.gif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 1.gif
mpjanic@valkyr:~/tmp$ ls -ltrh ~/tmp2
total 24K
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 1.tiff
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 2.tiff
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 3.tiff
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 2.tif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 3.tif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 1.tif
Similarly, remove ! if you want to move files that match the extensions:

mpjanic@valkyr:~/tmp$ find -type f \( -iname '*.gif' -o -iname '*.jpg' \) -exec  mv {} ~/tmp2 \;
mpjanic@valkyr:~/tmp$ ls -ltrh
total 24K
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 1.tiff
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 2.tiff
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 3.tiff
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 2.tif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 3.tif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 1.tif
mpjanic@valkyr:~/tmp$ ls -ltrh ~/tmp2
total 24K
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:26 1.jpg
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 2.jpg
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 3.jpg
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 3.gif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 2.gif
-rwxrwxr-x 1 mpjanic mpjanic 3 Feb 28 21:27 1.gif

How to remove all characters except ATCG> from a fasta file

If you have a fasta file that you want to clean up completely i.e. to remove all characters except ATCG> use:

sed 's/[^ATCG>]*//g' test.fasta
To remove blank lines that may appear pipe this code to sed

sed 's/[^ATCG>]*//g' test.fasta | sed '/^$/d'
Still this procedure may leave some remnant ATGC letters from the text you wanted to remove, for example here you have letters from the file header at the beginning of the file that were not removed, that you have to clean manually. Also G, A and > were left behind from the description lines.

sed 's/[^ATCG>]*//g' test.fasta | sed '/^$/d' | head -n50
GATC
G
GC
ATAAA
GGC
>G>AA
ATGAAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATT
ACCACAGGTAACGGTGCGGGCTGA
>G>AA
ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTT
GCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCC
GCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCT
TTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTCTGACGGGACTCGCCGCC
GCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTTTCGTCGACCAGGAATTTGCCCAA
ATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGATAGCATTAACGCT
GCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCG
CGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTAC
CTCGAATCTACTGTCGATATTGCAGAGTCCACCCGCCGTATTGCGGCAAGTCGTATTCCG
GCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTG
GTACTTGGACGCAACGGTTCCGACTACTCCGCGGCGGTGCTGGCTGCCTGTTTACGCGCC
GATTGTTGCGAGATTTGGACGGACGTTGACGGGGTATATACCTGCGACCCGCGTCAGGTG
CCCGATGCGAGGTTGTTGAAATCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTC
GGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGC
CTGATTAAAAATACCGGAAATCCTCAAGCTCCAGGTACGCTCATTGGTGCCAGTCGTGAT
GAAGACGAATTACCGGTCAAGGGCATTTCCAATCTGAATAATATGGCAATGTTCAGCGTT
TCCGGCCCGGGGATGAAAGGAATGGTCGGCATGGCGGCGCGCGTCTTTGCTGCAATGTCA
CGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGTATCAGTTTC
TGCGTTCCGCAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTCTACCTG
GAACTGAAAGAAGGCTTACTGGAGCCGCTGGCGGTGACGGAACGGCTGGCCATTATCTCG
GTGGTAGGTGATGGTATGCGCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCGCTG
GCCCGCGCCAATATCAACATTGTCGCTATTGCTCAGGGATCTTCTGAACGCTCAATCTCT
GTCGTGGTAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTC
AATACCGATCAGGTTATCGAAGTGTTTGTGATTGGCGTCGGTGGCGTTGGCGGTGCGCTG
CTGGAGCAACTGAAGCGTCAGCAAAGCTGGTTGAAGAATAAACATATCGACTTACGTGTC
TGCGGTGTTGCTAACTCGAAGGCTCTGCTCACCAATGTGCATGGCCTAAATCTGGAAAAC
TGGCAGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTCGCCTC
GTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCACCTCCAGCCAGGCAGTG
GCGGATCAATATGCCGACTTCCTGCGCGAAGGTTTCCACGTTGTCACGCCGAACAAAAAG
GCCAACACCTCGTCGATGGATTACTACCATCTGTTGCGTCATGCGGCTGAAAAATCGCGG
CGTAAATTCCTCTATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAA
AATCTGCTCAATGCTGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCAGGTTCGCTT
TCTTATATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACTACGCTGGCG
CGGGAAATGGGTTATACCGAACCGGATCCGCGAGATGATCTTTCTGGTATGGATGTAGCG
CGTAAACTATTAATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATTGAA
ATTGAACCTGTGCTGCCCGCAGAGTTTAACGCTGAGGGTGATGTTGCCGCTTTTATGGCG
AATCTGTCACAGCTCGACGATCTCTTTGCCGCGCGCGTGGCGAAGGCCCGTGATGAAGGA
AAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCC
GAAGTGGATGGTAATGATCCGCTGTTCAAAGTGAAAAATGGCGAAAACGCCCTGGCCTTT
TATAGCCACTATTATCAGCCGCTGCCGTTGGTGCTGCGCGGATATGGTGCGGGCAATGAC
GTTACCGCTGCCGGTGTCTTTGCCGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTC 

How to clean up your fasta file

If you have a fasta file with text inside that is not supposed to be there and you want to get rid of it, this can be done with a simple grep command

mpjanic@valkyr:~$ head -n100 test.fasta
GeneMark.hmm PROKARYOTIC (Version 3.25)
Date: Fri Feb 26 19:38:08 2016
Sequence file name: sequence-2.fasta
Model file name: /home/wangt2/MetaGeneMark_v1.mod
RBS: false

Model information: Heuristic_model_for_genetic_code_11_and_GC_51

FASTA definition line: gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
Predicted genes
   Gene    Strand    LeftEnd    RightEnd       Gene     Class
    #                                         Length
    1        +         190         273           84        1
    2        +         354        2816         2463        1
    3        +        2818        3750          933        1
    4        +        3751        5037         1287        1
    5        +        5390        5551          162        1
    6        -        5700        6476          777        1
    7        -        6546        7976         1431        1
    8        +        8255        9208          954        1
    9        +        9323        9910          588        1
   10        -        9945       10511          567        1
   11        -       10660       11373          714        1
   12        -       11399       11803          405        1
   13        +       12180       14096         1917        1
   14        +       14185       15315         1131        1
   15        -       15419       15628          210        2
   16        +       16157       17323         1167        1
   17        +       17383       18288          906        1
   18        -       18326       18751          426        1
   19        -       18959       19168          210        
Use grep to keep only rows with ATCG letters and those with > sign that the headers begin with:

mpjanic@valkyr:~$ grep "[>ATCG]" test.fasta | head -n  100
GeneMark.hmm PROKARYOTIC (Version 3.25)
Model file name: /home/wangt2/MetaGeneMark_v1.mod
Model information: Heuristic_model_for_genetic_code_11_and_GC_51
FASTA definition line: gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
   Gene    Strand    LeftEnd    RightEnd       Gene     Class
>gene_1|GeneMark.hmm|84_nt|+|190|273    >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGAAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATT
ACCACAGGTAACGGTGCGGGCTGA
>gene_2|GeneMark.hmm|2463_nt|+|354|2816 >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTT
GCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCC
GCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCT
TTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTCTGACGGGACTCGCCGCC
GCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTTTCGTCGACCAGGAATTTGCCCAA
ATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGATAGCATTAACGCT
GCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCG
CGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTAC
CTCGAATCTACTGTCGATATTGCAGAGTCCACCCGCCGTATTGCGGCAAGTCGTATTCCG
GCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTG
GTACTTGGACGCAACGGTTCCGACTACTCCGCGGCGGTGCTGGCTGCCTGTTTACGCGCC
GATTGTTGCGAGATTTGGACGGACGTTGACGGGGTATATACCTGCGACCCGCGTCAGGTG
CCCGATGCGAGGTTGTTGAAATCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTC
GGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGC
CTGATTAAAAATACCGGAAATCCTCAAGCTCCAGGTACGCTCATTGGTGCCAGTCGTGAT
GAAGACGAATTACCGGTCAAGGGCATTTCCAATCTGAATAATATGGCAATGTTCAGCGTT
TCCGGCCCGGGGATGAAAGGAATGGTCGGCATGGCGGCGCGCGTCTTTGCTGCAATGTCA
CGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGTATCAGTTTC
TGCGTTCCGCAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTCTACCTG
GAACTGAAAGAAGGCTTACTGGAGCCGCTGGCGGTGACGGAACGGCTGGCCATTATCTCG
GTGGTAGGTGATGGTATGCGCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCGCTG
GCCCGCGCCAATATCAACATTGTCGCTATTGCTCAGGGATCTTCTGAACGCTCAATCTCT
GTCGTGGTAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTC
AATACCGATCAGGTTATCGAAGTGTTTGTGATTGGCGTCGGTGGCGTTGGCGGTGCGCTG
CTGGAGCAACTGAAGCGTCAGCAAAGCTGGTTGAAGAATAAACATATCGACTTACGTGTC
TGCGGTGTTGCTAACTCGAAGGCTCTGCTCACCAATGTGCATGGCCTAAATCTGGAAAAC
TGGCAGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTCGCCTC
GTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCACCTCCAGCCAGGCAGTG
GCGGATCAATATGCCGACTTCCTGCGCGAAGGTTTCCACGTTGTCACGCCGAACAAAAAG
GCCAACACCTCGTCGATGGATTACTACCATCTGTTGCGTCATGCGGCTGAAAAATCGCGG
CGTAAATTCCTCTATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAA
AATCTGCTCAATGCTGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCAGGTTCGCTT
TCTTATATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACTACGCTGGCG
CGGGAAATGGGTTATACCGAACCGGATCCGCGAGATGATCTTTCTGGTATGGATGTAGCG
CGTAAACTATTAATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATTGAA
ATTGAACCTGTGCTGCCCGCAGAGTTTAACGCTGAGGGTGATGTTGCCGCTTTTATGGCG
AATCTGTCACAGCTCGACGATCTCTTTGCCGCGCGCGTGGCGAAGGCCCGTGATGAAGGA
AAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCC
GAAGTGGATGGTAATGATCCGCTGTTCAAAGTGAAAAATGGCGAAAACGCCCTGGCCTTT
TATAGCCACTATTATCAGCCGCTGCCGTTGGTGCTGCGCGGATATGGTGCGGGCAATGAC
GTTACCGCTGCCGGTGTCTTTGCCGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTC
TGA
>gene_3|GeneMark.hmm|933_nt|+|2818|3750 >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTC
GGGGCGGCGGTGACACCCGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGTCG
GCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCA
CGGGAAAATATCGTTTATCAGTGCTGGGAGCGTTTTTGCCAGGAGCTGGGCAAGCAAATT
CCAGTGGCGATGACTCTGGAAAAGAATATGCCGATCGGTTCGGGCTTAGGCTCCAGCGCC
In this example in addition you will have to remove the first 6 lines


mpjanic@valkyr:~$ grep "[>ATCG]" test.fasta | tail -n +6 | head -n  100
>gene_1|GeneMark.hmm|84_nt|+|190|273    >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGAAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATT
ACCACAGGTAACGGTGCGGGCTGA
>gene_2|GeneMark.hmm|2463_nt|+|354|2816 >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTT
GCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCC
GCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCT
TTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTCTGACGGGACTCGCCGCC
GCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTTTCGTCGACCAGGAATTTGCCCAA
ATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGATAGCATTAACGCT
GCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCG
CGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTAC

Wednesday, February 24, 2016

Principal component analysis using ggplot2 and wesanderson color palette in R

Lets show how to generate a principle component analysis (PCA) plot in R and make it more appealing.

This is the head of the data.frame

Hcasmc  Hcasmc-pdgfdd   Hcasmc-pdgfbb   Hcasmc-sf       Hcasmc-tgfb1    Hcasmc-pdgfdd   Hcasmc-pdgfbb   Hcasmc-sf       Hcasmc-tgfb1    Athero  Normal  Normal2 Normal3 Athero3
83.7839 49.2443 52.817  58.7663 68.057  44.9314 47.9035 66.4877 62.4438 150.564 173.965 86.8707 121.371 228.689
83.7839 49.2443 52.817  58.7663 68.057  44.9314 47.9035 66.4877 62.4438 150.564 173.965 86.8707 121.371 228.689
83.7839 49.2443 52.817  58.7663 68.057  44.9314 47.9035 66.4877 62.4438 0       173.965 86.8707 121.371 228.689
0       8.37066 13.914  6.40291 11.3867 9.96751 11.6739 11.559  10.0152 0       0       86.8707 121.371 0
0       30.2485 0       0       55.8487 48.8618 0       0       49.8919 0       0       0       0       0
54.9774 30.2485 38.5183 47.3038 55.8487 48.8618 42.1996 68.139  49.8919 0       0       0       0       0
54.9774 30.2485 38.5183 47.3038 55.8487 48.8618 42.1996 68.139  49.8919 0       34.9118 33.9246 33.4813 0
54.9774 30.2485 38.5183 47.3038 55.8487 48.8618 42.1996 68.139  49.8919 0       0       0       0       0
21.3106 51.5006 48.4945 41.1112 49.1787 39.7445 41.0823 31.3953 29.9609 0       0       0       0       0
In R load the data frame with read.delim, transpose it with t and use the prcomp function:

test <- read.delim("unionbedg_with_hcasmc_serum_pdgf_tgf_FINAL_nochrXY_over100_cut_100-2000_no_encode_no0_noatherobadsample_no0",header=T)

test.tr <- t(test)
pca <- prcomp(test.tr, scale=T)

pca.labels <- colnames(test)

plot(pca$x[,2], pca$x[,3],xlab="PCA2", ylab="PCA3",main="PCA for components 2&3", type="p", cex=2, pch=21, col=18, bg=13)
text(pca$x[,2], pca$x[,3],labels=pca.labels, cex= 0.8, pos=3)
This plot is generic and may not be appealing, however if you want to plot it with ggplot and with the wesanderson color palettes use:

library(ggplot2)
library (wesanderson)

PCA<- data.frame(pca$x[,2], pca$x[,3])
colnames(PCA)<-c("PC2","PC3")
PCA$CONDITION<-c("HCASMC SERUM", "HCASMC PDGFDD", "HCASMC PDGFBB", "HCASMC SERUM FREE", "HCASMC TGFB1", "HCASMC PDGFDD", "HCASMC PDGFBB", "HCASMC SERUM FREE", "HCASMC TGFB1", "ATHERO CORONARY", "NORMAL CORONARY", "NORMAL CORONARY", "NORMAL CORONARY","ATHERO CORONARY")

d<-ggplot(PCA, aes(x=PC2, y=PC3, color=CONDITION)) +geom_point(size=6)+scale_color_manual(values = c(wes_palette("Cavalcanti"),wes_palette("GrandBudapest"))) + theme_gray()

pdf("tissue_wesanderson.pdf", width=10, height=6)
d
dev.off()

Tuesday, February 23, 2016

Filtering and counting fields with awk

If you need to eliminate rows from a file that have a certain sum lets say 0, use awk. 

Create variable sum, let i go from 1 to NF, and add 0 or 1 to sum depending on the awk ternary operator (?:).  Add to sum 0 or 1, with sum +=, if $i (i.e if $i!=0) use 1 for sum+=, if false (i.e. if $i++0) use 0 for sum+=.

In case sum=0, the only time this would happen is if every field is 0, thus remove those fields with if (sum!=0) print.


awk '{sum=0; for (i=1; i<=NF; i++){sum += $i ? 1 : 0} if (sum!=0) print}'
To filter only those lines that have non 0 numbers repeated N times (for example 5) substitute 0 with 5 in if (sum!=0) print.

awk '{sum=0; for (i=1; i<=NF; i++){sum += $i ? 1 : 0} if (sum!=5) print}'
Or to filter lines that have 0 repeated 5 times.

awk '{sum=0; for (i=1; i<=NF; i++){sum += !$i ? 1 : 0} if (sum!=5) print}'

You can modify this code to count how many times in a row you have repeated 0, or any other number. To find out how many times 0 is in fields of each row:

awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==0 ? 1 : 0} print sum}'
To find out how many times 5 is repeated in each row:

awk '{sum=0; for (i=1; i<=NF; i++){sum += $i==5 ? 1 : 0} print sum}'

In R quickly remove column from data matrix

In R if you want to remove quickly one column you can assign NULL to it.




> data
                                 V2       V3                   V4
Alzheimers_disease          1.92308 36.03875                Other
Asthma                      5.08475 56.58559                Other
Bipolar_disorder            1.91693 48.52667                Brain
Blood_pressure              6.25000 63.01744                Other
BMI                         1.06383 11.44664                Other
Body_mass_index             4.34783 66.66428                Other

data$V4 <- NULL
> data
                                 V2       V3
Alzheimers_disease          1.92308 36.03875
Asthma                      5.08475 56.58559
Bipolar_disorder            1.91693 48.52667
Blood_pressure              6.25000 63.01744
BMI                         1.06383 11.44664
Body_mass_index             4.34783 66.66428

Using scp with folder names that contain spaces

If you use scp to download files from server and you have folders with names that contain spaces, you have to declare space and use quotes:

scp mpjanic@server.stanford.edu:'/home/mpjanic/path\ to\ folder\ with\ spaces/file.txt' ./

Tuesday, February 9, 2016

Search with find command and use regex

To find a file/s in a specific location and use regular expressions, use find command with -regextype sed option and specify regex with -regex:

find . -regextype sed -regex ".*[s|S]uper[e|E]nhancer.*"


This will find all files that contatin in their name:
superenhancer
Superenhancer
superEnhancer
SuperEnhancer

How to display all partitions in Unix systems

How to check partitions on your Unix based system.

Use df command (abbreviation of disk free is a standard Unix command used to display the amount of available disk space for file systems):

mpjanic@valkyr:/home/towerraid$ df -h
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/valkyr--vg-root   15T   12T  1.6T  89% /
none                         4.0K     0  4.0K   0% /sys/fs/cgroup
udev                         126G  4.0K  126G   1% /dev
tmpfs                         26G  1.5M   26G   1% /run
none                         5.0M     0  5.0M   0% /run/lock
none                         126G     0  126G   0% /run/shm
none                         100M     0  100M   0% /run/user
/dev/sdb2                    237M  145M   80M  65% /boot
//171.65.68.192/diskstation   33T   12T   21T  37% /home/diskstation
//171.65.68.192/usbshare1-2  7.3T  681G  6.7T  10% /home/diskstation2
/dev/sda1                     33T   29T  2.1T  94% /home/towerraid


Use lsblk command (lists information about all available or the specified block devices):


mpjanic@valkyr:/home/towerraid$ lsblk
NAME                           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                              8:0    0  32.8T  0 disk
└─sda1                           8:1    0  32.8T  0 part  /home/towerraid
sdb                              8:16   0  14.6T  0 disk
├─sdb1                           8:17   0     1M  0 part
├─sdb2                           8:18   0   244M  0 part  /boot
└─sdb3                           8:19   0  14.6T  0 part
  └─sda3_crypt (dm-0)          252:0    0  14.6T  0 crypt
    ├─valkyr--vg-root (dm-1)   252:1    0  14.3T  0 lvm   /
    └─valkyr--vg-swap_1 (dm-2) 252:2    0   256G  0 lvm
sr0                             11:0    1  1024M  0 rom


Use lsusb command (displaying information about USB buses in the system):


mpjanic@valkyr:/home/towerraid$ lsusb
Bus 002 Device 003: ID 0557:2221 ATEN International Co., Ltd Winbond Hermon
Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 004: ID 152d:0551 JMicron Technology Corp. / JMicron USA Technology Corp.
Bus 001 Device 003: ID 413c:2003 Dell Computer Corp. Keyboard
Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub



Use fdisk (a command-line utility that provides disk partitioning):


mpjanic@valkyr:/home/towerraid$ sudo fdisk -l

WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sda: 36006.6 GB, 36006589890560 bytes
256 heads, 63 sectors/track, 4360452 cylinders, total 70325370880 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1  4294967295  2147483647+  ee  GPT

WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdb: 15997.9 GB, 15997894131712 bytes
255 heads, 63 sectors/track, 1944966 cylinders, total 31245886976 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1  4294967295  2147483647+  ee  GPT
Partition 1 does not start on physical sector boundary.

Disk /dev/mapper/sda3_crypt: 15997.6 GB, 15997633298432 bytes
255 heads, 63 sectors/track, 1944934 cylinders, total 31245377536 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/sda3_crypt doesn't contain a valid partition table

Disk /dev/mapper/valkyr--vg-root: 15722.8 GB, 15722797334528 bytes
255 heads, 63 sectors/track, 1911521 cylinders, total 30708588544 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/valkyr--vg-root doesn't contain a valid partition table

Disk /dev/mapper/valkyr--vg-swap_1: 274.8 GB, 274831769600 bytes
255 heads, 63 sectors/track, 33413 cylinders, total 536780800 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/valkyr--vg-swap_1 doesn't contain a valid partition table


Use mount command


mpjanic@valkyr:/home/towerraid$ mount | grep "^/dev"
/dev/mapper/valkyr--vg-root on / type ext4 (rw,errors=remount-ro)
/dev/sdb2 on /boot type ext2 (rw)
/dev/sda1 on /home/towerraid type ext4 (rw)