Thursday, March 14, 2019

Python script to check if DNA sequence is palindrome or reverse palindrome

Check if a DNA sequence is palindrome or reverse palindrome:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 12 04:18:04 2019

@author: milospjanic
"""
# Program to check if a string
#  is palindrome or not
def reverse(Pattern):
    revcomp = []
    x = len(Pattern)
    for i in Pattern:
        x = x - 1
        revcomp.append(Pattern[x])
    return ''.join(revcomp)


def compliment(Nucleotide):
    comp = []
    for i in Nucleotide:
        if i == "t":
            comp.append("a")
        if i == "a":
            comp.append("t")
        if i == "g":
            comp.append("c")
        if i == "c":
            comp.append("g")

    return ''.join(comp)


my_str = str(input("Enter a sequence: "))
# change this value for a different output
#my_str = 'aIbohPhoBiA'

# make it suitable for caseless comparison
my_str = my_str.casefold()

# reverse the string
comp = compliment (my_str)

rev = reverse (my_str)

revcomp = compliment (reverse (my_str))

print("Complement:",comp) 

print("Reversed:",rev) 

print("Reversed Complement:",revcomp) 


# check if the string is equal to its reverse

if list(my_str) == list(rev):
   print(my_str, "is palindrome")
else:
   if list(my_str) == list(revcomp):
       print(my_str, "is reverse palindrome")
   
   else:
       print(my_str, "is not a palindrome")
   

Python script for Fibonacci sequence using while loop or recursion

Python script for Fibonacci sequence using while loop or recursion.

Using recursive function:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 12 05:32:45 2019

@author: milospjanic
"""

# Python program to display the Fibonacci sequence up to n-th term using recursive functions

def recur_fibo(n):
   """Recursive function to
   print Fibonacci sequence"""
   if n <= 1:
       return n
   else:
       return(recur_fibo(n-1) + recur_fibo(n-2))

# Change this value for a different result
#nterms = 10

# uncomment to take input from the user
nterms = int(input("How many terms? "))

# check if the number of terms is valid
if nterms <= 0:
   print("Plese enter a positive integer")
else:
   print("Fibonacci sequence:")
   for i in range(nterms):```
       print(recur_fibo(i), end=" , ")


Using while loop:


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 12 05:22:28 2019

@author: milospjanic
"""

# Program to display the Fibonacci sequence up to n-th term where n is provided by the user

# change this value for a different result
#nterms = 10

# uncomment to take input from the user
nterms = int(input("How many terms? "))

# first two terms
n1 = 0
n2 = 1
count = 0

# check if the number of terms is valid
if nterms <= 0:
   print("Please enter a positive integer")
elif nterms == 1:
   print("Fibonacci sequence upto",nterms,":")
   print(n1)
else:
   print("Fibonacci sequence upto",nterms,":")
   while count < nterms:
       print(n1,end=" , ")
       n3 = n1 + n2
       # update values
       n1 = n2
       n2 = n3
       count += 1

Tuesday, July 17, 2018

Plot and compare two scatterplots on the same graph using R and ggplot2

If you have two sets of data located in two data frames in R, and you want to plot their correlation in a scatterplot, use the following ggplot2 code:

ggplot(WT, aes(V1, EF, color=EF))+geom_point(shape = 16, size = 5, show.legend = FALSE, alpha = .4) 
+ theme_minimal() 
+ scale_color_gradient(low = "#0091ff", high = "#f0650e") 
+ theme(axis.text=element_text(size=24),axis.title=element_text(size=26)) 
+ labs(title = "GENE WT", x="", y="Percent") 
+ theme(plot.title = element_text(size = rel(2)))
+ geom_smooth(method=lm)
+ annotate(x=10,y=85,label=paste("R = ", round(cor (WT$V1,WT$EF),2), ", p=0.008869"),geom="text", size=8, col="darkred")




ggplot(KO, aes(V1, EF, color=EF))+geom_point(shape = 16, size = 5, show.legend = FALSE, alpha = .4) 
+ theme_minimal() 
+ scale_color_gradient(low = "#0091ff", high = "#f0650e") 
+ theme(axis.text=element_text(size=24),axis.title=element_text(size=26))
+ labs(title = "GENE KO", x="Day", y="Percent") 
+ theme(plot.title = element_text(size = rel(2)))
+ geom_smooth(method=lm)
+ annotate(x=10,y=85,label=paste("R = ", round(cor (KO$V1,KO$EF),2), ", p=1.655e-10"), geom="text", size=8, col="darkred")







However, a better visual comparison would be if we plot these on the same graph using the code:

ggplot() + 
+ theme_minimal() +
+ theme(axis.text=element_text(size=24),axis.title=element_text(size=26))+ labs(title = "GENE KO vs. WT", x="Day", y="Percent") + theme(plot.title = element_text(size = rel(2)))+
+ geom_point(data = WT, aes(x = V1, y = EF), color = "red",shape = 16, size = 5, show.legend = FALSE, alpha = .4)+
+ geom_point(data = NAT1KO, aes(x = V1, y = EF), color = "blue",shape = 16, size = 5, show.legend = FALSE, alpha = .4)+
+ geom_smooth(data = WT, aes(x = V1, y = EF),method=lm,color="red",fill="red")+
+ geom_smooth(data = NAT1KO, aes(x = V1, y = EF),method=lm,color="blue",fill="blue")+
+ annotate(x=10,y=85,label=paste("R = ", round(cor (WT$V1,WT$EF),2), ", p=0.008869"),geom="text", size=8, col="red")+
+ annotate(x=7,y=47,label=paste("R = ", round(cor (NAT1KO$V1,NAT1KO$EF),2), ", p=1.655e-10"), geom="text", size=8, col="blue")



Saturday, April 7, 2018

A bit of SED and regular expression magic to clear up your tables


Want to keep Gene ID and Fold change from 2nd and 6th column and remove quotations. Use SED and regular expressions.



mpjanic@zoran:~/test$ cat SLC2A.txt

"6033","SLC2A1-AS1",24.8051071979286,27.7198330991446,21.8903812967126,0.789701049729193,-0.340621486785587,0.565579800713107,1
"15656","SLC2A1",21989.7607363,25370.4928449324,18609.0286276675,0.733491018144907,-0.447148795187931,0.0135793982312018,0.274890758345199
"15657","SLC2A10",116.660701272146,109.114399777756,124.207002766537,1.13831907630452,0.186905008681256,0.536100348167181,1
"15658","SLC2A11",153.584023702827,160.671774940703,146.496272464952,0.911773536571793,-0.133252558036828,0.610646274133104,1
"15659","SLC2A12",0,0,0,NA,NA,NA,NA
"15660","SLC2A13",91.7915789084436,88.0137320553238,95.5694257615633,1.08584675970211,0.118820516938777,0.718169926714343,1
"15661","SLC2A14",26.1757135597822,34.6019488604179,17.7494782591466,0.512961808328973,-0.963076678383474,0.0836580834259747,0.696210238779407
"15662","SLC2A2",0.946102725140454,0.450180696019295,1.44202475426161,3.20321321418857,1.67951983083656,0.80556985989175,1
"15663","SLC2A3",1801.59770325241,1801.22866521645,1801.96674128837,1.00040976256161,0.000591041330547107,0.91557362726645,1
"15664","SLC2A4",64.9152045965179,47.8981046043178,81.932304588718,1.71055421222935,0.774463827856209,0.0320174623595299,0.440346477750378
"15665","SLC2A4RG",2839.51257874418,2552.43962703032,3126.58553045805,1.22494005239048,0.292711146586849,0.106250785209328,0.755271734116694
"15666","SLC2A5",0.488574210205611,0.450180696019295,0.526967724391927,1.17056934926712,0.227210408076608,1,1
"15667","SLC2A6",849.67804667658,741.433588192089,957.922505161072,1.29198692966806,0.369591475141396,0.0690179423560097,0.640745754210812
"15668","SLC2A7",0,0,0,NA,NA,NA,NA
"15669","SLC2A8",306.652668694167,317.104693305773,296.20064408256,0.934078398508418,-0.0983844524549534,0.643201071003747,1
"15670","SLC2A9",312.030609161521,297.26095251567,326.800265807372,1.09937165659235,0.136679190181628,0.580291149287337,1


mpjanic@zoran:~/test$ sed -E "s/\"[0-9]*\",\"//g" SLC2A.txt | sed -E "s/\",[0-9.]*,[0-9.]*,[0-9.]*,/ /g" | sed -E "s/,.*//g"
SLC2A1-AS1 0.789701049729193
SLC2A1 0.733491018144907
SLC2A10 1.13831907630452
SLC2A11 0.911773536571793
SLC2A12 NA
SLC2A13 1.08584675970211
SLC2A14 0.512961808328973
SLC2A2 3.20321321418857
SLC2A3 1.00040976256161
SLC2A4 1.71055421222935
SLC2A4RG 1.22494005239048
SLC2A5 1.17056934926712
SLC2A6 1.29198692966806
SLC2A7 NA
SLC2A8 0.934078398508418
SLC2A9 1.09937165659235

Tuesday, March 6, 2018

Code for parsing vcf files

Code to clear out vcf files to get only genotypes calls.

For example, in this vcf file we need to keep only first parameter indicating phased genotypes.

cat rs
0|0:0:1,0,0 0|1:1:0,1,0 0|0:0:1,0,0 1|0:1:0,1,0 0|1:1:0,1,0 0|0:0:1,0,0 1|0:1:0,1,0 0|1:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 1|1:2:0,0,1 1|1:2:0,0,1 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 1|1:2:0,0,1 0|0:0:1,0,0 0|0:0:1,0,0 1|0:1:0,1,0 0|0:0:1,0,0 1|0:1:0,1,0 0|1:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 1|0:1:0,1,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 0|1:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 1|1:2:0,0,1 1|0:1:0,1,0 0|1:1:0,1,0 1|0:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0
0|0:0:1,0,0 0|1:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|1:1:0,1,0 0|0:0:1,0,0 1|0:1:0,1,0 0|1:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 1|1:2:0,0,1 1|1:2:0,0,1 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 1|1:2:0,0,1 0|0:0:1,0,0 0|0:0:1,0,0 1|0:1:0,1,0 0|0:0:1,0,0 1|0:1:0,1,0 0|1:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 1|0:1:0,1,0 1|0:1:0,1,0 0|1:1:0,1,0 1|0:1:0,1,0 0|1:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 0|1:1:0,1,0 1|0:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0
0|0:0:1,0,0 0|1:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 1|0:1:0,1,0 0|1:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 1|1:2:0,0,1 1|1:2:0,0,1 0|0:0:1,0,0 0|0:0:1,0,0 1|1:2:0,0,1 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 1|0:1:0,1,0 0|1:1:0,1,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 1|1:2:0,0,1 0|1:1:0,1,0 0|0:0:1,0,0 1|0:1:0,1,0 0|0:0:1,0,0 1|0:1:0,1,0 0|1:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 1|0:1:0,1,0 1|0:1:0,1,0 0|1:1:0,1,0 1|0:1:0,1,0 0|1:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|1:1:0,1,0 1|0:1:0,1,0 0|1:1:0,1,0 1|0:1:0,1,0 1|0:1:0,1,0 1|0:1:0,1,0
1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|0:1:0,1,0 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 0|1:1:0,1,0 1|1:2:0,0,1 1|1:2:0,0,1 0|1:1:0,1,0 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|0:1:0,1,0 1|1:2:0,0,1 1|1:2:0,0,1 0|1:1:0,1,0 1|1:2:0,0,1 1|1:2:0,0,1 1|1:1.99:0,0.01,0.99 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|0:1:0,1,0 1|1:2:0,0,1 1|0:1:0,1,0 0|1:1:0,1,0 1|1:2:0,0,1 0|1:1:0,1,0 1|1:2:0,0,1 1|1:2:0,0,1 0|1:1:0,1,0 0|1:1:0,1,0 1|1:2:0,0,1 1|1:2:0,0,1 1|0:1.4:0,0.6,0.4 1|1:2:0,0,1 0|1:1:0,1,0 1|0:1:0,1,0 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|1:2:0,0,1 1|0:1:0,1,0

Using sed, we could remove : and a character class [0-9] with any number of repetitions *. However, this did not clear out decimal numbers present in some columns.

sed -E 's/:[0-9]*:[0-9]*,[0-9]*,[0-9]*//g' rs
0|0 0|1 0|0 1|0 0|1 0|0 1|0 0|1 0|0 0|0 0|0 1|1 1|1 0|0 0|0 0|0 0|0 0|1 1|0 1|0 0|0 0|0 1|0 0|0 0|0 0|1 1|0 1|1 0|0 0|0 1|0 0|0 1|0 0|1 1|0 0|0 1|0 0|0 0|0 1|0 0|0 0|1 1|0 0|1 0|0 0|0 0|1 1|0 0|0 0|0 0|0 0|0 1|1 1|0 0|1 1|0 1|0 0|0
0|0 0|1 0|0 0|0 0|1 0|0 1|0 0|1 0|0 0|0 0|0 1|1 1|1 0|0 0|0 0|0 0|0 0|1 1|0 1|0 0|0 0|0 1|0 0|0 0|0 0|1 1|0 1|1 0|0 0|0 1|0 0|0 1|0 0|1 1|0 0|0 1|0 0|0 0|0 1|0 1|0 0|1 1|0 0|1 0|0 0|0 0|1 1|0 0|0 0|0 0|0 0|0 0|1 1|0 0|1 1|0 1|0 0|0
0|0 0|1 0|0 0|0 0|0 0|0 1|0 0|1 0|0 0|0 0|0 1|1 1|1 0|0 0|0 1|1 0|0 0|1 1|0 1|0 0|0 0|0 1|0 0|1 0|0 0|1 1|0 1|1 0|1 0|0 1|0 0|0 1|0 0|1 1|0 0|0 1|0 0|0 0|0 1|0 1|0 0|1 1|0 0|1 0|0 0|0 0|1 1|0 0|0 0|0 0|0 0|0 0|1 1|0 0|1 1|0 1|0 1|0
1|1 1|1 1|1 1|1 1|1 1|0 1|1 1|1 1|1 1|1 1|1 1|1 1|1 0|1 1|1 1|1 0|1 1|1 1|1 1|1 1|1 1|1 1|0 1|1 1|1 0|1 1|1 1|1 1|1:1.99:0,0.01,0.99 1|1 1|1 1|1 1|0 1|1 1|0 0|1 1|1 0|1 1|1 1|1 0|1 0|1 1|1 1|1 1|0:1.4:0,0.6,0.4 1|1 0|1 1|0 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|0

To remove decimal points, use character class [0-9.] with any number of repetitions *.

sed -E 's/:[0-9.]*:[0-9.]*,[0-9.]*,[0-9.]*//g' rs
0|0 0|1 0|0 1|0 0|1 0|0 1|0 0|1 0|0 0|0 0|0 1|1 1|1 0|0 0|0 0|0 0|0 0|1 1|0 1|0 0|0 0|0 1|0 0|0 0|0 0|1 1|0 1|1 0|0 0|0 1|0 0|0 1|0 0|1 1|0 0|0 1|0 0|0 0|0 1|0 0|0 0|1 1|0 0|1 0|0 0|0 0|1 1|0 0|0 0|0 0|0 0|0 1|1 1|0 0|1 1|0 1|0 0|0
0|0 0|1 0|0 0|0 0|1 0|0 1|0 0|1 0|0 0|0 0|0 1|1 1|1 0|0 0|0 0|0 0|0 0|1 1|0 1|0 0|0 0|0 1|0 0|0 0|0 0|1 1|0 1|1 0|0 0|0 1|0 0|0 1|0 0|1 1|0 0|0 1|0 0|0 0|0 1|0 1|0 0|1 1|0 0|1 0|0 0|0 0|1 1|0 0|0 0|0 0|0 0|0 0|1 1|0 0|1 1|0 1|0 0|0
0|0 0|1 0|0 0|0 0|0 0|0 1|0 0|1 0|0 0|0 0|0 1|1 1|1 0|0 0|0 1|1 0|0 0|1 1|0 1|0 0|0 0|0 1|0 0|1 0|0 0|1 1|0 1|1 0|1 0|0 1|0 0|0 1|0 0|1 1|0 0|0 1|0 0|0 0|0 1|0 1|0 0|1 1|0 0|1 0|0 0|0 0|1 1|0 0|0 0|0 0|0 0|0 0|1 1|0 0|1 1|0 1|0 1|0
1|1 1|1 1|1 1|1 1|1 1|0 1|1 1|1 1|1 1|1 1|1 1|1 1|1 0|1 1|1 1|1 0|1 1|1 1|1 1|1 1|1 1|1 1|0 1|1 1|1 0|1 1|1 1|1 1|1 1|1 1|1 1|1 1|0 1|1 1|0 0|1 1|1 0|1 1|1 1|1 0|1 0|1 1|1 1|1 1|0 1|1 0|1 1|0 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|0


Monday, January 29, 2018

Calculate per gene average expression from mastertable using awk/bash scripting

If you want to calculate average expression per each gene across all conditions in a mastertable use awk:

mpjanic@zoran:~$ head mastertable -n25
        TQ6     TQ7     TQ8     TQ9     TQ10    TQ11
lnc-CCDC77-4:1  0       0       0       0       0       0
lnc-COX10-9:1   0       0       0       0       0       0
lnc-MAGEB2-1:1  0       0       0       0       0       0
lnc-TMEM99-2:1  0       0       0       0       0       0
lnc-COX10-9:2   0       0       0       0       0       0
DDN-AS1:2       0       0       0       0       0       0
lnc-TMEM99-2:2  1       0       0       0       0       0
lnc-SPRY4-3:1   0       0       0       0       0       0
DDN-AS1:3       0       0       0       0       0       0
lnc-TMEM99-2:3  0       0       0       0       0       0
DDN-AS1:4       28      32      10      31      2       13
DDN-AS1:5       0       0       0       0       0       0
lnc-ZNF516-4:10 0       0       0       0       0       0
DDN-AS1:6       0       0       0       0       0       0
lnc-ZNF516-4:11 0       0       0       0       0       0
GSEC:2  15      9       16      10      9       34
lnc-AATK-AS1-2:1        15      17      2       19      28      24
lnc-PLCH1-5:1   0       0       0       0       0       0
GSEC:3  0       0       0       0       0       0
lnc-MFSD9-7:1   53      27      22      41      47      41
lnc-PSMC1-1:1   29      30      54      18      23      58
GSEC:4  0       0       0       0       0       0
GSEC:5  0       0       0       0       0       0
lnc-ZNF780B-1:1 124     125     80      123     110     148
Use awk to print the first line, then for each field starting from NF>2 assuming the first field is the gene name, perform {sum=0; for (i=2; i<=NF; i++) sum+=$i; print $1, sum/(NF-1)}.

Then, sort -gr -k2, to sort in reverse order and with -g option (--general-numeric-sort):


mpjanic@zoran:~$ awk 'NR == 1 { print "lncRNA", "Average"; next }    # Print a heading row\
> NF > 2 { sum=0; for (i=2; i<=NF; i++) sum+=$i; print $1, sum/(NF-1) }' mastertable | sort -gr -k2| head -n 20
lnc-SGCE-3:1 157963
lnc-EIF2AK4-6:1 120530
lnc-SLC3A2-6:1 110467
lnc-ATIC-14:1 66120.8
lnc-TRIM69-3:1 59894.5
lnc-TRDMT1-5:2 52209.8
lnc-ANKRD55-6:1 44934.3
lnc-LRRTM4-6:1 44869.3
lnc-LYN-8:1 39859.2
lnc-VGF-4:1 37230.2
lnc-VGF-3:1 32908.2
lnc-VAT1-4:1 27177.7
lnc-SH3D19-2:1 22266.8
IGFBP7-AS1:16 21963.5
lnc-HSD17B7-1:2 21429.5
lnc-BTD-2:1 21348.2
lnc-ARID2-11:1 21063.7
lnc-CBY3-3:2 19925.3
lnc-DYNC2H1-4:1 19492.8
lnc-C6orf120-1:7 18986.2
Then save it in a file expression_average_per_gene:

mpjanic@zoran:~$ awk 'NR == 1 { print "lncRNA", "Average"; next }    # Print a heading row\
NF > 2 { sum=0; for (i=2; i<=NF; i++) sum+=$i; print $1, sum/(NF-1) }' mastertable | sort -gr -k2| head -n 20