Sunday, October 2, 2016

Simple solution to select Position Weight Matrix or ChIP-Seq sites in phase and out of phase

If you have two lists of position weight matrix (PWM) or ChiP-Seq protein binding sites and their distances obtained with bedtools closest and you used options -d or -D to output the distance between the two features:


chr1    13747   13753   TGCGTG  1069    +       chr1    14228   14239   AACAGCTGCCC     1440    +       476
chr1    17712   17718   TGCGTG  1069    -       chr1    14228   14239   AACAGCTGCCC     1440    +       3474
chr1    19157   19163   TGCGTG  1069    +       chr1    14228   14239   AACAGCTGCCC     1440    +       -4919
chr1    22861   22867   TGCGTG  1069    -       chr1    28916   28927   AGCAGCTGCGG     1421    +       -6050
chr1    26478   26484   TGCGTG  1069    +       chr1    28916   28927   AGCAGCTGCGG     1421    +       2433
chr1    32585   32591   TGCGTG  1069    -       chr1    32034   32045   AACAGCTGCAG     1579    +       541
chr1    33596   33602   TGCGTG  1069    +       chr1    32034   32045   AACAGCTGCAG     1579    +       -1552
chr1    36978   36984   TGCGTG  1069    +       chr1    32034   32045   AACAGCTGCAG     1579    +       -4934
chr1    41096   41102   TGCGTG  1069    +       chr1    32034   32045   AACAGCTGCAG     1579    +       -9052
chr1    80948   80954   TGCGTG  1069    +       chr1    32034   32045   AACAGCTGCAG     1579    +       -48904
chr1    87253   87259   TGCGTG  1069    +       chr1    131307  131318  AACAGCTGCCA     1433    -       44049
chr1    89356   89362   TGCGTG  1069    -       chr1    131307  131318  AACAGCTGCCA     1433    -       -41946
chr1    92760   92766   TGCGTG  1069    -       chr1    131307  131318  AACAGCTGCCA     1433    -       -38542
chr1    97039   97045   TGCGTG  1069    +       chr1    131307  131318  AACAGCTGCCA     1433    -       34263
chr1    104193  104199  TGCGTG  1069    -       chr1    131307  131318  AACAGCTGCCA     1433    -       -27109
chr1    106687  106693  TGCGTG  1069    -       chr1    131307  131318  AACAGCTGCCA     1433    -       -24615
chr1    109572  109578  TGCGTG  1069    +       chr1    131307  131318  AACAGCTGCCA     1433    -       21730
chr1    140637  140643  TGCGTG  1069    +       chr1    135896  135907  AACAGCTGGGC     1414    -       -4731
chr1    158167  158173  TGCGTG  1069    -       chr1    149357  149368  AACAGCTGCTA     1493    -       8800
chr1    162890  162896  TGCGTG  1069    -       chr1    172532  172543  AGCAGCTGCTG     1453    -       -9637
chr1    165259  165265  TGCGTG  1069    +       chr1    172532  172543  AGCAGCTGCTG     1453    -       7268
chr1    228669  228675  TGCGTG  1069    +       chr1    172533  172544  AGCAGCTGCTG     1453    +       -56126
chr1    231500  231506  TGCGTG  1069    -       chr1    172533  172544  AGCAGCTGCTG     1453    +       58957
chr1    239088  239094  TGCGTG  1069    -       chr1    172533  172544  AGCAGCTGCTG     1453    +       66545
chr1    243374  243380  TGCGTG  1069    +       chr1    172533  172544  AGCAGCTGCTG     1453    +       -70831
chr1    250508  250514  TGCGTG  1069    -       chr1    327446  327457  AACAGCTGGGC     1414    +       -76933
chr1    253002  253008  TGCGTG  1069    -       chr1    327446  327457  AACAGCTGGGC     1414    +       -74439
chr1    255893  255899  TGCGTG  1069    +       chr1    327446  327457  AACAGCTGGGC     1414    +       71548
chr1    395412  395418  TGCGTG  1069    -       chr1    388540  388551  AACAGCTGCAG     1579    -       6862
chr1    404658  404664  TGCGTG  1069    +       chr1    388540  388551  AACAGCTGCAG     1579    -       -16108
chr1    411852  411858  TGCGTG  1069    -       chr1    388540  388551  AACAGCTGCAG     1579    -       23302
chr1    436821  436827  TGCGTG  1069    -       chr1    441048  441059  GACAGCTGCTG     1542    +       -4222
chr1    436923  436929  TGCGTG  1069    -       chr1    441048  441059  GACAGCTGCTG     1542    +       -4120
chr1    437027  437033  TGCGTG  1069    -       chr1    441048  441059  GACAGCTGCTG     1542    +       -4016
chr1    437127  437133  TGCGTG  1069    +       chr1    441048  441059  GACAGCTGCTG     1542    +       3916
You can use this to select sites which are in phase with respect to the DNA rotational pitch and those that are not in phase, i.e. that are located on the opposite sides of the DNA. In these examples, phased sites are localized no more than 50 bp away. For bedtools closest option -d:

for j in {1..52..10}
do
echo $j
awk -F '\t' '{ if ($13 == '$j') print $0 }' PWM1_PWM2_bedtools_closest > "$j"_distance_even
cut -f 1-3 "$j"_distance_even > "$j"_distance_cut_even
done

cat *_cut_even > even_merge

for j in {5..56..10}
do
echo $j
awk -F '\t' '{ if ($13 == '$j') print $0 }' PWM1_PWM2_bedtools_closest > "$j"_distance_odd
cut -f 1-3 "$j"_distance_odd > "$j"overlapping_stringent_cut_odd
done

cat *_cut_odd > odd_merge
For bedtools closest option -D, that reports upstream and downstream closest features:

for j in {-49..52..10}
do
echo $j
awk -F '\t' '{ if ($13 == '$j') print $0 }' PWM1_PWM2_bedtools_closest > "$j"_distance_even
cut -f 1-3 "$j"_distance_even > "$j"_distance_cut_even
done

cat *_cut_even > even_merge

for j in {-45..56..10}
do
echo $j
awk -F '\t' '{ if ($13 == '$j') print $0 }' PWM1_PWM2_bedtools_closest > "$j"_distance_odd
cut -f 1-3 "$j"_distance_odd > "$j"overlapping_stringent_cut_odd
done

cat *_cut_odd > odd_merge

No comments:

Post a Comment