Sunday, September 18, 2016

Simple solutions of filtering by sum and repeats with awk scripting

In a file lets says you need to eliminate rows that have a certain sum or that have certain number repeated N times, use awk.

Example content of file.name:
cat file.name
0 0 0 0
0 1 1 1
0 3 4 9
5 5 5 5
5 5 5 5
0 0 0 0
1 1 3 4

1. To filter out lines that have all 0s.

Create variable sum, let i go from 1 to NF, and add 0 or 1 to sum depending on the awk ternary operator (?:).  With \$i to mark field content, add to sum 0 or 1, using sum += \$i ? 1 : 0. In brief, if \$i (i.e if \$i!=0) use 1 for sum+=, if false (i.e. if \$i=0) use 0 for sum+=.

In case sum=0, the only time this would happen is if every field is 0, thus remove those fields with if (sum!=0) print.

`awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i ? 1 : 0} if (sum!=0) print}' file.name`
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i ? 1 : 0} if (sum!=0) print}' file.name
0 1 1 1
0 3 4 9
5 5 5 5
5 5 5 5
1 1 3 4

2. To filter out lines that do not contain 0

To filter out only those lines that have non 0 numbers repeated N times (for example 4) substitute 0 with 4 in if (sum!=0) print. This code will basically keep only those lines that contain at least one 0.

`awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i ? 1 : 0} if (sum!=4) print}' file.name`
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i ? 1 : 0} if (sum!=4) print}' file.name
0 0 0 0
0 1 1 1
0 3 4 9
0 0 0 0

To filter out those lines that have non-zero number repeated 3 times use:

awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i ? 1 : 0} if (sum!=3) print}' file.name
0 0 0 0
5 5 5 5
5 5 5 5
0 0 0 0
1 1 3 4

To filter out lines that have non-zero number repeated 3 or 4 times use:

awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i ? 1 : 0} if (sum<3) print}' file.name
0 0 0 0
0 0 0 0

3. To filter out lines that do not contain certain number

Modification of this code is easy, i.e. to keep lines that contain at least one number 3, change the condition in ternary operator (?:) to \$i==3:

`awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==3 ? 1 : 0} if (sum!=0) print}' file.name`
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==3 ? 1 : 0} if (sum!=0) print}' file.name
0 3 4 9
1 1 3 4

4. To filter out lines that contain certain number

`awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==3 ? 1 : 0} if (sum==0) print}' file.name`
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==3 ? 1 : 0} if (sum==0) print}' file.name
0 0 0 0
0 1 1 1
5 5 5 5
5 5 5 5
0 0 0 0

5. To filter out lines that contain ceratin number repeated N times

For example to keep only lines that contain 1 repeated 3 times:

`awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==1 ? 1 : 0} if (sum==3) print}' file.name`
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==1 ? 1 : 0} if (sum==3) print}' file.name
0 1 1 1

6. To filter out lines that do not contain number repeated N times

For example to remove only lines that contain 1 repeated 3 times:

`awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==1 ? 1 : 0} if (sum!=3) print}' file.name`
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==1 ? 1 : 0} if (sum!=3) print}' file.name
0 0 0 0
0 3 4 9
5 5 5 5
5 5 5 5
0 0 0 0
1 1 3 4

7. To count how many times certain number is repeated in each row

You can modify this code to count how many times in a row you have repeated 0, or any other number. To find out how many times 0 is in fields of each row:

`awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==0 ? 1 : 0} print sum}' file.name`
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==0 ? 1 : 0} print sum}' file.name
4
1
1
0
0
4
0

To find out how many times 5 is repeated in each row:

awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==5 ? 1 : 0} print sum}' file.name
0
0
0
4
4
0
0

8. To count how many times certain combination of numbers is repeated in each row

You can modify this code to count how many times in a row you have repeated two or more numbers, for example 1 or 5. Change the code so that if \$i is either 1 or 5 sum will be increased by 1.

`awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==1||\$i==5 ? 1 : 0} print sum}' file.name`
Example:
awk '{sum=0; for (i=1; i<=NF; i++){sum += \$i==1||\$i==5 ? 1 : 0} print sum}' file.name
0
3
0
4
4
0
2