Skip to content Skip to sidebar Skip to footer

Efficiently Remove Lines From Filea That Contains String From Fileb

FileA contains lines FileB contains words How can I efficiently remove lines from FileB containing words found in FileA? I tried the following, and I'm not even sure if they work b

Solution 1:

I refuse to believe that Python can't at least match the performance of Perl on this one. This is my quick attempt at a more efficient version of solving this in Python. I'm using sets to optimize the search part of this problem. The & operator returns a new set with elements common to both sets.

This solution takes 12 seconds to run on my machine for a fileA with 3M lines and fileB with 200k of words and the perl takes 9. The biggest slow down seems to be re.split, which seems to be faster than string.split in this case.

Please comment on this answer if you have any suggestions to improve the speed.

import re

filea = open('Downloads/fileA.txt')
fileb = open('Downloads/fileB.txt')

output = open('output.txt', 'w')
bad_words = set(line.strip() for line in fileb)

splitter = re.compile("\s")
for line in filea:
    line_words = set(splitter.split(line))
    if bad_words.isdisjoint(line_words):
        output.write(line)

output.close()

Solution 2:

The commands you have look good so may be its time to try a good scripting language. Try to run the following perl script and see if it reports back any faster.

#!/usr/bin/perl#use strict;#use warnings;openmy $LOOKUP, "<", "fileA"ordie"Cannot open lookup file: $!";
openmy $MASTER, "<", "fileB"ordie"Cannot open Master file: $!";
openmy $OUTPUT, ">", "out"ordie"Cannot create Output file: $!";

my %words;
my @l;

while (my $word = <$LOOKUP>) {
    chomp($word);
    ++$words{$word};
}

LOOP_FILE_B: while (my $line = <$MASTER>) {
    @l = split/\s+/, $line;
        formy $i (0 .. $#l) {
            if (defined $words{$l[$i]}) {
                next LOOP_FILE_B;
            }
        }
    print $OUTPUT "$line"
}

Solution 3:

Using grep

grep -v -Fwf fileB fileA

Post a Comment for "Efficiently Remove Lines From Filea That Contains String From Fileb"