Efficiently Remove Lines From Filea That Contains String From Fileb
Solution 1:
I refuse to believe that Python can't at least match the performance of Perl on this one. This is my quick attempt at a more efficient version of solving this in Python. I'm using sets to optimize the search part of this problem. The & operator returns a new set with elements common to both sets.
This solution takes 12 seconds to run on my machine for a fileA with 3M lines and fileB with 200k of words and the perl takes 9. The biggest slow down seems to be re.split, which seems to be faster than string.split in this case.
Please comment on this answer if you have any suggestions to improve the speed.
import re
filea = open('Downloads/fileA.txt')
fileb = open('Downloads/fileB.txt')
output = open('output.txt', 'w')
bad_words = set(line.strip() for line in fileb)
splitter = re.compile("\s")
for line in filea:
line_words = set(splitter.split(line))
if bad_words.isdisjoint(line_words):
output.write(line)
output.close()
Solution 2:
The commands you have look good so may be its time to try a good scripting language. Try to run the following perl
script and see if it reports back any faster.
#!/usr/bin/perl#use strict;#use warnings;openmy $LOOKUP, "<", "fileA"ordie"Cannot open lookup file: $!";
openmy $MASTER, "<", "fileB"ordie"Cannot open Master file: $!";
openmy $OUTPUT, ">", "out"ordie"Cannot create Output file: $!";
my %words;
my @l;
while (my $word = <$LOOKUP>) {
chomp($word);
++$words{$word};
}
LOOP_FILE_B: while (my $line = <$MASTER>) {
@l = split/\s+/, $line;
formy $i (0 .. $#l) {
if (defined $words{$l[$i]}) {
next LOOP_FILE_B;
}
}
print $OUTPUT "$line"
}
Solution 3:
Using grep
grep -v -Fwf fileB fileA
Post a Comment for "Efficiently Remove Lines From Filea That Contains String From Fileb"