Detect Whether Sequence Is A Multiple Of A Subsequence In Python
Solution 1:
I believe I have an O(n) time solution (actually 2n+r, n is length of tuple, r is sub tuplle) which does not use suffix trees, but uses a string matching algorithm (like KMP, which you should find off-the shelf).
We use the following little known theorem:
If x,y are strings oversome alphabet,
then xy = yx if andonly if x = z^k and y = z^l forsome string z and integers k,l.
I now claim that, for the purposes of our problem, this means that all we need to do is determine if the given tuple/list (or string) is a cyclic shift of itself!
To determine if a string is a cyclic shift of itself, we concatenate it with itself (it does not even have to be a real concat, just a virtual one will do) and check for a substring match (with itself).
For a proof of that, suppose the string is a cyclic shift of itself.
The we have that the given string y = uv = vu. Since uv = vu, we must have that u = z^k and v= z^l and hence y = z^{k+l} from the above theorem. The other direction is easy to prove.
Here is the python code. The method is called powercheck.
def powercheck(lst):
count = 0
position = 0for pos in KnuthMorrisPratt(double(lst), lst):
count += 1
position = pos
ifcount== 2:
breakreturn lst[:position]
def double(lst):
for i in range(1,3):
for elem in lst:
yield elem
def main():
print powercheck([1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1])if__name__== "__main__":
main()
And here is the KMP code which I used (due to David Eppstein).
# Knuth-Morris-Pratt string matching# David Eppstein, UC Irvine, 1 Mar 2002defKnuthMorrisPratt(text, pattern):
'''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''# allow indexing into pattern and protect against change during yield
pattern = list(pattern)
# build table of shift amounts
shifts = [1] * (len(pattern) + 1)
shift = 1for pos inrange(len(pattern)):
while shift <= pos and pattern[pos] != pattern[pos-shift]:
shift += shifts[pos-shift]
shifts[pos+1] = shift
# do the actual search
startPos = 0
matchLen = 0for c in text:
while matchLen == len(pattern) or \
matchLen >= 0and pattern[matchLen] != c:
startPos += shifts[matchLen]
matchLen -= shifts[matchLen]
matchLen += 1if matchLen == len(pattern):
yield startPos
For your sample this outputs
[1,0,1,1]
as expected.
I compared this against shx2's code(not the numpy one), by generating a random 50 bit string, then replication to make the total length as 1 million. This was the output (the decimal number is the output of time.time())
1362988461.75
(50, [1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1])
1362988465.9650[1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1]1362988487.14
The above method took ~4 seconds, while shx2's method took ~21 seconds!
Here was the timing code. (shx2's method was called powercheck2).
def rand_bitstring(n):
rand = random.SystemRandom()
lst = []
for j in range(1, n+1):
r = rand.randint(1,2)
if r == 2:
lst.append(0)
else:
lst.append(1)
return lst
def main():
lst = rand_bitstring(50)*200000printtime.time()
print powercheck(lst)
printtime.time()
powercheck2(lst)
printtime.time()
Solution 2:
The following solution is O(N^2), but has the advantage of not creating any copies (or slices) of your data, as it is based on iterators.
Depending on the size of your input, the fact you avoid making copies of the data can result in a significant speed-up, but of course, it would not scale as well for huge inputs as algorithms with lower complexity (e.g. O(N*logN)).
[This is the second revision of my solution, the first one is given below. This one is simpler to understand, and is more along the lines of OP's tuple-multiplication, only using iterators.]
from itertools import izip, chain, tee
defiter_eq(seq1, seq2):
""" assumes the sequences have the same len """returnall( v1 == v2 for v1, v2 in izip(seq1, seq2) )
defdup_seq(seq, n):
""" returns an iterator which is seq chained to itself n times """return chain(*tee(seq, n))
defis_reps(arr, slice_size):
iflen(arr) % slice_size != 0:
returnFalse
num_slices = len(arr) / slice_size
return iter_eq(arr, dup_seq(arr[:slice_size], num_slices))
s = (1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1)
for i inrange(1,len(s)):
if is_reps(s, i):
print i, s[:i]
break
[My original solution]
from itertools import islice
defis_reps(arr, num_slices):
iflen(arr) % num_slices != 0:
returnFalse
slice_size = len(arr) / num_slices
for i in xrange(slice_size):
iflen(set( islice(arr, i, None, num_slices) )) > 1:
returnFalsereturnTrue
s = (1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1)
for i inrange(1,len(s)):
if is_reps(s, i):
print i, s[:i]
break
You can avoid the call to set()
by using something like:
defis_iter_unique(seq):
""" a faster version of testing len(set(seq)) <= 1 """
seen = set()
for x in seq:
seen.add(x)
iflen(seen) > 1:
returnFalsereturnTrue
and replacing this line:
iflen(set( islice(arr, i, None, num_slices) )) > 1:
with:
if not is_iter_unique(islice(arr, i, None, num_slices)):
Solution 3:
Simplifying Knoothe's solution. His algorithm is right, but his implementation is too complex. This implementation is also O(n).
Since your array is only composed of ones and zeros, what I do is use existing str.find implementation (Bayer Moore) to implement Knoothe's idea. It's suprisingly simpler and amazingly faster at runtime.
deff(s):
s2 = ''.join(map(str, s))
return s[:(s2+s2).index(s2, 1)]
Solution 4:
Here's another solution (competing with my earlier iterators-based solution), leveraging numpy.
It does make a (single) copy of your data, but taking advantage of the fact your values are 0s and 1s, it is super-fast, thanks to numpy's magics.
import numpy as np
defis_reps(arr, slice_size):
iflen(arr) % slice_size != 0:
returnFalse
arr = arr.reshape((-1, slice_size))
return (arr.all(axis=0) | (~arr).all(axis=0)).all()
s = (1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1) * 1000
a = np.array(s, dtype=bool)
for i inrange(1,len(s)):
if is_reps(a, i):
print i, s[:i]
break
Solution 5:
Just a different approach to the problem
I first determine all the factors of the length and then split the list and check if all the parts are same
>>> deff(s):
deffactors(n):
#http://stackoverflow.com/a/6800214/977038returnset(reduce(list.__add__,
([i, n//i] for i inrange(2, int(n**0.5) + 1) if n % i == 0)))
_len = len(s)
for fact inreversed(list(factors(_len))):
compare_set = set(izip(*[iter(s)]*fact))
iflen(compare_set) == 1:
return compare_set
>>> f(t)
set([(1, 0, 1, 1)])
Post a Comment for "Detect Whether Sequence Is A Multiple Of A Subsequence In Python"