Aug 10, 2012

Longest Common Subsequence in Python (version 3.0)

This is my new implementation of the Longest Common Subsequence algorithm, and it's far better than my two previous attempts. This algorithm has less memory usage and is faster than the others because it just compares the common elements.


This algorithm has 3 main variables: the first index of a and b (ia and ib) in wich the comparisson must begin, and the number of elements to compare in each phase (s).

  • Phase 1: The number of elements to compare is equal to 1, the last index of a with the first index of b, and for every iteration the number of elements to compare increases by one. then ia starts with the last index of a and decreases by one in every iteration, while ib is always 0.
  • Phase 2: The number of elements to compare becomes constant and equal to the size of the shortest sequence. if ia is not equal to the first element of a, then decrease ia by one until it becomes equal to 0, and stays constant. Then begin to increase ib by one.
  • Phase 3: For every iteration, decreases by one the number of elements to compare. ia is always equal to the first element of a, while increases ib by one.
The number of iterations is equal to:
$$l = len(a) + len(b) - 1$$
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Longest Common Subsequence, 3 phases comparisson algorithm.
# Copyright (C) 2012  Gonzalo Exequiel Pedone
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with This program. If not, see <http://www.gnu.org/licenses/>.
#
# Email   : hipersayan DOT x AT gmail DOT com
# Web-Site: http://hipersayanx.blogspot.com/

def lcs(a=[], b=[]):
    ja = -1
    jb = -1
    n = 0

    if a == [] or b == []:
        return ja, jb, n

    l = len(a) + len(b) - 1
    ia = len(a) - 1
    ib = 0
    s = 1

    for k in range(l):
        nCur = 0

        for r in range(s):
            if a[ia + r] == b[ib + r]:
                nCur += 1

                if nCur > n:
                    ja = ia + r - nCur + 1
                    jb = ib + r - nCur + 1
                    n = nCur
            else:
                nCur = 0

        if k < min(len(a), len(b)) - 1:
            ia -= 1
            s += 1
        elif k > l - min(len(a), len(b)) - 1:
            ib += 1
            s -= 1
        elif ia > 0:
            ia -= 1
        else:
            ib += 1

    return ja, jb, n


a = 'Andrea always wears black cat to all places she go.'
b = 'She likes her black cat because it is very mischievous and playful.'

ia, ib, n = lcs(list(a), list(b))

print([a[ia: ia + n]])
print([b[ib: ib + n]])

1 comment:

  1. Hi! Great post! I wonder if there is any way to fit this lcs v3.0 into the alignSequences function (http://hipersayanx.blogspot.com.co/2012/08/multiple-sequences-alignment-in-python.html).

    ReplyDelete