Added work from my other class repositories before deletion

2025-11-09 21:51:15 +00:00 · 2017-11-29 10:28:24 -08:00
parent cb0b5f4d25
commit 5ea24c81b5
198 changed files with 739603 additions and 0 deletions
--- a/2/Non-python/fastarch.pdf
+++ b/2/Non-python/fastarch.pdf
--- a/2/Non-python/fastarch.txt
+++ b/2/Non-python/fastarch.txt
@@ -0,0 +1,26 @@
+From an architectural point of view, very little is different between the fast version of kwic and the baseline
+implementation. However, the one architectural difference that is there is significant in terms of speed, the point of
+creating this version in the first place. In the baseline kwic, the code acts almost entirely like a single black box
+block. Data in to data out, all done in pure python. The fast version's biggest gains came from changing a core aspect
+of how some of the loops were being done, in terms that they are now handled by compiled c code rather than running in
+native python. This change was the use of the map function, an alternative to the traditional loop in python.
+While this is an architecturally significant change, as we're now expanding the number of boxes AND languages, is also
+came with the speed benefits of running code in c. This speed jump was most noticeable when used in the function that is
+called as an argument "key" in the alphabetization sort calls. Another place where changes were made that helped
+increase speed, though didn't necessarily change the code architecturally, was in loops where inherited methods were
+called. So, for example, I originally had many loops that called "array_name.append()". This is slow because every time
+the interpreter comes across that line in each loop, it has to process "array_name" to determine whether it contains
+append, and what append is. By aliasing "array_name.append" as a new variable like so "aliased = array_name.append" the
+interpreter only has to perform that lookup once. Then, by replacing the calls in the loop with my alias, I save that
+lookup time for each iteration. While this didn't result in a massive increase like the use of map did, it did help
+enough to mention. A few other minor changes were also made to help shave a couple tenths of a second off here and
+there. There were many places where I was needlessly making copies of large arrays of data, simply to make code
+readability better, but which took both time and ram to accomplish. By streamlining the use of existing variables, I
+again managed to save little bits of time here and there. Now, once it came to enabling listPairs, the other very slow
+part of the code, I decided to change the way that the words were sanitized as that's where the most speed loss was
+found. In the original version, I used a join command on an empty string that stripped out unwanted characters
+essentially rather brute forced. For the fast version, I learned of a way to perform the same task using the python's
+built in translate function. This change nearly halved the time it took for just the listPairs section of the code to
+run. In order to test all of this, I used python's built in cProfile tool which times every user written function call,
+as well as aggregates the times of all python's built in ones, and separately. This made it very easy to see what parts
+of the code needed to be focused on for speed increases.
--- a/2/Non-python/testarch.pdf
+++ b/2/Non-python/testarch.pdf
--- a/2/Non-python/testarch.txt
+++ b/2/Non-python/testarch.txt
@@ -0,0 +1,14 @@
+Compared to the original kwic, the testing version of kwic is quite different. The focus of this version was to pull out
+much of the code from the main into functions so that sensitive portions could be easily tested separately, as well as
+be changed more easily. From an architectural point of view, the original kwic is very much a traditional black box
+approach. Data and flags in, one tiny function call used for alphabetization (as I had trouble with it and needed to
+pull it out), and the final data came out. The testing version on the other hand is approximately 25% longer in terms of
+pure code length, and is split into ten separate functions rather than the two for the original kwic. By splitting the
+core features of the kwic system into these functions, it made testing the development of the code that much easier.
+Rather than having to run through all the code up to the point I wanted to test, I could simply only call the functions
+for features I was actively testing. Of course, adding all these extra function calls did affect performance slightly,
+though not as much as I was expecting. This version is only marginally slower than the baseline implementation. I also
+considering adding a flag to kwic that would enable debugging print statements, but I decided against it as (at least
+in my personal experience) adding tons of print statements isn't specifically always helpful. Generally, you only need
+printing for a particular section of code, which could easily be manually added now that the important parts of the code
+are broken out.
--- a/2/Reference/Original
+++ b/2/Reference/Original
--- a/2/Reference/Software
+++ b/2/Reference/Software
--- a/2/Reference/init.py
+++ b/2/Reference/init.py
--- a/2/Reference/mykwic.py
+++ b/2/Reference/mykwic.py
@@ -0,0 +1,58 @@
+def shift(line):
+    return [line[i:] + line[:i] for i in xrange(0,len(line))]
+
+def cleanWord(word):
+    return filter (lambda c: c not in [".",",","?","!",":"], word.lower())
+
+def ignorable(word,ignoreWords):
+    return cleanWord(word) in map(lambda w: w.lower(), ignoreWords)
+
+def splitBreaks(string, periodsToBreaks):
+    if not periodsToBreaks:
+        return string.split("\n")
+    else:
+        line = ""
+        lines = []
+        lastChar1 = None
+        lastChar2 = None
+        breakChars = map(chr, xrange(ord('a'),ord('z')+1))
+        for c in string:
+            if (c == " ") and (lastChar1 == ".") and (lastChar2 in breakChars):
+                lines.append(line)
+                line = ""
+            line += c
+            lastChar2 = lastChar1
+            lastChar1 = c
+        lines.append(line)
+        return lines
+
+
+def kwic(string,ignoreWords=[], listPairs=False, periodsToBreaks=False):
+    lines = splitBreaks(string, periodsToBreaks)
+    splitLines = map(lambda l: l.split(), lines)
+    if listPairs:
+        pairs = {}
+        for l in splitLines:
+            seen = set([])
+            for wu1 in l:
+                wc1 = cleanWord(wu1)
+                if len(wc1) == 0:
+                    continue
+                for wu2 in l:
+                    wc2 = cleanWord(wu2)
+                    if wc1 < wc2:
+                        if (wc1,wc2) in seen:
+                            continue
+                        seen.add((wc1,wc2))
+                        if (wc1, wc2) in pairs:
+                            pairs[(wc1,wc2)] += 1
+                        else:
+                            pairs[(wc1,wc2)] = 1
+    shiftedLines = [map(lambda x:(x,i), shift(splitLines[i])) for i in xrange(0,len(splitLines))]
+    flattenedLines = [l for subList in shiftedLines for l in subList]
+    filteredLines = filter(lambda l: not ignorable(l[0][0], ignoreWords), flattenedLines)
+    if not listPairs:
+        return sorted(filteredLines, key = lambda l: (map(cleanWord, l[0]),l[1]))
+    else:
+        return (sorted(filteredLines, key = lambda l: (map(lambda w:w.lower(), l[0]),l[1])),
+                map(lambda wp: (wp, pairs[wp]), sorted(filter(lambda wp: pairs[wp] > 1, pairs.keys()))))
--- a/2/Reference/requirements.txt
+++ b/2/Reference/requirements.txt
@@ -0,0 +1,20 @@
+- One version that has improved performance, and one to improve testability
+- Performance
+-- Faster by a good marging than the baseline version
+- Testability
+-- Make it easier to control and/or observe the behavior of the system
+-- Ideas
+--- Interfaces for controlling internal variables
+--- Copious assertions
+--- Limiting complexity
+--- Changes that make the code timing more consistent
+
+
+- Files (put in folder named perrenc361assign2.zip)
+-- kwic.py : baseline version
+-- fastkwic.py : better performance version
+-- fastarch.txt : describes changes in architectural terms that made faster, and how you tested this (400 words)
+-- fastarch.pdf : diagram of architecture
+-- testkwic.py : highly testable version of kwic
+-- testarch.txt : describes changes in architectural terms that made it more testable (200 words)
+-- testarch.pdf : diagram of architecture
--- a/Coursework/CS
+++ b/Coursework/CS
--- a/2/fastkwic.py
+++ b/2/fastkwic.py
@@ -0,0 +1,102 @@
+def alphabetized_key(input_data):
+    output_array = map(str.lower, input_data[0])
+    return output_array
+
+
+def kwic(document, listPairs=False, ignoreWords=None, periodsToBreaks=False):
+    if not document and not listPairs:
+        return []
+    elif not document and listPairs:
+        return [], []
+
+    if periodsToBreaks:
+        output_array = []
+        temp_sentence = ""
+        document_length_zero_indexed = len(document) - 1
+        for current_index, current_value in enumerate(document):
+            if current_value == '.':
+                if (current_index == 0) or (current_index == document_length_zero_indexed) or \
+                        (document[current_index - 1].islower() and (document[current_index + 1].isspace() or
+                                                                        (document[current_index + 1] == '\n'))):
+                    temp_sentence += current_value
+                    output_array.append(temp_sentence)
+                    temp_sentence = ""
+            else:
+                if current_value != '\n':
+                    temp_sentence += current_value
+                else:
+                    temp_sentence += " "
+
+        if temp_sentence:
+            output_array.append(temp_sentence)
+        split_into_sentences = output_array
+    else:
+        split_into_sentences = document.split('\n')
+
+    output_array_2 = []
+    index_incrementer = 0
+
+    temp_append = output_array_2.append
+    temp_split = str.split
+    for sentence in split_into_sentences:
+        words_array = temp_split(sentence, " ")
+        words_array = filter(None, words_array)
+        temp_append((words_array, index_incrementer))
+
+        index_incrementer += 1
+
+    split_into_word_tuples = output_array_2
+    circular_shifted_data = []
+
+    temp_append = circular_shifted_data.append
+    for current_tuple in output_array_2:
+        for index, _ in enumerate(current_tuple[0]):
+            temp_array = current_tuple[0][index:] + current_tuple[0][:index]
+            temp_append((temp_array, current_tuple[1]))
+
+    if ignoreWords:
+        lowered_input = []
+
+        for word in ignoreWords:
+            lowered_input.append(word.lower())
+
+        for current_tuple in circular_shifted_data:
+            if current_tuple[0][0].lower().strip(".:!?,") in lowered_input:
+                circular_shifted_data.remove(current_tuple)
+
+    alphabetized_data = sorted(circular_shifted_data, key=alphabetized_key)
+
+    if listPairs:
+        known_pairs = {}
+
+        for sentence_array, _ in split_into_word_tuples:
+            seen_in_sentence = set([])
+
+            for first_word in sentence_array:
+                for second_word in sentence_array:
+
+                    first = first_word.lower().translate(None, ".,?!:")
+                    second = second_word.lower().translate(None, ".,?!:")
+
+                    if (first == second) or (first == "") or (first > second):
+                        continue
+
+                    if (first, second) not in seen_in_sentence:
+                        seen_in_sentence.add((first, second))
+
+                        if (first, second) in known_pairs:
+                            known_pairs[(first, second)] += 1
+                        else:
+                            known_pairs[(first, second)] = 1
+
+        output_list = []
+
+        for key in known_pairs:
+            if known_pairs[key] > 1:
+                output_list.append((key, known_pairs[key]))
+
+        output_list.sort(key=alphabetized_key)
+
+        return alphabetized_data, output_list
+    else:
+        return alphabetized_data
--- a/Coursework/CS
+++ b/Coursework/CS
@@ -0,0 +1,115 @@
+def alphabetized_key(input_data):
+    output_array = []
+    for word in input_data[0]:
+        output_array.append(word.lower())
+    return output_array
+
+
+def kwic(document, listPairs=False, ignoreWords=None, periodsToBreaks=False):
+    if not document and not listPairs:
+        return []
+    elif not document and listPairs:
+        return [], []
+
+    if periodsToBreaks:
+        output_array = []
+        temp_sentence = ""
+        document_length_zero_indexed = len(document) - 1
+        for current_index, current_value in enumerate(document):
+            if current_value == '.':
+                if (current_index == 0) or (current_index == document_length_zero_indexed) or \
+                        (document[current_index - 1].islower() and (document[current_index + 1].isspace() or
+                                                                        (document[current_index + 1] == '\n'))):
+                    temp_sentence += current_value
+                    output_array.append(temp_sentence)
+                    temp_sentence = ""
+            else:
+                if current_value != '\n':
+                    temp_sentence += current_value
+                else:
+                    temp_sentence += " "
+
+        if temp_sentence:
+            output_array.append(temp_sentence)
+        split_into_sentences = output_array
+    else:
+        split_into_sentences = document.split('\n')
+
+    output_array = []
+    index_incrementer = 0
+
+    for sentence in split_into_sentences:
+        words_array = sentence.split(" ")
+        words_array = filter(None, words_array)
+        output_array.append((words_array, index_incrementer))
+        index_incrementer += 1
+
+    split_into_word_tuples =  output_array
+
+    output_array = []
+
+    for current_tuple in split_into_word_tuples:
+        for index, _ in enumerate(current_tuple[0]):
+            temp_array = current_tuple[0][index:] + current_tuple[0][:index]
+            output_array.append((temp_array, current_tuple[1]))
+
+    circular_shifted_data = output_array
+
+    if ignoreWords:
+        lowered_input = []
+        output_array = []
+
+        for word in ignoreWords:
+            lowered_input.append(word.lower())
+
+        for current_tuple in circular_shifted_data:
+            if current_tuple[0][0].lower().strip(".:!?,") in lowered_input:
+                pass
+            else:
+                output_array.append(current_tuple)
+
+        circular_shifted_data =  output_array
+
+    sorted_array = sorted(circular_shifted_data, key=alphabetized_key)
+    alphabetized_data = sorted_array
+
+    if listPairs:
+        known_pairs = {}
+
+        char_set = ".,?!:"
+        for sentence_array, _ in split_into_word_tuples:
+            seen_in_sentence = set([])
+
+            for first_word in sentence_array:
+                for second_word in sentence_array:
+
+                    first = "".join(char for char in first_word.lower() if char not in char_set)
+                    second = "".join(char for char in second_word.lower() if char not in char_set)
+
+                    if first > second:
+                        temp = second
+                        second = first
+                        first = temp
+
+                    if (first == second) or (first == ""):
+                        continue
+
+                    if (first, second) not in seen_in_sentence:
+                        seen_in_sentence.add((first, second))
+
+                        if (first, second) in known_pairs:
+                            known_pairs[(first, second)] += 1
+                        else:
+                            known_pairs[(first, second)] = 1
+
+        output_list = []
+
+        for key in known_pairs:
+            if known_pairs[key] > 1:
+                output_list.append((key, known_pairs[key]))
+
+        output_list.sort(key=alphabetized_key)
+
+        return alphabetized_data, output_list
+    else:
+        return alphabetized_data
--- a/Coursework/CS
+++ b/Coursework/CS
@@ -0,0 +1,91 @@
+import time
+import kwic as kwic_original
+import fastkwic as kwic_fast
+import testkwic as kwic_test
+
+num_tests = 1
+
+test_original_kwic = True
+print_original_kwic = False
+
+test_fast_kwic = True
+print_fast_kwic = False
+
+test_test_kwic = True
+print_test_kwic = False
+
+design_words_doc = "Design is hard.\nLet's just implement."
+goodbye_buddy_doc = "Hello there.\nHello there, buddy.\nHello and goodbye, buddy.\nHello is like buddy Goodbye!"
+hello_buddy_periods = "Hello there.  Hello there, buddy.    Hello and goodbye, buddy. Hello is like buddy Goodbye!"
+letters_and_stuff = "It's very nice to be footloose. \nWith just a toothbrush and a comb.\n"
+
+open_file = open("test_documents/chesterton_short.txt", "r")
+file_as_lines = open_file.readlines()
+file_as_string = ""
+
+for line in file_as_lines:
+    file_as_string += line
+
+del file_as_lines
+open_file.close()
+
+input_document = file_as_string
+
+if __name__ == "__main__":
+    original_output = None
+    fast_output = None
+    test_output = None
+
+    original_times = []
+    fast_times = []
+    test_times = []
+
+    for i in range(num_tests):
+        if test_original_kwic:
+            print "\nTesting kwic.py"
+            start_time = time.time()
+            # original_output = kwic_original.kwic(input_document)
+            original_output = kwic_original.kwic(input_document, listPairs=True)
+            if print_original_kwic:
+                print original_output
+            total = time.time() - start_time
+            original_times.append(total)
+            print "kwic.py took " + str(total) + " seconds."
+
+        if test_fast_kwic:
+
+            print "\nTesting fastkwic.py"
+            start_time = time.time()
+            # fast_output = kwic_fast.kwic(input_document)
+            fast_output = kwic_fast.kwic(input_document, listPairs=True)
+            if print_fast_kwic:
+                print fast_output
+            total = time.time() - start_time
+            fast_times.append(total)
+            print "fastkwic.py took " + str(total) + " seconds."
+
+        if test_test_kwic:
+
+            print "\nTesting testkwic.py"
+            start_time = time.time()
+            # test_output = kwic_test.kwic(input_document)
+            test_output = kwic_test.kwic(input_document, listPairs=True)
+            if print_test_kwic:
+                print test_output
+            total = time.time() - start_time
+            test_times.append(total)
+            print "testkwic.py took " + str(total) + " seconds."
+
+    print "\nOriginal == Fast: " + str(original_output == fast_output)
+    print "Original == Test: " + str(original_output == test_output)
+    print "Test == Fast: " + str(test_output == fast_output)
+    print "\n\n"
+    if test_original_kwic:
+        print "Original Avg: " + str(sum(original_times)/ float(len(original_times)))
+
+    if test_fast_kwic:
+        print "Fast Avg: " + str(sum(fast_times) / float(len(fast_times)))
+
+    if test_test_kwic:
+        print "Test Avg: " + str(sum(test_times) / float(len(test_times)))
+
--- a/2/test_documents/chesterton.txt
+++ b/2/test_documents/chesterton.txt
--- a/2/test_documents/chesterton_short.txt
+++ b/2/test_documents/chesterton_short.txt
--- a/2/test_documents/proust.html
+++ b/2/test_documents/proust.html
--- a/2/test_documents/ulysses.txt
+++ b/2/test_documents/ulysses.txt
--- a/2/testkwic.py
+++ b/2/testkwic.py
@@ -0,0 +1,149 @@
+def split_by_periods(document):
+    output_array = []
+    temp_sentence = ""
+    document_length_zero_indexed = len(document) - 1
+    for current_index, current_value in enumerate(document):
+        if current_value == '.':
+            if (current_index == 0) or (current_index == document_length_zero_indexed) or \
+                    (document[current_index-1].islower() and (document[current_index+1].isspace() or
+                                                              (document[current_index+1] == '\n'))):
+                temp_sentence += current_value
+                output_array.append(temp_sentence)
+                temp_sentence = ""
+        else:
+            if current_value != '\n':
+                temp_sentence += current_value
+            else:
+                temp_sentence += " "
+
+    if temp_sentence:
+        output_array.append(temp_sentence)
+    return output_array
+
+
+def split_by_word_as_tuples(sentence_array):
+    output_array = []
+    index_incrementer = 0
+
+    for sentence in sentence_array:
+        words_array = sentence.split(" ")
+        words_array = filter(None, words_array)
+        output_array.append((words_array, index_incrementer))
+        index_incrementer += 1
+
+    return output_array
+
+
+def array_circular_shift(input_array, rotate_val):
+    output_array = input_array[rotate_val:] + input_array[:rotate_val]
+    return output_array
+
+
+def fill_with_circular_shifts_and_original(sentence_array):
+    output_array = []
+
+    for current_tuple in sentence_array:
+        for index, _ in enumerate(current_tuple[0]):
+            output_array.append((array_circular_shift(current_tuple[0], index), current_tuple[1]))
+
+    return output_array
+
+
+def alphabetize_tuple_list(input_array):
+    sorted_array = sorted(input_array, key=alphabetized_key)
+    return sorted_array
+
+
+def alphabetized_key(input_data):
+    output_array = []
+    for word in input_data[0]:
+        output_array.append(word.lower())
+    return output_array
+
+
+def remove_words(input_array, words):
+    lowered_input = []
+    output_array = []
+
+    for word in words:
+        lowered_input.append(word.lower())
+
+    for current_tuple in input_array:
+        if current_tuple[0][0].lower().strip(".:!?,") in lowered_input:
+            pass
+        else:
+            output_array.append(current_tuple)
+
+    return output_array
+
+
+def create_list_pairs(input_array):
+    known_pairs = {}
+
+    for sentence_array, _ in input_array:
+        seen_in_sentence = set([])
+
+        for first_word in sentence_array:
+            for second_word in sentence_array:
+                first, second = return_ordered_words(sanitize_word(first_word), sanitize_word(second_word))
+
+                if (first == second) or (first == ""):
+                    continue
+
+                if (first, second) not in seen_in_sentence:
+                    seen_in_sentence.add((first, second))
+
+                    if (first, second) in known_pairs:
+                        known_pairs[(first, second)] += 1
+                    else:
+                        known_pairs[(first, second)] = 1
+
+    output_list = []
+
+    for key in known_pairs:
+        if known_pairs[key] > 1:
+            output_list.append((key, known_pairs[key]))
+
+    output_list.sort(key=alphabetized_key)
+
+    return output_list
+
+
+def sanitize_word(input_word):
+    char_set = ".,?!:"
+    return "".join(char for char in input_word.lower() if char not in char_set)
+
+
+def return_ordered_words(word_one, word_two):
+
+    if word_one < word_two:
+        return word_one, word_two
+    else:
+        return word_two, word_one
+
+
+def kwic(document, listPairs=False, ignoreWords=None, periodsToBreaks=False):
+    if not document and not listPairs:
+        return []
+    elif not document and listPairs:
+        return [], []
+
+    if periodsToBreaks:
+        split_into_sentences = split_by_periods(document)
+    else:
+        split_into_sentences = document.split('\n')
+
+    split_into_word_tuples = split_by_word_as_tuples(split_into_sentences)
+
+    circular_shifted_data = fill_with_circular_shifts_and_original(split_into_word_tuples)
+
+    if ignoreWords:
+        circular_shifted_data = remove_words(circular_shifted_data, ignoreWords)
+
+    alphabetized_data = alphabetize_tuple_list(circular_shifted_data)
+
+    if listPairs:
+        return alphabetized_data, create_list_pairs(split_into_word_tuples)
+    else:
+        return alphabetized_data
+