Added work from my other class repositories before deletion

This commit is contained in:
2017-11-29 10:28:24 -08:00
parent cb0b5f4d25
commit 5ea24c81b5
198 changed files with 739603 additions and 0 deletions

View File

@@ -0,0 +1,26 @@
From an architectural point of view, very little is different between the fast version of kwic and the baseline
implementation. However, the one architectural difference that is there is significant in terms of speed, the point of
creating this version in the first place. In the baseline kwic, the code acts almost entirely like a single black box
block. Data in to data out, all done in pure python. The fast version's biggest gains came from changing a core aspect
of how some of the loops were being done, in terms that they are now handled by compiled c code rather than running in
native python. This change was the use of the map function, an alternative to the traditional loop in python.
While this is an architecturally significant change, as we're now expanding the number of boxes AND languages, is also
came with the speed benefits of running code in c. This speed jump was most noticeable when used in the function that is
called as an argument "key" in the alphabetization sort calls. Another place where changes were made that helped
increase speed, though didn't necessarily change the code architecturally, was in loops where inherited methods were
called. So, for example, I originally had many loops that called "array_name.append()". This is slow because every time
the interpreter comes across that line in each loop, it has to process "array_name" to determine whether it contains
append, and what append is. By aliasing "array_name.append" as a new variable like so "aliased = array_name.append" the
interpreter only has to perform that lookup once. Then, by replacing the calls in the loop with my alias, I save that
lookup time for each iteration. While this didn't result in a massive increase like the use of map did, it did help
enough to mention. A few other minor changes were also made to help shave a couple tenths of a second off here and
there. There were many places where I was needlessly making copies of large arrays of data, simply to make code
readability better, but which took both time and ram to accomplish. By streamlining the use of existing variables, I
again managed to save little bits of time here and there. Now, once it came to enabling listPairs, the other very slow
part of the code, I decided to change the way that the words were sanitized as that's where the most speed loss was
found. In the original version, I used a join command on an empty string that stripped out unwanted characters
essentially rather brute forced. For the fast version, I learned of a way to perform the same task using the python's
built in translate function. This change nearly halved the time it took for just the listPairs section of the code to
run. In order to test all of this, I used python's built in cProfile tool which times every user written function call,
as well as aggregates the times of all python's built in ones, and separately. This made it very easy to see what parts
of the code needed to be focused on for speed increases.

View File

@@ -0,0 +1,14 @@
Compared to the original kwic, the testing version of kwic is quite different. The focus of this version was to pull out
much of the code from the main into functions so that sensitive portions could be easily tested separately, as well as
be changed more easily. From an architectural point of view, the original kwic is very much a traditional black box
approach. Data and flags in, one tiny function call used for alphabetization (as I had trouble with it and needed to
pull it out), and the final data came out. The testing version on the other hand is approximately 25% longer in terms of
pure code length, and is split into ten separate functions rather than the two for the original kwic. By splitting the
core features of the kwic system into these functions, it made testing the development of the code that much easier.
Rather than having to run through all the code up to the point I wanted to test, I could simply only call the functions
for features I was actively testing. Of course, adding all these extra function calls did affect performance slightly,
though not as much as I was expecting. This version is only marginally slower than the baseline implementation. I also
considering adding a flag to kwic that would enable debugging print statements, but I decided against it as (at least
in my personal experience) adding tons of print statements isn't specifically always helpful. Generally, you only need
printing for a particular section of code, which could easily be manually added now that the important parts of the code
are broken out.

View File

@@ -0,0 +1,58 @@
def shift(line):
return [line[i:] + line[:i] for i in xrange(0,len(line))]
def cleanWord(word):
return filter (lambda c: c not in [".",",","?","!",":"], word.lower())
def ignorable(word,ignoreWords):
return cleanWord(word) in map(lambda w: w.lower(), ignoreWords)
def splitBreaks(string, periodsToBreaks):
if not periodsToBreaks:
return string.split("\n")
else:
line = ""
lines = []
lastChar1 = None
lastChar2 = None
breakChars = map(chr, xrange(ord('a'),ord('z')+1))
for c in string:
if (c == " ") and (lastChar1 == ".") and (lastChar2 in breakChars):
lines.append(line)
line = ""
line += c
lastChar2 = lastChar1
lastChar1 = c
lines.append(line)
return lines
def kwic(string,ignoreWords=[], listPairs=False, periodsToBreaks=False):
lines = splitBreaks(string, periodsToBreaks)
splitLines = map(lambda l: l.split(), lines)
if listPairs:
pairs = {}
for l in splitLines:
seen = set([])
for wu1 in l:
wc1 = cleanWord(wu1)
if len(wc1) == 0:
continue
for wu2 in l:
wc2 = cleanWord(wu2)
if wc1 < wc2:
if (wc1,wc2) in seen:
continue
seen.add((wc1,wc2))
if (wc1, wc2) in pairs:
pairs[(wc1,wc2)] += 1
else:
pairs[(wc1,wc2)] = 1
shiftedLines = [map(lambda x:(x,i), shift(splitLines[i])) for i in xrange(0,len(splitLines))]
flattenedLines = [l for subList in shiftedLines for l in subList]
filteredLines = filter(lambda l: not ignorable(l[0][0], ignoreWords), flattenedLines)
if not listPairs:
return sorted(filteredLines, key = lambda l: (map(cleanWord, l[0]),l[1]))
else:
return (sorted(filteredLines, key = lambda l: (map(lambda w:w.lower(), l[0]),l[1])),
map(lambda wp: (wp, pairs[wp]), sorted(filter(lambda wp: pairs[wp] > 1, pairs.keys()))))

View File

@@ -0,0 +1,20 @@
- One version that has improved performance, and one to improve testability
- Performance
-- Faster by a good marging than the baseline version
- Testability
-- Make it easier to control and/or observe the behavior of the system
-- Ideas
--- Interfaces for controlling internal variables
--- Copious assertions
--- Limiting complexity
--- Changes that make the code timing more consistent
- Files (put in folder named perrenc361assign2.zip)
-- kwic.py : baseline version
-- fastkwic.py : better performance version
-- fastarch.txt : describes changes in architectural terms that made faster, and how you tested this (400 words)
-- fastarch.pdf : diagram of architecture
-- testkwic.py : highly testable version of kwic
-- testarch.txt : describes changes in architectural terms that made it more testable (200 words)
-- testarch.pdf : diagram of architecture

View File

@@ -0,0 +1,102 @@
def alphabetized_key(input_data):
output_array = map(str.lower, input_data[0])
return output_array
def kwic(document, listPairs=False, ignoreWords=None, periodsToBreaks=False):
if not document and not listPairs:
return []
elif not document and listPairs:
return [], []
if periodsToBreaks:
output_array = []
temp_sentence = ""
document_length_zero_indexed = len(document) - 1
for current_index, current_value in enumerate(document):
if current_value == '.':
if (current_index == 0) or (current_index == document_length_zero_indexed) or \
(document[current_index - 1].islower() and (document[current_index + 1].isspace() or
(document[current_index + 1] == '\n'))):
temp_sentence += current_value
output_array.append(temp_sentence)
temp_sentence = ""
else:
if current_value != '\n':
temp_sentence += current_value
else:
temp_sentence += " "
if temp_sentence:
output_array.append(temp_sentence)
split_into_sentences = output_array
else:
split_into_sentences = document.split('\n')
output_array_2 = []
index_incrementer = 0
temp_append = output_array_2.append
temp_split = str.split
for sentence in split_into_sentences:
words_array = temp_split(sentence, " ")
words_array = filter(None, words_array)
temp_append((words_array, index_incrementer))
index_incrementer += 1
split_into_word_tuples = output_array_2
circular_shifted_data = []
temp_append = circular_shifted_data.append
for current_tuple in output_array_2:
for index, _ in enumerate(current_tuple[0]):
temp_array = current_tuple[0][index:] + current_tuple[0][:index]
temp_append((temp_array, current_tuple[1]))
if ignoreWords:
lowered_input = []
for word in ignoreWords:
lowered_input.append(word.lower())
for current_tuple in circular_shifted_data:
if current_tuple[0][0].lower().strip(".:!?,") in lowered_input:
circular_shifted_data.remove(current_tuple)
alphabetized_data = sorted(circular_shifted_data, key=alphabetized_key)
if listPairs:
known_pairs = {}
for sentence_array, _ in split_into_word_tuples:
seen_in_sentence = set([])
for first_word in sentence_array:
for second_word in sentence_array:
first = first_word.lower().translate(None, ".,?!:")
second = second_word.lower().translate(None, ".,?!:")
if (first == second) or (first == "") or (first > second):
continue
if (first, second) not in seen_in_sentence:
seen_in_sentence.add((first, second))
if (first, second) in known_pairs:
known_pairs[(first, second)] += 1
else:
known_pairs[(first, second)] = 1
output_list = []
for key in known_pairs:
if known_pairs[key] > 1:
output_list.append((key, known_pairs[key]))
output_list.sort(key=alphabetized_key)
return alphabetized_data, output_list
else:
return alphabetized_data

View File

@@ -0,0 +1,115 @@
def alphabetized_key(input_data):
output_array = []
for word in input_data[0]:
output_array.append(word.lower())
return output_array
def kwic(document, listPairs=False, ignoreWords=None, periodsToBreaks=False):
if not document and not listPairs:
return []
elif not document and listPairs:
return [], []
if periodsToBreaks:
output_array = []
temp_sentence = ""
document_length_zero_indexed = len(document) - 1
for current_index, current_value in enumerate(document):
if current_value == '.':
if (current_index == 0) or (current_index == document_length_zero_indexed) or \
(document[current_index - 1].islower() and (document[current_index + 1].isspace() or
(document[current_index + 1] == '\n'))):
temp_sentence += current_value
output_array.append(temp_sentence)
temp_sentence = ""
else:
if current_value != '\n':
temp_sentence += current_value
else:
temp_sentence += " "
if temp_sentence:
output_array.append(temp_sentence)
split_into_sentences = output_array
else:
split_into_sentences = document.split('\n')
output_array = []
index_incrementer = 0
for sentence in split_into_sentences:
words_array = sentence.split(" ")
words_array = filter(None, words_array)
output_array.append((words_array, index_incrementer))
index_incrementer += 1
split_into_word_tuples = output_array
output_array = []
for current_tuple in split_into_word_tuples:
for index, _ in enumerate(current_tuple[0]):
temp_array = current_tuple[0][index:] + current_tuple[0][:index]
output_array.append((temp_array, current_tuple[1]))
circular_shifted_data = output_array
if ignoreWords:
lowered_input = []
output_array = []
for word in ignoreWords:
lowered_input.append(word.lower())
for current_tuple in circular_shifted_data:
if current_tuple[0][0].lower().strip(".:!?,") in lowered_input:
pass
else:
output_array.append(current_tuple)
circular_shifted_data = output_array
sorted_array = sorted(circular_shifted_data, key=alphabetized_key)
alphabetized_data = sorted_array
if listPairs:
known_pairs = {}
char_set = ".,?!:"
for sentence_array, _ in split_into_word_tuples:
seen_in_sentence = set([])
for first_word in sentence_array:
for second_word in sentence_array:
first = "".join(char for char in first_word.lower() if char not in char_set)
second = "".join(char for char in second_word.lower() if char not in char_set)
if first > second:
temp = second
second = first
first = temp
if (first == second) or (first == ""):
continue
if (first, second) not in seen_in_sentence:
seen_in_sentence.add((first, second))
if (first, second) in known_pairs:
known_pairs[(first, second)] += 1
else:
known_pairs[(first, second)] = 1
output_list = []
for key in known_pairs:
if known_pairs[key] > 1:
output_list.append((key, known_pairs[key]))
output_list.sort(key=alphabetized_key)
return alphabetized_data, output_list
else:
return alphabetized_data

View File

@@ -0,0 +1,91 @@
import time
import kwic as kwic_original
import fastkwic as kwic_fast
import testkwic as kwic_test
num_tests = 1
test_original_kwic = True
print_original_kwic = False
test_fast_kwic = True
print_fast_kwic = False
test_test_kwic = True
print_test_kwic = False
design_words_doc = "Design is hard.\nLet's just implement."
goodbye_buddy_doc = "Hello there.\nHello there, buddy.\nHello and goodbye, buddy.\nHello is like buddy Goodbye!"
hello_buddy_periods = "Hello there. Hello there, buddy. Hello and goodbye, buddy. Hello is like buddy Goodbye!"
letters_and_stuff = "It's very nice to be footloose. \nWith just a toothbrush and a comb.\n"
open_file = open("test_documents/chesterton_short.txt", "r")
file_as_lines = open_file.readlines()
file_as_string = ""
for line in file_as_lines:
file_as_string += line
del file_as_lines
open_file.close()
input_document = file_as_string
if __name__ == "__main__":
original_output = None
fast_output = None
test_output = None
original_times = []
fast_times = []
test_times = []
for i in range(num_tests):
if test_original_kwic:
print "\nTesting kwic.py"
start_time = time.time()
# original_output = kwic_original.kwic(input_document)
original_output = kwic_original.kwic(input_document, listPairs=True)
if print_original_kwic:
print original_output
total = time.time() - start_time
original_times.append(total)
print "kwic.py took " + str(total) + " seconds."
if test_fast_kwic:
print "\nTesting fastkwic.py"
start_time = time.time()
# fast_output = kwic_fast.kwic(input_document)
fast_output = kwic_fast.kwic(input_document, listPairs=True)
if print_fast_kwic:
print fast_output
total = time.time() - start_time
fast_times.append(total)
print "fastkwic.py took " + str(total) + " seconds."
if test_test_kwic:
print "\nTesting testkwic.py"
start_time = time.time()
# test_output = kwic_test.kwic(input_document)
test_output = kwic_test.kwic(input_document, listPairs=True)
if print_test_kwic:
print test_output
total = time.time() - start_time
test_times.append(total)
print "testkwic.py took " + str(total) + " seconds."
print "\nOriginal == Fast: " + str(original_output == fast_output)
print "Original == Test: " + str(original_output == test_output)
print "Test == Fast: " + str(test_output == fast_output)
print "\n\n"
if test_original_kwic:
print "Original Avg: " + str(sum(original_times)/ float(len(original_times)))
if test_fast_kwic:
print "Fast Avg: " + str(sum(fast_times) / float(len(fast_times)))
if test_test_kwic:
print "Test Avg: " + str(sum(test_times) / float(len(test_times)))

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,149 @@
def split_by_periods(document):
output_array = []
temp_sentence = ""
document_length_zero_indexed = len(document) - 1
for current_index, current_value in enumerate(document):
if current_value == '.':
if (current_index == 0) or (current_index == document_length_zero_indexed) or \
(document[current_index-1].islower() and (document[current_index+1].isspace() or
(document[current_index+1] == '\n'))):
temp_sentence += current_value
output_array.append(temp_sentence)
temp_sentence = ""
else:
if current_value != '\n':
temp_sentence += current_value
else:
temp_sentence += " "
if temp_sentence:
output_array.append(temp_sentence)
return output_array
def split_by_word_as_tuples(sentence_array):
output_array = []
index_incrementer = 0
for sentence in sentence_array:
words_array = sentence.split(" ")
words_array = filter(None, words_array)
output_array.append((words_array, index_incrementer))
index_incrementer += 1
return output_array
def array_circular_shift(input_array, rotate_val):
output_array = input_array[rotate_val:] + input_array[:rotate_val]
return output_array
def fill_with_circular_shifts_and_original(sentence_array):
output_array = []
for current_tuple in sentence_array:
for index, _ in enumerate(current_tuple[0]):
output_array.append((array_circular_shift(current_tuple[0], index), current_tuple[1]))
return output_array
def alphabetize_tuple_list(input_array):
sorted_array = sorted(input_array, key=alphabetized_key)
return sorted_array
def alphabetized_key(input_data):
output_array = []
for word in input_data[0]:
output_array.append(word.lower())
return output_array
def remove_words(input_array, words):
lowered_input = []
output_array = []
for word in words:
lowered_input.append(word.lower())
for current_tuple in input_array:
if current_tuple[0][0].lower().strip(".:!?,") in lowered_input:
pass
else:
output_array.append(current_tuple)
return output_array
def create_list_pairs(input_array):
known_pairs = {}
for sentence_array, _ in input_array:
seen_in_sentence = set([])
for first_word in sentence_array:
for second_word in sentence_array:
first, second = return_ordered_words(sanitize_word(first_word), sanitize_word(second_word))
if (first == second) or (first == ""):
continue
if (first, second) not in seen_in_sentence:
seen_in_sentence.add((first, second))
if (first, second) in known_pairs:
known_pairs[(first, second)] += 1
else:
known_pairs[(first, second)] = 1
output_list = []
for key in known_pairs:
if known_pairs[key] > 1:
output_list.append((key, known_pairs[key]))
output_list.sort(key=alphabetized_key)
return output_list
def sanitize_word(input_word):
char_set = ".,?!:"
return "".join(char for char in input_word.lower() if char not in char_set)
def return_ordered_words(word_one, word_two):
if word_one < word_two:
return word_one, word_two
else:
return word_two, word_one
def kwic(document, listPairs=False, ignoreWords=None, periodsToBreaks=False):
if not document and not listPairs:
return []
elif not document and listPairs:
return [], []
if periodsToBreaks:
split_into_sentences = split_by_periods(document)
else:
split_into_sentences = document.split('\n')
split_into_word_tuples = split_by_word_as_tuples(split_into_sentences)
circular_shifted_data = fill_with_circular_shifts_and_original(split_into_word_tuples)
if ignoreWords:
circular_shifted_data = remove_words(circular_shifted_data, ignoreWords)
alphabetized_data = alphabetize_tuple_list(circular_shifted_data)
if listPairs:
return alphabetized_data, create_list_pairs(split_into_word_tuples)
else:
return alphabetized_data