Assignment One — Assignment Two — Concepts List — What is Biobank — Git Intuition

Assignment 2

Subject - Take an input file and find the number of occurrences of individual words.

Say this is a String from the input file. “Apple is a healthy food but Soda is a Junk food.” The expected output would be

Apple : 1

is : 2

a : 2

healthy : 1

food : 2

but : 1

Soda : 1

Junk : 1

Listed below is the summary of the approaches and overview of my thinking process as I headed towards solving this problem. All the programs are implemented in python(3) programming language.

Approach 1.0

The First obvious approach would be to

Take a dictionary-like data type(Hashmap).
Read file char by char until a Space is found.
Add newly found word into the data-type with value as previous+1.
Repeat!

This would be a generic approach germane to any programming language given. But Talking about python specifically we can make some modification to the above approach. So my approach to the problem was,

Take a dictionary.
Read the file.
Use text.split(‘ ‘) to split the string into spaces and save it into a list.
for Traverse the list while adding the current element to dictionary with value as previous+1.
Repeat!

The problems with this approach are :

Can’t Read big file as a whole into memory.

No thought was given to the storage of the “words occurrences”.

Given enough input file size the dictionary will run out of space giving MemoryError().

Can’t print out large size of outputs.

Approach 2.0

Trivial errors with Approach 1.0 are removed in this one.

This Approach was all about,

Reading from a file in chunks.

Writing the dictionary into a file.

It was obvious to read from the file in chunks also I decided to write the output into a file as printing out the whole words and their occurrence count was not a viable option.

However the following problems still persisted,

I still needed to do something about the MemoryError() of the dictionary as the limited size of the dictionary was a liability and some alternative was needed.

Approach 3.0

To avoid the limited size of dictionary what I could do is to use the dictionary as a temporary storage and update the output, wherever it was stored, accordingly. Doing this in a file was not practical, as I’d have to check for each word in the file everytime it occurs in the dictionary and then update the value into a file and overwriting into a .txt file would also become an issue.

So instead of using a normal text file as output, we could use a .csv file or more perfectly we could use a database to write the output to. I’ve used sqlite3 module in python to do all the database operations.

This was finally a decent approach that I decided to implement in code.

Version 1.0

import sqlite3,time

start=time.time()

conn=sqlite3.connect('test1.db')
c=conn.cursor()

def makechunk(filer,size=25000000):
    while True:
        data=filer.read(size).lower()
        if data:
            yield data
        else:
            break

def Show_entry(table_name):
    for i in c.execute("SELECT * FROM {}".format(table_name)):
        print(i)

def Create_table(table_name):
    c.execute('CREATE TABLE IF NOT EXISTS {}(Name TEXT UNIQUE, value INT)'.format(table_name))
    conn.commit()
    print('Table Created!')

def Drop_table(table_name):
    c.execute("DROP TABLE {}".format(table_name))
    conn.commit()
    print('Table Dropped!')

def Update_db(table_name):
    param1=list(frequency.items())
    print("Updating...")
    c.executemany("UPDATE Dict SET value=value+? WHERE Name=?",param1)
    if c.rowcount == 0:
        c.executemany("INSERT INTO Dict (value,name) VALUES (?,?)",param1)   
    conn.commit()
    del param1[:]
    
frequency = {}
table_name='Dict'
Drop_table(table_name)
Create_table(table_name)
file_read = open('500mb.txt', 'r')

for d in makechunk(file_read):
    match=d.split(' ')
    for word in match:
        count = frequency.get(word,0)
        frequency[word] = count + 1
    Update_db('Dict')
    frequency.clear()
    
conn.commit()
#Show_entry('Dict')
    
file_read.close()
c.close()
conn.close()
print("***** Execution time : {} sec ******".format(time.time()-start))

Execution Flow:

Read a chunk from file.
Split the chunk using chunk.split(‘ ‘) and saving it into a list.
For loop on the size of list
- Update Dictionary for each item.
Make a list of tuple of the dictionary to pass to Executemany()(Prepared Statement).
- Update Database.
Make the dictionary empty and ready for a new chunk.
Repeat!

We have used Prepared Statements to efficiently execute multiple queries at a time.