Search Engine for FAST-Resources, a repository of study materials, for user to enter keywords as text, to retrieve all files that contains them.
i. Using Regular Expressions library to clean trailing whitespaces from files and converting to Lower Case
iii. Transforming into Cleansed Words after filtering Fillers (Stop Words / Punctuation / 3 Letter Words)
ii. Extracting text from file, skipping over any text-less files and Counting Term Frequency of Transformed
v. Running the driver program to perform all the above for all the files present in FAST-Resources repository
i. Searching for the query (input) by splitting into words for using as index to fetch file paths in its row
ii. If only single file is found then its keys (file paths and Relevance score) is fetched from src dict (DB)
iii. If multiple files are found then their file paths are intersected and the intersection's keys are fetched
i. Iniating SQLite database with text columns: Word, Topic, File and integer column: Relevance score
ii. Creating Index at runtime on Word column and initiating Search Pattern 1 that allows search with Topic
iii. Creating Index at runtime on Word column and initiating Search Pattern 2 that allows search without Topic
iv. Inserting the Word and its Topic, File Path, and Relevance score (TF in main.py for ordering search results)
v. Using S3 as front, DynamoDB over SQLite, and Actions to trigger Lambda API-Gateway when new file added
Iterate over all the documents and generate index to make searching possible
- Iterate over all the files (pdf, docx, pptx)
main.py
- Extract text
extract_text.py
- Remove stopwords, punctuations, anything that is not alpha-numeric
process_words.py
- Persist in sqlite table
database.py
A function that take search keywords as input and returns like of documents that contain those words
- Search by keyword(s)
- Search by keywords and Topic Name
- Example
- Fork repository
- Git Clone Fast Resources Link
- Replace
self.repo_path = '/Users/jazib/Desktop/workrepo/FAST-Resources/'
inmain.py
with absolute path to Fast Resources clone - Replace
self.conn = sqlite3.connect('/Users/jazib/Desktop/workrepo/FAST_Resources_Reverse_Indexer/fast_zakhira.db')
indatabase.py
with any local directory - Push your changes
- Create Merge Request
Replace
self.repo_path