FAST_Resources Reverse_Indexer

Abstract

Search Engine for FAST-Resources, a repository of study materials, for user to enter keywords as text, to retrieve all files that contains them.

Pipeline

1. Extract (extract_text.py)

Used Python Libraries to extract text from respective file types (PDF/PPT/DOC)

2. Transform (process_words.py)

i. Using Regular Expressions library to clean trailing whitespaces from files and converting to Lower Case

ii. Using NLTK library predefined stopwords dictionary to identify any Stop Words

iii. Transforming into Cleansed Words after filtering Fillers (Stop Words / Punctuation / 3 Letter Words)

3. Load (main.py)

i. Initializing the FAST-Resources repository and SQLite database on Local machine

ii. Extracting text from file, skipping over any text-less files and Counting Term Frequency of Transformed

iii. Extracting Topic Name, that is, text in the file path before '/' endpoint

iv. Creating Absolute Link, that is, replacing whitespaces with '%20' to format as URL

v. Running the driver program to perform all the above for all the files present in FAST-Resources repository

4. Search (search.py)

i. Searching for the query (input) by splitting into words for using as index to fetch file paths in its row

ii. If only single file is found then its keys (file paths and Relevance score) is fetched from src dict (DB)

iii. If multiple files are found then their file paths are intersected and the intersection's keys are fetched

5. CICD (database.py --> AWS + GitHub-Actions)

i. Iniating SQLite database with text columns: Word, Topic, File and integer column: Relevance score

ii. Creating Index at runtime on Word column and initiating Search Pattern 1 that allows search with Topic

iii. Creating Index at runtime on Word column and initiating Search Pattern 2 that allows search without Topic

iv. Inserting the Word and its Topic, File Path, and Relevance score (TF in main.py for ordering search results)

v. Using S3 as front, DynamoDB over SQLite, and Actions to trigger Lambda API-Gateway when new file added

Need to fix documentation below & Create separate Contributions Guidline file:

Generate Reverse Index Flow

Iterate over all the documents and generate index to make searching possible

Steps

Iterate over all the files (pdf, docx, pptx) main.py
Extract text extract_text.py
Remove stopwords, punctuations, anything that is not alpha-numeric process_words.py
Persist in sqlite table database.py

Search Flow (In Progress)

A function that take search keywords as input and returns like of documents that contain those words

Access Patterns

Search by keyword(s)
Search by keywords and Topic Name

Steps

Example

How can you contribute?

Fork repository
Git Clone Fast Resources Link
Replace self.repo_path = '/Users/jazib/Desktop/workrepo/FAST-Resources/' in main.py with absolute path to Fast Resources clone
Replace self.conn = sqlite3.connect('/Users/jazib/Desktop/workrepo/FAST_Resources_Reverse_Indexer/fast_zakhira.db') in database.py with any local directory
Push your changes
Create Merge Request

Replace

self.repo_path

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
FAST_Resources_Reverse_Indexer		FAST_Resources_Reverse_Indexer
docs		docs
iac		iac
notebooks		notebooks
scripts		scripts
site		site
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FAST_Resources Reverse_Indexer

Abstract

Pipeline

1. Extract (extract_text.py)

Used Python Libraries to extract text from respective file types (PDF/PPT/DOC)

2. Transform (process_words.py)

i. Using Regular Expressions library to clean trailing whitespaces from files and converting to Lower Case

ii. Using NLTK library predefined stopwords dictionary to identify any Stop Words

iii. Transforming into Cleansed Words after filtering Fillers (Stop Words / Punctuation / 3 Letter Words)

3. Load (main.py)

i. Initializing the FAST-Resources repository and SQLite database on Local machine

ii. Extracting text from file, skipping over any text-less files and Counting Term Frequency of Transformed

iii. Extracting Topic Name, that is, text in the file path before '/' endpoint

iv. Creating Absolute Link, that is, replacing whitespaces with '%20' to format as URL

v. Running the driver program to perform all the above for all the files present in FAST-Resources repository

4. Search (search.py)

i. Searching for the query (input) by splitting into words for using as index to fetch file paths in its row

ii. If only single file is found then its keys (file paths and Relevance score) is fetched from src dict (DB)

iii. If multiple files are found then their file paths are intersected and the intersection's keys are fetched

5. CICD (database.py --> AWS + GitHub-Actions)

i. Iniating SQLite database with text columns: Word, Topic, File and integer column: Relevance score

ii. Creating Index at runtime on Word column and initiating Search Pattern 1 that allows search with Topic

iii. Creating Index at runtime on Word column and initiating Search Pattern 2 that allows search without Topic

iv. Inserting the Word and its Topic, File Path, and Relevance score (TF in main.py for ordering search results)

v. Using S3 as front, DynamoDB over SQLite, and Actions to trigger Lambda API-Gateway when new file added

Need to fix documentation below & Create separate Contributions Guidline file:

Generate Reverse Index Flow

Steps

Search Flow (In Progress)

Access Patterns

Steps

How can you contribute?

About

Releases

Packages

Languages

License

mehdirazajaffri/FAST_Resources_Reverse_Indexing

Folders and files

Latest commit

History

Repository files navigation

FAST_Resources Reverse_Indexer

Abstract

Pipeline

1. Extract (extract_text.py)

Used Python Libraries to extract text from respective file types (PDF/PPT/DOC)

2. Transform (process_words.py)

i. Using Regular Expressions library to clean trailing whitespaces from files and converting to Lower Case

ii. Using NLTK library predefined stopwords dictionary to identify any Stop Words

iii. Transforming into Cleansed Words after filtering Fillers (Stop Words / Punctuation / 3 Letter Words)

3. Load (main.py)

i. Initializing the FAST-Resources repository and SQLite database on Local machine

ii. Extracting text from file, skipping over any text-less files and Counting Term Frequency of Transformed

iii. Extracting Topic Name, that is, text in the file path before '/' endpoint

iv. Creating Absolute Link, that is, replacing whitespaces with '%20' to format as URL

v. Running the driver program to perform all the above for all the files present in FAST-Resources repository

4. Search (search.py)

i. Searching for the query (input) by splitting into words for using as index to fetch file paths in its row

ii. If only single file is found then its keys (file paths and Relevance score) is fetched from src dict (DB)

iii. If multiple files are found then their file paths are intersected and the intersection's keys are fetched

5. CICD (database.py --> AWS + GitHub-Actions)

i. Iniating SQLite database with text columns: Word, Topic, File and integer column: Relevance score

ii. Creating Index at runtime on Word column and initiating Search Pattern 1 that allows search with Topic

iii. Creating Index at runtime on Word column and initiating Search Pattern 2 that allows search without Topic

iv. Inserting the Word and its Topic, File Path, and Relevance score (TF in main.py for ordering search results)

v. Using S3 as front, DynamoDB over SQLite, and Actions to trigger Lambda API-Gateway when new file added

Need to fix documentation below & Create separate Contributions Guidline file:

Generate Reverse Index Flow

Steps

Search Flow (In Progress)

Access Patterns

Steps

How can you contribute?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages