Skip to content

Pdf parser #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
52f1777
Initial write up of the project objectives
sarmakdvsr Oct 20, 2021
c336438
update readme
Nov 9, 2021
37a5115
Updated proposal
sarmakdvsr Nov 10, 2021
eeed1b5
Merge pull request #1 from dkrovi2/update_readme
dkrovi2 Nov 10, 2021
6091d2d
architecture diagram
sarmakdvsr Nov 10, 2021
4c28cfb
architecture diagram
sarmakdvsr Nov 10, 2021
45c5f46
basic parsing-engine codebase setup
sarmakdvsr Nov 10, 2021
82fd6e7
progress-report first cut
sarmakdvsr Nov 13, 2021
afcf520
progress report
sarmakdvsr Nov 13, 2021
2326afe
move docs into doc director
sarmakdvsr Nov 13, 2021
3b62e75
Intermediate commit
Nov 13, 2021
e009a59
Move progress report to top-level
sarmakdvsr Nov 14, 2021
fbf8de3
Basic search implementation
Nov 15, 2021
6d5793c
sample input change
Nov 19, 2021
49fac21
Merge branch 'main' into Doc_Scoring
Nov 19, 2021
1a87078
word parsing
Nov 19, 2021
6e3c26f
Merge pull request #3 from dkrovi2/word_parsing
sidmeister Nov 19, 2021
5c71d61
ignore out directory
sarmakdvsr Nov 20, 2021
03bfcb3
ignore build artifact dirs. Add sample inputs
sarmakdvsr Nov 20, 2021
42eca39
PDF parsing added and some code refactoring
sarmakdvsr Nov 21, 2021
19eb9e1
Scoring implementation
Nov 27, 2021
45df681
Merge branch 'main' into Doc_Scoring
Nov 27, 2021
f71b97a
Refined scoring implementation
Nov 27, 2021
7b92d95
Merge pull request #2 from dkrovi2/Doc_Scoring
saxenaj Nov 27, 2021
a5eef23
Sample json file
Nov 27, 2021
3c06bdb
Merge pull request #4 from dkrovi2/Doc_Scoring
saxenaj Nov 27, 2021
64fb859
Merge remote-tracking branch 'origin/main' into pdf-parser
sarmakdvsr Nov 27, 2021
9ff7c38
pdf and doc parsing -> extract skills
sarmakdvsr Nov 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.idea/
out
build
target
54 changes: 52 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,53 @@
# CourseProject
# Build Experience Profile from Resumes

Please fork this repository and paste the github link of your fork on Microsoft CMT. Detailed instructions are on Coursera under Week 1: Course Project Overview/Week 9 Activities.

1. **What are the names and NetIDs of all your team members? Who is the captain? The captain will have more administrative duties than team members.**

* alokk3@illinois.edu
* dkrovi2@illinois.edu
* jsaxena3@illinois.edu
* rathi9@illinois.edu

2. **What is your free topic? Please give a detailed description. What is the task? Why is it important or interesting? What is your planned approach? What tools, systems or datasets are involved? What is the expected outcome? How are you going to evaluate your work?**

In this project, we use text extraction and retrieval for the following functions:

* Parse resumes in doc and pdf format
* Parse job descriptions in doc and pdf format
* Build an analysis engine to extract experience details of a candidate on various tools and technologies
* Rank the available set of resumes based on the skill set specified in the job description

The current keyword based search used by many online websites might not be entirely accurate, as the correlation between the skills and the experience is often missing.

For example, for a skill set of ‘Spark’, instead of just searching for the keyword ‘Spark’ in the resume, we want to know (for scoring purpose)
- if the employee worked in Spark for X number of years,
- did he have experience on Spark, in multiple organizations.

We then create a score for each profile/resume based on the skill set mentioned in the query and rank them in order of score (highest to lowest).

3. **Which programming language do you plan to use?**

We will use the standard text retrieval tools and programming APIs (MeTA, python, numpy etc) with a customized algorithm to score each resume.

4. **Please justify that the workload of your topic is at least 20 \* N hours, N being the total number of students in your team. You may list the main tasks to be completed, and the estimated time cost for each task.**

The following are the steps and key milestones for this project:

| Task | Time needed | ETA |
|:--------------------------------------------------------------|-------------:|----------------:|
| Gather representative data set for training and evaluation | 8 hours | Nov 8 |
| Parsing engine to parse resumes and job descriptions | 20 hours | Nov 15 |
| Progress report | 2 hours | Nov 15 |
| Analysis engine to analyze resumes | 30 hours | Nov 22 |
| Scoring engine to match resumes to provided job description | 30 hours | Nov 29 |
| Basic UI to search for resumes matching a job description | 24 hours | Dec 5 |
| Software documentation | 8 hours | Dec 9 |
| **Total** |**122 hours** | |


# Contributors

* alokk3@illinois.edu
* dkrovi2@illinois.edu
* jsaxena3@illinois.edu
* rathi9@illinois.edu
6 changes: 6 additions & 0 deletions code/parsing-engine/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#
# https://help.github.com/articles/dealing-with-line-endings/
#
# These are explicitly windows files and should use crlf
*.bat text eol=crlf

5 changes: 5 additions & 0 deletions code/parsing-engine/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Ignore Gradle project-specific cache directory
.gradle

# Ignore Gradle build output directory
build
66 changes: 66 additions & 0 deletions code/parsing-engine/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
/*
* This file was generated by the Gradle 'init' task.
*
* This generated file contains a sample Java application project to get you started.
* For more details take a look at the 'Building Java & JVM projects' chapter in the Gradle
* User Manual available at https://docs.gradle.org/6.8.2/userguide/building_java_projects.html
*/

plugins {
// Apply the application plugin to add support for building a CLI application in Java.
id 'application'
id 'com.github.johnrengelman.shadow' version '6.1.0'
}

repositories {
mavenCentral()
}

dependencies {

implementation 'org.apache.commons:commons-lang3:3.12.0'
implementation 'commons-lang:commons-lang:2.6'
implementation 'commons-io:commons-io:2.11.0'

// PDF Parsing
implementation 'org.apache.pdfbox:pdfbox:2.0.24'

// DOC Parsing
implementation group: 'org.apache.lucene', name: 'lucene-core', version: '8.1.0'
implementation group: 'org.apache.lucene', name: 'lucene-queryparser', version: '8.1.0'
implementation group: 'org.apache.lucene', name: 'lucene-analyzers-common', version: '8.1.0'
implementation group: 'com.googlecode.json-simple', name: 'json-simple', version: '1.1.1'


// Use JUnit Jupiter API for testing.
implementation 'edu.stanford.nlp:stanford-corenlp:4.3.1'
implementation 'edu.stanford.nlp:stanford-corenlp:4.3.1:models'
implementation 'com.fasterxml.jackson.core:jackson-databind:2.9.2'
implementation 'org.apache.poi:poi-ooxml:4.1.2'

// Use JUnit Jupiter Engine for testing.
testImplementation 'org.junit.jupiter:junit-jupiter-api:5.6.2'
testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.6.2'

// Lombok.
compileOnly 'org.projectlombok:lombok:1.18.22'
annotationProcessor 'org.projectlombok:lombok:1.18.22'
testCompileOnly 'org.projectlombok:lombok:1.18.22'
testAnnotationProcessor 'org.projectlombok:lombok:1.18.22'

// Logback
implementation 'ch.qos.logback:logback-classic:1.2.7'
}

application {
// Define the main class for the application.
mainClass = 'edu.illinois.phantom.Main'
}

tasks.named('test') {
// Use junit platform for unit tests.
useJUnitPlatform()
}

mainClassName = 'edu.illinois.phantom.Main'
build.dependsOn shadowJar
Loading