-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tweet cleanup (with potentational peformance boost) #22
Open
nielstiben
wants to merge
16
commits into
main
Choose a base branch
from
score_improments
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 6 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
3017f20
tweet cleaning
nielstiben 252a948
Trigger notification
2852f89
Merge branch 'main' into score_improments
nielstiben 05e0062
Merge remote-tracking branch 'origin/score_improments' into score_imp…
e5082e1
Add entity to wandb.
1a64819
model configs.
48ff83f
fix, mistakes in configuration
cfe64fe
Update default model
denisramiros f704483
Merge branch 'score_improments' of https://github.com/nielstiben/MLOP…
denisramiros 51cf6b5
Merge branch 'main' into score_improments
denisramiros e43a08f
Merge branch 'main' into score_improments
nielstiben 7f35896
New model
nielstiben 5f07c2d
add skikit-learn requirement
nielstiben 3078410
docker fix
nielstiben ab47c56
docker fix
nielstiben 4fdd7af
docker fix
nielstiben File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
[flake8] | ||
exclude = venv | ||
ignore = W503 #line break occurred before binary operation | ||
ignore = W503,W605 #line break occurred before binary operation | ||
max-line-length = 100 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,4 @@ | ||
lr: 6e-6 | ||
eps: 1e-8 | ||
# model: 'bert' | ||
# pretrained-model: 'bert-large-uncased' | ||
model: 'distilbert' | ||
model: 'bert' | ||
#model: 'distilbert' | ||
pretrained-model: 'distilbert-base-uncased' | ||
num_labels: 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,5 @@ | ||
optimizer: Adam | ||
lr: 0.001 | ||
batch_size: 8 | ||
scheduler: | ||
name: ExponentialLR | ||
gamma: 0.1 | ||
optimizer: AdamW | ||
lr: 6e-6 | ||
eps: 1e-8, | ||
batch_size: 16 | ||
epochs: 5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,3 +10,4 @@ torch==1.10.1 | |
transformers==4.15.0 | ||
google-cloud-secret-manager==2.5.0 | ||
wandb==0.12.9 | ||
nltk==3.6.7 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,305 @@ | ||
import re | ||
|
||
import nltk | ||
|
||
replacement_patterns = [ | ||
(r"won\'t", "will not"), | ||
(r"can\'t", "cannot"), | ||
(r"i\'m", "i am"), | ||
(r"ain\'t", "is not"), | ||
(r"(\w+)\'ll", "\g<1> will"), | ||
(r"(\w+)n\'t", "\g<1> not"), | ||
(r"(\w+)\'ve", "\g<1> have"), | ||
(r"(\w+)\'s", "\g<1> is"), | ||
(r"(\w+)\'re", "\g<1> are"), | ||
(r"(\w+)\'d", "\g<1> would"), | ||
] | ||
abbreviations = { | ||
"$": " dollar ", | ||
"€": " euro ", | ||
"4ao": "for adults only", | ||
"a.m": "before midday", | ||
"a3": "anytime anywhere anyplace", | ||
"aamof": "as a matter of fact", | ||
"acct": "account", | ||
"adih": "another day in hell", | ||
"afaic": "as far as i am concerned", | ||
"afaict": "as far as i can tell", | ||
"afaik": "as far as i know", | ||
"afair": "as far as i remember", | ||
"afk": "away from keyboard", | ||
"app": "application", | ||
"approx": "approximately", | ||
"apps": "applications", | ||
"asap": "as soon as possible", | ||
"asl": "age, sex, location", | ||
"atk": "at the keyboard", | ||
"ave.": "avenue", | ||
"aymm": "are you my mother", | ||
"ayor": "at your own risk", | ||
"b&b": "bed and breakfast", | ||
"b+b": "bed and breakfast", | ||
"b.c": "before christ", | ||
"b2b": "business to business", | ||
"b2c": "business to customer", | ||
"b4": "before", | ||
"b4n": "bye for now", | ||
"b@u": "back at you", | ||
"bae": "before anyone else", | ||
"bak": "back at keyboard", | ||
"bbbg": "bye bye be good", | ||
"bbc": "british broadcasting corporation", | ||
"bbias": "be back in a second", | ||
"bbl": "be back later", | ||
"bbs": "be back soon", | ||
"be4": "before", | ||
"bfn": "bye for now", | ||
"blvd": "boulevard", | ||
"bout": "about", | ||
"brb": "be right back", | ||
"bros": "brothers", | ||
"brt": "be right there", | ||
"bsaaw": "big smile and a wink", | ||
"btw": "by the way", | ||
"bwl": "bursting with laughter", | ||
"c/o": "care of", | ||
"cet": "central european time", | ||
"cf": "compare", | ||
"cia": "central intelligence agency", | ||
"csl": "can not stop laughing", | ||
"cu": "see you", | ||
"cul8r": "see you later", | ||
"cv": "curriculum vitae", | ||
"cwot": "complete waste of time", | ||
"cya": "see you", | ||
"cyt": "see you tomorrow", | ||
"dae": "does anyone else", | ||
"dbmib": "do not bother me i am busy", | ||
"diy": "do it yourself", | ||
"dm": "direct message", | ||
"dwh": "during work hours", | ||
"e123": "easy as one two three", | ||
"eet": "eastern european time", | ||
"eg": "example", | ||
"embm": "early morning business meeting", | ||
"encl": "enclosed", | ||
"encl.": "enclosed", | ||
"etc": "and so on", | ||
"faq": "frequently asked questions", | ||
"fawc": "for anyone who cares", | ||
"fb": "facebook", | ||
"fc": "fingers crossed", | ||
"fig": "figure", | ||
"fimh": "forever in my heart", | ||
"ft.": "feet", | ||
"ft": "featuring", | ||
"ftl": "for the loss", | ||
"ftw": "for the win", | ||
"fwiw": "for what it is worth", | ||
"fyi": "for your information", | ||
"g9": "genius", | ||
"gahoy": "get a hold of yourself", | ||
"gal": "get a life", | ||
"gcse": "general certificate of secondary education", | ||
"gfn": "gone for now", | ||
"gg": "good game", | ||
"gl": "good luck", | ||
"glhf": "good luck have fun", | ||
"gmt": "greenwich mean time", | ||
"gmta": "great minds think alike", | ||
"gn": "good night", | ||
"g.o.a.t": "greatest of all time", | ||
"goat": "greatest of all time", | ||
"goi": "get over it", | ||
"gps": "global positioning system", | ||
"gr8": "great", | ||
"gratz": "congratulations", | ||
"gyal": "girl", | ||
"h&c": "hot and cold", | ||
"hp": "horsepower", | ||
"hr": "hour", | ||
"hrh": "his royal highness", | ||
"ht": "height", | ||
"ibrb": "i will be right back", | ||
"ic": "i see", | ||
"icq": "i seek you", | ||
"icymi": "in case you missed it", | ||
"idc": "i do not care", | ||
"idgadf": "i do not give a damn fuck", | ||
"idgaf": "i do not give a fuck", | ||
"idk": "i do not know", | ||
"ie": "that is", | ||
"i.e": "that is", | ||
"ifyp": "i feel your pain", | ||
"IG": "instagram", | ||
"iirc": "if i remember correctly", | ||
"ilu": "i love you", | ||
"ily": "i love you", | ||
"imho": "in my humble opinion", | ||
"imo": "in my opinion", | ||
"imu": "i miss you", | ||
"iow": "in other words", | ||
"irl": "in real life", | ||
"j4f": "just for fun", | ||
"jic": "just in case", | ||
"jk": "just kidding", | ||
"jsyk": "just so you know", | ||
"l8r": "later", | ||
"lb": "pound", | ||
"lbs": "pounds", | ||
"ldr": "long distance relationship", | ||
"lmao": "laugh my ass off", | ||
"lmfao": "laugh my fucking ass off", | ||
"lol": "laughing out loud", | ||
"ltd": "limited", | ||
"ltns": "long time no see", | ||
"m8": "mate", | ||
"mf": "motherfucker", | ||
"mfs": "motherfuckers", | ||
"mfw": "my face when", | ||
"mofo": "motherfucker", | ||
"mph": "miles per hour", | ||
"mr": "mister", | ||
"mrw": "my reaction when", | ||
"ms": "miss", | ||
"mte": "my thoughts exactly", | ||
"nagi": "not a good idea", | ||
"nbc": "national broadcasting company", | ||
"nbd": "not big deal", | ||
"nfs": "not for sale", | ||
"ngl": "not going to lie", | ||
"nhs": "national health service", | ||
"nrn": "no reply necessary", | ||
"nsfl": "not safe for life", | ||
"nsfw": "not safe for work", | ||
"nth": "nice to have", | ||
"nvr": "never", | ||
"nyc": "new york city", | ||
"oc": "original content", | ||
"og": "original", | ||
"ohp": "overhead projector", | ||
"oic": "oh i see", | ||
"omdb": "over my dead body", | ||
"omg": "oh my god", | ||
"omw": "on my way", | ||
"p.a": "per annum", | ||
"p.m": "after midday", | ||
"pm": "prime minister", | ||
"poc": "people of color", | ||
"pov": "point of view", | ||
"pp": "pages", | ||
"ppl": "people", | ||
"prw": "parents are watching", | ||
"ps": "postscript", | ||
"pt": "point", | ||
"ptb": "please text back", | ||
"pto": "please turn over", | ||
"qpsa": "what happens", # "que pasa", | ||
"ratchet": "rude", | ||
"rbtl": "read between the lines", | ||
"rlrt": "real life retweet", | ||
"rofl": "rolling on the floor laughing", | ||
"roflol": "rolling on the floor laughing out loud", | ||
"rotflmao": "rolling on the floor laughing my ass off", | ||
"rt": "retweet", | ||
"ruok": "are you ok", | ||
"sfw": "safe for work", | ||
"sk8": "skate", | ||
"smh": "shake my head", | ||
"sq": "square", | ||
"srsly": "seriously", | ||
"ssdd": "same stuff different day", | ||
"tbh": "to be honest", | ||
"tbs": "tablespooful", | ||
"tbsp": "tablespooful", | ||
"tfw": "that feeling when", | ||
"thks": "thank you", | ||
"tho": "though", | ||
"thx": "thank you", | ||
"tia": "thanks in advance", | ||
"til": "today i learned", | ||
"tl;dr": "too long i did not read", | ||
"tldr": "too long i did not read", | ||
"tmb": "tweet me back", | ||
"tntl": "trying not to laugh", | ||
"ttyl": "talk to you later", | ||
"u": "you", | ||
"u2": "you too", | ||
"u4e": "yours for ever", | ||
"utc": "coordinated universal time", | ||
"w/": "with", | ||
"w/o": "without", | ||
"w8": "wait", | ||
"wassup": "what is up", | ||
"wb": "welcome back", | ||
"wtf": "what the fuck", | ||
"wtg": "way to go", | ||
"wtpa": "where the party at", | ||
"wuf": "where are you from", | ||
"wuzup": "what is up", | ||
"wywh": "wish you were here", | ||
"yd": "yard", | ||
"ygtr": "you got that right", | ||
"ynk": "you never know", | ||
"zzz": "sleeping bored and tired", | ||
} | ||
|
||
|
||
class RegexpReplacer(object): | ||
# Replaces regular expression in a text. | ||
def __init__(self, patterns=replacement_patterns): | ||
self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] | ||
|
||
def replace(self, text): | ||
s = text | ||
|
||
for (pattern, repl) in self.patterns: | ||
s = re.sub(pattern, repl, s) | ||
return s | ||
|
||
|
||
def convert_abbrev(word): | ||
return abbreviations[word.lower()] if word.lower() in abbreviations.keys() else word | ||
|
||
|
||
def clean_tweet(text: str): | ||
# remove urls | ||
# text = df.apply(lambda x: re.sub(r'http\S+', '', x)) | ||
text = re.sub(r"http\S+", "", text) | ||
|
||
# replace contractions | ||
replacer = RegexpReplacer() | ||
text = replacer.replace(text) | ||
|
||
# split words on - and \ | ||
text = re.sub(r"\b", " ", text) | ||
text = re.sub(r"-", " ", text) | ||
# replace negations with antonyms | ||
|
||
# nltk.download('punkt') | ||
tokenizer = nltk.RegexpTokenizer(r"\w+") | ||
tokens = tokenizer.tokenize(text) | ||
|
||
# Replace abbreviations | ||
tokens = [convert_abbrev(word) for word in tokens] | ||
|
||
# todo: spelling correction | ||
# replacer = SpellingReplacer() | ||
# tokens = [replacer.replace(t) for t in tokens] | ||
|
||
# lemmatize/stemming | ||
wnl = nltk.WordNetLemmatizer() | ||
tokens = [wnl.lemmatize(t) for t in tokens] | ||
|
||
# todo: stemming conflicts with our tokenizer (Bert) | ||
# porter = nltk.PorterStemmer() | ||
# tokens = [porter.stem(t) for t in tokens] | ||
# filter insignificant words (using fastai) | ||
# swap word phrases | ||
|
||
text = " ".join(tokens) | ||
return text | ||
|
||
|
||
def clean_tweet_list(tweet_list: list[str]): | ||
return list(map(clean_tweet, tweet_list)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise we get complaints about our regexes in
replacement_patterns