✂️ hashformers

Hashtag segmentation is the task of automatically adding spaces between the words on a hashtag.

Hashformers is the current state-of-the-art for hashtag segmentation, as demonstrated on this paper accepted at LREC 2022.

Hashformers is also language-agnostic: you can use it to segment hashtags not just with English models, but also using any language model available on the Hugging Face Model Hub.

✂️ Segment hashtags on Hugging Face Spaces

✂️ Get started - Google Colab tutorial

✂️ Read the Docs

Basic usage

from hashformers import TransformerWordSegmenter as WordSegmenter

ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    segmenter_model_type="incremental",
    reranker_model_name_or_path="google/flan-t5-base",
    reranker_model_type="seq2seq"
)

segmentations = ws.segment([
    "#weneedanationalpark",
    "#icecold"
])

print(segmentations)

# [ 'we need a national park',
# 'ice cold' ]

It is also possible to use hashformers without a reranker by setting the reranker_model_name_or_path and the reranker_model_type to None.

Installation

pip install hashformers

Important: Hashformers is designed to work with Python 3.10.12, the version currently used on Google Colab.

What models can I use?

Visit the HuggingFace Model Hub and choose your models for the WordSegmenter class.

You can use any model supported by the minicons library. Currently hashformers supports the following model types as the segmenter_model_type or reranker_model_type:

`incremental`

Auto-regressive models like GPT-2 and XLNet, or any model that can be loaded with AutoModelForCausalLM. This includes large language models (LLMs) such as Alpaca-LoRA ( chainyo/alpaca-lora-7b ) and GPT-J ( EleutherAI/gpt-j-6b ).

ws = WordSegmenter(
    segmenter_model_name_or_path="EleutherAI/gpt-j-6b",
    segmenter_model_type="incremental",
    reranker_model_name_or_path=None,
    reranker_model_type=None
)

`masked`

Masked language models like BERT, or any model that can be loaded with AutoModelForMaskedLM.

`seq2seq`

Seq2Seq models like FLAN-T5 ( google/flan-t5-base ), or any model that can be loaded with AutoModelForSeq2SeqLM.

Best results are usually achieved by using an incremental model as the segmenter_model_name_or_path and a masked or seq2seq model as the reranker_model_name_or_path.

A segmenter is always required, however a reranker is optional.

Contributing

Pull requests are welcome! Read our paper for more details on the inner workings of our framework.

If you want to develop the library, you can install hashformers directly from this repository ( or your fork ):

git clone https://github.com/ruanchaves/hashformers.git
cd hashformers
pip install -e .

Relevant Papers

This is a collection of papers that have utilized the hashformers library as a tool in their research.

hashformers v1.3

These papers have utilized hashformers version 1.3 or below.

Blog Posts

15 Datasets for Word Segmentation on the Hugging Face Hub

Citation

@misc{rodrigues2021zeroshot,
      title={Zero-shot hashtag segmentation for multilingual sentiment analysis}, 
      author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
      year={2021},
      eprint={2112.03213},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 509 Commits
datasets		datasets
scripts		scripts
src/hashformers		src/hashformers
tests		tests
tutorials		tutorials
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
barplot_evaluation.png		barplot_evaluation.png
hashformers.ipynb		hashformers.ipynb
hashformers.png		hashformers.png
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✂️ hashformers

✂️ Segment hashtags on Hugging Face Spaces

✂️ Get started - Google Colab tutorial

✂️ Read the Docs

Basic usage

Installation

What models can I use?

`incremental`

`masked`

`seq2seq`

Contributing

Relevant Papers

hashformers v1.3

Blog Posts

Citation

About

Releases 5

Packages

Languages

License

ruanchaves/hashformers

Folders and files

Latest commit

History

Repository files navigation

✂️ hashformers

✂️ Segment hashtags on Hugging Face Spaces

✂️ Get started - Google Colab tutorial

✂️ Read the Docs

Basic usage

Installation

What models can I use?

incremental

masked

seq2seq

Contributing

Relevant Papers

hashformers v1.3

Blog Posts

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

`incremental`

`masked`

`seq2seq`

Packages