Skip to content

Bucket Argument in fasttext not working as expected ? #1765

Closed
@saroufimc1

Description

Hi, For the fasttext native from gensim:

My understanding is that according to the hashing trick, if bucket is < total # of subwords, there will be collisions and some subwords will be mapped to the same integers. Am I wrong?
However, it is not what I see on a toy example:

import gensim
from gensim.models.fasttext import FastText

sent = [['lol', 'dds', 'sdsf'], ['anticonsti']]
model = FastText(min_count = 1, bucket = 20)
model.build_vocab(sentences=sent)
model.train(sentences = sent, epochs = 1, report_delay = 1.0)

model.wv.ngrams

Expected Results

Dictionary with ngrams and their mappings to integers between 0 and 19 ( buckets = 20)

Actual Results

Dictionary with ngrams and their mappings to integers between 0 and 55 ( number of ngrams is 56 here)

Versions

import platform; print(platform.platform())
Windows-10-10.0.14393-SP0
import sys; print("Python", sys.version)
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
import numpy; print("NumPy", numpy.version)
NumPy 1.13.3
import scipy; print("SciPy", scipy.version)
SciPy 1.0.0
import gensim; print("gensim", gensim.version)
gensim 3.1.0
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions