Skip to content
This repository has been archived by the owner on Nov 26, 2024. It is now read-only.

Fastext upstream final #2

Merged
merged 11 commits into from
May 22, 2024
Merged

Fastext upstream final #2

merged 11 commits into from
May 22, 2024

Conversation

sburman
Copy link

@sburman sburman commented May 22, 2024

The upstream repo is now readonly, so this updates us to the final commit.

Celebio and others added 11 commits April 17, 2023 03:23
Summary: Replace outdated url in the scripts

Reviewed By: piotr-bojanowski

Differential Revision: D43464784

fbshipit-source-id: 51a98a9ad5a0939acd0d578126290909a613938b
Summary:
[Word vectors](https://huggingface.co/facebook/fasttext-en-vectors) for 157 languages are now hosted on the Hugging Face Hub as well as the [language identification model](https://huggingface.co/facebook/fasttext-language-identification). (cc ajoulin)

A newer language model [referred in the NLLB project](https://github.com/facebookresearch/fairseq/blob/nllb/README.md#lid-model) is not mentioned in the official website, so I updated the doc accordingly.

Pull Request resolved: facebookresearch#1335

Reviewed By: Celebio

Differential Revision: D46507563

Pulled By: jmp84

fbshipit-source-id: 64883a6829c68b968acd980ba77a712b8e7a1365
Summary:
fbcode is migrating to LLVM-15 for safer and more up-to-date code and new compiler features. All contbuilds in your directory have passed our build test with LLVM-15, and your directory does not host any packages. This diff will migrate it to LLVM-15.

If you approve of this diff, please use the "Accept & Ship" button. If you have a reason for why it should not build with LLVM 15, please make a comment and send it back to author. Otherwise we will land this on Thursday 06/15/2023.

See the [FAQ post](https://fb.workplace.com/groups/llvm15platform010/posts/749154386769776/)! Please also direct any questions to [this group](https://fb.workplace.com/groups/llvm15platform010).

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Reviewed By: meyering

Differential Revision: D46661531

fbshipit-source-id: 7278fbfcadec2392c94efd6deb710bdd5e9280f8
…cs.py

Summary: Python3 makes the use of `(object)` in class inheritance unnecessary. Let's modernize our code by eliminating this.

Reviewed By: itamaro

Differential Revision: D48673901

fbshipit-source-id: 3e0ef05efe886b32a07bb58bd0725fa2ec934c14
Reviewed By: r-barnes

Differential Revision: D49677606

fbshipit-source-id: ec5b375177586c76ecccb83a29b562bc6e9961f6
Summary:
Adds pyproject.toml to comply with PEP-518, which fixes the building of the library by poetry - See python-poetry/poetry#6113 . This is a copy of facebookresearch#1270 , but I have signed the CLA.

Pull Request resolved: facebookresearch#1292

Differential Revision: D51601444

Pulled By: alexkosau

fbshipit-source-id: 357d702281ca3519c3640483eba04d124d0744b4
…1340)

Summary:
Due to[ header dependency changes](https://gcc.gnu.org/gcc-13/porting_to.html#header-dep-changes) in GCC 13, we need to include the <cstdint> header.

Pull Request resolved: facebookresearch#1340

Reviewed By: jmp84

Differential Revision: D51602433

Pulled By: alexkosau

fbshipit-source-id: cc9bffb276cb00f1db8ec97a36784c484ae4563a
Summary:
I made prediction 1.9x to 4.2x faster than before.

# Motivation
I want to use https://tinyurl.com/nllblid218e and similarly parametrized models to run language classification on petabytes of web data.

# Methodology
The costliest operation is summing the rows for each model input.  I've optimized this in three ways:
1. `addRowToVector` was a virtual function call for each row.  I've replaced this with one virtual function call per prediction by adding `averageRowsToVector` to `Matrix` calls.
2. `Vector` and `DenseMatrix` were not 64-byte aligned so the CPU was doing a lot of unaligned memory access.  I've brought in my own `vector` replacement that does 64-byte alignment.
3.  Write the `averageRowsToVector` in intrinsics for common vector sizes.  This works on SSE, AVX, and AVX512F.

See the commit history for a breakdown of speed improvement from each change.

# Experiments
Test set [docs1000.txt.gz](https://github.com/facebookresearch/fastText/files/11832996/docs1000.txt.gz) which is a bunch of random documents https://data.statmt.org/heafield/classified-fasttext/
CPU: AMD Ryzen 9 7950X 16-Core

Model https://tinyurl.com/nllblid218e with 256-dimensional vectors
Before
real    0m8.757s
user    0m8.434s
sys     0m0.327s

After
real    0m2.046s
user    0m1.717s
sys     0m0.334s

Model https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin with 16-dimensional vectors
Before
real    0m0.926s
user    0m0.889s
sys     0m0.037s

After
real    0m0.477s
user    0m0.436s
sys     0m0.040s

Pull Request resolved: facebookresearch#1341

Reviewed By: graemenail

Differential Revision: D52134736

Pulled By: kpuatfb

fbshipit-source-id: 42067161f4c968c34612934b48a562399a267f3b
Reviewed By: azad-meta

Differential Revision: D53908330

fbshipit-source-id: b2215f0522c32a82cd876633210befefe9317d76
Summary: Pull Request resolved: facebookresearch#1366

Reviewed By: jailby

Differential Revision: D54850920

Pulled By: bigfootjon

fbshipit-source-id: 9a3eec7b7cb42335a786fb247cb16be9ed3c2d59
@sburman sburman enabled auto-merge (squash) May 22, 2024 01:27
@sburman sburman requested a review from markryd May 22, 2024 01:30
@sburman sburman merged commit 04fbfbd into main May 22, 2024
4 checks passed
@sburman sburman deleted the fastext_upstream_final branch May 22, 2024 01:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants