Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chain.arun() for summarization no faster than chain() #8391

Closed
3 of 14 tasks
pseudotensor opened this issue Jul 28, 2023 · 8 comments
Closed
3 of 14 tasks

chain.arun() for summarization no faster than chain() #8391

pseudotensor opened this issue Jul 28, 2023 · 8 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module

Comments

@pseudotensor
Copy link

pseudotensor commented Jul 28, 2023

System Info

(h2ogpt) jon@pseudotensor:~/h2ogpt$ pip freeze | grep langchain
langchain==0.0.235
langchainplus-sdk==0.0.20

Python 3.10

(h2ogpt) jon@pseudotensor:~/h2ogpt$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal

Who can help?

@agola11

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

h2oai/h2ogpt@cc3331d

The above shows my introduction of async from before not having it.

The text generation inference server is set to have a large concurrency, but is showing requests are coming in back-to-back.

Expected behavior

I expect the summarization part to be parallel, like stated here:

https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/chains/combine_documents/map_reduce.py#L210-L213

But perhaps I misunderstand something. Or perhaps it's not really parallel:

#1145 (comment)

There's lots of discussion w.r.t. hitting rate limit with OpenAI:

#2465
#1643

So I presume this works, but I'm not seeing it. In OpenAI case it seems to be done via batching, which is possible in HF TGI server but not implemented. But I would have thought that all the reduction tasks could have been in parallel with asyncio.

#1463 (comment)

@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jul 28, 2023
@dosubot
Copy link

dosubot bot commented Jul 28, 2023

Answer generated by a 🤖

Answer

I understand that you're experiencing an issue where the chain.arun() method for summarization in LangChain is not running faster than the standard chain() method. You're expecting the chain.arun() method to run the summarization process in parallel, as indicated in the code comments, but you're not observing this behavior.

The chain.arun() method is indeed designed to run the summarization process in parallel. However, the speed of the operation can be influenced by several factors. These include the number of documents being processed, the complexity of the documents, the specific language model being used, and the system resources available.

If you're not seeing a speed improvement when using chain.arun(), it could be due to one or more of these factors. For instance, if the number of documents is small, the overhead of setting up the parallel processing might outweigh the benefits. Alternatively, if your system resources, such as CPU or memory, are already heavily utilized, parallel processing might not provide a speedup.

In the code, self.llm_chain.aapply() is the method that applies the language model to each document in parallel. If this isn't providing the expected speedup, it might be worth investigating the factors I mentioned earlier.

Please note that this is just one potential explanation for the behavior you're observing. There could be other factors at play that are specific to your setup or use case.

If you find this explanation accurate and it's not covered in the existing documentation, I encourage you to make a pull request to update the documentation. This could be beneficial for other users who might encounter the same issue.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

@pseudotensor
Copy link
Author

pseudotensor commented Jul 28, 2023

Here's verbose output:


> Entering new MapReduceDocumentsChain chain...


> Entering new LLMChain chain...
Prompt after formatting:
In order to write a concise single-paragraph or bulleted list summary, pay attention to the following text:
"""
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford * 1 Jong Wook Kim * 1 Tao Xu 1 Greg Brockman 1 Christine McLeavey 1 Ilya Sutskever 1
Abstract
We study the capabilities of speech processing
systems trained simply to predict large amounts of
transcripts of audio on the internet. When scaled
to 680,000 hours of multilingual and multitask
supervision, the resulting models generalize well
to standard benchmarks and are often competitive
with prior fully supervised results but in a zero-
shot transfer setting without the need for any fine-
tuning. When compared to humans, the models
approach their accuracy and robustness. We are
releasing models and inference code to serve as
a foundation for further work on robust speech
processing.
1. Introduction
Progress in speech recognition has been energized by the
development of unsupervised pre-training techniques exem-
plified by Wav2Vec 2.0 (Baevski et al., 2020). Since these
methods learn directly from raw audio without the need for
human labels, they can productively use large datasets of un-
labeled speech and have been quickly scaled up to 1,000,000
hours of training data (Zhang et al., 2021), far more than the
1,000 or so hours typical of an academic supervised dataset.
When fine-tuned on standard benchmarks, this approach
has improved the state of the art, especially in a low-data
setting.
These pre-trained audio encoders learn high-quality repre-
sentations of speech, but because they are purely unsuper-
vised they lack an equivalently performant decoder mapping
those representations to usable outputs, necessitating a fine-
tuning stage in order to actually perform a task such as
speech recognition1. This unfortunately limits their use-
fulness and impact as fine-tuning can still be a complex
process requiring a skilled practitioner. There is an addi-
tional risk with requiring fine-tuning. Machine learning
*Equal contribution 1OpenAI, San Francisco, CA 94110, USA.
Correspondence to: Alec Radford <alec@openai.com>, Jong
Wook Kim <jongwook@openai.com>.
1Baevski et al. (2021) is an exciting exception - having devel-
oped a fully unsupervised speech recognition system
methods are exceedingly adept at finding patterns within a
training dataset which boost performance on held-out data
from the same dataset. However, some of these patterns are
brittle and spurious and don’t generalize to other datasets
and distributions. In a particularly disturbing example, Rad-
ford et al. (2021) documented a 9.2% increase in object
classification accuracy when fine-tuning a computer vision
model on the ImageNet dataset (Russakovsky et al., 2015)
without observing any improvement in average accuracy
when classifying the same objects on seven other natural
image datasets. A model that achieves “superhuman” per-
formance when trained on a dataset can still make many
basic errors when evaluated on another, possibly precisely
because it is exploiting those dataset-specific quirks that
humans are oblivious to (Geirhos et al., 2020).
This suggests that while unsupervised pre-training has im-
proved the quality of audio encoders dramatically, the lack
of an equivalently high-quality pre-trained decoder, com-
bined with a recommended protocol of dataset-specific fine-
tuning, is a crucial weakness which limits their usefulness
and robustness. The goal of a speech recognition system
should be to work reliably “out of the box” in a broad range
of environments without requiring supervised fine-tuning of
a decoder for every deployment distribution.
As demonstrated by Narayanan et al. (2018), Likhomanenko
et al. (2020), and Chan et al. (2021) speech recognition sys-
tems that are pre-trained in a supervised fashion across many
datasets/domains exhibit higher robustness and generalize
much more effectively to held-out datasets than models
trained on a single source. These works achieve this by
combining as many existing high-quality speech recogni-
tion datasets as possible. However, there is still only a
moderate amount of this data easily available. SpeechStew
(Chan et al., 2021) mixes together 7 pre-existing datasets
totalling 5,140 hours of supervision. While not insignifi-
cant, this is still tiny compared to the previously mentioned
1,000,000 hours of unlabeled speech data utilized in Zhang
et al. (2021).
Recognizing the limiting size of existing high-quality super-
vised datasets, recent efforts have created larger datasets for
speech recognition. By relaxing the requirement of gold-
standard human-validated transcripts, Chen et al. (2021) and
Galvez et al. (2021) make use of sophisticated automated
"""
Using only the text above, write a condensed and concise summary of key results (preferably as bullet points):

Prompt after formatting:
In order to write a concise single-paragraph or bulleted list summary, pay attention to the following text:
"""
Robust Speech Recognition via Large-Scale Weak Supervision
9
audio. The pub noise represents a more natural noisy envi-
ronment with ambient noise and indistinct chatter typical
in a crowded restaurant or a pub. Among the 14 models,
twelve are pre-trained and/or fine-tuned on LibriSpeech, and
the other two are NVIDIA STT models trained on a mixture
dataset similar to prior work like SpeechStew that includes
LibriSpeech. The level of additive noise corresponding to
a given signal-to-noise ratio (SNR) is calculated based on
the signal power of individual examples. Figure 5 shows
how the ASR performance degrades as the additive noise
becomes more intensive. There are many models that out-
perform our zero-shot performance under low noise (40 dB
SNR), which is unsurprising given those models are trained
primarily on LibriSpeech, but all models quickly degrade as
the noise becomes more intensive, performing worse than
the Whisper model under additive pub noise of SNR below
10 dB. This showcases Whisper’s robustness to noise, es-
pecially under more natural distribution shifts like the pub
noise.
3.8. Long-form Transcription
Whisper models are trained on 30-second audio chunks and
cannot consume longer audio inputs at once. This is not a
problem with most academic datasets comprised of short
utterances but presents challenges in real-world applications
which often require transcribing minutes- or hours-long au-
dio. We developed a strategy to perform buffered transcrip-
tion of long audio by consecutively transcribing 30-second
segments of audio and shifting the window according to the
timestamps predicted by the model. We observed that it
is crucial to have beam search and temperature scheduling
based on the repetitiveness and the log probability of the
model predictions in order to reliably transcribe long audio.
The full procedure is described in Section 4.5.
We evaluate the long-form transcription performance on
seven datasets consisting of speech recordings of various
lengths and recording conditions, to cover as diverse a data
distribution as possible. These include a long-form adapta-
tion of TED-LIUM3 (Hernandez et al., 2018) concatenated
so that each example is a full-length TED talk, a collection
of jargon-laden segments taken from The Late Show with
Stephen Colbert (Meanwhile), sets of videos/podcasts that
has been used as ASR benchmarks in online blogs (Rev16
and Kincaid46), recordings of earnings calls (Del Rio et al.,
2021), and the full-length interviews from the Corpus of
Regional African American Language (CORAAL) (Gunter
et al., 2021). Full details about the long-form datasets can
be found in Appendix A.
We compare the performance with open-source models as
well as 4 commercial ASR services. The results are sum-
marized in Figure 6, showing the distribution of word error
rates from Whisper and the 4 commercial ASR services,
as well as the NVIDIA STT Conformer-CTC Large model
from the NeMo toolkit (Kuchaiev et al., 2019) which per-
formed the best among the open-source models. All com-
mercial ASR services are queried using their default English
transcription settings as of September 1st, 2022, and for
the NVIDIA STT model we used their buffered inference
TED-LIUM3
Meanwhile
Kincaid46
Rev16
Earnings-21
Earnings-22
CORAAL
0
5
10
15
20
25
30
35
40
Word Error Rate (%)
Whisper
Company A
Company B
Company C
Company D
NVIDIA STT (CTC large)
Figure 6. Whisper is competitive with state-of-the-art commercial and open-source ASR systems in long-form transcription. The
distribution of word error rates from six ASR systems on seven long-form datasets are compared, where the input lengths range from a
few minutes to a few hours. The boxes show the quartiles of per-example WERs, and the per-dataset aggregate WERs are annotated
on each box. Our model outperforms the best open source model (NVIDIA STT) on all datasets, and in most cases, commercial ASR
systems as well.
"""
Using only the text above, write a condensed and concise summary of key results (preferably as bullet points):

Prompt after formatting:
In order to write a concise single-paragraph or bulleted list summary, pay attention to the following text:
"""
Robust Speech Recognition via Large-Scale Weak Supervision
17
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B.,
Cubuk, E. D., and Le, Q. V. SpecAugment: A simple data
augmentation method for automatic speech recognition.
arXiv preprint arXiv:1904.08779, 2019.
Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty
of training recurrent neural networks. In International
conference on machine learning, pp. 1310–1318. PMLR,
2013.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
Bai, J., and Chintala, S. Pytorch: An imperative style,
high-performance deep learning library. In Advances
in Neural Information Processing Systems 32, pp. 8024–
8035, 2019.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12:2825–2830, 2011.
Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic
approximation by averaging. SIAM journal on control
and optimization, 30(4):838–855, 1992.
Pratap, V., Sriram, A., Tomasello, P., Hannun, A. Y.,
Liptchinsky, V., Synnaeve, G., and Collobert, R. Mas-
sively multilingual asr: 50 languages, 1 model, 1 billion
parameters. ArXiv, abs/2007.03001, 2020a.
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., and Collobert,
R. Mls: A large-scale multilingual dataset for speech
research. arXiv preprint arXiv:2012.03411, 2020b.
Press, O. and Wolf, L. Using the output embedding to
improve language models. In Proceedings of the 15th
Conference of the European Chapter of the Associa-
tion for Computational Linguistics: Volume 2, Short
Papers, pp. 157–163, Valencia, Spain, April 2017. As-
sociation for Computational Linguistics. URL https:
//aclanthology.org/E17-2025.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
Sutskever, I. Language models are unsupervised multitask
learners. 2019.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., Krueger, G., and Sutskever, I. Learning transferable
visual models from natural language supervision. arXiv
preprint arXiv:2103.00020, 2021.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Exploring
the limits of transfer learning with a unified text-to-text
transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cor-
nell, S., Lugosch, L., Subakan, C., Dawalatabad, N.,
Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W.,
Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na,
H., Gao, Y., Mori, R. D., and Bengio, Y. SpeechBrain: A
general-purpose speech toolkit, 2021. arXiv:2106.04624.
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V.
Do ImageNet classifiers generalize to ImageNet?
In
Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceed-
ings of the 36th International Conference on Machine
Learning, volume 97 of Proceedings of Machine Learn-
ing Research, pp. 5389–5400. PMLR, 09–15 Jun 2019.
URL https://proceedings.mlr.press/v97/
recht19a.html.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
M., et al. Imagenet large scale visual recognition chal-
lenge. International journal of computer vision, 115(3):
211–252, 2015.
Schultz, T. and Kirchhoff, K. Multilingual speech process-
ing. Elsevier, 2006.
Seide, F., Li, G., Chen, X., and Yu, D. Feature engineering
in context-dependent deep neural networks for conver-
sational speech transcription. In 2011 IEEE Workshop
on Automatic Speech Recognition & Understanding, pp.
24–29. IEEE, 2011.
Sennrich, R., Haddow, B., and Birch, A. Neural machine
translation of rare words with subword units.
arXiv
preprint arXiv:1508.07909, 2015.
Speer, R. ftfy. Zenodo, 2019. URL https://doi.org/
10.5281/zenodo.2591652. Version 5.5.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
and Salakhutdinov, R. Dropout: a simple way to prevent
neural networks from overfitting. The journal of machine
learning research, 15(1):1929–1958, 2014.
Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-
quence learning with neural networks. Advances in neural
information processing systems, 27, 2014.
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B.,
and Schmidt, L.
Measuring robustness to natural
distribution shifts in image classification. In Larochelle,
H., Ranzato, M., Hadsell, R., Balcan, M., and Lin,
H. (eds.), Advances in Neural Information Processing
"""
Using only the text above, write a condensed and concise summary of key results (preferably as bullet points):

> Finished chain.


> Entering new LLMChain chain...
Prompt after formatting:
In order to write a concise single-paragraph or bulleted list summary, pay attention to the following text:
"""
 Sure! Here's a condensed and concise summary of the key results based on the provided text:

• Robust Speech Recognition via Large-Scale Weak Supervision:

▪ Training speech processing systems on large amounts of unlabeled audio data (680,000 hours) leads to models that generalize well to standard benchmarks and are competitive with fully supervised results.

▪ The models approach human accuracy and robustness when compared to human transcriptions.

▪ The approach does not require fine-tuning, making it more practical and accessible for real-world applications.

▪ The models are robust and generalize well to held-out datasets, outperforming models trained on a single source.

▪ The use of large-scale weak supervision can help overcome the limitations of small-scale supervised datasets and improve the robustness of speech recognition systems.

 Sure! Here is a condensed and concise summary of the key results based on the provided text:

• Robust speech recognition performance under natural noise environments:

▪ Whisper models outperform zero-shot performance under low noise levels (40 dB SNR) but degrade quickly under more intensive noise (<10 dB SNR).
▪ Whisper models demonstrate robustness to natural distribution shifts, such as pub noise.

• Long-form transcription capabilities:

▪ Whisper models can perform buffered transcription of long audio by consecutively transcribing 30-second segments and shifting the window based on timestamps predicted by the model.
▪ Beam search and temperature scheduling based on repetitiveness and log probability of model predictions are crucial for reliable long-form transcription.
▪ Whisper competes with state-of-the-art commercial and open-source ASR systems in long-form transcription, with the best performance on all datasets and in most cases.

 Sure! Here is a condensed and concise summary of the key results based on the provided text:

• Robust Speech Recognition via Large-Scale Weak Supervision:
▪ Developed a simple data augmentation method called SpecAugment that improves the robustness of automatic speech recognition (ASR) models to unseen speakers and environments.
▪ Demonstrated the effectiveness of SpecAugment on several benchmark datasets, achieving state-of-the-art performance in robust ASR.

• On the Difficulty of Training Recurrent Neural Networks:
▪ Analyzed the difficulty of training recurrent neural networks (RNNs) and proposed a new measure of training difficulty based on the energy of the loss landscape.
▪ Showed that RNNs are more difficult to train than other neural network architectures, and that this difficulty can be mitigated by using better initialization methods and regularization techniques.

• Pytorch: An Imperative Style, High-Performance Deep Learning Library:
▪ Introduced PyTorch, an imperative-style deep learning library that provides a dynamic computation graph and automatic differentiation.
▪ Demonstrated the performance and flexibility of PyTorch on several benchmark tasks, including image classification and natural language processing.

• Scikit-learn: Machine Learning in Python:
▪ Presented scikit-learn, an open-source machine learning library for Python that provides a wide range of algorithms for classification, regression, clustering, and other tasks.
▪ Discussed the design principles and implementation of scikit-learn, and demonstrated its use on several benchmark datasets.

• Acceleration of Stochastic Approximation by Averaging:
▪ Proposed a method for accelerating stochastic approximation using averaging, which can improve the convergence rate of optimization algorithms.
▪ Analyzed the convergence properties of the proposed method and demonstrated its effectiveness on several benchmark problems.

• Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters:
▪ Introduced a massively multilingual speech recognition system that uses a single model to recognize 50 languages, with over 1 billion parameters.
▪ Demonstrated the effectiveness of the proposed system on several benchmark datasets, achieving state-of-the-art performance in multilingual ASR.

• Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer:
▪ Investigated the limits of transfer learning with a unified text-to-text transformer, which can perform a variety of natural language processing tasks without task-specific fine-tuning.
▪ Demonstrated the effectiveness of the proposed approach on several benchmark datasets, achieving state-of-the-art performance in a variety of tasks.
"""
Using only the text above, write a condensed and concise summary of key results (preferably as bullet points):


> Finished chain.

> Finished chain.

Actual final answer:

Sure! Here is a condensed and concise summary of the key results based on the provided text:

• Robust Speech Recognition via Large-Scale Weak Supervision:

▪ Trained speech processing systems on large amounts of unlabeled audio data, leading to models that generalize well to standard benchmarks and are competitive with fully supervised results.

▪ Models approach human accuracy and robustness when compared to human transcriptions.

▪ No fine-tuning required, making the approach more practical and accessible for real-world applications.

▪ Models generalize well to held-out datasets, outperforming models trained on a single source.

• Robust speech recognition performance under natural noise environments:

▪ Whisper models outperform zero-shot performance under low noise levels but degrade quickly under more intensive noise.

▪ Whisper models demonstrate robustness to natural distribution shifts, such as pub noise.

• Long-form transcription capabilities:

▪ Whisper models can perform buffered transcription of long audio by consecutively transcribing 30-second segments and shifting the window based on timestamps predicted by the model.

▪ Beam search and temperature scheduling are crucial for reliable long-form transcription.

▪ Whisper competes with state-of-the-art commercial and open-source ASR systems in long-form transcription.

• Other key results:

▪ Developed a simple data augmentation method called SpecAugment that improves the robustness of ASR models to unseen speakers and environments.

▪ Analyzed the difficulty of training RNNs and proposed a new measure of training difficulty based on the energy of the loss landscape.

▪ Introduced PyTorch, an imperative-style deep learning library that provides a dynamic computation graph and automatic differentiation.

▪ Presented scikit-learn, an open-source machine learning library for Python that provides a wide range of algorithms for classification, regression, clustering, and other tasks.

▪ Proposed a method for accelerating stochastic approximation using averaging, which can improve the convergence rate of optimization algorithms.

▪ Introduced a massively multilingual speech recognition system that uses a single model to recognize 50 languages, with over 1 billion parameters.

▪ Investigated the limits of transfer learning with a unified text-to-text transformer, which can perform a variety of natural language processing tasks without task-specific fine-tuning.

@pseudotensor
Copy link
Author

I confirmed in pycharm that when reaching acombine_docs() that len(docs)==3. So should be work to parallelize.

@pseudotensor
Copy link
Author

If I put debug into huggingface_text_gen_inference.py when entering and calling _acall() I see:

> Entering new MapReduceDocumentsChain chain...


> Entering new LLMChain chain...
Prompt after formatting:
In order to write a concise single-paragraph or bulleted list summary, pay attention to the following text:
"""
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford * 1 Jong Wook Kim * 1 Tao Xu 1 Greg Brockman 1 Christine McLeavey 1 Ilya Sutskever 1
Abstract
We study the capabilities of speech processing
systems trained simply to predict large amounts of
transcripts of audio on the internet. When scaled
to 680,000 hours of multilingual and multitask
supervision, the resulting models generalize well
to standard benchmarks and are often competitive
with prior fully supervised results but in a zero-
shot transfer setting without the need for any fine-
tuning. When compared to humans, the models
approach their accuracy and robustness. We are
releasing models and inference code to serve as
a foundation for further work on robust speech
processing.
1. Introduction
Progress in speech recognition has been energized by the
development of unsupervised pre-training techniques exem-
plified by Wav2Vec 2.0 (Baevski et al., 2020). Since these
methods learn directly from raw audio without the need for
human labels, they can productively use large datasets of un-
labeled speech and have been quickly scaled up to 1,000,000
hours of training data (Zhang et al., 2021), far more than the
1,000 or so hours typical of an academic supervised dataset.
When fine-tuned on standard benchmarks, this approach
has improved the state of the art, especially in a low-data
setting.
These pre-trained audio encoders learn high-quality repre-
sentations of speech, but because they are purely unsuper-
vised they lack an equivalently performant decoder mapping
those representations to usable outputs, necessitating a fine-
tuning stage in order to actually perform a task such as
speech recognition1. This unfortunately limits their use-
fulness and impact as fine-tuning can still be a complex
process requiring a skilled practitioner. There is an addi-
tional risk with requiring fine-tuning. Machine learning
*Equal contribution 1OpenAI, San Francisco, CA 94110, USA.
Correspondence to: Alec Radford <alec@openai.com>, Jong
Wook Kim <jongwook@openai.com>.
1Baevski et al. (2021) is an exciting exception - having devel-
oped a fully unsupervised speech recognition system
methods are exceedingly adept at finding patterns within a
training dataset which boost performance on held-out data
from the same dataset. However, some of these patterns are
brittle and spurious and don’t generalize to other datasets
and distributions. In a particularly disturbing example, Rad-
ford et al. (2021) documented a 9.2% increase in object
classification accuracy when fine-tuning a computer vision
model on the ImageNet dataset (Russakovsky et al., 2015)
without observing any improvement in average accuracy
when classifying the same objects on seven other natural
image datasets. A model that achieves “superhuman” per-
formance when trained on a dataset can still make many
basic errors when evaluated on another, possibly precisely
because it is exploiting those dataset-specific quirks that
humans are oblivious to (Geirhos et al., 2020).
This suggests that while unsupervised pre-training has im-
proved the quality of audio encoders dramatically, the lack
of an equivalently high-quality pre-trained decoder, com-
bined with a recommended protocol of dataset-specific fine-
tuning, is a crucial weakness which limits their usefulness
and robustness. The goal of a speech recognition system
should be to work reliably “out of the box” in a broad range
of environments without requiring supervised fine-tuning of
a decoder for every deployment distribution.
As demonstrated by Narayanan et al. (2018), Likhomanenko
et al. (2020), and Chan et al. (2021) speech recognition sys-
tems that are pre-trained in a supervised fashion across many
datasets/domains exhibit higher robustness and generalize
much more effectively to held-out datasets than models
trained on a single source. These works achieve this by
combining as many existing high-quality speech recogni-
tion datasets as possible. However, there is still only a
moderate amount of this data easily available. SpeechStew
(Chan et al., 2021) mixes together 7 pre-existing datasets
totalling 5,140 hours of supervision. While not insignifi-
cant, this is still tiny compared to the previously mentioned
1,000,000 hours of unlabeled speech data utilized in Zhang
et al. (2021).
Recognizing the limiting size of existing high-quality super-
vised datasets, recent efforts have created larger datasets for
speech recognition. By relaxing the requirement of gold-
standard human-validated transcripts, Chen et al. (2021) and
Galvez et al. (2021) make use of sophisticated automated
"""
Using only the text above, write a condensed and concise summary of key results (preferably as bullet points):

Prompt after formatting:
In order to write a concise single-paragraph or bulleted list summary, pay attention to the following text:
"""
Robust Speech Recognition via Large-Scale Weak Supervision
9
audio. The pub noise represents a more natural noisy envi-
ronment with ambient noise and indistinct chatter typical
in a crowded restaurant or a pub. Among the 14 models,
twelve are pre-trained and/or fine-tuned on LibriSpeech, and
the other two are NVIDIA STT models trained on a mixture
dataset similar to prior work like SpeechStew that includes
LibriSpeech. The level of additive noise corresponding to
a given signal-to-noise ratio (SNR) is calculated based on
the signal power of individual examples. Figure 5 shows
how the ASR performance degrades as the additive noise
becomes more intensive. There are many models that out-
perform our zero-shot performance under low noise (40 dB
SNR), which is unsurprising given those models are trained
primarily on LibriSpeech, but all models quickly degrade as
the noise becomes more intensive, performing worse than
the Whisper model under additive pub noise of SNR below
10 dB. This showcases Whisper’s robustness to noise, es-
pecially under more natural distribution shifts like the pub
noise.
3.8. Long-form Transcription
Whisper models are trained on 30-second audio chunks and
cannot consume longer audio inputs at once. This is not a
problem with most academic datasets comprised of short
utterances but presents challenges in real-world applications
which often require transcribing minutes- or hours-long au-
dio. We developed a strategy to perform buffered transcrip-
tion of long audio by consecutively transcribing 30-second
segments of audio and shifting the window according to the
timestamps predicted by the model. We observed that it
is crucial to have beam search and temperature scheduling
based on the repetitiveness and the log probability of the
model predictions in order to reliably transcribe long audio.
The full procedure is described in Section 4.5.
We evaluate the long-form transcription performance on
seven datasets consisting of speech recordings of various
lengths and recording conditions, to cover as diverse a data
distribution as possible. These include a long-form adapta-
tion of TED-LIUM3 (Hernandez et al., 2018) concatenated
so that each example is a full-length TED talk, a collection
of jargon-laden segments taken from The Late Show with
Stephen Colbert (Meanwhile), sets of videos/podcasts that
has been used as ASR benchmarks in online blogs (Rev16
and Kincaid46), recordings of earnings calls (Del Rio et al.,
2021), and the full-length interviews from the Corpus of
Regional African American Language (CORAAL) (Gunter
et al., 2021). Full details about the long-form datasets can
be found in Appendix A.
We compare the performance with open-source models as
well as 4 commercial ASR services. The results are sum-
marized in Figure 6, showing the distribution of word error
rates from Whisper and the 4 commercial ASR services,
as well as the NVIDIA STT Conformer-CTC Large model
from the NeMo toolkit (Kuchaiev et al., 2019) which per-
formed the best among the open-source models. All com-
mercial ASR services are queried using their default English
transcription settings as of September 1st, 2022, and for
the NVIDIA STT model we used their buffered inference
TED-LIUM3
Meanwhile
Kincaid46
Rev16
Earnings-21
Earnings-22
CORAAL
0
5
10
15
20
25
30
35
40
Word Error Rate (%)
Whisper
Company A
Company B
Company C
Company D
NVIDIA STT (CTC large)
Figure 6. Whisper is competitive with state-of-the-art commercial and open-source ASR systems in long-form transcription. The
distribution of word error rates from six ASR systems on seven long-form datasets are compared, where the input lengths range from a
few minutes to a few hours. The boxes show the quartiles of per-example WERs, and the per-dataset aggregate WERs are annotated
on each box. Our model outperforms the best open source model (NVIDIA STT) on all datasets, and in most cases, commercial ASR
systems as well.
"""
Using only the text above, write a condensed and concise summary of key results (preferably as bullet points):

Prompt after formatting:
In order to write a concise single-paragraph or bulleted list summary, pay attention to the following text:
"""
Robust Speech Recognition via Large-Scale Weak Supervision
17
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B.,
Cubuk, E. D., and Le, Q. V. SpecAugment: A simple data
augmentation method for automatic speech recognition.
arXiv preprint arXiv:1904.08779, 2019.
Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty
of training recurrent neural networks. In International
conference on machine learning, pp. 1310–1318. PMLR,
2013.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
Bai, J., and Chintala, S. Pytorch: An imperative style,
high-performance deep learning library. In Advances
in Neural Information Processing Systems 32, pp. 8024–
8035, 2019.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12:2825–2830, 2011.
Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic
approximation by averaging. SIAM journal on control
and optimization, 30(4):838–855, 1992.
Pratap, V., Sriram, A., Tomasello, P., Hannun, A. Y.,
Liptchinsky, V., Synnaeve, G., and Collobert, R. Mas-
sively multilingual asr: 50 languages, 1 model, 1 billion
parameters. ArXiv, abs/2007.03001, 2020a.
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., and Collobert,
R. Mls: A large-scale multilingual dataset for speech
research. arXiv preprint arXiv:2012.03411, 2020b.
Press, O. and Wolf, L. Using the output embedding to
improve language models. In Proceedings of the 15th
Conference of the European Chapter of the Associa-
tion for Computational Linguistics: Volume 2, Short
Papers, pp. 157–163, Valencia, Spain, April 2017. As-
sociation for Computational Linguistics. URL https:
//aclanthology.org/E17-2025.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
Sutskever, I. Language models are unsupervised multitask
learners. 2019.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., Krueger, G., and Sutskever, I. Learning transferable
visual models from natural language supervision. arXiv
preprint arXiv:2103.00020, 2021.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Exploring
the limits of transfer learning with a unified text-to-text
transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cor-
nell, S., Lugosch, L., Subakan, C., Dawalatabad, N.,
Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W.,
Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na,
H., Gao, Y., Mori, R. D., and Bengio, Y. SpeechBrain: A
general-purpose speech toolkit, 2021. arXiv:2106.04624.
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V.
Do ImageNet classifiers generalize to ImageNet?
In
Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceed-
ings of the 36th International Conference on Machine
Learning, volume 97 of Proceedings of Machine Learn-
ing Research, pp. 5389–5400. PMLR, 09–15 Jun 2019.
URL https://proceedings.mlr.press/v97/
recht19a.html.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
M., et al. Imagenet large scale visual recognition chal-
lenge. International journal of computer vision, 115(3):
211–252, 2015.
Schultz, T. and Kirchhoff, K. Multilingual speech process-
ing. Elsevier, 2006.
Seide, F., Li, G., Chen, X., and Yu, D. Feature engineering
in context-dependent deep neural networks for conver-
sational speech transcription. In 2011 IEEE Workshop
on Automatic Speech Recognition & Understanding, pp.
24–29. IEEE, 2011.
Sennrich, R., Haddow, B., and Birch, A. Neural machine
translation of rare words with subword units.
arXiv
preprint arXiv:1508.07909, 2015.
Speer, R. ftfy. Zenodo, 2019. URL https://doi.org/
10.5281/zenodo.2591652. Version 5.5.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
and Salakhutdinov, R. Dropout: a simple way to prevent
neural networks from overfitting. The journal of machine
learning research, 15(1):1929–1958, 2014.
Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-
quence learning with neural networks. Advances in neural
information processing systems, 27, 2014.
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B.,
and Schmidt, L.
Measuring robustness to natural
distribution shifts in image classification. In Larochelle,
H., Ranzato, M., Hadsell, R., Balcan, M., and Lin,
H. (eds.), Advances in Neural Information Processing
"""
Using only the text above, write a condensed and concise summary of key results (preferably as bullet points):

Enter acall
end acall
Enter acall
end acall
Enter acall
end acall

> Finished chain.


> Entering new LLMChain chain...
Prompt after formatting:
In order to write a concise single-paragraph or bulleted list summary, pay attention to the following text:
"""

Please note that the original text contains several references to external sources, which I've removed for brevity. Also, please keep in mind that the text may contain some technical jargon or specific terms related to the field of speech recognition, which I'll try my best to simplify or exclude.


* Whisper achieves robust speech recognition performance even under adverse noise conditions, outperforming other models in certain scenarios.
* Whisper models are trained on 30-second audio chunks and can be extended to long-form transcription using a buffered transcription approach.
* The long-form transcription performance of Whisper is competitive with state-of-the-art commercial and open-source ASR systems, with a distribution of word error rates ranging from 4% to 17%.
* Whisper outperforms the best open-source model (NVIDIA STT) on all datasets and in most cases, commercial ASR systems as well.


• Robust speech recognition via large-scale weak supervision
• Simple data augmentation method for automatic speech recognition
• On the difficulty of training recurrent neural networks
• High-performance deep learning library for imperative style
• Machine learning in Python for various tasks
• Acceleration of stochastic approximation by averaging
• Massively multilingual ASR with 50 languages and 1 billion parameters
• Large-scale multilingual dataset for speech research
• Unsupervised multitask learners using language models
• Transferable visual models from natural language supervision
• Exploring the limits of transfer learning with a unified text-to-text transformer
• General-purpose speech toolkit for various tasks
• Do ImageNet classifiers generalize to ImageNet?
• Imagenet large scale visual recognition challenge
• Feature engineering in context-dependent deep neural networks for conversational speech transcription
• Neural machine translation of rare words with subword units
• Simple way to prevent neural networks from overfitting
• Sequence to sequence learning with neural networks
• Measuring robustness to natural distribution shifts in image classification.
"""
Using only the text above, write a condensed and concise summary of key results (preferably as bullet points):

Enter acall
end acall

> Finished chain.

> Finished chain.

i.e. this block is a problem:

Enter acall
end acall
Enter acall
end acall
Enter acall
end acall

It's not doing these in parallel. I have no callbacks, just the default stdout callback. So I don't know what is wrong.

@pseudotensor
Copy link
Author

pseudotensor commented Jul 28, 2023

I don't see how this can work:

https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/llms/base.py#L972-L979

        for prompt in prompts:
            print("prompt", flush=True)
            text = (
                await self._acall(prompt, stop=stop, run_manager=run_manager, **kwargs)
                if new_arg_supported
                else await self._acall(prompt, stop=stop, **kwargs)
            )
            generations.append([Generation(text=text)])
        return LLMResult(generations=generations)

This will await in a simple loop, each await blocking the next call.

Needs to be like: https://stackoverflow.com/questions/43215507/how-to-await-method-in-loop

i.e.

tasks = [self.generate_url(url) for url in urls]
    await asyncio.wait(tasks)

and probably need to control degree of concurrency like: https://stackoverflow.com/a/48486557

e.g.

sem = asyncio.Semaphore(3)


async def safe_download(i):
    async with sem:  # semaphore limits num of simultaneous downloads
        return await download(i)


async def main():
    tasks = [
        asyncio.ensure_future(safe_download(i))  # creating task starts coroutine
        for i
        in range(9)
    ]
    await asyncio.gather(*tasks)  # await moment all downloads done

@pseudotensor
Copy link
Author

@agola11 Seems you make the related commits: fe30be6

@pseudotensor
Copy link
Author

After fixing, now get what I expect:

Enter acall
begin gen_text
Enter acall
begin gen_text
Enter acall
begin gen_text
end gen_text
end acall
end gen_text
end acall
end gen_text
end acall

@pseudotensor
Copy link
Author

One can see fixes here in h2oGPT: h2oai/h2ogpt@6aac088#diff-7e1a68b7db14748467aa4b777c853ca024616237c57f8bf91ebcf792b82869a6R596-R627

Surrounding parts for the sem are also needed/good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module
Projects
None yet
Development

No branches or pull requests

1 participant