asyml
diff --git a/‎.github/ISSUE_TEMPLATE/data_efficient_issue.md
+2-1 b/‎.github/ISSUE_TEMPLATE/data_efficient_issue.md
+2-1
diff --git a/‎.github/workflows/main.yml
+2-1 b/‎.github/workflows/main.yml
+2-1
diff --git a/‎README.md
+10 b/‎README.md
+10
diff --git a/‎data_samples/README.md
+5 b/‎data_samples/README.md
+5
diff --git a/‎data_samples/amazon_review_polarity_csv/sample.csv
+10 b/‎data_samples/amazon_review_polarity_csv/sample.csv
+10
diff --git a/‎data_samples/audio_reader_test/test_audio_0.flac
92.1 KB b/‎data_samples/audio_reader_test/test_audio_0.flac
92.1 KB
diff --git a/‎data_samples/audio_reader_test/test_audio_1.flac
99.3 KB b/‎data_samples/audio_reader_test/test_audio_1.flac
99.3 KB
diff --git a/‎data_samples/banking77/sample.csv
+20 b/‎data_samples/banking77/sample.csv
+20
diff --git a/‎docs/audio_processing.md
+26 b/‎docs/audio_processing.md
+26
diff --git a/‎docs/code/data.rst
+6 b/‎docs/code/data.rst
+6
diff --git a/‎docs/data_pack.md
+24 b/‎docs/data_pack.md
+24
diff --git a/‎docs/index.rst
+2 b/‎docs/index.rst
+2
diff --git a/‎docs/ontology_generation.md
+10-1 b/‎docs/ontology_generation.md
+10-1
diff --git a/‎docs/requirements.txt
+1-1 b/‎docs/requirements.txt
+1-1
diff --git a/‎examples/audio/README.md
+52 b/‎examples/audio/README.md
+52
diff --git a/‎examples/audio/requirements.txt
+4 b/‎examples/audio/requirements.txt
+4
@@ -21,4 +21,5 @@ A clear and concise description of any alternative solutions or features you've
 
 **Additional context**
 - This is part of the data efficiency project
-- This PR should be made to the `new_datapack` branch.
+- This PR should be made to the `master` branch.
+- After the data tuple class is finished, we may switch to a new branch for integration.
@@ -45,6 +45,7 @@ jobs:
             ${{ runner.os }}-
       - name: Install dependencies
         run: |
+          sudo apt-get install -y libsndfile1-dev
           python -m pip install --progress-bar off --upgrade pip
           pip install --progress-bar off Django django-guardian
           pip install --progress-bar off pylint==2.10.2 flake8==3.9.2 mypy==0.910 pytest==5.1.3 black==20.8b1
@@ -77,7 +78,7 @@ jobs:
           rm -rf texar-pytorch
       - name: Install Forte
         run: |
-          pip install --use-feature=in-tree-build --progress-bar off .[ner,test,example,wikipedia,augment,stave]
+          pip install --use-feature=in-tree-build --progress-bar off .[ner,test,example,wikipedia,augment,stave,audio_ext]
       - name: Build ontology
         run: |
           ./scripts/build_ontology_specs.sh
 
@@ -48,6 +48,16 @@ cd forte-wrappers
 pip install src/spacy
 ```
 
+Some components or modules in forte may require some [extra requirements](https://github.com/asyml/forte/blob/master/setup.py#L45):
+
+* `pip install forte[ner]`: Install packages required for [ner_trainer](https://github.com/asyml/forte/blob/master/forte/trainer/ner_trainer.py)
+* `pip install forte[test]`: Install packages required for running [unit tests](https://github.com/asyml/forte/tree/master/tests).
+* `pip install forte[example]`: Install packages required for running [forte examples](https://github.com/asyml/forte/tree/master/examples).
+* `pip install forte[wikipedia]`: Install packages required for reading [wikipedia datasets](https://github.com/asyml/forte/tree/master/forte/datasets/wikipedia).
+* `pip install forte[augment]`: Install packages required for [data augmentation module](https://github.com/asyml/forte/tree/master/forte/processors/data_augment).
+* `pip install forte[stave]`: Install packages required for [StaveProcessor](https://github.com/asyml/forte/blob/master/forte/processors/stave/stave_processor.py).
+* `pip install forte[audio_ext]`: Install packages required for [AudioReader](https://github.com/asyml/forte/blob/master/forte/data/readers/audio_reader.py).
+
 ## Getting Started
 
 * [Examples](./examples)
 
@@ -0,0 +1,5 @@
+This folder contains a list of data samples that are used by forte to facilitate test cases.
+
+# List of Data Samples
+## audio_reader_test
+This directory consists of audio files that are used in a unit test for verifying the AudioReader in `forte/tests/forte/data/readers/audio_reader_test.py`. Currently it contains two `.flac` files excerpted from a HuggingFace dataset called [patrickvonplaten/librispeech_asr_dummy](https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_dummy) for automatic speech recognition.
@@ -0,0 +1,10 @@
+"2","Great CD","My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing ""Who was that singing ?"""
+"2","One of the best game music soundtracks - for a game I didn't really play","Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, which I find distracting. But even if those weren't included I would still consider the collection worth it."
+"1","Batteries died within a year ...","I bought this charger in Jul 2003 and it worked OK for a while. The design is nice and convenient. However, after about a year, the batteries would not hold a charge. Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power."
+"2","works fine, but Maha Energy is better","Check out Maha Energy's website. Their Powerex MH-C204F charger works in 100 minutes for rapid charge, with option for slower charge (better for batteries). And they have 2200 mAh batteries."
+"2","Great for the non-audiophile","Reviewed quite a bit of the combo players and was hesitant due to unfavorable reviews and size of machines. I am weaning off my VHS collection, but don't want to replace them with DVD's. This unit is well built, easy to setup and resolution and special effects (no progressive scan for HDTV owners) suitable for many people looking for a versatile product.Cons- No universal remote."
+"1","DVD Player crapped out after one year","I also began having the incorrect disc problems that I've read about on here. The VCR still works, but hte DVD side is useless. I understand that DVD players sometimes just quit on you, but after not even one year? To me that's a sign on bad quality. I'm giving up JVC after this as well. I'm sticking to Sony or giving another brand a shot."
+"1","Incorrect Disc","I love the style of this, but after a couple years, the DVD is giving me problems. It doesn't even work anymore and I use my broken PS2 Now. I wouldn't recommend this, I'm just going to upgrade to a recorder now. I wish it would work but I guess i'm giving up on JVC. I really did like this one... before it stopped working. The dvd player gave me problems probably after a year of having it."
+"1","DVD menu select problems","I cannot scroll through a DVD menu that is set up vertically. The triangle keys will only select horizontally. So I cannot select anything on most DVD's besides play. No special features, no language select, nothing, just play."
+"2","Unique Weird Orientalia from the 1930's","Exotic tales of the Orient from the 1930's. ""Dr Shen Fu"", a Weird Tales magazine reprint, is about the elixir of life that grants immortality at a price. If you're tired of modern authors who all sound alike, this is the antidote for you. Owen's palette is loaded with splashes of Chinese and Japanese colours. Marvelous."
+"1","Not an ""ultimate guide""","Firstly,I enjoyed the format and tone of the book (how the author addressed the reader). However, I did not feel that she imparted any insider secrets that the book promised to reveal. If you are just starting to research law school, and do not know all the requirements of admission, then this book may be a tremendous help. If you have done your homework and are looking for an edge when it comes to admissions, I recommend some more topic-specific books. For example, books on how to write your personal statment, books geared specifically towards LSAT preparation (Powerscore books were the most helpful for me), and there are some websites with great advice geared towards aiding the individuals whom you are asking to write letters of recommendation. Yet, for those new to the entire affair, this book can definitely clarify the requirements for you."
@@ -0,0 +1,20 @@
+text,category
+How do I locate my card?,card_arrival
+"I still have not received my new card, I ordered over a week ago.",card_arrival
+I ordered a card but it has not arrived. Help please!,card_arrival
+Is there a way to know when my card will arrive?,card_arrival
+My card has not arrived yet.,card_arrival
+When will I get my card?,card_arrival
+Do you know if there is a tracking number for the new card you sent me?,card_arrival
+i have not received my card,card_arrival
+still waiting on that card,card_arrival
+Is it normal to have to wait over a week for my new card?,card_arrival
+How do I track my card?,card_arrival
+How long does a card delivery take?,card_arrival
+I still don't have my card after 2 weeks.  What should I do?,card_arrival
+still waiting on my new card,card_arrival
+I am still waiting for my card after 1 week.  Is this ok?,card_arrival
+"I have been waiting longer than expected for my bank card, could you provide information on when it will arrive?",card_arrival
+I've been waiting longer than expected for my card.,card_arrival
+Why hasn't my card been delivered?,card_arrival
+Can the card be mailed and used in Europe?,country_support
@@ -0,0 +1,26 @@
+# Audio Processing
+
+## Audio DataPack
+`DataPack` includes a payload for audio data and a metadata for sample rate. You can set them by calling the `set_audio` method:
+```python
+from forte.data.data_pack import DataPack
+
+pack: DataPack = DataPack()
+pack.set_audio(audio, sample_rate)
+```
+The input parameter `audio` should be a numpy array of raw waveform and `sample_rate` should be an integer the specifies the sample rate. Now you can access these data using `DataPack.audio` and `DataPack.sample_rate`.
+
+## Audio Reader
+`AudioReader` supports reading in the audio data from files under a specific directory. You can set it as the reader of your forte pipeline whenever you need to process audio files:
+```python
+from forte.pipeline import Pipeline
+from forte.data.readers.audio_reader import AudioReader
+
+Pipeline().set_reader(
+    reader=AudioReader(),
+    config={"file_ext": ".wav"}
+).run(
+    "path-to-audio-directory"
+)
+```
+The example above builds a simple pipeline that can walk through the specified directory and load all the files with extension of `.wav`. `AudioReader` will create a `DataPack` for each file with the corresponding audio payload and the sample rate.
@@ -191,6 +191,12 @@ Readers
 .. autoclass:: forte.datasets.mrc.squad_reader.SquadReader
     :members:
 
+:hidden:`ClassificationDatasetReader`
+--------------------------------------
+.. autoclass:: forte.data.readers.classification_reader.ClassificationDatasetReader
+    :members:
+
+
 DataPack Dataset
 =================
 
 
@@ -0,0 +1,24 @@
+# DataPack Tutorial
+
+## Build Coverage Index
+`DataPack.get()` is commonly used to retrieve entries from a datapack. In some cases, we are only interested in getting entries from a specific range. `DataPack.get()` allows users to set `range_annotation` which controls the search area of the sub-types. If `DataPack.get()` is called frequently with queries related to the `range_annotation`, you may consider building the coverage index regarding the related entry types. Users can call `DataPack.build_coverage_for(context_type, covered_type)` in order to create a mapping between a pair of entry types and target entries that are covered in ranges specified by outer entries.
+
+For example, if you need to get all the `Token`s from some `Sentence`, you can write your code as:
+```python
+# Iterate through all the sentences in the pack.
+for sentence in input_pack.get(Sentence):
+    # Take all tokens from a sentence
+    token_entries = input_pack.get(
+        entry_type=Token, range_annotation=sentence
+    )
+```
+However, the snippet above may become a bottleneck if you have a lot of `Sentence` and `Token` entries inside the datapack. To speed up this process, you can build a coverage index first:
+```python
+# Build coverage index between `Token` and `Sentence`
+input_pack.build_coverage_for(
+    context_type=Sentence
+    covered_type=Token
+)
+```
+This `DataPack.build_coverage_for(context_type, covered_type)` function is able to build a mapping from `context_type` to `covered_type`, allowing faster retrieval of inner entries covered by outer entries inside the datapack.
+We also provide a function called `DataPack.covers(context_entry, covered_entry)` for coverage checking. It returns `True` if the span of `covered_entry` is covered by the span of `context_entry`.
@@ -11,6 +11,8 @@ Welcome to Forte's documentation!
 
    examples.md
    ontology_generation.md
+   audio_processing.md
+   data_pack.md
 
 API
 ====
 
@@ -72,6 +72,13 @@ Let us consider a simple ontology for documents of a pet shop.
                 {
                     "name": "pet_type",
                     "type": "str"
+                },
+                {
+                    "name": "price",
+                    "description": "Price for pet. A 2x2 matrix, whose columns are female/male and rows are juvenile/adult.",
+                    "type": "NdArray",
+                    "ndarray_dtype": "float",
+                    "ndarray_shape": [2, 2]
                 }
             ]
         },
@@ -133,7 +140,7 @@ Each entry definition will define a couple (can be empty) attributes, mimicking
 * The `description` keyword is optionally used as the comment to describe the attribute.
 * The `type` keyword is used to define the type of the attribute. Currently supported types are:
     * Primitive types - `int`, `float`, `str`, `bool`
-    * Composite types - `List`, `Dict`
+    * Composite types - `List`, `Dict`, `NdArray`
     * Entries defined in the `top` module - The attributes can be of the type base
     entries (defined in the `forte.data.ontology.top` module) and can be directly 
     referred by the class name.
@@ -146,6 +153,8 @@ Each entry definition will define a couple (can be empty) attributes, mimicking
 * `key_type` and `value_type`: If the `type` of the property is a `Dict`,
    then these two represent the types of the key and value of the dictionary,
    currently, only primitive types are supported as the `key_type`.
+* `ndarray_dtype: str` and `ndarray_shape: array`: If the `type` of the property is a `NdArray`, then 
+   these two represent the data type and the shape of the array. `NdArray` allows storing a N-dimensional (N-d) array in an entry. For instance, through the simple ontology of pet shop above, we are able to instantiate `Pet` and name it `dog`. Then, we can assign a matrix to the attribute `price` by `dog.price.data = [[2.99, 1.99], [4.99, 3.99]]`. Internally, this $2 \times 2$ matrix is stored as a NumPy array. When `ndarray_shape`/`ndarray_dtype` is specified, the shape/data type of the upcoming array will be verified whether they match. If both `ndarray_dtype` and `ndarray_shape` are provided, a placeholder will be created by `numpy.ndarray(ndarray_shape, dtype=ndarray_dtype)`.
 
 ## Major ontology types, Annotations, Links, Groups and Generics
 There are some very frequently used types in NLP: 
 
@@ -32,7 +32,7 @@ jsonschema==3.0.2
 transformers>=3.1
 
 # NLTK
-nltk==3.4.5
+nltk==3.6.6
 
 # FastAPI
 fastapi==0.65.2
 
@@ -0,0 +1,52 @@
+# Audio Processing Examples
+This folder contains a series of tutorial examples that walk through the basics of building audio processing pipelines using forte.
+
+##  Introduction
+
+We provide a simple speech processing example here to showcase forte's capability to support a wide range of audio processing tasks. This example consists of two parts: speaker segmentation and automatic speech recognition.
+
+### Speaker Segmentation
+Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. It is the process of partitioning an input audio stream into acoustically homogeneous segments according to the speaker identity. A typical speaker segmentation system finds potential speaker change points using the audio characteristics. In this example, the speaker segmentation is backed by a pretrained Hugging Face model where you can find details [here](https://huggingface.co/pyannote/speaker-segmentation).
+
+### Automatic Speech Recognition
+Automatic Speech Recognition (ASR) develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. Here we using a simple example to show how to build a forte pipeline to perform speech transcription tasks. This example is based on a pretrained wav2vec2 model and you can check out the details [here](https://huggingface.co/facebook/wav2vec2-base-960h).
+
+## Run the Example Script
+
+This example requires **python3.8 or later versions**. Before running the script, we will need to install a few packages first:
+```bash
+pip install -r requirements.txt
+```
+Note that some packages (e.g., `soundfile`) depend on a system library called `libsndfile` which might entail [additional steps](https://pysoundfile.readthedocs.io/en/latest/#installation) for Linux users.
+
+Now you are able to run the example script `speaker_segmentation_pipeline.py`:
+```bash
+python speaker_segmentation_pipeline.py
+```
+which will print out the annotated transcription results including speakers and their corresponding utterances. Each audio segment will be played through your PC speaker. Example output:
+```
+INFO:speaker_segmentation_pipeline.py:SPEAKER_01: HE JOINS US LIFE FROM THE ALLERT CENTER WITH WHAT VOTERS THINK OF TO NIGHT'S DEBATE MICHAEL
+```
+
+We include a `test_audio.wav` extracted from [VoxConverse speaker diarisation dataset](https://github.com/joonson/voxconverse) in this example. It is a conversation consisting of three speakers speaking in turns. The example script will partition the audio, transcript the waveform, and play the audio segment for each speaker. The results are not meant to be 100% accurate but they are still recognizable and reasonable.
+
+## Code Walkthrough
+The backbone of the example script is a simple forte pipeline for speech processing:
+```python
+    pipeline = Pipeline[DataPack]()
+    pipeline.set_reader(AudioReader(), config={"file_ext": ".wav"})
+    pipeline.add(SpeakerSegmentationProcessor())
+    pipeline.add(AudioUtteranceASRProcessor())
+    pipeline.initialize()
+```
+The pipeline includes three major components:
+- [`AudioReader`](https://github.com/asyml/forte/blob/master/forte/data/readers/audio_reader.py) supports reading in the audio data from files under a specific directory. Use `file_ext` to configure the target file extension that you want to include as input to your pipeline. You can set it as the reader of your forte pipeline whenever you need to process audio files
+- `SpeakerSegmentationProcessor` performs the speaker segmentation task utilizing a pretrained [model](https://huggingface.co/pyannote/speaker-segmentation) from HuggingFace. After partitioning the recording into segments, it creates annotations called [`AudioUtterance`](https://github.com/asyml/forte/blob/master/ft/onto/base_ontology.py#L537) to store the audio span and speaker information for later retrieval.
+- `AudioUtteranceASRProcessor` transcribes audio segments into text for each `AudioUtterance` found in input datapack. It appends the transcripted text into the text payload of datapack and creates corresponding [`Utterance`](https://github.com/asyml/forte/blob/master/ft/onto/base_ontology.py#L211) with speaker identity for each segment. To illustrate the one-to-one correspondence of `AudioUtterance` and `Utterance` within each segment, it adds a [`Link`](https://github.com/asyml/forte/blob/master/forte/data/ontology/top.py#L194) entry for each speech-to-text relationship.
+
+After running the pipeline, you can retrieve the audio and text annotations from each segment by getting all the `Link`s inside the output datapack:
+```python
+for asr_link in pack.get(Link):
+    audio_utter = asr_link.get_parent()
+    text_utter = asr_link.get_child()
+```
@@ -0,0 +1,4 @@
+soundfile>=0.10.3
+sounddevice>=0.4.4
+transformers>=4.15.0
+https://github.com/pyannote/pyannote-audio/archive/develop.zip