Skip to content

BaSma99/Text_Summarization

Repository files navigation

Text Summarization

1- Introduction : - Extractive text summarization does not use words aside from the ones already in the text.

          - selects some combination of the existing words most relevant to the meaning of the source.
          
          - we chose various articles with different categories after that we make preprocessing . 
          
          - we applied Summrization Model like Bert and LSA .

Implementation Steps:

   1- Preprocessing:
        - preprocessing our divides into label, text, and text summary.
        
        - After we look at this data we find this data is useful for our needs. 
        
        - But like any data, there are many characters not useful 
        
        - So we started to clean the data from any unwanted characters, white spaces, and stop words. 
        
        - we lemmatized and stemmed but we keep commas and full stops 
        
  
   2- Split data Splitting the data 
         
         - Split the data into cleaned text that include only text after cleaning it with their labels 
         
   3- Perform Feature Engineer:
   
            - It is very important tb be able to deal with model by help us convert text sentences into numeric vectors. 
            
            1- BOW The Bag of Words is a method often used for document classification:
            
                    - This method turns text into fixed-length vectors by simply counting the number of times a word appears in a document.
                    
                    - process referred to as vectorization. 
                    
                    - As we said that we split the data. So, we applied BOW to training and testing data. 
            
            2- Tf-IDF Term frequency:
            
                    - works by looking at the frequency of a particular term you are concerned with relative to the document. 
            
                    - There are multiple measures, or ways, of defining frequency As we said that we split the data. 
                    
                    - So, we applied Tf-IDF to training and testing data. 
            
            
    3- Classfication:
            
            1- SVM Support vector machines: 
                    - are a set of supervised learning methods used for classification, regression, and outliers detection. 
                    
                    - All of these are common tasks in machine learning. 
                    
            2- Decision Tree: 
            
                   - Decision Tree is the most powerful and popular tool for classification and prediction. 
                   
                   - A Decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute.
                   
                   - each branch represents an outcome of the test
                   
                   - each leaf node (terminal node) holds a class label. 

            3- KNN: 
                  
                  - The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm 
                  
                  - can be used to solve both classification and regression problems.
            
     4- Clustering K-means Centrid-based Clustering:
        
            - K-means is an unsupervised machine learning algorithm in which each observation belongs to the cluster with the nearest mean. 
            
            
      5- Summrization Techniques:
      
           1- Bert summarization:
           
                    - BERT is the encoder of transformers, and it consists of 12 layers in the base model, and 24 layers for the large model. 
                    
                    - So, we can take the output of these layers as an embedding vector from the pretrained model. 
                   
                   - There are three approaches to the embedding vectors: concatenate the last four layers
                   
                   - the sum of the last four layers, or embed the full sentence by taking the mean of the embedding vectors of the tokenized word 
                    
           2- LSA summariation.
           
       6- Chatbot Question and answering system:
       
               - Extracts the answer from the text summary from Bert and LSA 
               
               
       7- Innovation:
          
               - We created a transformer model to translate the text to Arabic, German, and Chinese 

Conclusion:

     - In this project, we applied data pre-processing, classification techniques, and clustering. 
     
     - Then we applied LSA and Bert summarization models, after that we made a comparison between them. 
     
     - The LSA model had a good sores and summary close to the human summary than the BERT model. 
     
     - After that we made the error analysis to see what the machine tried to predict. 
     
     - Then we made a simple question and answering system, to extract the answer from the summary. 
     
     - Finally, we made a different language translation from English to German, Chinese, and Arabic languages.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published