Skip to content

This mini-assignment demonstrates how to efficiently perform aggregation on a large XML file based on the 'year' tag. It utilizes SAX abstract class for parsing the XML file, ensuring minimal memory usage while handling sizable datasets. After performing the aggregation, the resulting dictionary is plotted for visualization.

License

Notifications You must be signed in to change notification settings

menychtak/SAX-XML_mini_assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

XML Publication Analyzer

This mini-assignment demonstrates how to efficiently perform aggregation and extract tag content from a large XML file to derive meaningful insights. It utilizes SAX abstract class for parsing the XML file, ensuring minimal memory usage while processing large datasets. After completing the aggregation, the resulting dictionary is visualized through a bar-chart using matplotlib.

Folder Structure

The project follows this structure:

.
├── code
│   └── main.py               # Main script to run the XML analyzer
├── dataset
│   └── file.xml              # Place your XML files here (e.g., file.xml)
├── libs
│   ├── xml_parser.py         # XML parsing logic
│   ├── logger_config.py      # Logger setup
│   ├── file_handler.py       # File handling utilities
│   └── plotter.py            # Plotting functions
├── .gitignore
├── README.md
└── requirements.txt

Getting Started

Prerequisites

  • Python 3.x
  • Required packages listed in requirements.txt

To install the necessary packages, run:

pip install -r requirements.txt

Dataset

To replicate the results, download the XML dataset from this link, unzip it, and place the .xml file in the dataset folder. Rename it to file.xml or update the path in the code.

Usage

  1. Add your XML file to the dataset folder.
  2. In main.py, set the tag_name variable to the tag you want to count. For example, to count occurrences of <year> tags, set:
    tag_name = "year"
  3. Run the main script:
    python code/main.py

Output

  • The script will output a dictionary with counts of each year (or specified tag) and display a bar chart of the publication counts.

Example

An example XML file containing <year> tags can be found in the DBLP dataset linked above. The dictionary output will look something like:

{"2020": 100, "2021": 150, "2022": 200}

The program will also display a bar chart with the counts per year.


License

This project is licensed under the MIT License. See the LICENSE file for details.

About

This mini-assignment demonstrates how to efficiently perform aggregation on a large XML file based on the 'year' tag. It utilizes SAX abstract class for parsing the XML file, ensuring minimal memory usage while handling sizable datasets. After performing the aggregation, the resulting dictionary is plotted for visualization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages