Skip to content

Solução em Java puro (Zero Dependencies) para análise de profundidade DOM e validação de HTML. Algoritmo otimizado O(N) baseado em Pilhas.

Notifications You must be signed in to change notification settings

KevenGustavo/HTML-Analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

EASTER_EGG_URLS

HtmlAnalyzer - Software Development Intern Challenge

This project is a pure Java solution (no external libraries) designed to analyze the HTML content of a URL and extract the text contained at the deepest level of the structure.
The solution was developed strictly adhering to the functional and technical requirements, including malformed HTML validation (bonus feature) and robust connection error handling.

Prerequisites

  • Java JDK 17 installed and configured in the PATH.
  • Internet connection to access the provided URL.

How to Compile and Run

The project is designed to be compiled and executed via the command line, without the need for IDEs or external build tools (Maven/Gradle).

1. Compilation

Navigate to the directory where the HtmlAnalyzer.java file is located and run the following command:

javac HtmlAnalyzer.java

2. Execution

After compilation, run the program passing the desired URL as an argument:

java HtmlAnalyzer http://hiring.axreng.com/internship/example1.html

Implemented Features

  • Deepest Level Extraction: The algorithm traverses the DOM structure and returns the text located at the deepest nesting level.
  • Tie-Breaking Rule: If multiple text snippets exist at the same maximum depth, the program returns the first one found, as per the specification.
  • Malformed HTML Detection (Bonus): The solution identifies structural inconsistencies in the HTML. The program will output malformed HTML if it encounters:
    • Closing tags without a corresponding opening tag (e.g., \</div\> without an open \<div\>).
    • Incorrectly crossed tags (e.g., \<div\>\<span\>\</div\>).
    • Tags that remain open at the end of the file.
  • Error Handling: Returns URL connection error in cases of network failures, invalid URLs, or timeouts.

Design and Architecture Decisions

To ensure performance and compliance with the "Zero Dependencies" constraint, the following decisions were made:

  • Stack-Based Algorithm: A java.util.Stack data structure was used. This allows tracking the current depth and validating correct tag nesting at runtime with linear complexity O(N), ideal for parsing hierarchical structures like HTML.
  • Stream Processing: The use of BufferedReader enables line-by-line processing. This optimizes memory usage by avoiding loading the entire page content into memory before processing it.
  • Standard JDK API: The solution uses only native libraries (java.net, java.io, java.util), ensuring total portability and complying with the prohibition of third-party libraries or XML/DOM parsing classes.

Author: [Keven Gustavo Dos Santos Gomes]

About

Solução em Java puro (Zero Dependencies) para análise de profundidade DOM e validação de HTML. Algoritmo otimizado O(N) baseado em Pilhas.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages