EASTER_EGG_URLS
This project is a pure Java solution (no external libraries) designed to analyze the HTML content of a URL and extract the text contained at the deepest level of the structure.
The solution was developed strictly adhering to the functional and technical requirements, including malformed HTML validation (bonus feature) and robust connection error handling.
- Java JDK 17 installed and configured in the PATH.
- Internet connection to access the provided URL.
The project is designed to be compiled and executed via the command line, without the need for IDEs or external build tools (Maven/Gradle).
Navigate to the directory where the HtmlAnalyzer.java file is located and run the following command:
javac HtmlAnalyzer.javaAfter compilation, run the program passing the desired URL as an argument:
java HtmlAnalyzer http://hiring.axreng.com/internship/example1.html- Deepest Level Extraction: The algorithm traverses the DOM structure and returns the text located at the deepest nesting level.
- Tie-Breaking Rule: If multiple text snippets exist at the same maximum depth, the program returns the first one found, as per the specification.
- Malformed HTML Detection (Bonus): The solution identifies structural inconsistencies in the HTML. The program will output malformed HTML if it encounters:
- Closing tags without a corresponding opening tag (e.g.,
\</div\>without an open\<div\>). - Incorrectly crossed tags (e.g.,
\<div\>\<span\>\</div\>). - Tags that remain open at the end of the file.
- Closing tags without a corresponding opening tag (e.g.,
- Error Handling: Returns
URL connection errorin cases of network failures, invalid URLs, or timeouts.
To ensure performance and compliance with the "Zero Dependencies" constraint, the following decisions were made:
- Stack-Based Algorithm: A
java.util.Stackdata structure was used. This allows tracking the current depth and validating correct tag nesting at runtime with linear complexity O(N), ideal for parsing hierarchical structures like HTML. - Stream Processing: The use of
BufferedReaderenables line-by-line processing. This optimizes memory usage by avoiding loading the entire page content into memory before processing it. - Standard JDK API: The solution uses only native libraries (
java.net,java.io,java.util), ensuring total portability and complying with the prohibition of third-party libraries or XML/DOM parsing classes.
Author: [Keven Gustavo Dos Santos Gomes]