Skip to content

Commit

Permalink
unicode decoding, edge case handling & info msg dialog
Browse files Browse the repository at this point in the history
  • Loading branch information
CyrylSz committed Nov 10, 2024
1 parent 0b49dd2 commit 4f8adb8
Show file tree
Hide file tree
Showing 9 changed files with 312 additions and 228 deletions.
Binary file modified CORE-To-Excel/CoreToExcelAggregator.jar
Binary file not shown.
402 changes: 201 additions & 201 deletions CORE-To-Excel/Excel-Output/output.txt

Large diffs are not rendered by default.

Binary file modified CORE-To-Excel/example-Research-Papers-Library.xlsm
Binary file not shown.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,20 @@
Organize your research papers into a personal data hub! Easily combine complex CSV files sourced from https://core.ac.uk (the world’s largest collection of open-access research papers) into a single TXT file in an Excel-friendly format!

![ss1](/src/screenshots/ss1.png)

## How to Use:
1. [Download the contents of this folder](CORE-To-Excel).
* Note: You can rename the main folder and files as desired, but don't rename the "CORE-Data" and "Excel-Output" folders to enable automatic folder path insertion into the text fields on startup.
* Note: Files will be processed alphabetically, so ensure the newest files are at the end to avoid incorrect checkbox assignments. Files will also be later automatically renamed in numerical order.
2. Place all TXT or CSV files downloaded from https://core.ac.uk into the "CORE-Data" folder.
* Note: The first line of each file should follow this format:
> "workID","oaiID","doi","title","authors","createdDate"
3. Launch the CoreToExcelAggregator.jar file. The default separator is "||", but changes are not recommended if you plan to proceed to the next step. If the file paths are correct, the play button will be enabled. Press it!
3. Launch the CoreToExcelAggregator.jar file. The default separator is "|", changes are not recommended if you plan to proceed to the next step. If the file paths are correct, the play button will be enabled. Press it!
* Note: Place this file in the same folder as the "CORE-Data" and "Excel-Output" folders, and do not rename them. This will allow the automatic path insertion into the text fields upon startup.
* Note: You need to have Java installed. You can download it from https://www.oracle.com/java/technologies/downloads/.
4. Open example-Research-Papers-Library.xlsm, and go to: Data → Queries & Connections → double-left-click on "output" → expand APPLIED STEPS click LMB on "Source" → in the equation bar near the top adjust the path (it should end with "Excel-Output\output.txt").
* Note: If you later change location of output.txt you need to repeat this step.
4. Open example-Research-Papers-Library.xlsm, and go to: Data → Queries & Connections → double-left-click on "output" → under "APPLIED STEPS" click LMB on "Source" → in the equation bar near the top adjust the path (it should end with "Excel-Output\output.txt").
* Note: If you later change location of output.txt or your main folder, you will need to repeat this step. The Excel workbook itself isn't location dependent.
5. Now save and reopen and you're done! The output.txt file should now load automatically upon opening the Excel workbook.
* Note: Checkboxes aren’t automatically populated, but you can easily drag the fill handle to extend them across rows.
* Note: Checkboxes aren’t automatically populated, but you can easily drag the fill handle to extend them down the first column.
* Tip: If you've completed your library, you can disable the automatic loading of output.txt by navigating to: Data → Queries & Connections → right-click on "output" → Properties... → uncheck "Refresh data when opening the file".
* Tip: For a cleaner view, consider hiding the second and third columns.

Expand All @@ -26,7 +26,7 @@ Organize your research papers into a personal data hub! Easily combine complex C
* Deleting redundant research papers from the list.
* Packaging IDs into links for improved usability.
* Dividing each row into 6 parts with 5 specified separators.
* Removing unnecessary symbols.
* Removing unnecessary symbols and Unicode decoding.
* Excel stuff: checkboxes, conditional formatting, macros...

## Why:
Expand Down
106 changes: 91 additions & 15 deletions src/CoreToExcelAggregator.java
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
import java.io.*;
import java.nio.file.*;
import java.util.*;
import java.util.function.Function;
import java.util.regex.*;

public class CoreToExcelAggregator {
static String separator = "||";
static String separator = "|";
static final String HEADER = "\"workID\",\"oaiID\",\"doi\",\"title\",\"authors\",\"createdDate\"";

public static int duplicateLineCount = 0;
public static int forceFixCount = 0;
public static int lineCount = 0;
public static void process(String dataFolderPath, String outputFolderPath) throws IOException {
File dataFolder = new File(dataFolderPath);
File outputFolder = new File(outputFolderPath);
Expand Down Expand Up @@ -49,6 +53,12 @@ public static void process(String dataFolderPath, String outputFolderPath) throw
System.out.println("The folder is empty or an error occurred.");
}

try (BufferedReader reader = new BufferedReader(new FileReader(outputFile))) {
while (reader.readLine() != null) {
lineCount++;
}
}

renameFilesAlphabetically(dataFolder);
}
private static void convertFilesToTxt(File dataFolder) {
Expand Down Expand Up @@ -156,11 +166,15 @@ private static File removeDuplicateIDs(File fixedFile, File outputFolder) throws
if (uniqueLines.stream().noneMatch(existingLine -> existingLine.startsWith(uniqueID + ","))) {
uniqueLines.add(line);
}
else System.out.println("Duplicates of ID:" + uniqueID + " removed");
else {
System.out.println("Duplicates of ID:" + uniqueID + " removed");
duplicateLineCount++;
}
}
try (BufferedWriter bw = new BufferedWriter(new FileWriter(outputFile))) {
for (String uniqueLine : uniqueLines) {
bw.write(uniqueLine);
String modifiedLine = uniqueLine.replace(separator, "I");
bw.write(modifiedLine);
bw.newLine();
}
}
Expand All @@ -170,6 +184,7 @@ private static File removeDuplicateIDs(File fixedFile, File outputFolder) throws
return outputFile;
}


private static void mainTransformation(File inputFile, File outputFile) throws IOException {

try (BufferedReader reader = new BufferedReader(new FileReader(inputFile));
Expand All @@ -180,15 +195,72 @@ private static void mainTransformation(File inputFile, File outputFile) throws I

String line;
while ((line = reader.readLine()) != null) {
// System.out.println("Original line: " + line);

String[] parts = splitOutsideQuotesAndBrackets(line);
// System.out.println("Parts after split: ");
// for (int i=0; i< parts.length; i++)System.out.println(parts[i]);

// System.out.println("parts.lenght=" + parts.length);
if (parts.length < 6) {
System.out.println("Invalid line format, skipping line: " + line + " | parts.lenght=" + parts.length);
continue;
if (parts.length != 6) {
System.out.println("-----------------------");
System.out.println("Invalid line format: " + line + " | parts.length=" + parts.length);
System.out.println("Parts after split: ");
for (String part : parts) System.out.println(part);

// FORCE FIX
Function<String, String[]> forceFixLine = badLine -> {
String validTitlePattern = "(?<=,)(\"[^\"]*\")(?=,)";
Pattern pattern = Pattern.compile(validTitlePattern);
Matcher matcher = pattern.matcher(badLine);

String[] fixedParts = new String[6];
Arrays.fill(fixedParts, "");
fixedParts[3] = "No Title: This line contained unsafe characters for processing and was fixed by force!";
StringBuilder concatenatedStrings = new StringBuilder();

while (matcher.find()) {
String part = matcher.group(1).replaceAll("^\"|\"$", "").replace("\\\"", "\"");
if (!concatenatedStrings.isEmpty()) {
concatenatedStrings.append(" ");
}
concatenatedStrings.append(part.replace(",", " "));
}

fixedParts[4] = !concatenatedStrings.isEmpty() ? concatenatedStrings.toString() : "Nobody";

Pattern numberPattern = Pattern.compile("^([^,]+)(?=,)");
Matcher numberMatcher = numberPattern.matcher(badLine);
if (numberMatcher.find()) {
fixedParts[0] = numberMatcher.group(1);
} else {
fixedParts[0] = "0";
}
Function<String, Boolean> isValidDateTime = dateTime -> dateTime.matches("\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}");
Function<String, String> extractDateTime = lineToSearch -> {
String dateTimePattern = "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2})";
Pattern datePattern = Pattern.compile(dateTimePattern);
Matcher dateMatcher = datePattern.matcher(lineToSearch);
if (dateMatcher.find()) {
return dateMatcher.group(1);
}
return null;
};
if (fixedParts[5] != null && !isValidDateTime.apply(fixedParts[5])) {
String dateTime = extractDateTime.apply(badLine);
if (dateTime != null) {
fixedParts[5] = dateTime;
} else {
fixedParts[5] = "0000-00-00T00:00:00";
}
} else if (fixedParts[5] == null) {
fixedParts[5] = "0000-00-00T00:00:00";
}

return fixedParts;
};

parts = forceFixLine.apply(line);
forceFixCount++;

System.out.println("Parts after FORCE FIX:");
for (String part : parts) System.out.println(part);
}

String idNumber = parts[0];
Expand All @@ -203,15 +275,19 @@ private static void mainTransformation(File inputFile, File outputFile) throws I

// Concatenate parts
transformedLine += separator + string0 + separator + string1 + separator + string2 + separator + formattedNames + separator + timestamp;

writer.write(transformedLine);
String decodedLine = decodeUnicodeInString(transformedLine);
writer.write(decodedLine);
writer.newLine();
}

} catch (IOException e) {
e.printStackTrace();
}
}
public static String decodeUnicodeInString(String input) {
return Pattern.compile("\\\\u([0-9a-fA-F]{4})")
.matcher(input)
.replaceAll(match -> String.valueOf((char) Integer.parseInt(match.group(1), 16)));
}

private static String[] splitOutsideQuotesAndBrackets(String line) {
String fixedLine = line.replace("\\\"", "");
Expand All @@ -232,7 +308,7 @@ private static String formatNamesList(String namesList) {
}

String formattedNames = String.join(" ", nameParts);
String trimmed = formattedNames.substring(2, formattedNames.length() - 2);
String trimmed = formattedNames.length() >=4 ? formattedNames.substring(2, formattedNames.length() - 2) : "Nobody";
formattedNames = trimmed.replaceAll("\"\"", ", ");

return formattedNames;
Expand Down
18 changes: 13 additions & 5 deletions src/CoreToExcelAggregatorGUI.java
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ public static void main(String[] args) {
JPanel separatorPanel = new JPanel(new FlowLayout(FlowLayout.CENTER));
separatorPanel.setBackground(new Color(0xb6ccc9));
JLabel separatorLabel = new JLabel("Excel Separator:");
JTextField separatorTextField = new JTextField("||", 4);
JTextField separatorTextField = new JTextField(CoreToExcelAggregator.separator, 4);
separatorPanel.add(separatorLabel);
separatorPanel.add(separatorTextField);
gbc.gridx = 0;
Expand Down Expand Up @@ -116,8 +116,18 @@ public static void main(String[] args) {
public void actionPerformed(ActionEvent e) {
CoreToExcelAggregator.separator = separatorTextField.getText().trim();
try {
CoreToExcelAggregator.duplicateLineCount = 0;
CoreToExcelAggregator.forceFixCount = 0;
CoreToExcelAggregator.lineCount = 0;

CoreToExcelAggregator.process(dataFolderPath, outputFolderPath);
JOptionPane.showMessageDialog(frame, "Process completed.");

String message = "<html>" +
"Amount: <font size='+1'>" + (CoreToExcelAggregator.lineCount - 1) + " Papers!</b></font><br>" +
"Duplicate Papers Removed: " + CoreToExcelAggregator.duplicateLineCount + "<br>" +
"Line Fixes By Force: " + CoreToExcelAggregator.forceFixCount + "</html>";
JOptionPane.showMessageDialog(frame, message, "Process completed.", JOptionPane.INFORMATION_MESSAGE);

} catch (IOException ex) {
JOptionPane.showMessageDialog(frame, "Error during transformation: " + ex.getMessage());
}
Expand All @@ -128,9 +138,7 @@ public void actionPerformed(ActionEvent e) {
buttonGbc.gridwidth = 2;
buttonPanel.add(startProgramButton, buttonGbc);

gbc.gridx = 0;
gbc.gridy = 3;
gbc.gridwidth = 2;
gbc.gridy = 4;
frame.add(buttonPanel, gbc);

frame.setVisible(true);
Expand Down
Binary file modified src/screenshots/ss1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 4f8adb8

Please sign in to comment.