Documentation #9

tripleo1 · 2023-02-18T15:37:57Z

tripleo1
Feb 18, 2023
Maintainer

I pulled the old jnode.org website and put the contents here:

https://github.com/jnode-revisited/dataset-jnode.org

There is scripting work to do if anyone is familiar with manipulating text and specifically HTML.

ghost · 2023-03-27T01:15:13Z

ghost
Mar 27, 2023

There is scripting work to do if anyone is familiar with manipulating text and specifically HTML.

Initially I removed local files like js, css. So I convert all html files to md. I made a fork where I'm making necessary changes, don't use this code.

#!/bin/bash
import os
import html2markdown
from bs4 import BeautifulSoup, Doctype

# reset
# find . -name \*.css -type f -delete
# find . -name \*.icon -type f -delete
# find . -name \*.ico -type f -delete
# find . -name \*.js -type f -delete
# find . -name \*.png -type f -delete
# find . -name \*.svg -type f -delete
# find . -name \*.jpeg -type f -delete
# find . -name \*.jpg -type f -delete
# find . -name \*.jfif -type f -delete
# find . -name \*.json -type f -delete
# find . -name \*.gif -type f -delete

directory = './data_sets/'
for root, dirnames, filenames in os.walk(directory):
    for filename in filenames:
        if filename.endswith('.html'):
            fname = os.path.join(root, filename)
            print('Filename: {}'.format(fname))
            with open(fname) as handle:
                soup = BeautifulSoup(handle.read(), 'html.parser')
                for item in soup.contents:
                    if isinstance(item, Doctype):
                        print('Doctype: {}'.format(item))
                        break

1 reply

ghost Mar 30, 2023

Hello everyone!

Initially I removed local files like js, css. So I convert all html files to md. I made a fork where I'm making necessary changes, don't use this code.

this code makes no sense, better code could be:

import os
import html2markdown
from bs4 import BeautifulSoup, Doctype
from markdownify import markdownify

directory = './dataset/'
for root, dirnames, filenames in os.walk(directory):
    for filename in filenames:
        if filename.endswith('.html'):
            fname = os.path.join(root, filename) #print('Filename: {}'.format(fname))
            file = open(fname, "r").read() # file = open("./index.html", "r").read()
            html = markdownify(file, heading_style="ATX")
            newfile = open(fname, "w")
            newfile.write(html)
            newfile.close()

this code converts any markdown html file to a document. license: https://brianli.com/python-convert-html-markdown/. I answer this here: jnode-revisited/dataset-jnode.org/issues/1 and github.com/jnode-revisited/dataset-jnode.org/pull/2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation #9

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Documentation #9

tripleo1 Feb 18, 2023 Maintainer

Replies: 1 comment · 1 reply

ghost Mar 27, 2023

ghost Mar 30, 2023

tripleo1
Feb 18, 2023
Maintainer

Replies: 1 comment 1 reply

ghost
Mar 27, 2023