Replies: 1 comment 1 reply
-
Initially I removed local files like js, css. So I convert all html files to md. I made a fork where I'm making necessary changes, don't use this code. #!/bin/bash
import os
import html2markdown
from bs4 import BeautifulSoup, Doctype
# reset
# find . -name \*.css -type f -delete
# find . -name \*.icon -type f -delete
# find . -name \*.ico -type f -delete
# find . -name \*.js -type f -delete
# find . -name \*.png -type f -delete
# find . -name \*.svg -type f -delete
# find . -name \*.jpeg -type f -delete
# find . -name \*.jpg -type f -delete
# find . -name \*.jfif -type f -delete
# find . -name \*.json -type f -delete
# find . -name \*.gif -type f -delete
directory = './data_sets/'
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
if filename.endswith('.html'):
fname = os.path.join(root, filename)
print('Filename: {}'.format(fname))
with open(fname) as handle:
soup = BeautifulSoup(handle.read(), 'html.parser')
for item in soup.contents:
if isinstance(item, Doctype):
print('Doctype: {}'.format(item))
break |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I pulled the old jnode.org website and put the contents here:
https://github.com/jnode-revisited/dataset-jnode.org
There is scripting work to do if anyone is familiar with manipulating text and specifically HTML.
Beta Was this translation helpful? Give feedback.
All reactions