Skip to content

Commit 3f25597

Browse files
committed
Don't use Tika for now, just extract text from HTML (Nokogiri), PDF, and plaintext.
1 parent 0d56069 commit 3f25597

File tree

3 files changed

+33
-14
lines changed

3 files changed

+33
-14
lines changed

Gemfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
source "https://rubygems.org"
2-
gem 'ruby-filemagic', '0.6.1'
2+
gem 'nokogiri'
3+
gem 'ruby-filemagic'

Gemfile.lock

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,16 @@
11
GEM
22
remote: https://rubygems.org/
33
specs:
4+
headless (1.0.2)
5+
mini_portile (0.5.3)
6+
nokogiri (1.6.1)
7+
mini_portile (~> 0.5.0)
48
ruby-filemagic (0.6.1)
59

610
PLATFORMS
711
ruby
812

913
DEPENDENCIES
10-
ruby-filemagic (= 0.6.1)
14+
headless
15+
nokogiri
16+
ruby-filemagic

docs2csv.rb

Lines changed: 24 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
require 'optparse'
1616
require 'uri'
1717
require 'csv'
18+
require 'filemagic'
19+
require 'nokogiri'
1820

1921
# ------------------------------------------- Modules, functions ----------------------------------------
2022
# text extraction, directory recursion, file matching
@@ -42,6 +44,11 @@ def extractTextFromPDF(filename, options)
4244
text
4345
end
4446

47+
# Extract text from specified HTML.
48+
def extractTextFromHTML(filename)
49+
Nokogiri::HTML(File.open(filename).read).text
50+
end
51+
4552
# OCR a specific file.
4653
# Requires a tmp path to where the output file will be written (won't be deleted after use)
4754
# More or less just a tesseract call, but we turn on orientation detection.
@@ -98,15 +105,15 @@ def extractTextTika(filename)
98105
# extract text from specified file
99106
# Format dependent
100107
def extractTextFromFile(filename, options)
101-
format = File.extname(filename)
102-
if format == ".pdf"
108+
mime = FileMagic.mime.file(filename)
109+
if mime.start_with?("application/pdf")
103110
extractTextFromPDF(filename, options)
104-
elsif format == ".jpg"
105-
ocrImage(filename, options)
106-
elsif format == ".txt"
111+
elsif mime.start_with?("text/html")
112+
extractTextFromHTML(filename)
113+
elsif mime.start_with?("text/plain")
107114
File.open(filename).read
108115
else
109-
extractTextTika(filename)
116+
false
110117
end
111118
end
112119

@@ -163,12 +170,17 @@ def processFile(filename, options)
163170
# - title, the filename (relative)
164171
# - url, an http://localhost:8000 URL to the relative path
165172
if options.process
166-
text = cleanText(extractTextFromFile(filename, options))
167-
title = filename
168-
url = "http://localhost:8000/" + filename
169-
uid = Digest::MD5.hexdigest(filename)
170-
171-
options.csv << [uid, text, title, url]
173+
text = extractTextFromFile(filename, options)
174+
175+
if text
176+
uid = Digest::MD5.hexdigest(filename)
177+
text = cleanText(text)
178+
title = filename
179+
url = "http://localhost:8000/" + filename
180+
options.csv << [uid, text, title, url]
181+
else
182+
STDERR.write "Skipping #{filename}\n"
183+
end
172184
end
173185
rescue => error
174186
STDERR.write "Error processing #{filename}, skipping.\n"

0 commit comments

Comments
 (0)