Skip to content

replace printable for try/except utf-8 #2255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 29, 2017
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
addressing @josenavas comments
  • Loading branch information
antgonza committed Aug 29, 2017
commit e195e07c3e40c283d7444a8d5241dbb028b15b11
15 changes: 4 additions & 11 deletions qiita_db/metadata_template/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from __future__ import division
from future.utils import PY3, viewitems
from six import StringIO
from collections import defaultdict

import pandas as pd
import numpy as np
Expand Down Expand Up @@ -102,24 +103,16 @@ def load_template_to_dataframe(fn, index='sample_name'):
# Load in file lines
holdfile = None
with open_file(fn, mode='U') as f:
errors = {}
errors = defaultdict(list)
holdfile = f.readlines()
# here we are checking for non UTF-8 chars
for row, line in enumerate(holdfile):
for col, block in enumerate(line.split('\t')):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire try/except block can be replaced by:

try:
    tblock = block.encode('utf-8')
except UnicodeDecodeError:
    tblock = unicode(block, errors='replace')
    tblock = tblock.replace(u'\ufffd', '🐾')
    if tblock not in errors:
        errors[tblock] = []
    errors[tblock].append('(%d, %d)' % (row, col))

Also, if errors is initializes as a defaultdict(list):

try:
    tblock = block.encode('utf-8')
except UnicodeDecodeError:
    tblock = unicode(block, errors='replace')
    tblock = tblock.replace(u'\ufffd', '🐾')
    errors[tblock].append('(%d, %d)' % (row, col))

The character u'\ufffd' is the official unicode character to replace a character that can't be decoded. The call to replace replaces it with our "qiita" paws.

tblock = block.encode('utf-8')
except UnicodeDecodeError:
tblock = []
for c in block:
try:
c.encode('utf-8')
tblock.append(c)
except UnicodeDecodeError:
tblock.append('🐾')
tblock = ''.join(tblock)
if tblock not in errors:
errors[tblock] = []
tblock = unicode(block, errors='replace')
tblock = tblock.replace(u'\ufffd', '🐾')
errors[tblock].append('(%d, %d)' % (row, col))
if bool(errors):
raise ValueError(
Expand Down