Skip to content

Instantly share code, notes, and snippets.

@anarchivist
Created April 19, 2010 16:24
Show Gist options
  • Save anarchivist/371235 to your computer and use it in GitHub Desktop.
Save anarchivist/371235 to your computer and use it in GitHub Desktop.
replacement RowDict for kochief's marc parser
class RowDict(dict):
"""
Subclass of dict that joins sequences and encodes to utf-8 on get.
Encoding to utf-8 is necessary for Python's csv library because it
can't handle unicode.
>>> row = RowDict()
>>> row['bob'] = ['Montalb\\xe2an, Ricardo', 'Roddenberry, Gene']
>>> row.get('bob')
'Montalb\\xc3\\xa1n, Ricardo|Roddenberry, Gene'
>>> print row.get('bob')
Montalbán, Ricardo|Roddenberry, Gene
"""
def get(self, key, *args):
value = dict.get(self, key, *args)
if not value:
return ''
if hasattr(value, '__iter__'):
try:
value = '|'.join([x for x in value if x])
except UnicodeDecodeError:
value = '|'.join([x.decode('utf8') for x in value if x])
#return pymarc.marc8.marc8_to_unicode(value).encode('utf8')
try:
return value.encode('utf8')
except UnicodeDecodeError:
return value.decode('utf8', 'replace').encode('utf8')
@gsf
Copy link

gsf commented Apr 19, 2010

Tell me if this works:

class RowDict(dict):
    """
    Subclass of dict that joins sequences and encodes to utf-8 on get.
    Encoding to utf-8 is necessary for Python's csv library because it 
    can't handle unicode.
    >>> row = RowDict()
    >>> row['bob'] = ['Montalb\\xe2an, Ricardo', 'Roddenberry, Gene']
    >>> row.get('bob')
    'Montalb\\xc3\\xa1n, Ricardo|Roddenberry, Gene'
    >>> print row.get('bob')
    Montalbán, Ricardo|Roddenberry, Gene
    """
    def get(self, key, *args):
        value = dict.get(self, key, *args)
        if not value:
            return ''
        if hasattr(value, '__iter__'):
            value = '|'.join([x for x in value if x])
        return value.decode('utf8', 'replace').encode('utf8')

@gsf
Copy link

gsf commented Apr 19, 2010

Actually, no, try this:

class RowDict(dict):
    """
    Subclass of dict that joins sequences and encodes to utf-8 on get.
    Encoding to utf-8 is necessary for Python's csv library because it 
    can't handle unicode.
    >>> row = RowDict()
    >>> row['bob'] = ['Montalb\\xe2an, Ricardo', 'Roddenberry, Gene']
    >>> row.get('bob')
    'Montalb\\xc3\\xa1n, Ricardo|Roddenberry, Gene'
    >>> print row.get('bob')
    Montalbán, Ricardo|Roddenberry, Gene
    """
    def get(self, key, *args):
        value = dict.get(self, key, *args)
        if not value:
            return ''
        if hasattr(value, '__iter__'):
            return '|'.join([x.decode('utf8', 'replace') for x in
                value if x]).encode('utf8')
        return value.decode('utf8', 'replace').encode('utf8')

@anarchivist
Copy link
Author

Nope, neither works for me; only the first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment