Skip to content Skip to sidebar Skip to footer

Python Convert Html Into Json Using Soup

These are the rules The HTML tags will start with any of the following

,

    or
      The content of the HTML when any of step 1 tags is found will contain on

Solution 1:

I'd use a function to parse each element, not use one huge loop. Select on p and ol tags, and raise an exception in your parsing to flag anything that doesn't match your specific rules:

from bs4 import NavigableString

defparse(elem):
    if elem.name == 'ol':
        result = []
        for li in elem.find_all('li'):
            iflen(li) > 1:
                result.append([parse_text(sub) for sub in li])
            else:
                result.append(parse_text(next(iter(li))))
        return {'ol': result}
    return {'text': [parse_text(sub) for sub in elem]}

defparse_text(elem):
    ifisinstance(elem, NavigableString):
        return {'text': elem}

    result = {}
    if elem.name == 'em':
        result['italics'] = Trueelif elem.name == 'strong':
        result['bold'] = Trueelif elem.name == 'span':
        try:
            # rudimentary parse into a dictionary
            styles = dict(
                s.replace(' ', '').split(':') 
                for s in elem.get('style', '').split(';')
                if s.strip()
            )
        except ValueError:
            raise ValueError('Invalid structure')
        if'underline'notin styles.get('text-decoration', ''):
            raise ValueError('Invalid structure')
        result['decoration'] = 'underline'else:
        raise ValueError('Invalid structure')

    iflen(elem) > 1:
        result['text'] = [parse_text(sub) for sub in elem]
    else:
        result.update(parse_text(next(iter(elem))))
    return result

You then parse your document:

for candidate in soup.select('ol,p'):
    try:
        result = parse(candidate)
    except ValueError:
        # invalid structure, ignorecontinueprint(result)

Using pprint, this results in:

{'text': [{'text': 'The name is not mine it is for the people'},
          {'bold': True,
           'decoration': 'underline',
           'italics': True,
           'text': 'stephen'},
          {'italics': True,
           'text': [{'bold': True, 'text': ' how can'}, {'text': 'name '}]},
          {'bold': True, 'text': 'good'},
          {'text': ' '},
          {'italics': True,
           'text': [{'text': 'his name '},
                    {'decoration': 'underline', 'text': 'moneuet'},
                    {'text': 'please '}]},
          {'bold': True, 'decoration': 'underline', 'text': 'forever'},
          {'italics': True,
           'text': [{'text': 'tomorrow'}, {'bold': True, 'text': 'USA'}]}]}
{'text': [{'text': '2'}]}
{'text': [{'bold': True, 'text': 'moment'},
          {'italics': True, 'text': 'Africa'},
          {'text': ' '},
          {'italics': True, 'text': 'China'},
          {'text': ' '},
          {'decoration': 'underline', 'text': 'home'},
          {'text': ' '},
          {'italics': True, 'text': 'thomas'},
          {'text': ' '},
          {'bold': True, 'text': 'nothing'}]}
{'ol': [{'text': 'first item'},
        {'bold': True,
         'decoration': 'underline',
         'italics': True,
         'text': 'second item'}]}

Note that the text nodes are now nested; this lets you consistently re-create the same structure, with correct whitespace and nested text decorations.

The structure is also reasonably consistent; a 'text' key will either point at a single string, or a list of dictionaries. Such a list will never mix types. You could improve on this still; have 'text'only point to a string, and use a different key to signify nested data, such as contains or nested or similar, then use just one or the other. All that'd require is changing the 'text' keys in len(elem) > 1 case and in the parse() function.

Post a Comment for "Python Convert Html Into Json Using Soup"