Skip to content Skip to sidebar Skip to footer

Python Convert Html Into Json Using Soup

These are the rules The HTML tags will start with any of the following


      The content of the HTML when any of step 1 tags is found will contain on

Solution 1:

I'd use a function to parse each element, not use one huge loop. Select on p and ol tags, and raise an exception in your parsing to flag anything that doesn't match your specific rules:

from bs4 import NavigableString

    if == 'ol':
        result = []
        for li in elem.find_all('li'):
            iflen(li) > 1:
                result.append([parse_text(sub) for sub in li])
        return {'ol': result}
    return {'text': [parse_text(sub) for sub in elem]}

    ifisinstance(elem, NavigableString):
        return {'text': elem}

    result = {}
    if == 'em':
        result['italics'] = Trueelif == 'strong':
        result['bold'] = Trueelif == 'span':
            # rudimentary parse into a dictionary
            styles = dict(
                s.replace(' ', '').split(':') 
                for s in elem.get('style', '').split(';')
                if s.strip()
        except ValueError:
            raise ValueError('Invalid structure')
        if'underline'notin styles.get('text-decoration', ''):
            raise ValueError('Invalid structure')
        result['decoration'] = 'underline'else:
        raise ValueError('Invalid structure')

    iflen(elem) > 1:
        result['text'] = [parse_text(sub) for sub in elem]
    return result

You then parse your document:

for candidate in'ol,p'):
        result = parse(candidate)
    except ValueError:
        # invalid structure, ignorecontinueprint(result)

Using pprint, this results in:

{'text': [{'text': 'The name is not mine it is for the people'},
          {'bold': True,
           'decoration': 'underline',
           'italics': True,
           'text': 'stephen'},
          {'italics': True,
           'text': [{'bold': True, 'text': ' how can'}, {'text': 'name '}]},
          {'bold': True, 'text': 'good'},
          {'text': ' '},
          {'italics': True,
           'text': [{'text': 'his name '},
                    {'decoration': 'underline', 'text': 'moneuet'},
                    {'text': 'please '}]},
          {'bold': True, 'decoration': 'underline', 'text': 'forever'},
          {'italics': True,
           'text': [{'text': 'tomorrow'}, {'bold': True, 'text': 'USA'}]}]}
{'text': [{'text': '2'}]}
{'text': [{'bold': True, 'text': 'moment'},
          {'italics': True, 'text': 'Africa'},
          {'text': ' '},
          {'italics': True, 'text': 'China'},
          {'text': ' '},
          {'decoration': 'underline', 'text': 'home'},
          {'text': ' '},
          {'italics': True, 'text': 'thomas'},
          {'text': ' '},
          {'bold': True, 'text': 'nothing'}]}
{'ol': [{'text': 'first item'},
        {'bold': True,
         'decoration': 'underline',
         'italics': True,
         'text': 'second item'}]}

Note that the text nodes are now nested; this lets you consistently re-create the same structure, with correct whitespace and nested text decorations.

The structure is also reasonably consistent; a 'text' key will either point at a single string, or a list of dictionaries. Such a list will never mix types. You could improve on this still; have 'text'only point to a string, and use a different key to signify nested data, such as contains or nested or similar, then use just one or the other. All that'd require is changing the 'text' keys in len(elem) > 1 case and in the parse() function.

Post a Comment for "Python Convert Html Into Json Using Soup"