Python Convert Html Into Json Using Soup
,
- or
- The content of the HTML when any of step 1 tags is found will contain on
Solution 1:
I'd use a function to parse each element, not use one huge loop. Select on p
and ol
tags, and raise an exception in your parsing to flag anything that doesn't match your specific rules:
from bs4 import NavigableString
defparse(elem):
if elem.name == 'ol':
result = []
for li in elem.find_all('li'):
iflen(li) > 1:
result.append([parse_text(sub) for sub in li])
else:
result.append(parse_text(next(iter(li))))
return {'ol': result}
return {'text': [parse_text(sub) for sub in elem]}
defparse_text(elem):
ifisinstance(elem, NavigableString):
return {'text': elem}
result = {}
if elem.name == 'em':
result['italics'] = Trueelif elem.name == 'strong':
result['bold'] = Trueelif elem.name == 'span':
try:
# rudimentary parse into a dictionary
styles = dict(
s.replace(' ', '').split(':')
for s in elem.get('style', '').split(';')
if s.strip()
)
except ValueError:
raise ValueError('Invalid structure')
if'underline'notin styles.get('text-decoration', ''):
raise ValueError('Invalid structure')
result['decoration'] = 'underline'else:
raise ValueError('Invalid structure')
iflen(elem) > 1:
result['text'] = [parse_text(sub) for sub in elem]
else:
result.update(parse_text(next(iter(elem))))
return result
You then parse your document:
for candidate in soup.select('ol,p'):
try:
result = parse(candidate)
except ValueError:
# invalid structure, ignorecontinueprint(result)
Using pprint
, this results in:
{'text': [{'text': 'The name is not mine it is for the people'},
{'bold': True,
'decoration': 'underline',
'italics': True,
'text': 'stephen'},
{'italics': True,
'text': [{'bold': True, 'text': ' how can'}, {'text': 'name '}]},
{'bold': True, 'text': 'good'},
{'text': ' '},
{'italics': True,
'text': [{'text': 'his name '},
{'decoration': 'underline', 'text': 'moneuet'},
{'text': 'please '}]},
{'bold': True, 'decoration': 'underline', 'text': 'forever'},
{'italics': True,
'text': [{'text': 'tomorrow'}, {'bold': True, 'text': 'USA'}]}]}
{'text': [{'text': '2'}]}
{'text': [{'bold': True, 'text': 'moment'},
{'italics': True, 'text': 'Africa'},
{'text': ' '},
{'italics': True, 'text': 'China'},
{'text': ' '},
{'decoration': 'underline', 'text': 'home'},
{'text': ' '},
{'italics': True, 'text': 'thomas'},
{'text': ' '},
{'bold': True, 'text': 'nothing'}]}
{'ol': [{'text': 'first item'},
{'bold': True,
'decoration': 'underline',
'italics': True,
'text': 'second item'}]}
Note that the text nodes are now nested; this lets you consistently re-create the same structure, with correct whitespace and nested text decorations.
The structure is also reasonably consistent; a 'text'
key will either point at a single string, or a list of dictionaries. Such a list will never mix types. You could improve on this still; have 'text'
only point to a string, and use a different key to signify nested data, such as contains
or nested
or similar, then use just one or the other. All that'd require is changing the 'text'
keys in len(elem) > 1
case and in the parse()
function.
Post a Comment for "Python Convert Html Into Json Using Soup"