Unable To Retrieve Data From Macro Trends Using Selenium And Read_html To Create A Data Frame?
I'm want to import data from macro trends into pandas data frame. From looking at the page source of the website it appears that data is in a jqxgrid. I have tried using pandas/bea
Solution 1:
Here's an alternative that's quicker than selenium which has the headers as shown on page.
import requests
from bs4 import BeautifulSoup as bs
import re
import json
import pandas as pd
r = requests.get('https://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q')
p = re.compile(r' var originalData = (.*?);\r\n\r\n\r',re.DOTALL)
data = json.loads(p.findall(r.text)[0])
headers = list(data[0].keys())
headers.remove('popup_icon')
result = []
for row in data:
soup = bs(row['field_name'])
field_name = soup.select_one('a, span').text
fields = list(row.values())[2:]
fields.insert(0, field_name)
result.append(fields)
pd.option_context('display.max_rows', None, 'display.max_columns', None)
df = pd.DataFrame(result, columns = headers)
print(df.head())
Solution 2:
The problem is the data is not in a table but 'div' elements. I'm not an expert on pandas but you can do it with BeautifulSoup.
Insert the line after your outher imports
from bs4 importBeautifulSoup
then change your last line for:
soup = BeautifulSoup(grid.get_attribute('outerHTML'), "html.parser")
divList = soup.findAll('div', {'role': 'row'})
data = [[x.text for x in div.findChildren('div', recursive=False)] for div in divList]
df = pd.DataFrame(data)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df)
This finds all the 'div' elements with the attribute 'row'. Then reads the text elements for each div it finds under the 'div' elements with the attribute 'row' but only descends one level as some have multiple 'div' elements.
Output:
012345 \
0 Revenue $4,135$5,672$3,262$2,8861 Cost Of Goods Sold $3,179$4,501$2,500$2,1852 Gross Profit $956$1,171$762$7013 Research And Development Expenses $234$222$209$2014 SG&A Expenses $518$675$427$3815 Other Operating Income Or Expenses $-6$-3$-3$-3
...
678910111213140$3,015$3,986$2,307$2,139$2,279$2,977$1,858$1,753$1,9021$2,296$3,135$1,758$1,630$1,732$2,309$1,395$1,303$1,4442$719$851$549$509$547$668$463$450$4583$186$177$172$167$146$132$121$106$924$388$476$335$292$292$367$247$238$2575 - $-2$-2$-3$-3$-4$-40$-2$-1
...
However as you scroll across the page the items on the left are removed from the page source so that not all the data is scraped.
Updated in response to comment. To set the column headers use:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.maximize_window()
driver.execute_script(
"window.location = 'http://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q';")
driver.implicitly_wait(2)
grid = driver.find_element_by_id('wrapperjqxgrid')
time.sleep(1)
driver.execute_script("window.scrollBy(0, 600);")
scrollbar = driver.find_element_by_id('jqxScrollThumbhorizontalScrollBarjqxgrid')
time.sleep(1)
actions = ActionChains(driver)
time.sleep(1)
for i inrange(1, 6):
actions.drag_and_drop_by_offset(scrollbar, i * 70, 0).perform()
time.sleep(1)
soup = BeautifulSoup(grid.get_attribute('outerHTML'), "html.parser")
headersList = soup.findAll('div', {'role': 'columnheader'})
col_names=[h.text for h in headersList]
divList = soup.findAll('div', {'role': 'row'})
data = [[x.text for x in div.findChildren('div', recursive=False)] for div in divList]
df = pd.DataFrame(data, columns=col_names)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df)
Outputs:
Quarterly Data | Millions of US $ except per share data 2008-03-31 \
0 Revenue $4,135
1 Cost Of Goods Sold $3,179
2 Gross Profit $956
...
2007-12-31 2007-09-30 2007-06-30 2007-03-31 2006-12-31 2006-09-30 \
0 $5,672 $3,262 $2,886 $3,015 $3,986 $2,307
1 $4,501 $2,500 $2,185 $2,296 $3,135 $1,758
...
2006-06-30 2006-03-31 2005-12-31 2005-09-30 2005-06-30 2005-03-31
0 $2,139 $2,279 $2,977 $1,858 $1,753 $1,902
1 $1,630 $1,732 $2,309 $1,395 $1,303 $1,444
Post a Comment for "Unable To Retrieve Data From Macro Trends Using Selenium And Read_html To Create A Data Frame?"