是否从输出中删除附加数据(html标记)?

我抓取了一个股票列表,并将这些项目附加到一个列表中,但由于我的bs4查询,这样做还会添加额外的html元素。

下面是我的可重现代码:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

url = 'https://bullishbears.com/russell-2000-stocks-list/'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(url,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)

divTag = soup.find_all("div", {"class": "thrv_wrapper thrv_text_element"})

stock_list = []
for tag in divTag:
    strongTags = tag.find_all("strong")
    for tag in strongTags:
        for x in tag:      
            stock_list.append(x)

看一下列表的结果,我对每次股票后跟一个逗号的股票字符串格式(字符串列表)很满意。正如您所看到的,我还得到了其他我想要删除的<br/><span>元素。

stock_list =

[<span data-css="tve-u-17078d9d4a6">RUSSELL 2000 STOCKS LIST</span>,
 <strong><strong><strong><span data-css="tve-u-17031e9c4ac"> We provide you a list of Russell 2000 stocks and companies below</span><span data-css="tve-u-17031e9c4ad">. </span></strong></strong></strong>,
 <strong><strong><span data-css="tve-u-17031e9c4ac"> We provide you a list of Russell 2000 stocks and companies below</span><span data-css="tve-u-17031e9c4ad">. </span></strong></strong>,
 <strong><span data-css="tve-u-17031e9c4ac"> We provide you a list of Russell 2000 stocks and companies below</span><span data-css="tve-u-17031e9c4ad">. </span></strong>,
 <span data-css="tve-u-17031e9c4ac"> We provide you a list of Russell 2000 stocks and companies below</span>,
 <span data-css="tve-u-17031e9c4ad">. </span>,
 'List of Russell 2000 Stocks & Updated Chart',
 'IWM',
 <br/>,
 'SPSM',
 <br/>,
 'VTWO',
 '/RTY',
 <br/>,
 '/M2K',
 'AAN',
 <br/>,
 'AAOI',
 <br/>,
 'AAON',
 <br/>,
 'AAT',
 <br/>,
 'AAWW',
 <br/>,
 'AAXN',
 <br/>,
 'ABCB',
 <br/>,
 'ABEO',
 <br/>,
 'ABG',
 <br/>,
 'ABM',
 <br/>,
 'ABTX',
 <br/>,
 'AC',
 <br/>,
 'ACA',
 <br/>,
 'ACAD',
 <br/>,
 'ACBI',
 <br/>,
 'ACCO',
# More to the list but for brevity I removed the rest.

如何正确地调整我的bs4查询,使其只获得股票列表?

转载请注明出处:http://www.nali5.com/article/20230507/1883781.html