Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

BeautifulSoup / selenium Deep dive

81 vues

Publié le

BeautifulSoup / Selenium Deep dive

06th May, 2020
SAKURA Internet Research Center.
Senior Researcher / Naoto MATSUMOTO

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

BeautifulSoup / selenium Deep dive

  1. 1. BeautifulSoup/Selenium Deep dive 6th May, 2020 SAKURA Internet, Inc. Research Center SR / Naoto MATSUMOTO (C) Copyright 1996-2020 SAKURA Internet Inc
  2. 2. BeautifulSoup sample code 2 # apt install python3-pip # pip3 install BeautifulSoup4 # pip3 install lxml # vi bs4.py #!/usr/bin/env python3 import re import sys import requests from bs4 import BeautifulSoup as bs4 url = sys.argv[1] key = sys.argv[2] html = requests.get(url) soup = bs4(html.content,'lxml',from_encoding='utf-8') for script in soup(["script", "style"]): script.decompose() text = soup.get_text() regex = re.compile(key) for line in text.splitlines(): if line: match = regex.search(line) if match: print("%s, %s" % (url, line)) # python3 bs4.py http://www.sakura.ad.jp VPS http://www.sakura.ad.jp, さくらのVPS SOURCE: SAKURA Internet Research Center (2020/05)
  3. 3. Selenium sample code 3 # apt install python-pip curl -y # apt install -y unzip xvfb libxi6 libgconf-2-4 -y # apt install chromium-chromedriver -y # pip install selenium # pip install pyvirtualdisplay # vi se.py # encoding: utf-8 import sys import time from selenium import webdriver from selenium.webdriver.chrome.options import Options options = Options() options.add_argument('--headless') options.add_argument('disable-infobars') options.add_argument('--no-sandbox') driver = webdriver.Chrome(chrome_options=options) word = "site:" + sys.argv[1] + " " + sys.argv[2] url= "https://www.google.com/search?q={}&safe=off".format(word) driver.get(url) time.sleep(1) for i, g in enumerate(driver.find_elements_by_class_name("g")): s = g.find_element_by_class_name("s") x = s.find_element_by_class_name("st").text print(x) driver.quit() SOURCE: SAKURA Internet Research Center (2020/05)

×