728x90

작년에 처음으로 크롤링이란걸 해보고 일년만에 다시 해보는 웹 크롤링...
코드 짜면서 있었던 이슈들 정리

1. WebDriver 설치

크롬 브라우저와 크롬 웹드라이버의 버전을 꼬옥 맞춰줘야한다.
크롬 설정에서 버전 체크 후 고대로 복사해서 웹드라이버 서치해보면 나온다.

2. WebDriver 옵션들

# webdrivier options
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")
option.add_experimental_option("useAutomationExtension", False)
option.add_experimental_option("excludeSwitches", ['enable-automation'])

# connect Chrome webdriver
# Must match chrome to driver version.
driver = webdriver.Chrome(executable_path='/home/kueyeon/chromedriver', chrome_options=option)

옵션1. 브라우저가 최대화 된 상태로 실행*
꽤 중요한 친구였다
옵션2, 옵션3. 실행 시 위에 크롬이 어쩌구.. 하는 경고창(?) 같은게 뜨는걸 막아줬던거같다. 뭐였는지 기억 안 남.

3. 초기 HTML 태그 접근

왜 처음부터 접근이 안 될까 너무 슬펐는데
내가 크롤링하고자하는 웹 페이지가 iframe 형태로 되어있었다.
그게 뭔데,

iframe

하나의 HTML 문서 안에 또 다른 HTML 문서를 삽입하는 형태

# move into the frame
driver.switch_to.frame(driver.find_element_by_xpath('/html/frameset/frame'))
driver.implicitly_wait(5)

frame을 먼저 swtich해주고 원하는 HTML 태그로 접근해야한다.

4. 셀레니움(Selenium)과 BeautifulSoup4의 용도

그냥 헷갈렸다
냅다 여기저기 코드 긁어다가 조합해보긴했는데 각각의 용도가 좀 헷갈렸다

셀레니움(Selenium)

웹 드라이버 제어
-> 페이지 이동, 클릭 등등 동적인 부분? 대충 이렇게 이해

# move
driver.get(url+page_endpoint+str(i))
driver.implicitly_wait(5)

BeautifulSoup

파싱 패키지, HTML로부터 데이터 추출
-> 한 페이지를 쫙 스캔해 파싱할 수 있도록 해주는? 대충 이렇게 이해

# HTML code machining for crawling
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
imgs = soup.select('div.list img') # select elements

드라이버로부터 페이지 소스를 추출해 bs4로 파싱해주면 각 태그 접근이 가능하다.

5. hover tag 접근

element not interactive in selenium error로 진짜 애먹었던거...

from selenium.webdriver.common.action_chains import ActionChains
...
option.add_argument("--start-maximized")
...
actions = ActionChains(driver)
hover_button= driver.find_element_by_xpath(f'//*[@id="css_gnb_frame"]/div[1]/ul/li[{x[1]}]')
actions.move_to_element(hover_button)

kind_button = driver.find_element_by_xpath(f'//*[@id="css_gnb_frame"]/div[{x[1]}]/ul/li[{y+1}]')
actions.click(kind_button)
 actions.perform()

ActionChains

이름처럼 selenium 동작들을 체인으로 엮어서 실행할 수 있도로고 해주는 기능이다.
나는 hover tag로 이동하는 것과 hover tag를 통해서 visible되는 tag를 클릭하는 동작을 엮었다.
마지막에 perform() 해주면 엮여있는 동작들 실행.
신기하고 재밌는 기능.

그치만 해결은 option.add_argument("--start-maximized") 이거 덕분인거 같긴하다.
이거 추가하자마자 됐으니..ㅎ.

6. save images

# 'src' is image path on web browser
req = urllib.request.Request(src, headers={'User-Agent': 'Mozilla/5.0'})
try:
    imgUrl = urllib.request.urlopen(req).read() #웹 페이지 상의 이미지를 불러옴
    if not os.path.exists(path):
        os.makedirs(path)
     with open(saveUrl,"wb") as f: # directory open
         f.write(imgUrl) # save file
except urllib.error.HTTPError:
    print('에러')
    sys.exit(0)

웹 브라우저 상 이미지 경로를 urllib을 통해 불러오고
directory path가 존재하지 않으면, directory를 생성하고
이미지 파일 write

References

https://kimcoder.tistory.com/259

728x90

저작자표시 비영리 변경금지

'Data Engineering' 카테고리의 다른 글

[ElasticSearch] Job for elasticsearch.service failed because a fatal signal was delivered to the control process. (1)	2022.11.13
[ElasticSearch] Ubuntu20.04: ElasticSearch, Kibana uninstall (0)	2022.11.04
[ElasticSearch] JVM: Heap size (1)	2022.09.06
[ElasticSearch] Ubuntu 20.04: Kibana 8.4 install (0)	2022.09.03
[ElasticSearch] Ubuntu 20.04: ElasticSearch 8.4 install (0)	2022.09.03

isPowerfulBlog

[Python] Web Crawling: Selenium, BeautifulSoup4

1. WebDriver 설치

2. WebDriver 옵션들

3. 초기 HTML 태그 접근

iframe

4. 셀레니움(Selenium)과 BeautifulSoup4의 용도

셀레니움(Selenium)

BeautifulSoup

5. hover tag 접근

ActionChains

6. save images

References

'Data Engineering' 카테고리의 다른 글

티스토리툴바

[Python] Web Crawling: Selenium, BeautifulSoup4

1. WebDriver 설치

2. WebDriver 옵션들

3. 초기 HTML 태그 접근

iframe

4. 셀레니움(Selenium)과 BeautifulSoup4의 용도

셀레니움(Selenium)

BeautifulSoup

5. hover tag 접근

ActionChains

6. save images

References

'Data Engineering' 카테고리의 다른 글

관련글

티스토리툴바