Ngelixa ezinye iiwebhusayithi zithe ngqo ukukrazula ngokusebenzisa nje i-Selenium, i-Puppeteer, kunye nokunye okunjalo, ezinye iiwebhusayithi eziphumeza imilinganiselo yokhuseleko oluphezulu njengeCAPTCHA kunye nokuvalwa kwe-IP kunokuba nzima. Ukuze unqobe le mingeni kwaye uqinisekise ukuba unokukrazula i-99% yeewebhusayithi simahla usebenzisa i-Scraper, uya kwakha oku kweli nqaku, kwaye uya kudibanisa a
Nangona kunjalo, ukuqokelela idatha linyathelo nje elinye; okwenzayo ngaloo datha ngokulinganayo, ukuba akukho ngaphezulu, kubalulekile. Ngokufuthi, oku kufuna ukuhluzwa ngocoselelo kwimiqulu emikhulu yolwazi ngesandla. Kodwa kuthekani ukuba unokuyitshintsha le nkqubo? Ngokusebenzisa imodeli yolwimi (LLM), awukwazi ukuqokelela idatha kuphela kodwa uphinde uyibuze ukuze ukhuphe ulwazi olunentsingiselo-ukugcina ixesha kunye nomgudu.
Kule khokelo, uya kufunda indlela yokudibanisa i-web scraping kunye ne-AI ukwakha isixhobo esinamandla sokuqokelela kunye nokuhlalutya idatha kwisikali samahhala. Masingene ngaphakathi!
Ngaphambi kokuba uqale, qiniseka ukuba unokulandelayo:
Ukuqhubeka nesi sifundo, gqibezela la manyathelo alandelayo:
Landela la manyathelo ukuseta indawo yakho kwaye ulungiselele ukwakha i-AI-powered scraper.
Okokuqala, misela indawo engqongileyo yokulawula ukuxhomekeka kweprojekthi yakho. Oku kuya kuqinisekisa ukuba unendawo esecaleni kuzo zonke iipakethe ezifunekayo.
Yenza uluhlu olutsha lweprojekthi:
Vula i-terminal yakho (okanye i-Command Prompt/PowerShell kwi-Windows) kwaye wenze ulawulo olutsha lweprojekthi yakho:
mkdir ai-website-scraper cd ai-website-scraper
Yenza okusingqongileyo okubonakalayo:
Sebenzisa lo myalelo ulandelayo ukwenza imeko-bume enenyani:
KwiWindows:
python -m venv venv
Kwi-macOS/Linux:
python3 -m venv venv
Oku kwenza incwadi eneenkcukacha venv
eya kugcina imeko-bume yenyani.
Vula imeko-bume yenyani ukuze uqalise ukusebenza ngaphakathi kwayo:
KwiWindows:
.\venv\Scripts\activate
Kwi-macOS/Linux:
source venv/bin/activate
Umyalelo wakho weterminal uya kutshintsha ukuze ubonise ( venv
), iqinisekisa ukuba ngoku ungaphakathi kwimeko-bume yenyani.
Ngoku, faka iilayibrari ezifunekayo kwiprojekthi yakho. Yenza i requirements.txt
ifayile kulawulo lweprojekthi yakho kwaye wongeze ukuxhomekeka kulandelayo:
streamlit selenium Beautifulsoup4 langchain langchain-ollama lxml html5lib
Ezi phakheji zibalulekile ekukhutsheni, ukusetyenzwa kwedatha, kunye nokwakha i-UI:
streamlit : Oku kusetyenziselwa ukwenza ujongano lomsebenzisi olusebenzisanayo.
I-Selenium : Ukukrazula umxholo wewebhusayithi.
beautifulsoup4 : Yokwahlulahlula kunye nokucoca iHTML.
langchain kunye ne-langchain-ollama : Oku kukudibanisa ne-Ollama LLM kunye nokusetyenzwa kwesicatshulwa.
lxml kunye ne-html5lib : Kukwahlulahlula kweHTML okuphambili.
Faka izinto ezixhomekeke kuwe ngokwenza lo myalelo ulandelayo:
(Qinisekisa ukuba ukwincwadi eneenkcukacha apho ifayile ikhoyo phambi kokuba usebenzise umyalelo.)
pip install -r requirements.txt
Yenza ifayile enegama elithi ui.py kulawulo lweprojekthi yakho. Esi script siya kuchaza i-UI yescraper sakho. Sebenzisa ikhowudi engezantsi ukwakha isicelo sakho:
import streamlit as st import pathlib from main import scrape_website # function to load css from the assets folder def load_css(file_path): with open(file_path) as f: st.html(f"<style>{f.read()}</style>") # Load the external CSS css_path = pathlib.Path("assets/style.css") if css_path.exists(): load_css(css_path) st.title("AI Scraper") st.markdown( "Enter a website URL to scrape, clean the text content, and display the result in smaller chunks." ) url = st.text_input(label= "", placeholder="Enter the URL of the website you want to scrape") if st.button("Scrape", key="scrape_button"): st.write("scraping the website...") result = scrape_website(url) st.write("Scraping complete.") st.write(result)
Unokufunda ngakumbi malunga nezixhobo zokuhambisa ukusuka kubo
Ukwenza isimbo isicelo sakho, yenza isiqulathi see-asethi kulawulo lwakho lweprojekthi kwaye wongeze ifayile ye style.css. Yenza ngokwezifiso ujongano lwe-Streamlit ngeCSS:
.stAppViewContainer { background-image: url("https://images.unsplash.com/photo-1732979887702-40baea1c1ff6?q=80&w=2832&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"); background-size: cover; color: black; } .stAppHeader { background-color: rgba(0, 0, 0, 0); } .st-ae { background-color: rgba(233, 235, 234, 0.895); } .st-emotion-cache-ysk9xe { color: black; } .st.info, .stAlert { background-color: black; } .st-key-scrape_button button { display: inline-block; padding: 10px 20px; font-size: 16px; color: #fff; background-color: #007bff; border: none; border-radius: 5px; cursor: pointer; animation: pulse 2s infinite; } .st-key-scrape_button button:hover { background-color: #0056b3; color: #fff; }
Kuluhlu lweprojekthi yakho, sebenzisa lo myalelo ulandelayo:
streamlit run ui.py
Oku kuya kuphehlelela umncedisi wendawo, kwaye kufuneka ubone i-URL kwi-terminal, ngesiqhelo http://localhost:8501
. Vula le URL kwibrawuza yakho ukuze unxibelelane nesicelo sewebhu.
Emva koko, bhala ikhowudi yokukhupha umxholo we-HTML walo naliphi na iphepha lewebhu usebenzisa i-Selenium. Nangona kunjalo, ukuze ikhowudi isebenze, udinga i-Chrome WebDriver.
I-Selenium idinga iWebDriver ukusebenzisana namaphepha ewebhu. Nantsi indlela yokuseta:
Emva kokukhuphela iChromeDriver, khupha ifayile kwaye ukope igama lefayile yesicelo " chromedriver " kwaye uyincamathisele kwifolda yeprojekthi yakho.
Xa oku kwenziwe, yenza ifayile entsha ebizwa ngokuba main.py
kwaye usebenzise ikhowudi engezantsi:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Function to scrape HTML from a website def scrape_website(website_url): # Path to WebDriver webdriver_path = "./chromedriver" # Replace with your WebDriver path service = Service(webdriver_path) driver = webdriver.Chrome(service=service) try: # Open the website driver.get(website_url) # Wait for the page to fully load WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "body"))) # Extract the HTML source html_content = driver.page_source return html_content finally: # Ensure the browser is closed after scraping driver.quit()
Gcina kwaye usebenzise ikhowudi; kufuneka ufumane yonke i-HTML yephepha olikhuphileyo liboniswe kwisicelo sakho somsinga ngolu hlobo:
Ngelixa ngoku unako ukubuyisela i-HTML yewebhusayithi, ikhowudi engentla ayinakusebenza kwiindawo ezineendlela eziphambili zokulwa ne-scraping ezifana nemingeni ye-CAPTCHA okanye i-IP yokuvalwa. Ngokomzekelo, ukukrazula indawo efana ne-Ewe okanye i-Amazon usebenzisa i-Selenium inokubangela ukuba iphepha leCAPTCHA livimbe ukufikelela. Oku kwenzeka ngenxa yokuba iwebhusayithi ifumanisa ukuba i-bot izama ukufikelela kumxholo wayo. Ukuba le ndlela yokuziphatha iyaqhubeleka, isayithi ekugqibeleni inokuvala idilesi yakho ye-IP, ithintele ukufikelela ngakumbi.
Ukulungisa oku, hlanganisa
Ukubhalisa - yiya ku
Emva kokungena, cofa ku " Fumana iiMveliso zoMmeli ".
Cofa kwiqhosha elithi " Yongeza " kwaye ukhethe u-" Scraping Browser ".
Okulandelayo, uya kuthathwa kwiphepha elithi " Yongeza indawo ", apho kuya kufuneka ukuba ukhethe igama lendawo yakho ye-proxy ye-browser entsha. Emva koko, cofa kwi " Yongeza ".
Emva koku, iinkcazi zezowuni zommeli ziyakwenziwa. Uya kufuna ezi nkcukacha kwiskripthi sakho ukuze udlule naziphi na iindlela zokulwa ne-scraping ezisetyenziswa kuyo nayiphi na iwebhusayithi.
Unokujonga kwakhona amaxwebhu ophuhlisi weDatha yeBright ngeenkcukacha ezithe kratya malunga nesikhangeli sokukrala.
Kwifayile yakho main.py
, tshintsha ikhowudi kule nto. Uya kuqaphela ukuba le khowudi icocekile kwaye imfutshane kunekhowudi yangaphambili.
from selenium.webdriver import Remote, ChromeOptions from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection from selenium.webdriver.common.by import By from bs4 import BeautifulSoup AUTH = '<username>:<passord>' SBR_WEBDRIVER = f'https://{AUTH}@brd.superproxy.io:9515' # Function to scrape HTML from a website def scrape_website(website_url): print("Connecting to Scraping Browser...") sbr_connection = ChromiumRemoteConnection(SBR_WEBDRIVER, "goog", "chrome") with Remote(sbr_connection, options=ChromeOptions()) as driver: driver.get(website_url) print("Waiting captcha to solve...") solve_res = driver.execute( "executeCdpCommand", { "cmd": "Captcha.waitForSolve", "params": {"detectTimeout": 10000}, }, ) print("Captcha solve status:", solve_res["value"]["status"]) print("Navigated! Scraping page content...") html = driver.page_source return html
Faka esikhundleni <igama lomsebenzisi> kunye <password> ngegama lakho lomsebenzisi lokukrala kunye negama lokugqitha.
Emva kokukrazula umxholo we-HTML wewebhusayithi, ihlala izaliswe zizinto ezingeyomfuneko njengeJavaScript, izitayile zeCSS, okanye iithegi ezingafunekiyo ezingafaki galelo kulwazi olungundoqo olukhuphayo. Ukwenza idatha icwangciswe ngakumbi kwaye iluncedo ekuqhubekeni phambili, kufuneka ucoce umxholo we-DOM ngokususa izinto ezingabalulekanga kunye nokulungelelanisa isicatshulwa.
Eli candelo lichaza indlela yokucoca umxholo we-HTML, ukhuphe umbhalo onentsingiselo, kwaye uwahlule ube ngamaqhekeza amancinci okuqhuba ezantsi. Inkqubo yokucoca ibalulekile ekulungiseleleni idatha yemisebenzi efana nokucubungula ulwimi lwendalo okanye uhlalutyo lomxholo.
Nantsi ikhowudi eya kongezwa kwi-main.py ukuphatha ukucoca umxholo we-DOM:
from bs4 import BeautifulSoup # Extract the body content from the HTML def extract_body_content(html_content): soup = BeautifulSoup(html_content, "html.parser") body_content = soup.body if body_content: return str(body_content) return "" # Clean the body content by removing scripts, styles, and other unwanted elements def clean_body_content(body_content): soup = BeautifulSoup(body_content, "html.parser") # Remove <script> and <style> tags for script_or_style in soup(["script", "style"]): script_or_style.extract() # Extract cleaned text with each line separated by a newline cleaned_content = soup.get_text(separator="\n") cleaned_content = "\n".join( line.strip() for line in cleaned_content.splitlines() if line.strip() ) return cleaned_content # Split the cleaned content into smaller chunks for processing def split_dom_content(dom_content, max_length=5000): return [ dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length) ]
Yintoni eyenziwa yiKhowudi
Gcina utshintsho lwakho kwaye uvavanye isicelo sakho. Kuya kufuneka ufumane imveliso efana nale emva kokukrazula iwebhusayithi.
Nje ukuba umxholo we-DOM ucocwe kwaye ulungiswe, inyathelo elilandelayo kukucazulula ulwazi ukukhupha iinkcukacha ezithile usebenzisa
Ukuba awunayo, khuphela kwaye ufake i-Ollama kwi
brew install ollama
Okulandelayo, faka nayiphi na imodeli ukusuka
ollama pull phi3
Emva kokufakela, unokufowunela loo modeli kwiskripthi sakho usebenzisa iLangChain ukubonelela ngengqiqo enentsingiselo evela kwidatha eya kuthunyelwa kuyo.
Nantsi indlela yokuseta umsebenzi wokwahlula umxholo we-DOM ube yimodeli ye -phi3
Le khowudi ilandelayo isebenzisa ingqiqo yokwahlulahlula iichunks zeDOM kunye no-Ollama kunye nokukhupha iinkcukacha ezifanelekileyo:
from langchain_ollama import OllamaLLM from langchain_core.prompts import ChatPromptTemplate # Template to instruct Ollama for parsing template = ( "You are tasked with extracting specific information from the following text content: {dom_content}. " "Please follow these instructions carefully: \n\n" "1. **Extract Information:** Only extract the information that directly matches the provided description: {parse_description}. " "2. **No Extra Content:** Do not include any additional text, comments, or explanations in your response. " "3. **Empty Response:** If no information matches the description, return an empty string ('')." "4. **Direct Data Only:** Your output should contain only the data that is explicitly requested, with no other text." ) # Initialize the Ollama model model = OllamaLLM(model="phi3") # Function to parse DOM chunks with Ollama def parse_with_ollama(dom_chunks, parse_description): prompt = ChatPromptTemplate.from_template(template) chain = prompt | model parsed_results = [] for i, chunk in enumerate(dom_chunks, start=1): if not chunk.strip(): # Skip empty chunks print(f"Skipping empty chunk at batch {i}") continue try: print(f"Processing chunk {i}: {chunk[:100]}...") # Print a preview print(f"Parse description: {parse_description}") response = chain.invoke( { "dom_content": chunk, "parse_description": parse_description, } ) print(f"Response for batch {i}: {response}") parsed_results.append(response) except Exception as e: print(f"Error parsing chunk {i}: {repr(e)}") parsed_results.append(f"Error: {repr(e)}") return "\n".join(parsed_results)
Yongeza le khowudi ilandelayo kwifayile ye-ui.py ukuvumela abasebenzisi ukuba bafake imiyalelo yokwahlulahlula kwiLLM kwaye bajonge iziphumo:
from main import scrape_website, extract_body_content, clean_body_content, split_dom_content from llm import parse_with_ollama if "dom_content" in st.session_state: parse_description = st.text_area("Enter a description to extract specific insights from your scraped data:") if st.button("Parse Content", key="parse_button"): if parse_description.strip() and st.session_state.get("dom_content"): st.info("Parsing the content...") dom_chunks = split_dom_content(st.session_state.dom_content) parsed_result = parse_with_ollama(dom_chunks, parse_description) st.text_area("Parsed Results", parsed_result, height=300) else: st.error("Please provide valid DOM content and a description to parse.")
Ngokwenza oku, i-scraper ngoku inokubonelela ngeempendulo kwi-prompts yakho ngokusekelwe kwi-data scraped.
Ukudityaniswa kwe-web scraping kunye ne-AI ivula amathuba anomdla kwiinkcukacha eziqhutywa yidatha. Ngaphandle kokuqokelela kunye nokugcina idatha, ngoku unokusebenzisa i-AI ukwandisa inkqubo yokufumana ingqiqo kwidatha ekhutshiweyo. Oku kuluncedo kumaqela okuthengisa kunye nokuthengisa, uhlalutyo lwedatha, abanini bamashishini, kunye nokunye okuninzi.
Ungayifumana ikhowudi epheleleyo ye-AI scraper apha. Zive ukhululekile ukuyilinga kwaye uyilungelelanise neemfuno zakho ezizodwa. Igalelo likwamkelekile-ukuba unemibono yokuphucula, cinga ukwenza isicelo sokutsala!
Usenokuthatha oku ngakumbi. Nazi ezinye iingcamango: