The way to Construct an Superior BrightData Net Scraper with Google Gemini for AI-Powered Information Extraction

On this tutorial, we stroll you thru constructing an enhanced internet scraping software that leverages BrightData’s highly effective proxy community alongside Google’s Gemini API for clever information extraction. You’ll see methods to construction your Python undertaking, set up and import the mandatory libraries, and encapsulate scraping logic inside a clear, reusable BrightDataScraper class. Whether or not you’re focusing on Amazon product pages, bestseller listings, or LinkedIn profiles, the scraper’s modular strategies exhibit methods to configure scraping parameters, deal with errors gracefully, and return structured JSON outcomes. An non-compulsory React-style AI agent integration additionally exhibits you methods to mix LLM-driven reasoning with real-time scraping, empowering you to pose pure language queries for on-the-fly information evaluation.

!pip set up langchain-brightdata langchain-google-genai langgraph langchain-core google-generativeai

We set up the entire key libraries wanted for the tutorial in a single step: langchain-brightdata for BrightData internet scraping, langchain-google-genai and google-generativeai for Google Gemini integration, langgraph for agent orchestration, and langchain-core for the core LangChain framework.

import os
import json
from typing import Dict, Any, Non-obligatory
from langchain_brightdata import BrightDataWebScraperAPI
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.prebuilt import create_react_agent

These imports put together your surroundings and core performance: os and json deal with system operations and information serialization, whereas typing gives structured sort hints. You then usher in BrightDataWebScraperAPI for BrightData scraping, ChatGoogleGenerativeAI to interface with Google’s Gemini LLM, and create_react_agent to orchestrate these elements in a React-style agent.

class BrightDataScraper:
    """Enhanced internet scraper utilizing BrightData API"""
   
    def __init__(self, api_key: str, google_api_key: Non-obligatory[str] = None):
        """Initialize scraper with API keys"""
        self.api_key = api_key
        self.scraper = BrightDataWebScraperAPI(bright_data_api_key=api_key)
       
        if google_api_key:
            self.llm = ChatGoogleGenerativeAI(
                mannequin="gemini-2.0-flash",
                google_api_key=google_api_key
            )
            self.agent = create_react_agent(self.llm, [self.scraper])
   
    def scrape_amazon_product(self, url: str, zipcode: str = "10001") -> Dict[str, Any]:
        """Scrape Amazon product information"""
        attempt:
            outcomes = self.scraper.invoke({
                "url": url,
                "dataset_type": "amazon_product",
                "zipcode": zipcode
            })
            return {"success": True, "information": outcomes}
        besides Exception as e:
            return {"success": False, "error": str(e)}
   
    def scrape_amazon_bestsellers(self, area: str = "in") -> Dict[str, Any]:
        """Scrape Amazon bestsellers"""
        attempt:
            url = f"https://www.amazon.{area}/gp/bestsellers/"
            outcomes = self.scraper.invoke({
                "url": url,
                "dataset_type": "amazon_product"
            })
            return {"success": True, "information": outcomes}
        besides Exception as e:
            return {"success": False, "error": str(e)}
   
    def scrape_linkedin_profile(self, url: str) -> Dict[str, Any]:
        """Scrape LinkedIn profile information"""
        attempt:
            outcomes = self.scraper.invoke({
                "url": url,
                "dataset_type": "linkedin_person_profile"
            })
            return {"success": True, "information": outcomes}
        besides Exception as e:
            return {"success": False, "error": str(e)}
   
    def run_agent_query(self, question: str) -> None:
        """Run AI agent with pure language question"""
        if not hasattr(self, 'agent'):
            print("Error: Google API key required for agent performance")
            return
       
        attempt:
            for step in self.agent.stream(
                {"messages": question},
                stream_mode="values"
            ):
                step["messages"][-1].pretty_print()
        besides Exception as e:
            print(f"Agent error: {e}")
   
    def print_results(self, outcomes: Dict[str, Any], title: str = "Outcomes") -> None:
        """Fairly print outcomes"""
        print(f"n{'='*50}")
        print(f"{title}")
        print(f"{'='*50}")
       
        if outcomes["success"]:
            print(json.dumps(outcomes["data"], indent=2, ensure_ascii=False))
        else:
            print(f"Error: {outcomes['error']}")
        print()

The BrightDataScraper class encapsulates all BrightData web-scraping logic and non-compulsory Gemini-powered intelligence underneath a single, reusable interface. Its strategies allow you to simply fetch Amazon product particulars, bestseller lists, and LinkedIn profiles, dealing with API calls, error dealing with, and JSON formatting, and even stream natural-language “agent” queries when a Google API secret’s offered. A handy print_results helper ensures your output is all the time cleanly formatted for inspection.

def major():
    """Predominant execution operate"""
    BRIGHT_DATA_API_KEY = "Use Your Personal API Key"
    GOOGLE_API_KEY = "Use Your Personal API Key"
   
    scraper = BrightDataScraper(BRIGHT_DATA_API_KEY, GOOGLE_API_KEY)
   
    print("🛍️ Scraping Amazon India Bestsellers...")
    bestsellers = scraper.scrape_amazon_bestsellers("in")
    scraper.print_results(bestsellers, "Amazon India Bestsellers")
   
    print("📦 Scraping Amazon Product...")
    product_url = "https://www.amazon.com/dp/B08L5TNJHG"
    product_data = scraper.scrape_amazon_product(product_url, "10001")
    scraper.print_results(product_data, "Amazon Product Information")
   
    print("👤 Scraping LinkedIn Profile...")
    linkedin_url = "https://www.linkedin.com/in/satyanadella/"
    linkedin_data = scraper.scrape_linkedin_profile(linkedin_url)
    scraper.print_results(linkedin_data, "LinkedIn Profile Information")
   
    print("🤖 Working AI Agent Question...")
    agent_query = """
    Scrape Amazon product information for https://www.amazon.com/dp/B0D2Q9397Y?th=1
    in New York (zipcode 10001) and summarize the important thing product particulars.
    """
    scraper.run_agent_query(agent_query)

The principle() operate ties every part collectively by setting your BrightData and Google API keys, instantiating the BrightDataScraper, after which demonstrating every function: it scrapes Amazon India’s bestsellers, fetches particulars for a particular product, retrieves a LinkedIn profile, and eventually runs a natural-language agent question, printing neatly formatted outcomes after every step.

if __name__ == "__main__":
    print("Putting in required packages...")
    os.system("pip set up -q langchain-brightdata langchain-google-genai langgraph")
   
    os.environ["BRIGHT_DATA_API_KEY"] = "Use Your Personal API Key"
   
    major()

Lastly, this entry-point block ensures that, when run as a standalone script, the required scraping libraries are quietly put in, and the BrightData API secret’s set within the surroundings. Then the principle operate is executed to provoke all scraping and agent workflows.

In conclusion, by the top of this tutorial, you’ll have a ready-to-use Python script that automates tedious information assortment duties, abstracts away low-level API particulars, and optionally faucets into generative AI for superior question dealing with. You may prolong this basis by including help for different dataset varieties, integrating extra LLMs, or deploying the scraper as half of a bigger information pipeline or internet service. With these constructing blocks in place, you’re now outfitted to collect, analyze, and current internet information extra effectively, whether or not for market analysis, aggressive intelligence, or customized AI-driven purposes.

Try the Pocket book. All credit score for this analysis goes to the researchers of this undertaking. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.