Tutorial Web Scraping dengan Puppeteer dan Playwright
ID | EN

Tutorial Web Scraping dengan Puppeteer dan Playwright

Senin, 29 Des 2025

Pernah butuh data dari website tapi tidak ada API-nya? Atau perlu monitor harga produk dari beberapa e-commerce? Web scraping adalah solusinya.

Puppeteer dan Playwright adalah dua library paling populer untuk browser automation dan web scraping di Node.js. Keduanya bisa mengontrol browser sungguhan, render JavaScript, dan extract data dari website modern yang dinamis.

Use Case Web Scraping

Sebelum mulai coding, pahami dulu kapan web scraping berguna:

Use CaseContoh
Price MonitoringTrack harga produk di Tokopedia, Shopee
Lead GenerationCollect data bisnis dari direktori online
Content AggregationKumpulkan berita dari berbagai sumber
Research & AnalysisData untuk market research, sentiment analysis
Testing & QAE2E testing, visual regression testing
ArchivingBackup konten website, screenshot historical

Penting! Sebelum scraping, pastikan kamu memahami aspek legal:

  1. Cek Terms of Service - Banyak website melarang scraping di ToS mereka
  2. Hormati robots.txt - File ini menunjukkan halaman mana yang boleh di-crawl
  3. Rate limiting - Jangan bombardir server dengan request berlebihan
  4. Data pribadi - Hati-hati dengan GDPR dan regulasi privasi data
  5. Copyright - Konten yang di-scrape mungkin dilindungi hak cipta
# Cek robots.txt website
curl https://example.com/robots.txt

Sebagai aturan umum:

  • Scraping data publik untuk keperluan pribadi/research biasanya aman
  • Scraping untuk komersial atau re-publishing konten bisa bermasalah
  • Selalu minta izin jika ragu

Puppeteer vs Playwright: Mana yang Dipilih?

FiturPuppeteerPlaywright
Browser SupportChrome/Chromium onlyChrome, Firefox, Safari (WebKit)
DeveloperGoogleMicrosoft
Auto-waitManualBuilt-in smart waiting
Parallel ExecutionBasicBrowser contexts isolation
Mobile EmulationYesYes + better device profiles
Network InterceptionYesYes + more powerful
Debugging ToolsDevToolsInspector, Trace Viewer, Codegen
API StyleCallback-based originsModern async/await from start

Rekomendasi:

  • Puppeteer - Jika hanya perlu Chrome dan sudah familiar
  • Playwright - Untuk cross-browser, fitur lebih lengkap, project baru

Setup Project

Install Dependencies

mkdir web-scraper && cd web-scraper
npm init -y
npm install puppeteer playwright
npm install typescript ts-node @types/node -D
npx tsc --init

Struktur Project

web-scraper/
├── src/
│   ├── scrapers/
│   │   ├── puppeteer-scraper.ts
│   │   └── playwright-scraper.ts
│   ├── utils/
│   │   ├── browser.ts
│   │   └── helpers.ts
│   └── index.ts
├── data/
│   └── output/
├── package.json
└── tsconfig.json

Konfigurasi TypeScript

// filepath: tsconfig.json
{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "lib": ["ES2020"],
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "resolveJsonModule": true
  },
  "include": ["src/**/*"],
  "exclude": ["node_modules"]
}

Basic Navigation & Selectors

Puppeteer Basic Example

// filepath: src/scrapers/puppeteer-basic.ts
import puppeteer from 'puppeteer';

async function basicScraping() {
  // Launch browser
  const browser = await puppeteer.launch({
    headless: true, // false untuk melihat browser
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });

  const page = await browser.newPage();

  // Set viewport
  await page.setViewport({ width: 1280, height: 800 });

  // Navigate ke halaman
  await page.goto('https://quotes.toscrape.com', {
    waitUntil: 'networkidle2', // Tunggu sampai network idle
  });

  // Get page title
  const title = await page.title();
  console.log('Page Title:', title);

  // Get text content dengan selector
  const firstQuote = await page.$eval('.quote .text', (el) => el.textContent);
  console.log('First Quote:', firstQuote);

  // Get multiple elements
  const quotes = await page.$$eval('.quote', (elements) =>
    elements.map((el) => ({
      text: el.querySelector('.text')?.textContent,
      author: el.querySelector('.author')?.textContent,
      tags: Array.from(el.querySelectorAll('.tag')).map((tag) => tag.textContent),
    }))
  );
  console.log('All Quotes:', quotes);

  await browser.close();
}

basicScraping();

Playwright Basic Example

// filepath: src/scrapers/playwright-basic.ts
import { chromium } from 'playwright';

async function basicScraping() {
  // Launch browser
  const browser = await chromium.launch({
    headless: true,
  });

  // Create context (isolated session)
  const context = await browser.newContext({
    viewport: { width: 1280, height: 800 },
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  });

  const page = await context.newPage();

  // Navigate - Playwright auto-waits untuk load
  await page.goto('https://quotes.toscrape.com');

  // Get page title
  const title = await page.title();
  console.log('Page Title:', title);

  // Get text dengan locator (recommended way)
  const firstQuote = await page.locator('.quote .text').first().textContent();
  console.log('First Quote:', firstQuote);

  // Get all quotes dengan locator
  const quoteLocators = page.locator('.quote');
  const count = await quoteLocators.count();

  const quotes = [];
  for (let i = 0; i < count; i++) {
    const quote = quoteLocators.nth(i);
    quotes.push({
      text: await quote.locator('.text').textContent(),
      author: await quote.locator('.author').textContent(),
      tags: await quote.locator('.tag').allTextContents(),
    });
  }
  console.log('All Quotes:', quotes);

  await browser.close();
}

basicScraping();

Selector Strategies

Memilih selector yang tepat adalah kunci scraping yang robust:

// Playwright selectors - lebih fleksibel
const page = await context.newPage();
await page.goto('https://example.com');

// CSS Selector
await page.locator('div.product-card').click();

// Text selector
await page.locator('text=Add to Cart').click();

// Combining selectors
await page.locator('article:has-text("Featured")').click();

// XPath (jika CSS tidak cukup)
await page.locator('xpath=//div[@data-testid="product"]').click();

// Role selector (accessibility-based)
await page.locator('role=button[name="Submit"]').click();

// Data attributes (paling stable untuk scraping)
await page.locator('[data-product-id="123"]').click();

Tips memilih selector:

  1. Prioritaskan data attributes - [data-testid="x"] paling stable
  2. Hindari class yang auto-generated - .css-1a2b3c bisa berubah
  3. Gunakan kombinasi - .product-card h2 lebih spesifik
  4. Test dengan browser DevTools - Paste selector di Console

Extracting Data

Extract Tabel Data

// filepath: src/scrapers/extract-table.ts
import { chromium } from 'playwright';

interface ProductData {
  name: string;
  price: string;
  stock: string;
  rating: string;
}

async function extractTableData(): Promise<ProductData[]> {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://webscraper.io/test-sites/e-commerce/allinone');

  // Wait for table to be visible
  await page.waitForSelector('.thumbnail');

  // Extract data dari product cards
  const products = await page.evaluate(() => {
    const items: ProductData[] = [];
    const cards = document.querySelectorAll('.thumbnail');

    cards.forEach((card) => {
      items.push({
        name: card.querySelector('.title')?.getAttribute('title') || '',
        price: card.querySelector('.price')?.textContent?.trim() || '',
        stock: card.querySelector('.pull-right')?.textContent?.trim() || '',
        rating: card.querySelectorAll('.glyphicon-star').length.toString(),
      });
    });

    return items;
  });

  console.log(`Extracted ${products.length} products`);
  console.table(products);

  await browser.close();
  return products;
}

extractTableData();

Extract dengan Pagination

// filepath: src/scrapers/extract-with-pagination.ts
import { chromium, Page } from 'playwright';

interface Quote {
  text: string;
  author: string;
  tags: string[];
}

async function extractQuotesFromPage(page: Page): Promise<Quote[]> {
  return await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.quote')).map((el) => ({
      text: el.querySelector('.text')?.textContent || '',
      author: el.querySelector('.author')?.textContent || '',
      tags: Array.from(el.querySelectorAll('.tag')).map(
        (tag) => tag.textContent || ''
      ),
    }));
  });
}

async function scrapeAllPages() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  let allQuotes: Quote[] = [];
  let currentPage = 1;

  await page.goto('https://quotes.toscrape.com');

  while (true) {
    console.log(`Scraping page ${currentPage}...`);

    // Extract quotes dari halaman current
    const quotes = await extractQuotesFromPage(page);
    allQuotes = [...allQuotes, ...quotes];

    // Cek apakah ada next button
    const nextButton = page.locator('.next > a');
    const hasNext = (await nextButton.count()) > 0;

    if (!hasNext) {
      console.log('No more pages');
      break;
    }

    // Klik next dan tunggu navigation
    await nextButton.click();
    await page.waitForLoadState('networkidle');

    currentPage++;

    // Rate limiting - jangan terlalu cepat
    await page.waitForTimeout(1000);
  }

  console.log(`Total quotes extracted: ${allQuotes.length}`);
  await browser.close();

  return allQuotes;
}

scrapeAllPages();

Handling Dynamic Content

Website modern sering load content dengan JavaScript. Berikut cara mengatasinya:

Wait for Elements

// filepath: src/scrapers/dynamic-content.ts
import { chromium } from 'playwright';

async function handleDynamicContent() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com/dynamic-page');

  // Wait for specific element
  await page.waitForSelector('.dynamic-content', {
    state: 'visible',
    timeout: 10000,
  });

  // Wait for element dengan text tertentu
  await page.waitForSelector('text=Data loaded');

  // Wait for network idle (semua request selesai)
  await page.waitForLoadState('networkidle');

  // Wait for function condition
  await page.waitForFunction(() => {
    const items = document.querySelectorAll('.list-item');
    return items.length > 5;
  });

  // Custom wait dengan polling
  await page.waitForFunction(
    () => {
      const loadingIndicator = document.querySelector('.loading');
      return loadingIndicator === null;
    },
    { polling: 500, timeout: 30000 }
  );

  const data = await page.locator('.dynamic-content').textContent();
  console.log('Dynamic content:', data);

  await browser.close();
}

handleDynamicContent();

Infinite Scroll

// filepath: src/scrapers/infinite-scroll.ts
import { chromium } from 'playwright';

async function handleInfiniteScroll() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com/infinite-scroll');

  // Scroll sampai jumlah item tercapai atau limit
  const targetItems = 50;
  const maxScrolls = 10;
  let scrollCount = 0;

  while (scrollCount < maxScrolls) {
    // Count current items
    const itemCount = await page.locator('.item').count();
    console.log(`Items loaded: ${itemCount}`);

    if (itemCount >= targetItems) {
      console.log('Target reached!');
      break;
    }

    // Scroll to bottom
    await page.evaluate(() => {
      window.scrollTo(0, document.body.scrollHeight);
    });

    // Wait for new content
    await page.waitForTimeout(2000);

    // Check if we've reached the end
    const newItemCount = await page.locator('.item').count();
    if (newItemCount === itemCount) {
      console.log('No new items loaded - end of content');
      break;
    }

    scrollCount++;
  }

  // Extract all items
  const items = await page.locator('.item').allTextContents();
  console.log(`Total items: ${items.length}`);

  await browser.close();
}

handleInfiniteScroll();

Click to Load More

async function handleLoadMore() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com/load-more');

  while (true) {
    // Cek apakah "Load More" button ada
    const loadMoreButton = page.locator('button:has-text("Load More")');
    const isVisible = await loadMoreButton.isVisible();

    if (!isVisible) {
      console.log('No more items to load');
      break;
    }

    // Click dan tunggu loading
    await loadMoreButton.click();

    // Tunggu loading selesai
    await page.waitForSelector('.loading', { state: 'hidden' });

    // Small delay untuk courtesy
    await page.waitForTimeout(500);
  }

  await browser.close();
}

Screenshots dan PDF

Capture Screenshots

// filepath: src/scrapers/screenshots.ts
import { chromium } from 'playwright';
import path from 'path';

async function captureScreenshots() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com');

  // Full page screenshot
  await page.screenshot({
    path: path.join(__dirname, '../../data/output/fullpage.png'),
    fullPage: true,
  });

  // Viewport screenshot only
  await page.screenshot({
    path: path.join(__dirname, '../../data/output/viewport.png'),
  });

  // Element screenshot
  const header = page.locator('header');
  await header.screenshot({
    path: path.join(__dirname, '../../data/output/header.png'),
  });

  // Screenshot dengan options
  await page.screenshot({
    path: path.join(__dirname, '../../data/output/quality.jpeg'),
    type: 'jpeg',
    quality: 80,
    animations: 'disabled', // Stop animasi
  });

  // Mobile screenshot
  await page.setViewportSize({ width: 375, height: 667 });
  await page.screenshot({
    path: path.join(__dirname, '../../data/output/mobile.png'),
  });

  await browser.close();
}

captureScreenshots();

Generate PDF

// filepath: src/scrapers/generate-pdf.ts
import { chromium } from 'playwright';
import path from 'path';

async function generatePDF() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com/article', {
    waitUntil: 'networkidle',
  });

  // Generate PDF
  await page.pdf({
    path: path.join(__dirname, '../../data/output/article.pdf'),
    format: 'A4',
    printBackground: true,
    margin: {
      top: '20mm',
      bottom: '20mm',
      left: '15mm',
      right: '15mm',
    },
  });

  // PDF dengan custom header/footer
  await page.pdf({
    path: path.join(__dirname, '../../data/output/with-header.pdf'),
    format: 'A4',
    displayHeaderFooter: true,
    headerTemplate: `
      <div style="font-size: 10px; text-align: center; width: 100%;">
        <span>Company Name</span>
      </div>
    `,
    footerTemplate: `
      <div style="font-size: 10px; text-align: center; width: 100%;">
        <span class="pageNumber"></span> / <span class="totalPages"></span>
      </div>
    `,
    margin: {
      top: '40mm',
      bottom: '40mm',
    },
  });

  await browser.close();
}

generatePDF();

Handling Anti-Bot Detection

Website modern punya berbagai cara detect scraper. Berikut teknik untuk bypass:

Basic Evasion

// filepath: src/utils/stealth-browser.ts
import { chromium, BrowserContext } from 'playwright';

interface StealthBrowserOptions {
  headless?: boolean;
  proxy?: string;
}

export async function createStealthBrowser(options: StealthBrowserOptions = {}) {
  const browser = await chromium.launch({
    headless: options.headless ?? true,
    args: [
      '--disable-blink-features=AutomationControlled',
      '--disable-features=IsolateOrigins,site-per-process',
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-accelerated-2d-canvas',
      '--no-first-run',
      '--no-zygote',
      '--disable-gpu',
    ],
  });

  const contextOptions: any = {
    viewport: { width: 1920, height: 1080 },
    userAgent:
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    locale: 'en-US',
    timezoneId: 'America/New_York',
    permissions: ['geolocation'],
    geolocation: { latitude: 40.7128, longitude: -74.006 },
    colorScheme: 'light',
  };

  if (options.proxy) {
    contextOptions.proxy = { server: options.proxy };
  }

  const context = await browser.newContext(contextOptions);

  // Inject anti-detection scripts
  await context.addInitScript(() => {
    // Override navigator.webdriver
    Object.defineProperty(navigator, 'webdriver', {
      get: () => undefined,
    });

    // Override navigator.plugins
    Object.defineProperty(navigator, 'plugins', {
      get: () => [1, 2, 3, 4, 5],
    });

    // Override navigator.languages
    Object.defineProperty(navigator, 'languages', {
      get: () => ['en-US', 'en'],
    });

    // Override chrome runtime
    (window as any).chrome = {
      runtime: {},
    };

    // Override permissions
    const originalQuery = window.navigator.permissions.query;
    window.navigator.permissions.query = (parameters: any) =>
      parameters.name === 'notifications'
        ? Promise.resolve({ state: Notification.permission } as PermissionStatus)
        : originalQuery(parameters);
  });

  return { browser, context };
}

User Agent Rotation

// filepath: src/utils/user-agents.ts
export const userAgents = [
  // Chrome Windows
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
  // Chrome Mac
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
  // Firefox Windows
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
  // Firefox Mac
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 14.1; rv:121.0) Gecko/20100101 Firefox/121.0',
  // Safari Mac
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
  // Edge
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
];

export function getRandomUserAgent(): string {
  return userAgents[Math.floor(Math.random() * userAgents.length)];
}

Human-like Behavior

// filepath: src/utils/human-behavior.ts
import { Page } from 'playwright';

export async function randomDelay(min: number = 500, max: number = 2000) {
  const delay = Math.floor(Math.random() * (max - min + 1)) + min;
  await new Promise((resolve) => setTimeout(resolve, delay));
}

export async function humanType(page: Page, selector: string, text: string) {
  await page.click(selector);
  for (const char of text) {
    await page.keyboard.type(char, { delay: Math.random() * 100 + 50 });
  }
}

export async function humanScroll(page: Page) {
  const scrollHeight = await page.evaluate(() => document.body.scrollHeight);
  let currentPosition = 0;
  const viewportHeight = await page.evaluate(() => window.innerHeight);

  while (currentPosition < scrollHeight) {
    // Random scroll amount
    const scrollAmount = Math.floor(Math.random() * 300) + 100;
    currentPosition += scrollAmount;

    await page.evaluate((y) => window.scrollTo(0, y), currentPosition);
    await randomDelay(200, 800);
  }
}

export async function randomMouseMovement(page: Page) {
  const viewport = page.viewportSize();
  if (!viewport) return;

  for (let i = 0; i < 3; i++) {
    const x = Math.floor(Math.random() * viewport.width);
    const y = Math.floor(Math.random() * viewport.height);
    await page.mouse.move(x, y, { steps: 10 });
    await randomDelay(100, 300);
  }
}

Proxy Rotation

// filepath: src/utils/proxy-rotation.ts
import { chromium } from 'playwright';

interface ProxyConfig {
  server: string;
  username?: string;
  password?: string;
}

const proxies: ProxyConfig[] = [
  { server: 'http://proxy1.example.com:8080' },
  { server: 'http://proxy2.example.com:8080' },
  { server: 'http://proxy3.example.com:8080', username: 'user', password: 'pass' },
];

export function getRandomProxy(): ProxyConfig {
  return proxies[Math.floor(Math.random() * proxies.length)];
}

export async function createBrowserWithProxy() {
  const proxy = getRandomProxy();

  const browser = await chromium.launch();
  const context = await browser.newContext({
    proxy: {
      server: proxy.server,
      username: proxy.username,
      password: proxy.password,
    },
  });

  return { browser, context };
}

Running Headless

Headless mode menjalankan browser tanpa GUI - lebih cepat dan efisien untuk server.

// filepath: src/scrapers/headless-example.ts
import { chromium } from 'playwright';

async function runHeadless() {
  // Headless mode (default)
  const browser = await chromium.launch({
    headless: true,
  });

  // Atau untuk debugging, gunakan headed mode
  // const browser = await chromium.launch({ headless: false });

  // New headless mode (lebih mirip browser asli)
  const browserNew = await chromium.launch({
    headless: true,
    channel: 'chrome', // Use installed Chrome
  });

  const page = await browser.newPage();

  // Untuk CI/CD environments
  const browserCI = await chromium.launch({
    headless: true,
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-gpu',
      '--single-process', // For containers
    ],
  });

  await browser.close();
}

Docker Configuration

# filepath: Dockerfile
FROM node:20-slim

# Install dependencies for Playwright
RUN apt-get update && apt-get install -y \
    libnss3 \
    libnspr4 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libcups2 \
    libdrm2 \
    libdbus-1-3 \
    libxkbcommon0 \
    libxcomposite1 \
    libxdamage1 \
    libxfixes3 \
    libxrandr2 \
    libgbm1 \
    libasound2 \
    libpango-1.0-0 \
    libcairo2 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY package*.json ./
RUN npm ci

# Install Playwright browsers
RUN npx playwright install chromium

COPY . .

CMD ["npm", "start"]

Storing Data

Save to JSON

// filepath: src/utils/storage.ts
import fs from 'fs/promises';
import path from 'path';

interface StorageOptions {
  outputDir?: string;
}

export class DataStorage {
  private outputDir: string;

  constructor(options: StorageOptions = {}) {
    this.outputDir = options.outputDir || path.join(process.cwd(), 'data/output');
  }

  async ensureDir() {
    await fs.mkdir(this.outputDir, { recursive: true });
  }

  async saveJSON(filename: string, data: any) {
    await this.ensureDir();
    const filepath = path.join(this.outputDir, filename);
    await fs.writeFile(filepath, JSON.stringify(data, null, 2));
    console.log(`Data saved to ${filepath}`);
  }

  async appendJSON(filename: string, newData: any[]) {
    await this.ensureDir();
    const filepath = path.join(this.outputDir, filename);

    let existingData: any[] = [];
    try {
      const content = await fs.readFile(filepath, 'utf-8');
      existingData = JSON.parse(content);
    } catch {
      // File doesn't exist yet
    }

    const combinedData = [...existingData, ...newData];
    await fs.writeFile(filepath, JSON.stringify(combinedData, null, 2));
  }
}

Save to CSV

// filepath: src/utils/csv-storage.ts
import fs from 'fs/promises';
import path from 'path';

export async function saveToCSV(
  filename: string,
  data: Record<string, any>[],
  outputDir: string = 'data/output'
) {
  if (data.length === 0) return;

  await fs.mkdir(outputDir, { recursive: true });

  // Get headers from first object
  const headers = Object.keys(data[0]);
  const headerRow = headers.join(',');

  // Convert data to CSV rows
  const rows = data.map((item) => {
    return headers
      .map((header) => {
        let value = item[header];

        // Handle special characters
        if (typeof value === 'string') {
          // Escape quotes and wrap in quotes if contains comma
          value = value.replace(/"/g, '""');
          if (value.includes(',') || value.includes('\n')) {
            value = `"${value}"`;
          }
        }

        return value ?? '';
      })
      .join(',');
  });

  const csvContent = [headerRow, ...rows].join('\n');
  const filepath = path.join(outputDir, filename);

  await fs.writeFile(filepath, csvContent);
  console.log(`CSV saved to ${filepath}`);
}

Save to Database (SQLite)

// filepath: src/utils/db-storage.ts
import Database from 'better-sqlite3';

interface ScrapedItem {
  url: string;
  title: string;
  content: string;
  scrapedAt: string;
}

export class DBStorage {
  private db: Database.Database;

  constructor(dbPath: string = 'data/scraper.db') {
    this.db = new Database(dbPath);
    this.init();
  }

  private init() {
    this.db.exec(`
      CREATE TABLE IF NOT EXISTS scraped_data (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        url TEXT UNIQUE,
        title TEXT,
        content TEXT,
        scraped_at TEXT
      )
    `);
  }

  insert(item: ScrapedItem) {
    const stmt = this.db.prepare(`
      INSERT OR REPLACE INTO scraped_data (url, title, content, scraped_at)
      VALUES (?, ?, ?, ?)
    `);
    stmt.run(item.url, item.title, item.content, item.scrapedAt);
  }

  insertMany(items: ScrapedItem[]) {
    const insert = this.db.prepare(`
      INSERT OR REPLACE INTO scraped_data (url, title, content, scraped_at)
      VALUES (?, ?, ?, ?)
    `);

    const insertMany = this.db.transaction((items: ScrapedItem[]) => {
      for (const item of items) {
        insert.run(item.url, item.title, item.content, item.scrapedAt);
      }
    });

    insertMany(items);
  }

  getAll(): ScrapedItem[] {
    return this.db.prepare('SELECT * FROM scraped_data').all() as ScrapedItem[];
  }

  close() {
    this.db.close();
  }
}

Scheduling Scraping Jobs

Using node-cron

// filepath: src/scheduler.ts
import cron from 'node-cron';
import { chromium } from 'playwright';
import { DataStorage } from './utils/storage';

const storage = new DataStorage();

async function scrapeJob() {
  console.log(`[${new Date().toISOString()}] Starting scrape job...`);

  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  try {
    await page.goto('https://example.com/data');

    const data = await page.evaluate(() => {
      // Extract data logic
      return { timestamp: new Date().toISOString() };
    });

    await storage.saveJSON(`scrape-${Date.now()}.json`, data);
    console.log('Scrape completed successfully');
  } catch (error) {
    console.error('Scrape failed:', error);
  } finally {
    await browser.close();
  }
}

// Run every hour
cron.schedule('0 * * * *', scrapeJob);

// Run every day at midnight
cron.schedule('0 0 * * *', scrapeJob);

// Run every 15 minutes
cron.schedule('*/15 * * * *', scrapeJob);

console.log('Scheduler started...');

Using BullMQ (Production)

// filepath: src/jobs/scrape-worker.ts
import { Worker, Queue } from 'bullmq';
import { chromium } from 'playwright';

const connection = {
  host: 'localhost',
  port: 6379,
};

// Create queue
export const scrapeQueue = new Queue('scraping', { connection });

// Worker untuk process jobs
const worker = new Worker(
  'scraping',
  async (job) => {
    const { url, selector } = job.data;
    console.log(`Processing job ${job.id}: ${url}`);

    const browser = await chromium.launch({ headless: true });
    const page = await browser.newPage();

    try {
      await page.goto(url);
      const data = await page.locator(selector).allTextContents();
      await browser.close();
      return { url, data, scrapedAt: new Date().toISOString() };
    } catch (error) {
      await browser.close();
      throw error;
    }
  },
  {
    connection,
    concurrency: 3, // Process 3 jobs simultaneously
  }
);

worker.on('completed', (job, result) => {
  console.log(`Job ${job.id} completed:`, result);
});

worker.on('failed', (job, err) => {
  console.error(`Job ${job?.id} failed:`, err.message);
});

// Add jobs ke queue
async function addScrapeJobs() {
  const urls = [
    { url: 'https://example.com/page1', selector: '.content' },
    { url: 'https://example.com/page2', selector: '.content' },
    { url: 'https://example.com/page3', selector: '.content' },
  ];

  for (const job of urls) {
    await scrapeQueue.add('scrape', job, {
      attempts: 3,
      backoff: {
        type: 'exponential',
        delay: 1000,
      },
    });
  }
}

addScrapeJobs();

Best Practices

1. Error Handling & Retry

// filepath: src/utils/retry.ts
export async function withRetry<T>(
  fn: () => Promise<T>,
  options: {
    maxRetries?: number;
    delay?: number;
    backoff?: number;
  } = {}
): Promise<T> {
  const { maxRetries = 3, delay = 1000, backoff = 2 } = options;

  let lastError: Error | undefined;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
      console.log(`Attempt ${attempt} failed: ${lastError.message}`);

      if (attempt < maxRetries) {
        const waitTime = delay * Math.pow(backoff, attempt - 1);
        console.log(`Waiting ${waitTime}ms before retry...`);
        await new Promise((resolve) => setTimeout(resolve, waitTime));
      }
    }
  }

  throw lastError;
}

// Usage
const data = await withRetry(
  async () => {
    const page = await context.newPage();
    await page.goto('https://example.com');
    return await page.locator('.data').textContent();
  },
  { maxRetries: 3, delay: 2000 }
);

2. Rate Limiting

// filepath: src/utils/rate-limiter.ts
export class RateLimiter {
  private queue: (() => Promise<void>)[] = [];
  private processing = false;
  private requestsThisSecond = 0;
  private lastReset = Date.now();

  constructor(private requestsPerSecond: number = 1) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push(async () => {
        try {
          const result = await fn();
          resolve(result);
        } catch (error) {
          reject(error);
        }
      });

      this.processQueue();
    });
  }

  private async processQueue() {
    if (this.processing) return;
    this.processing = true;

    while (this.queue.length > 0) {
      const now = Date.now();

      // Reset counter every second
      if (now - this.lastReset >= 1000) {
        this.requestsThisSecond = 0;
        this.lastReset = now;
      }

      // Wait if rate limit reached
      if (this.requestsThisSecond >= this.requestsPerSecond) {
        await new Promise((resolve) =>
          setTimeout(resolve, 1000 - (now - this.lastReset))
        );
        continue;
      }

      const fn = this.queue.shift();
      if (fn) {
        this.requestsThisSecond++;
        await fn();
      }
    }

    this.processing = false;
  }
}

3. Logging

// filepath: src/utils/logger.ts
import fs from 'fs/promises';
import path from 'path';

type LogLevel = 'info' | 'warn' | 'error' | 'debug';

class Logger {
  private logFile: string;

  constructor(logDir: string = 'logs') {
    this.logFile = path.join(logDir, `scraper-${Date.now()}.log`);
  }

  private async log(level: LogLevel, message: string, meta?: any) {
    const timestamp = new Date().toISOString();
    const logEntry = {
      timestamp,
      level,
      message,
      ...(meta && { meta }),
    };

    const logLine = JSON.stringify(logEntry);

    // Console output
    console.log(`[${timestamp}] [${level.toUpperCase()}] ${message}`);

    // File output
    try {
      await fs.mkdir(path.dirname(this.logFile), { recursive: true });
      await fs.appendFile(this.logFile, logLine + '\n');
    } catch (error) {
      console.error('Failed to write log:', error);
    }
  }

  info(message: string, meta?: any) {
    this.log('info', message, meta);
  }

  warn(message: string, meta?: any) {
    this.log('warn', message, meta);
  }

  error(message: string, meta?: any) {
    this.log('error', message, meta);
  }

  debug(message: string, meta?: any) {
    this.log('debug', message, meta);
  }
}

export const logger = new Logger();

4. Complete Scraper Example

// filepath: src/scrapers/complete-scraper.ts
import { chromium, Browser, BrowserContext, Page } from 'playwright';
import { createStealthBrowser } from '../utils/stealth-browser';
import { withRetry } from '../utils/retry';
import { RateLimiter } from '../utils/rate-limiter';
import { DataStorage } from '../utils/storage';
import { logger } from '../utils/logger';
import { randomDelay } from '../utils/human-behavior';

interface ScrapedProduct {
  name: string;
  price: string;
  url: string;
  scrapedAt: string;
}

class ProductScraper {
  private browser: Browser | null = null;
  private context: BrowserContext | null = null;
  private rateLimiter: RateLimiter;
  private storage: DataStorage;

  constructor() {
    this.rateLimiter = new RateLimiter(2); // 2 requests per second
    this.storage = new DataStorage();
  }

  async init() {
    const { browser, context } = await createStealthBrowser({
      headless: true,
    });
    this.browser = browser;
    this.context = context;
    logger.info('Browser initialized');
  }

  async scrapeProduct(url: string): Promise<ScrapedProduct | null> {
    if (!this.context) throw new Error('Browser not initialized');

    return this.rateLimiter.execute(async () => {
      return withRetry(
        async () => {
          const page = await this.context!.newPage();

          try {
            logger.info(`Scraping: ${url}`);
            await page.goto(url, { waitUntil: 'domcontentloaded' });
            await randomDelay(500, 1500);

            const name = await page.locator('h1.product-title').textContent();
            const price = await page.locator('.price').textContent();

            return {
              name: name?.trim() || '',
              price: price?.trim() || '',
              url,
              scrapedAt: new Date().toISOString(),
            };
          } finally {
            await page.close();
          }
        },
        { maxRetries: 3, delay: 2000 }
      );
    });
  }

  async scrapeMultiple(urls: string[]): Promise<ScrapedProduct[]> {
    const results: ScrapedProduct[] = [];

    for (const url of urls) {
      try {
        const product = await this.scrapeProduct(url);
        if (product) {
          results.push(product);
          logger.info(`Scraped: ${product.name}`);
        }
      } catch (error) {
        logger.error(`Failed to scrape ${url}`, { error: (error as Error).message });
      }
    }

    // Save results
    await this.storage.saveJSON(`products-${Date.now()}.json`, results);

    return results;
  }

  async close() {
    await this.browser?.close();
    logger.info('Browser closed');
  }
}

// Main execution
async function main() {
  const scraper = new ProductScraper();

  try {
    await scraper.init();

    const urls = [
      'https://example.com/product/1',
      'https://example.com/product/2',
      'https://example.com/product/3',
    ];

    const products = await scraper.scrapeMultiple(urls);
    console.log(`Successfully scraped ${products.length} products`);
  } catch (error) {
    logger.error('Scraping failed', { error: (error as Error).message });
  } finally {
    await scraper.close();
  }
}

main();

Kesimpulan

Web scraping dengan Puppeteer dan Playwright memberikan kemampuan luar biasa untuk extract data dari website modern:

  1. Puppeteer - Pilihan solid untuk Chrome-only scraping
  2. Playwright - Lebih powerful dengan multi-browser support dan fitur modern
  3. Selalu hormati ToS dan rate limits - Scraping yang bertanggung jawab
  4. Handle dynamic content - Gunakan proper waiting strategies
  5. Anti-bot evasion - User agents, proxies, human behavior simulation
  6. Production-ready - Error handling, retry logic, logging, scheduling

Mulai dengan basic scraping, pahami website target, lalu scale up dengan teknik advanced sesuai kebutuhan.

Resources