Web Scraping Tutorial with Puppeteer and Playwright
Monday, Dec 29, 2025
Ever needed data from a website but there’s no API? Or need to monitor product prices from multiple e-commerce sites? Web scraping is the solution.
Puppeteer and Playwright are the two most popular libraries for browser automation and web scraping in Node.js. Both can control real browsers, render JavaScript, and extract data from modern dynamic websites.
Web Scraping Use Cases
Before we start coding, let’s understand when web scraping is useful:
| Use Case | Example |
|---|---|
| Price Monitoring | Track product prices on Amazon, eBay |
| Lead Generation | Collect business data from online directories |
| Content Aggregation | Gather news from various sources |
| Research & Analysis | Data for market research, sentiment analysis |
| Testing & QA | E2E testing, visual regression testing |
| Archiving | Backup website content, historical screenshots |
Legal & Ethical Considerations
Important! Before scraping, make sure you understand the legal aspects:
- Check Terms of Service - Many websites prohibit scraping in their ToS
- Respect robots.txt - This file indicates which pages can be crawled
- Rate limiting - Don’t bombard servers with excessive requests
- Personal data - Be careful with GDPR and data privacy regulations
- Copyright - Scraped content may be protected by copyright
# Check website's robots.txt
curl https://example.com/robots.txt
As a general rule:
- Scraping public data for personal/research purposes is usually safe
- Scraping for commercial purposes or re-publishing content can be problematic
- Always ask for permission if in doubt
Puppeteer vs Playwright: Which to Choose?
| Feature | Puppeteer | Playwright |
|---|---|---|
| Browser Support | Chrome/Chromium only | Chrome, Firefox, Safari (WebKit) |
| Developer | Microsoft | |
| Auto-wait | Manual | Built-in smart waiting |
| Parallel Execution | Basic | Browser contexts isolation |
| Mobile Emulation | Yes | Yes + better device profiles |
| Network Interception | Yes | Yes + more powerful |
| Debugging Tools | DevTools | Inspector, Trace Viewer, Codegen |
| API Style | Callback-based origins | Modern async/await from start |
Recommendation:
- Puppeteer - If you only need Chrome and are already familiar
- Playwright - For cross-browser, more complete features, new projects
Project Setup
Install Dependencies
mkdir web-scraper && cd web-scraper
npm init -y
npm install puppeteer playwright
npm install typescript ts-node @types/node -D
npx tsc --init
Project Structure
web-scraper/
├── src/
│ ├── scrapers/
│ │ ├── puppeteer-scraper.ts
│ │ └── playwright-scraper.ts
│ ├── utils/
│ │ ├── browser.ts
│ │ └── helpers.ts
│ └── index.ts
├── data/
│ └── output/
├── package.json
└── tsconfig.json
TypeScript Configuration
// filepath: tsconfig.json
{
"compilerOptions": {
"target": "ES2020",
"module": "commonjs",
"lib": ["ES2020"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"resolveJsonModule": true
},
"include": ["src/**/*"],
"exclude": ["node_modules"]
}
Basic Navigation & Selectors
Puppeteer Basic Example
// filepath: src/scrapers/puppeteer-basic.ts
import puppeteer from 'puppeteer';
async function basicScraping() {
// Launch browser
const browser = await puppeteer.launch({
headless: true, // false to see the browser
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
const page = await browser.newPage();
// Set viewport
await page.setViewport({ width: 1280, height: 800 });
// Navigate to page
await page.goto('https://quotes.toscrape.com', {
waitUntil: 'networkidle2', // Wait until network is idle
});
// Get page title
const title = await page.title();
console.log('Page Title:', title);
// Get text content with selector
const firstQuote = await page.$eval('.quote .text', (el) => el.textContent);
console.log('First Quote:', firstQuote);
// Get multiple elements
const quotes = await page.$$eval('.quote', (elements) =>
elements.map((el) => ({
text: el.querySelector('.text')?.textContent,
author: el.querySelector('.author')?.textContent,
tags: Array.from(el.querySelectorAll('.tag')).map((tag) => tag.textContent),
}))
);
console.log('All Quotes:', quotes);
await browser.close();
}
basicScraping();
Playwright Basic Example
// filepath: src/scrapers/playwright-basic.ts
import { chromium } from 'playwright';
async function basicScraping() {
// Launch browser
const browser = await chromium.launch({
headless: true,
});
// Create context (isolated session)
const context = await browser.newContext({
viewport: { width: 1280, height: 800 },
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
});
const page = await context.newPage();
// Navigate - Playwright auto-waits for load
await page.goto('https://quotes.toscrape.com');
// Get page title
const title = await page.title();
console.log('Page Title:', title);
// Get text with locator (recommended way)
const firstQuote = await page.locator('.quote .text').first().textContent();
console.log('First Quote:', firstQuote);
// Get all quotes with locator
const quoteLocators = page.locator('.quote');
const count = await quoteLocators.count();
const quotes = [];
for (let i = 0; i < count; i++) {
const quote = quoteLocators.nth(i);
quotes.push({
text: await quote.locator('.text').textContent(),
author: await quote.locator('.author').textContent(),
tags: await quote.locator('.tag').allTextContents(),
});
}
console.log('All Quotes:', quotes);
await browser.close();
}
basicScraping();
Selector Strategies
Choosing the right selector is key to robust scraping:
// Playwright selectors - more flexible
const page = await context.newPage();
await page.goto('https://example.com');
// CSS Selector
await page.locator('div.product-card').click();
// Text selector
await page.locator('text=Add to Cart').click();
// Combining selectors
await page.locator('article:has-text("Featured")').click();
// XPath (if CSS isn't enough)
await page.locator('xpath=//div[@data-testid="product"]').click();
// Role selector (accessibility-based)
await page.locator('role=button[name="Submit"]').click();
// Data attributes (most stable for scraping)
await page.locator('[data-product-id="123"]').click();
Tips for choosing selectors:
- Prioritize data attributes -
[data-testid="x"]is most stable - Avoid auto-generated classes -
.css-1a2b3ccan change - Use combinations -
.product-card h2is more specific - Test with browser DevTools - Paste selector in Console
Extracting Data
Extract Table Data
// filepath: src/scrapers/extract-table.ts
import { chromium } from 'playwright';
interface ProductData {
name: string;
price: string;
stock: string;
rating: string;
}
async function extractTableData(): Promise<ProductData[]> {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://webscraper.io/test-sites/e-commerce/allinone');
// Wait for table to be visible
await page.waitForSelector('.thumbnail');
// Extract data from product cards
const products = await page.evaluate(() => {
const items: ProductData[] = [];
const cards = document.querySelectorAll('.thumbnail');
cards.forEach((card) => {
items.push({
name: card.querySelector('.title')?.getAttribute('title') || '',
price: card.querySelector('.price')?.textContent?.trim() || '',
stock: card.querySelector('.pull-right')?.textContent?.trim() || '',
rating: card.querySelectorAll('.glyphicon-star').length.toString(),
});
});
return items;
});
console.log(`Extracted ${products.length} products`);
console.table(products);
await browser.close();
return products;
}
extractTableData();
Extract with Pagination
// filepath: src/scrapers/extract-with-pagination.ts
import { chromium, Page } from 'playwright';
interface Quote {
text: string;
author: string;
tags: string[];
}
async function extractQuotesFromPage(page: Page): Promise<Quote[]> {
return await page.evaluate(() => {
return Array.from(document.querySelectorAll('.quote')).map((el) => ({
text: el.querySelector('.text')?.textContent || '',
author: el.querySelector('.author')?.textContent || '',
tags: Array.from(el.querySelectorAll('.tag')).map(
(tag) => tag.textContent || ''
),
}));
});
}
async function scrapeAllPages() {
const browser = await chromium.launch();
const page = await browser.newPage();
let allQuotes: Quote[] = [];
let currentPage = 1;
await page.goto('https://quotes.toscrape.com');
while (true) {
console.log(`Scraping page ${currentPage}...`);
// Extract quotes from current page
const quotes = await extractQuotesFromPage(page);
allQuotes = [...allQuotes, ...quotes];
// Check if there's a next button
const nextButton = page.locator('.next > a');
const hasNext = (await nextButton.count()) > 0;
if (!hasNext) {
console.log('No more pages');
break;
}
// Click next and wait for navigation
await nextButton.click();
await page.waitForLoadState('networkidle');
currentPage++;
// Rate limiting - don't go too fast
await page.waitForTimeout(1000);
}
console.log(`Total quotes extracted: ${allQuotes.length}`);
await browser.close();
return allQuotes;
}
scrapeAllPages();
Handling Dynamic Content
Modern websites often load content with JavaScript. Here’s how to handle it:
Wait for Elements
// filepath: src/scrapers/dynamic-content.ts
import { chromium } from 'playwright';
async function handleDynamicContent() {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/dynamic-page');
// Wait for specific element
await page.waitForSelector('.dynamic-content', {
state: 'visible',
timeout: 10000,
});
// Wait for element with specific text
await page.waitForSelector('text=Data loaded');
// Wait for network idle (all requests complete)
await page.waitForLoadState('networkidle');
// Wait for function condition
await page.waitForFunction(() => {
const items = document.querySelectorAll('.list-item');
return items.length > 5;
});
const data = await page.locator('.dynamic-content').textContent();
console.log('Dynamic content:', data);
await browser.close();
}
handleDynamicContent();
Infinite Scroll
// filepath: src/scrapers/infinite-scroll.ts
import { chromium } from 'playwright';
async function handleInfiniteScroll() {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/infinite-scroll');
// Scroll until item count is reached or limit
const targetItems = 50;
const maxScrolls = 10;
let scrollCount = 0;
while (scrollCount < maxScrolls) {
// Count current items
const itemCount = await page.locator('.item').count();
console.log(`Items loaded: ${itemCount}`);
if (itemCount >= targetItems) {
console.log('Target reached!');
break;
}
// Scroll to bottom
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
// Wait for new content
await page.waitForTimeout(2000);
// Check if we've reached the end
const newItemCount = await page.locator('.item').count();
if (newItemCount === itemCount) {
console.log('No new items loaded - end of content');
break;
}
scrollCount++;
}
// Extract all items
const items = await page.locator('.item').allTextContents();
console.log(`Total items: ${items.length}`);
await browser.close();
}
handleInfiniteScroll();
Anti-Bot Evasion Techniques
Stealth Mode
// filepath: src/utils/stealth-browser.ts
import { chromium, Browser, BrowserContext } from 'playwright';
export async function createStealthBrowser(options: {
headless?: boolean;
proxy?: string;
} = {}): Promise<{ browser: Browser; context: BrowserContext }> {
const browser = await chromium.launch({
headless: options.headless ?? true,
args: [
'--disable-blink-features=AutomationControlled',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
],
});
const context = await browser.newContext({
viewport: { width: 1920, height: 1080 },
userAgent:
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
locale: 'en-US',
timezoneId: 'America/New_York',
permissions: ['geolocation'],
geolocation: { latitude: 40.7128, longitude: -74.006 },
...(options.proxy && {
proxy: { server: options.proxy },
}),
});
// Override navigator.webdriver
await context.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
// Override plugins
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5],
});
// Override languages
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en'],
});
});
return { browser, context };
}
Human-like Behavior
// filepath: src/utils/human-behavior.ts
import { Page } from 'playwright';
export async function randomDelay(min: number, max: number): Promise<void> {
const delay = Math.random() * (max - min) + min;
await new Promise((resolve) => setTimeout(resolve, delay));
}
export async function humanType(
page: Page,
selector: string,
text: string
): Promise<void> {
await page.click(selector);
for (const char of text) {
await page.keyboard.type(char);
await randomDelay(50, 150); // Random delay between keystrokes
}
}
export async function humanScroll(page: Page): Promise<void> {
const scrollAmount = Math.floor(Math.random() * 500) + 200;
await page.evaluate((amount) => {
window.scrollBy({
top: amount,
behavior: 'smooth',
});
}, scrollAmount);
await randomDelay(500, 1500);
}
Best Practices
1. Error Handling & Retry
// filepath: src/utils/retry.ts
export async function withRetry<T>(
fn: () => Promise<T>,
options: {
maxRetries?: number;
delay?: number;
backoff?: number;
} = {}
): Promise<T> {
const { maxRetries = 3, delay = 1000, backoff = 2 } = options;
let lastError: Error | undefined;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
console.log(`Attempt ${attempt} failed: ${lastError.message}`);
if (attempt < maxRetries) {
const waitTime = delay * Math.pow(backoff, attempt - 1);
console.log(`Waiting ${waitTime}ms before retry...`);
await new Promise((resolve) => setTimeout(resolve, waitTime));
}
}
}
throw lastError;
}
2. Rate Limiting
// filepath: src/utils/rate-limiter.ts
export class RateLimiter {
private queue: (() => Promise<void>)[] = [];
private processing = false;
private requestsThisSecond = 0;
private lastReset = Date.now();
constructor(private requestsPerSecond: number = 1) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
return new Promise((resolve, reject) => {
this.queue.push(async () => {
try {
const result = await fn();
resolve(result);
} catch (error) {
reject(error);
}
});
this.processQueue();
});
}
private async processQueue() {
if (this.processing) return;
this.processing = true;
while (this.queue.length > 0) {
const now = Date.now();
// Reset counter every second
if (now - this.lastReset >= 1000) {
this.requestsThisSecond = 0;
this.lastReset = now;
}
// Wait if rate limit reached
if (this.requestsThisSecond >= this.requestsPerSecond) {
await new Promise((resolve) =>
setTimeout(resolve, 1000 - (now - this.lastReset))
);
continue;
}
const fn = this.queue.shift();
if (fn) {
this.requestsThisSecond++;
await fn();
}
}
this.processing = false;
}
}
Conclusion
Web scraping with Puppeteer and Playwright gives you powerful capabilities to extract data from modern websites:
- Puppeteer - Solid choice for Chrome-only scraping
- Playwright - More powerful with multi-browser support and modern features
- Always respect ToS and rate limits - Responsible scraping
- Handle dynamic content - Use proper waiting strategies
- Anti-bot evasion - User agents, proxies, human behavior simulation
- Production-ready - Error handling, retry logic, logging, scheduling
Start with basic scraping, understand the target website, then scale up with advanced techniques as needed.
Resources
- Puppeteer Documentation
- Playwright Documentation
- Web Scraping Best Practices
- Playwright Codegen - Generate code from browser interactions