Advanced Concepts

This guide covers advanced topics for building robust and reliable scrapers.

Stream Protection and Proxying

Modern streaming services use various protection mechanisms.

Common Protections

  1. Referer Checking - URLs only work from specific domains
  2. CORS Restrictions - Prevent browser access from unauthorized origins
  3. Geographic Blocking - IP-based access restrictions
  4. Time-Limited Tokens - URLs expire after short periods
  5. User-Agent Filtering - Only allow specific browsers/clients

Handling Protected Streams

Convert HLS playlists to data URLs:

Data URLs embed content directly in the URL using base64 encoding, completely bypassing CORS restrictions since no HTTP request is made to a different origin. This is often the most effective solution for HLS streams.

import { convertPlaylistsToDataUrls } from '@/utils/playlist';

// Original playlist URL from streaming service
const playlistUrl = 'https://protected-cdn.example.com/playlist.m3u8';

// Headers required to access the playlist
const headers = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};

// Convert playlist and all variants to data URLs
const dataUrl = await convertPlaylistsToDataUrls(
  ctx.proxiedFetcher,
  playlistUrl,
  headers
);

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: dataUrl, // Self-contained data URL
    flags: [flags.CORS_ALLOWED], // No CORS issues with data URLs
    captions: []
  }]
};

Why data URLs work for CORS bypass:

  • Inital proxied request WITH HEADERS: A request is sent from the proxy with headers allowing for the playlist to load
  • Fewer external requests: Content is embedded directly in the URL as base64
  • Same-origin: Browsers treat data URLs as same-origin content
  • Complete isolation: No network requests means no CORS preflight checks
  • Self-contained: All playlist data and segments are embedded in the response

How the conversion works:

  1. Fetches the master playlist using provided headers
  2. For each quality variant, fetches the variant playlist
  3. Converts all playlists to base64-encoded data URLs
  4. Returns a master data URL containing all embedded variants

When to use data URLs vs M3U8 proxy:

  • Use data URLs when you can fetch all playlist data upfront
  • Use M3U8 proxy when playlists are too large or change frequently
  • Data URLs are preferred for most HLS streams due to simplicity and reliability
  • Each segment is origin or header locked: converting to a data URL does not apply the headers to the segments

Use M3U8 proxy for HLS streams:

Using the createM3U8ProxyUrl function we can use our configured M3U8 proxy to send headers to the playlist and all it's segments.

When the target is browser-extension or native, then the createM3U8ProxyUrl function will return an unproxied url. It ONLY converts for vanilla browser targets to bypass CORS. This is why you need to also include the headers item in the stream response rather than in just the function.

import { createM3U8ProxyUrl } from '@/utils/proxy';

// Extract the original stream URL
const originalPlaylist = 'https://protected-cdn.example.com/playlist.m3u8';

// Headers required by the streaming service
const streamHeaders = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: createM3U8ProxyUrl(originalPlaylist, ctx.features, streamHeaders), // Include headers for proxy usage
    headers: streamHeaders, // Include headers for extension/native usage
    flags: [flags.CORS_ALLOWED], // Proxy enables CORS for all targets
    captions: []
  }]
};

Stream Validation Bypass

Streams are checked with a proxied fetcher before returning. However, some streams may be blocked by proxies, so you may need to bypass automatic stream validation in valid.ts:

// In src/utils/valid.ts, add your scraper ID to skip validation
const SKIP_VALIDATION_CHECK_IDS = [
  // ... existing IDs
  'your-scraper-id', // Add your scraper ID here
];

// Sources here are always proxied, so we dont need to validate with a proxy, but we should fetch nativly
// NOTE: all m3u8 proxy urls are automatically processed using this method, so no need to add them here manually
const UNPROXIED_VALIDATION_CHECK_IDS = [
    // ... existing IDs
  'your-scraper-id', // Add your scraper ID here
];

Why this is needed:

  • By default, all streams are validated by attempting to fetch metadata
  • The validation uses proxiedFetcher to check if streams are playable
  • If the stream blocks the fetcher, validation will fail
  • But the proxied URL should still work in the actual player context
  • Adding to skip list bypasses validation and returns the proxied URL directly without checking it

When to skip validation:

  • Validation consistently fails but streams work in browsers
  • The stream may be origin or IP locked
  • The stream blocks the extension or proxy

Use setupProxy for MP4 streams: When adding headers in the stream response, usually may need to use the extension or native to send the correct headers in the request.

import { setupProxy } from '@/utils/proxy';

let stream = {
  id: 'primary',
  type: 'file',
  flags: [],
  qualities: {
    '1080p': { url: 'https://protected-cdn.example.com/video.mp4' }
  },
  headers: {
    'Referer': 'https://player.example.com/',
    'User-Agent': 'Mozilla/5.0...'
  },
  captions: []
};

// setupProxy will handle proxying if needed
stream = setupProxy(stream);

return { stream: [stream] };

Performance Optimization

Efficient Data Extraction

Use targeted selectors:

// ✅ Good - specific selector
const embedUrl = $('iframe[src*="turbovid"]').attr('src');

// ❌ Bad - searches entire document
const embedUrl = $('*').filter((_, el) => $(el).attr('src')?.includes('turbovid')).attr('src');

Cache expensive operations:

// Cache parsed data to avoid re-parsing
let cachedConfig;
if (!cachedConfig) {
  cachedConfig = JSON.parse(configString);
}

Minimize HTTP Requests

Combine operations when possible:

// ✅ Good - single request with full processing
const embedPage = await ctx.proxiedFetcher(embedUrl);
const streams = extractAllStreams(embedPage);

// ❌ Bad - multiple requests for same page
const page1 = await ctx.proxiedFetcher(embedUrl);
const config = extractConfig(page1);
const page2 = await ctx.proxiedFetcher(embedUrl); // Duplicate request
const streams = extractStreams(page2);

Security Considerations

Input Validation

Validate external data:

// Validate URLs before using them
const isValidUrl = (url: string) => {
  try {
    new URL(url);
    return url.startsWith('http://') || url.startsWith('https://');
  } catch {
    return false;
  }
};

if (!isValidUrl(streamUrl)) {
  throw new Error('Invalid stream URL received');
}

Sanitize regex inputs:

// Be careful with dynamic regex
const safeTitle = ctx.media.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const titleRegex = new RegExp(safeTitle, 'i');

Safe JSON Parsing

// Handle malformed JSON gracefully
let config;
try {
  config = JSON.parse(configString);
} catch (error) {
  throw new Error('Invalid configuration format');
}

// Validate expected structure
if (!config || typeof config !== 'object' || !config.streams) {
  throw new Error('Invalid configuration structure');
}

Testing and Debugging

Debug Logging

// Add temporary debug logs (remove before submitting)
console.log('Request URL:', requestUrl);
console.log('Response headers:', response.headers);
console.log('Extracted data:', extractedData);

Test Edge Cases

  • Content with special characters in titles
  • Very new releases (might not be available)
  • Old content (might have different URL patterns)
  • Different regions (geographic restrictions)
  • Different quality levels

Common Debugging Steps

  1. Verify URLs are correct
  2. Check HTTP status codes
  3. Inspect response headers
  4. Validate extracted data structure
  5. Test with different content types

Best Practices Summary

  1. Always use ctx.proxiedFetcher for external requests
  2. Throw NotFoundError for content-not-found scenarios
  3. Update progress at meaningful milestones
  4. Use appropriate flags for stream capabilities
  5. Handle protected streams with proxy functions
  6. Validate external data before using it
  7. Test thoroughly with diverse content
  8. Document your implementation in pull requests

Next Steps

With these advanced concepts:

  1. Review Sources vs Embeds for architectural patterns
  2. Study existing scrapers in src/providers/ for real examples
  3. Test your implementation thoroughly
  4. Submit pull requests with detailed testing documentation