Advanced Concepts

This guide covers advanced topics for building robust and reliable scrapers.

Stream Protection and Proxying

Modern streaming services use various protection mechanisms.

Common Protections

Referer Checking - URLs only work from specific domains
CORS Restrictions - Prevent browser access from unauthorized origins
Geographic Blocking - IP-based access restrictions
Time-Limited Tokens - URLs expire after short periods
User-Agent Filtering - Only allow specific browsers/clients

Data URLs embed content directly in the URL using base64 encoding, completely bypassing CORS restrictions since no HTTP request is made to a different origin. This is often the most effective solution for HLS streams.

import { convertPlaylistsToDataUrls } from '@/utils/playlist';

// Original playlist URL from streaming service
const playlistUrl = 'https://protected-cdn.example.com/playlist.m3u8';

// Headers required to access the playlist
const headers = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};

// Convert playlist and all variants to data URLs
const dataUrl = await convertPlaylistsToDataUrls(
  ctx.proxiedFetcher,
  playlistUrl,
  headers
);

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: dataUrl, // Self-contained data URL
    flags: [flags.CORS_ALLOWED], // No CORS issues with data URLs
    captions: []
  }]
};

Why data URLs work for CORS bypass:

Inital proxied request WITH HEADERS: A request is sent from the proxy with headers allowing for the playlist to load
Fewer external requests: Content is embedded directly in the URL as base64
Same-origin: Browsers treat data URLs as same-origin content
Complete isolation: No network requests means no CORS preflight checks
Self-contained: All playlist data and segments are embedded in the response

How the conversion works:

Fetches the master playlist using provided headers
For each quality variant, fetches the variant playlist
Converts all playlists to base64-encoded data URLs
Returns a master data URL containing all embedded variants

When to use data URLs vs M3U8 proxy:

Use data URLs when you can fetch all playlist data upfront
Use M3U8 proxy when playlists are too large or change frequently
Data URLs are preferred for most HLS streams due to simplicity and reliability
Each segment is origin or header locked: converting to a data URL does not apply the headers to the segments

Use M3U8 proxy for HLS streams:

Using the createM3U8ProxyUrl function we can use our configured M3U8 proxy to send headers to the playlist and all it's segments.

When the target is browser-extension or native, then the createM3U8ProxyUrl function will return an unproxied url. It ONLY converts for vanilla browser targets to bypass CORS. This is why you need to also include the headers item in the stream response rather than in just the function.

import { createM3U8ProxyUrl } from '@/utils/proxy';

// Extract the original stream URL
const originalPlaylist = 'https://protected-cdn.example.com/playlist.m3u8';

// Headers required by the streaming service
const streamHeaders = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: createM3U8ProxyUrl(originalPlaylist, ctx.features, streamHeaders), // Include headers for proxy usage
    headers: streamHeaders, // Include headers for extension/native usage
    flags: [flags.CORS_ALLOWED], // Proxy enables CORS for all targets
    captions: []
  }]
};

Stream Validation Bypass

Streams are checked with a proxied fetcher before returning. However, some streams may be blocked by proxies, so you may need to bypass automatic stream validation in valid.ts:

// In src/utils/valid.ts, add your scraper ID to skip validation
const SKIP_VALIDATION_CHECK_IDS = [
  // ... existing IDs
  'your-scraper-id', // Add your scraper ID here
];

// Sources here are always proxied, so we dont need to validate with a proxy, but we should fetch nativly
// NOTE: all m3u8 proxy urls are automatically processed using this method, so no need to add them here manually
const UNPROXIED_VALIDATION_CHECK_IDS = [
    // ... existing IDs
  'your-scraper-id', // Add your scraper ID here
];

Why this is needed:

By default, all streams are validated by attempting to fetch metadata
The validation uses proxiedFetcher to check if streams are playable
If the stream blocks the fetcher, validation will fail
But the proxied URL should still work in the actual player context
Adding to skip list bypasses validation and returns the proxied URL directly without checking it

When to skip validation:

Validation consistently fails but streams work in browsers
The stream may be origin or IP locked
The stream blocks the extension or proxy

Use setupProxy for MP4 streams: When adding headers in the stream response, usually may need to use the extension or native to send the correct headers in the request.

import { setupProxy } from '@/utils/proxy';

let stream = {
  id: 'primary',
  type: 'file',
  flags: [],
  qualities: {
    '1080p': { url: 'https://protected-cdn.example.com/video.mp4' }
  },
  headers: {
    'Referer': 'https://player.example.com/',
    'User-Agent': 'Mozilla/5.0...'
  },
  captions: []
};

// setupProxy will handle proxying if needed
stream = setupProxy(stream);

return { stream: [stream] };

Performance Optimization

Efficient Data Extraction

Use targeted selectors:

// ✅ Good - specific selector
const embedUrl = $('iframe[src*="turbovid"]').attr('src');

// ❌ Bad - searches entire document
const embedUrl = $('*').filter((_, el) => $(el).attr('src')?.includes('turbovid')).attr('src');

Cache expensive operations:

// Cache parsed data to avoid re-parsing
let cachedConfig;
if (!cachedConfig) {
  cachedConfig = JSON.parse(configString);
}

Minimize HTTP Requests

Combine operations when possible:

// ✅ Good - single request with full processing
const embedPage = await ctx.proxiedFetcher(embedUrl);
const streams = extractAllStreams(embedPage);

// ❌ Bad - multiple requests for same page
const page1 = await ctx.proxiedFetcher(embedUrl);
const config = extractConfig(page1);
const page2 = await ctx.proxiedFetcher(embedUrl); // Duplicate request
const streams = extractStreams(page2);

Security Considerations

Input Validation

Validate external data:

// Validate URLs before using them
const isValidUrl = (url: string) => {
  try {
    new URL(url);
    return url.startsWith('http://') || url.startsWith('https://');
  } catch {
    return false;
  }
};

if (!isValidUrl(streamUrl)) {
  throw new Error('Invalid stream URL received');
}

Sanitize regex inputs:

// Be careful with dynamic regex
const safeTitle = ctx.media.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const titleRegex = new RegExp(safeTitle, 'i');

Safe JSON Parsing

// Handle malformed JSON gracefully
let config;
try {
  config = JSON.parse(configString);
} catch (error) {
  throw new Error('Invalid configuration format');
}

// Validate expected structure
if (!config || typeof config !== 'object' || !config.streams) {
  throw new Error('Invalid configuration structure');
}

Testing and Debugging

Debug Logging

// Add temporary debug logs (remove before submitting)
console.log('Request URL:', requestUrl);
console.log('Response headers:', response.headers);
console.log('Extracted data:', extractedData);

Test Edge Cases

Content with special characters in titles
Very new releases (might not be available)
Old content (might have different URL patterns)
Different regions (geographic restrictions)
Different quality levels

Common Debugging Steps

Verify URLs are correct
Check HTTP status codes
Inspect response headers
Validate extracted data structure
Test with different content types

Best Practices Summary

Always use ctx.proxiedFetcher for external requests
Throw NotFoundError for content-not-found scenarios
Update progress at meaningful milestones
Use appropriate flags for stream capabilities
Handle protected streams with proxy functions
Validate external data before using it
Test thoroughly with diverse content
Document your implementation in pull requests

Next Steps

With these advanced concepts:

Review Sources vs Embeds for architectural patterns
Study existing scrapers in src/providers/ for real examples
Test your implementation thoroughly
Submit pull requests with detailed testing documentation

In DepthFlags

Api ReferencemakeProviders