Building Scrapers

This guide covers the technical details of implementing scrapers, from basic structure to advanced patterns.

The Combo Scraper Pattern

The most common and recommended pattern is the "combo scraper" that handles both movies and TV shows with a single function. This reduces code duplication and ensures consistent behavior.

Basic Structure

import { SourcererEmbed, SourcererOutput, makeSourcerer } from '@/providers/base';
import { MovieScrapeContext, ShowScrapeContext } from '@/utils/context';
import { NotFoundError } from '@/utils/errors';

// Main scraping function that handles both movies and TV shows
async function comboScraper(ctx: ShowScrapeContext | MovieScrapeContext): Promise<SourcererOutput> {
  // 1. Build the appropriate URL based on media type
  const embedUrl = `https://embed.su/embed/${
    ctx.media.type === 'movie' 
      ? `movie/${ctx.media.tmdbId}` 
      : `tv/${ctx.media.tmdbId}/${ctx.media.season.number}/${ctx.media.episode.number}`
  }`;

  // 2. Fetch the embed page using proxied fetcher
  const embedPage = await ctx.proxiedFetcher<string>(embedUrl, {
    headers: {
      Referer: 'https://embed.su/',
      'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
    },
  });

  // 3. Extract and decode configuration
  const vConfigMatch = embedPage.match(/window\.vConfig\s*=\s*JSON\.parse\(atob\(`([^`]+)/i);
  const encodedConfig = vConfigMatch?.[1];
  if (!encodedConfig) throw new NotFoundError('No encoded config found');

  // 4. Process the data (decode, decrypt, etc.)
  const decodedConfig = JSON.parse(await stringAtob(encodedConfig));
  if (!decodedConfig?.hash) throw new NotFoundError('No stream hash found');

  // 5. Update progress to show we're making progress
  ctx.progress(50);

  // 6. Build the final result
  const embeds: SourcererEmbed[] = secondDecode.map((server) => ({
    embedId: 'viper',  // ID of the embed scraper to handle this URL
    url: `https://embed.su/api/e/${server.hash}`,
  }));

  ctx.progress(90);

  return { embeds };
}

// Export the scraper configuration
export const embedsuScraper = makeSourcerer({
  id: 'embedsu',              // Unique identifier
  name: 'embed.su',           // Display name
  rank: 165,                  // Priority rank (must be unique)
  disabled: false,            // Whether the scraper is disabled
  flags: [],                  // Feature flags (see Advanced Concepts)
  scrapeMovie: comboScraper,  // Function for movies
  scrapeShow: comboScraper,   // Function for TV shows
});

Alternative: Separate Functions

For complex cases where movie and TV show logic differs significantly. However, its best to use combo scraper!

async function scrapeMovie(ctx: MovieScrapeContext): Promise<SourcererOutput> {
  // Movie-specific logic
  const movieUrl = `${baseUrl}/movie/${ctx.media.tmdbId}`;
  // ... movie processing
}

async function scrapeShow(ctx: ShowScrapeContext): Promise<SourcererOutput> {
  // TV show-specific logic  
  const showUrl = `${baseUrl}/tv/${ctx.media.tmdbId}/${ctx.media.season.number}/${ctx.media.episode.number}`;
  // ... show processing
}

export const myScraper = makeSourcerer({
  id: 'my-scraper',
  name: 'My Scraper',
  rank: 150,
  disabled: false,
  flags: [],
  scrapeMovie: scrapeMovie,    // Separate functions
  scrapeShow: scrapeShow,
});

Return Types

A SourcererOutput can return two types of data. Understanding when to use each is crucial:

1. Embeds Array (Most Common)

Use when your scraper finds embed players that need further processing:

return { 
  embeds: [
    {
      embedId: 'turbovid',  // Must match an existing embed scraper ID
      url: 'https://turbovid.com/embed/abc123'
    },
    {
      embedId: 'mixdrop',   // Backup option
      url: 'https://mixdrop.co/embed/def456'
    }
  ]
};

When to use:

  • Your scraper finds embed player URLs
  • You want to leverage existing embed scrapers
  • The site uses common players (turbovid, mixdrop, etc.)
  • You want to provide multiple server options

2. Stream Array (Direct Streams)

Use when your scraper finds direct video streams that are ready to play:

import { flags } from '@/entrypoint/utils/targets';

// For HLS streams
return {
  embeds: [], // Can be empty when returning streams
  stream: [
    {
      id: 'primary',
      type: 'hls',
      playlist: streamUrl,
      flags: [flags.CORS_ALLOWED],
      captions: [], // Subtitle tracks (optional)
    }
  ]
};

// For MP4 files with a single quality
return {
    embeds: [],
    stream: [
      {
        id: 'primary',
        captions,
        qualities: {
          unknown: {
            type: 'mp4',
            url: streamUrl,
          },
        },
        type: 'file',
        flags: [flags.CORS_ALLOWED],
      },
    ],
  };

  // For MP4 files with multiple qualities:
  // It's recommended to return it using a function similar to this:

  const streams = Object.entries(data.streams).reduce((acc: Record<string, string>, [quality, url]) => {
    let qualityKey: number;
    if (quality === 'ORG') {
      // Only add unknown quality if it's an mp4 (handle URLs with query parameters)
      const urlPath = url.split('?')[0]; // Remove query parameters
      if (urlPath.toLowerCase().endsWith('.mp4')) {
        acc.unknown = url;
      }
      return acc;
    }
    if (quality === '4K') {
      qualityKey = 2160;
    } else {
      qualityKey = parseInt(quality.replace('P', ''), 10);
    }
    if (Number.isNaN(qualityKey) || acc[qualityKey]) return acc;
    acc[qualityKey] = url;
    return acc;
  }, {});

  // Filter qualities based on provider type
  const filteredStreams = Object.entries(streams).reduce((acc: Record<string, string>, [quality, url]) => {
    // Skip unknown for cached provider
    if (provider.useCacheUrl && quality === 'unknown') {
      return acc;
    }

    acc[quality] = url;
    return acc;
  }, {});
  
  // Returning each quality like so
  return {
    stream: [
      {
        id: 'primary',
        captions: [],
        qualities: {
          ...(filteredStreams[2160] && {
            '4k': {
              type: 'mp4',
              url: filteredStreams[2160],
            },
          }),
          ...(filteredStreams[1080] && {
            1080: {
              type: 'mp4',
              url: filteredStreams[1080],
            },
          }),
          ...(filteredStreams[720] && {
            720: {
              type: 'mp4',
              url: filteredStreams[720],
            },
          }),
          ...(filteredStreams[480] && {
            480: {
              type: 'mp4',
              url: filteredStreams[480],
            },
          }),
          ...(filteredStreams[360] && {
            360: {
              type: 'mp4',
              url: filteredStreams[360],
            },
          }),
          ...(filteredStreams.unknown && {
            unknown: {
              type: 'mp4',
              url: filteredStreams.unknown,
            },
          }),
        },
        type: 'file',
        flags: [flags.CORS_ALLOWED],
      },
    ],
  };

When to use:

  • Your scraper can extract direct video URLs
  • The site provides its own player technology
  • You need fine control over stream handling
  • The streams don't require complex embed processing

Context and Utilities

The scraper context (ctx) provides everything you need for implementation:

Media Information

// Basic media info (always available)
ctx.media.title          // "Spirited Away"
ctx.media.type           // "movie" | "show"
ctx.media.tmdbId         // 129
ctx.media.releaseYear    // 2001
ctx.media.imdbId         // "tt0245429" (when available)

// For TV shows only (check ctx.media.type === 'show')
ctx.media.season.number  // 1
ctx.media.season.tmdbId  // Season TMDB ID
ctx.media.episode.number // 5
ctx.media.episode.tmdbId // Episode TMDB ID

HTTP Client

// Always use proxiedFetcher for external requests to avoid CORS
const response = await ctx.proxiedFetcher<string>('https://example.com/api', {
  method: 'POST',
  headers: {
    'User-Agent': 'Mozilla/5.0...',
    'Referer': 'https://example.com'
  },
  body: JSON.stringify({ key: 'value' })
});

// For API calls with base URL
const data = await ctx.proxiedFetcher('/search', {
  baseUrl: 'https://api.example.com',
  query: { q: ctx.media.title, year: ctx.media.releaseYear }
});

Progress Updates

// Update the loading indicator (0-100)
ctx.progress(25);  // Found media page
// ... processing ...
ctx.progress(50);  // Extracted embed links
// ... more processing ...
ctx.progress(90);  // Almost done

Common Patterns

1. URL Building

// Handle different media types
const buildUrl = (ctx: ShowScrapeContext | MovieScrapeContext) => {
  const apiUrl = ctx.media.type === 'movie'
    ? `${baseUrl}/movie/${ctx.media.tmdbId}`
    : `${baseUrl}/tv/${ctx.media.tmdbId}/${ctx.media.season.number}/${ctx.media.episode.number}`;
    
  return apiUrl;
};

2. Data Extraction

import { load } from 'cheerio';

// Scraping with Cheerio
const $ = load(embedPage);
const embedUrls = $('iframe[src*="turbovid"]')
  .map((_, el) => $(el).attr('src'))
  .get()
  .filter(Boolean);

// Regex extraction
const configMatch = embedPage.match(/window\.playerConfig\s*=\s*({.*?});/s);
if (configMatch) {
  const config = JSON.parse(configMatch[1]);
  // Process config...
}

3. Error Handling

import { NotFoundError } from '@/utils/errors';

// Throw NotFoundError for content not found
if (!embedUrls.length) {
  throw new NotFoundError('No embed players found');
}

// Throw generic Error for other issues
if (!apiResponse.success) {
  throw new Error(`API request failed: ${apiResponse.message}`);
}

4. Protected Streams

There are several ways to bypass protections on streams.

Using the M3U8 proxy:

import { createM3U8ProxyUrl } from '@/utils/proxy';

// For streams that require special headers
const streamHeaders = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0...'
};

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: createM3U8ProxyUrl(originalPlaylist, ctx.features, streamHeaders),
    headers: streamHeaders, // Include headers in the createM3U8ProxyUrl function and here for native and extension targets
    flags: [flags.CORS_ALLOWED], // createM3U8ProxyUrl (or the extension) bypasses cors so we say it's allowed to play in a browser
    captions: []
  }]
};

Using the browser extension:

// For streams that require special headers
const streamHeaders = {
  'Referer': 'https://player.example.com/',
  'Origin': 'https://player.example.com',
  'User-Agent': 'Mozilla/5.0...'
};

return {
  stream: [{
    id: 'primary',
    type: 'hls',
    playlist: originalPlaylist,
    headers: streamHeaders,
    flags: [], // Use the extension becuase it can pass headers, include no flag for extension or native
    captions: []
  }]
};

Building Embed Scrapers

Embed scrapers follow a simpler pattern since they only handle one URL type:

import { makeEmbed } from '@/providers/base';

export const myEmbedScraper = makeEmbed({
  id: 'my-embed',
  name: 'My Embed Player',
  rank: 120,
  async scrape(ctx) {
    // ctx.url contains the embed URL from a source
    
    // 1. Fetch the embed page
    const embedPage = await ctx.proxiedFetcher(ctx.url);
    
    // 2. Extract the stream URL (example with regex)
    const streamMatch = embedPage.match(/src:\s*["']([^"']+\.m3u8[^"']*)/);
    if (!streamMatch) {
      throw new NotFoundError('No stream found in embed');
    }
    
    // 3. Return the stream
    return {
      stream: [{
        id: 'primary',
        type: 'hls',
        playlist: streamMatch[1],
        flags: [flags.CORS_ALLOWED],
        captions: []
      }]
    };
  },
});

Testing Your Scrapers

1. Basic Testing

# Test your scraper with CLI
pnpm cli --source-id my-scraper --tmdb-id 11527

# Test different content types
pnpm cli --source-id my-scraper --tmdb-id 94605 --season 1 --episode 1  # TV show

2. Real CLI Output Examples

Testing a source that returns embeds:

pnpm cli --source-id catflix --tmdb-id 11527
{
  embeds: [
    {
      embedId: 'turbovid',
      url: 'https://turbovid.eu/embed/DjncbDBEmbLW'
    }
  ]
}

Testing an embed that returns streams:

pnpm cli --source-id turbovid --url "https://turbovid.eu/embed/DjncbDBEmbLW"
{
  stream: [
    {
      type: 'hls',
      id: 'primary',
      playlist: 'https://proxy.fifthwit.net/m3u8-proxy?url=https%3A%2F%2Fqueenselti.pro%2Fwrofm%2Fuwu.m3u8&headers=%7B%22referer%22%3A%22https%3A%2F%2Fturbovid.eu%2F%22%2C%22origin%22%3A%2F%2Fturbovid.eu%22%7D',
      flags: [flags.CORS_ALLOWED],
      captions: []
    }
  ]
}

Notice: The playlist URL shows how createM3U8ProxyUrl() creates proxied URLs to handle protected streams.

3. Comprehensive Testing

Test with various content:

  • Popular movies (The Shining: 11527, Spirited Away: 129, Avatar: 19995)
  • Recent releases (check current popular movies)
  • TV shows with multiple seasons
  • Anime series (different episode numbering)
  • Different languages/regions

4. Debug Mode

# Add debug logging to your scraper
console.log('Fetching URL:', embedUrl);
console.log('Response status:', response.status);
console.log('Extracted data:', extractedData);

Next Steps

Once you've built your scraper:

  1. Test thoroughly with multiple content types
  2. Check Advanced Concepts for flags and error handling
  3. Register in all.ts with a unique rank
  4. Submit a pull request with testing documentation
Always test your scrapers with both movies and TV shows, and include multiple examples in your pull request description.