API Reverse Engineering and Web Scraping
There are many articles written on Web Scraping and Reverse Engineering of Web APIs. Unfortunately, many of them are misguided, resort to browser automation, or are trying to sell you a product to solve a contrived problem.
My goal is to try and filter out the noise. To clearly illustrate — at varying levels of abstraction — the process of reverse engineering a Web API for the purposes of efficiently extracting data, while also explaining the common problems faced, and solutions to these problems.
Misconceptions/Avoiding Common Mistakes
- Browser automation is rarely required to extract data. Using tools like Selenium, Puppeteer, or Playwright should always be a last resort, especially if efficiency is a priority.
- Be wary of paid/freemium services that offer innovative approaches to web scraping. You can almost always do it yourself for free.
- Paid proxy services should be a last resort. Many free alternatives exist such as VPN automation, proxying through cloud functions (within free-tier limits), Tor, etc.
- If you encounter CAPTCHAs, do not immediately assume you need to start paying for captcha solving services. Following the guidleines detailed in this article will allow you to avoid CAPTCHAs in many cases.
Anatomy of a Network Request
SSL Client Hello
Sites may use various fingerprinting methods at different levels of the network stack to identify a client.
One data point that can be used to identify a client is the JA3 fingerprint, which is a hash of the SSL Client Hello packet.
There exist many software libraries — such as curl_cffi
and tls-client
— which allow us spoof this hash
such that it matches a common browser's fingerprint.
See the following Salesforce engineering blogs open-sourcing-ja3 and tls-fingerprinting-with-ja3-and-ja3s for more information.
Headers
The user-agent
header should always be changed to a generic/common browser user agent.
This is because libraries in Python, Node, and many other languages often use default user agent headers that uniquely identify themselves.
Sometimes headers contain special data. Below is an example of creating "signed" headers which are required to make AWS API requests:
import datetime
import hmac
from functools import partial
from hashlib import sha256
from urllib.parse import quote, urlsplit
hmac_sha256 = partial(hmac.new, digestmod=sha256)
SIG_HEADERS_BLACKLIST = ['expect','user-agent','x-amzn-trace-id']
EMPTY_SHA256 = 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
def sigv4(
*,
secret_key: str,
access_key: str,
region: str,
service: str,
method: str,
url: str,
headers: dict = None,
params: dict = None,
data: str = "",
version: str = "AWS4",
aws4_hash: str = "AWS4-HMAC-SHA256",
rtype: str = "aws4_request",
) -> dict:
headers = headers or {}
params = params or {}
params = dict(sorted((quote(k, safe='-_.~'), quote(str(v), safe='-_.~')) for k,v in params.items()))
x_amz_date = headers.get('x-amz-date') or datetime.datetime.now(datetime.timezone.utc).strftime('%Y%m%dT%H%M%SZ')
timestamp = x_amz_date.split("T")[0]
scope = f"{timestamp}/{region}/{service}/{rtype}"
headers |= {'x-amz-date': x_amz_date, 'host': urlsplit(url).netloc}
headers = {kl:v for k,v in sorted(headers.items()) if (kl:=k.lower().strip()) not in SIG_HEADERS_BLACKLIST}
ch = "\n".join(f"{k}:{v}" for k, v in headers.items())
sh = ";".join(headers)
cp = "&".join(f"{k}={v}" for k, v in params.items())
hp = sha256(data.encode('utf8')).hexdigest() if data else EMPTY_SHA256
cr = f"{method}\n/\n{cp}\n{ch}\n\n{sh}\n{hp}"
hcr = sha256(cr.encode('utf8')).hexdigest()
v = hmac_sha256(f'{version}{secret_key}'.encode('utf8'), timestamp.encode('utf8')).digest()
r = hmac_sha256(v, region.encode('utf8')).digest()
s = hmac_sha256(r, service.encode('utf8')).digest()
k = hmac_sha256(s, rtype.encode('utf8')).digest()
sig = hmac_sha256(k, f"{aws4_hash}\n{x_amz_date}\n{scope}\n{hcr}".encode('utf8')).digest().hex()
return {
'x-amz-date':x_amz_date,
'authorization': f'{aws4_hash} Credential={access_key}/{scope}, SignedHeaders={sh}, Signature={sig}',
}
A much simpler scenario is when one or more headers contain values that must match one another.
Here, we investigate the x-csrf-token
header and ct0
cookie required when making requests to certain endpoints within Twitter's GraphQL API:
GET https://api.twitter.com/graphql/piUHOePH_uDdwbD9GkquJA/UserTweets?{various params}
authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA
cookie: ct0=abc123; auth_token=kldsd2fIUyilhufs89;
x-twitter-auth-type: OAuth2Session
x-csrf-token: abc123
Through trial and error, we determine the following conditions:
- The
x-csrf-token
andct0
values must match. If they don't, the request will error out. - We need to supply an
authorization
header with a valid bearer token (in this case, it's a public token) - If we are logged-in, another cookie
auth_token
is needed, which in turn maps to a differentx-csrf-token
. - We also must supply an
x-twitter-auth-type
header with the valueOAuth2Session
if we are using theauth_token
, otherwise we set it to an empty string. - Any other combination or permutation of these headers will result in an error.
Cookies
Cookies are used to store state. This state can be used to identify some form of authenticated session, or "guest" session. It is important to understand what triggers the creation of cookies, and how they are used in subsequent requests.
In simple cases, generating cookies boils down to making specifically crafted network requests, to specific pages, in a specific order.
Using expired/invalid cookies may lead to rate limiting, CAPTCHAs, 4xx errors, limited data, limited endpoints, etc.
Payloads, Params, and Paths
These are the attributes of a network request that are commonly altered/fuzzed to see if more data can be returned or sent per-request.
Common scenarios include:
- GraphQL introspection queries to determine the API schema.
- Pluralization of a documented parameter (e.g. user vs users) to see if a batched endpoint exists.
- Trying common endpoint/parameter names from a wordlist (e.g. user=123, /settings, /user_data, etc.)
- Modifying a parameter that indicates the amount of data per result (e.g. includeMetaData=true, includeSubProperties=true, etc.)
- Modifying a parameter that indicates the number of results returned per page (e.g. count=9999, offset=123, etc.)
- Modifying a parameter that indicates the type of data returned (e.g. format=json, format=xml, etc.)
Look for endpoints that contain words like "beta","staging","internal","test",etc. Since these usually aren't intended to be public-facing, their rate limits may be unbounded, finer granularity of data may be available, other special endpoints may be exposed, etc.
Authentication
There exist many forms of Authentication Flows. Some are trivial to replicate, while others require knowledge of cryptography and hashing.
The best case scenario is that the authentication flow is simply a chain of message passing without hashing or cryptographic functions.
An example of an easily replicated step in a login-flow can be seen in my Twitter API library:
def flow_password(client: Client) -> Client:
return update_token(client, 'flow_token', 'https://api.twitter.com/1.1/onboarding/task.json', json={
"flow_token": client.cookies.get('flow_token'),
"subtask_inputs": [{
"subtask_id": "LoginEnterPassword",
"enter_password": {"password": client.cookies.get('password'), "link": "next_link"}}]
})
An example of a more complex step in the login-flow from subzeroid's instagrapi. The server expects the user's password to be encrypted in this specific way:
class PasswordMixin:
def password_encrypt(self, password):
publickeyid, publickey = self.password_publickeys()
session_key = get_random_bytes(32)
iv = get_random_bytes(12)
timestamp = str(int(time.time()))
decoded_publickey = base64.b64decode(publickey.encode())
recipient_key = RSA.import_key(decoded_publickey)
cipher_rsa = PKCS1_v1_5.new(recipient_key)
rsa_encrypted = cipher_rsa.encrypt(session_key)
cipher_aes = AES.new(session_key, AES.MODE_GCM, iv)
cipher_aes.update(timestamp.encode())
aes_encrypted, tag = cipher_aes.encrypt_and_digest(password.encode("utf8"))
size_buffer = len(rsa_encrypted).to_bytes(2, byteorder="little")
payload = base64.b64encode(
b"".join(
[
b"\x01",
publickeyid.to_bytes(1, byteorder="big"),
iv,
size_buffer,
rsa_encrypted,
tag,
aes_encrypted,
]
)
)
return f"#PWD_INSTAGRAM:4:{timestamp}:{payload.decode()}"
Another snippet of a more complex step in the login-flow from my Proton API. Again, the server expects the user's password to be encrypted in this specific way:
def process_challenge(self, bytes_s: bytes, bytes_server_challenge: bytes) -> bytes | None:
""" Returns M or None if SRP-6a safety check is violated """
self.bytes_s = bytes_s
self.B = b2l(bytes_server_challenge)
# SRP-6a safety check
if (self.B % self.N) == 0:
return None
self.u = hash_custom(self.hash_class, self.A, self.B)
# SRP-6a safety check
if self.u == 0:
return None
self.x = calc_x(self.hash_class, self.bytes_s, self.p, self.N)
self.v = pow(self.g, self.x, self.N)
self.S = pow((self.B - self.k * self.v), (self.a + self.u * self.x), self.N)
self.K = l2b(self.S, SRP_LEN_BYTES)
self.M = calc_client_proof(self.hash_class, self.A, self.B, self.K) # noqa
self.expected_server_proof = calc_server_proof(self.hash_class, self.A, self.M, self.K)
return self.M
If the authentication flow is too difficult to reverse, we then fall back to manually generating cookies by logging in through the browser and copying the cookies generated from the login-flow.
Strings
Sometimes there are — for lack of a better term — "special values", "tokens", "keys", etc. that are required to make a request.
For example, you may notice long unique looking strings in the network request, minified/obfuscated JavaScript, or hidden somwhere in the DOM. These values may be generated server-side or client-side via JavaScript. You will need to investigate when and where the first instances of these values occurred to determine how they were generated. This can be done inside the browser's developer tools, or by exporting and grepping the HAR file.
Some examples of how these values may be generated:
- Hard-coded in the source, either base64 encoded, encrypted, or simply in plain text.
- Calculated by the server, based on specific components of the network request (e.g server hashes params => add timestamp => b64 encode).
- Client runs it's own obfuscated JavaScript which generates these values.
- Client loads an external library which helps generate these values.
Rate Limits
Before jumping straight into rotating proxies, setting up lambdas on AWS, using a VPN, etc. you should first determine the root cause of your rate-limit issues.
Some questions you can ask yourself are:
- What is the target's connection limit per IP?
- Am I sending too many requests per IP?
- Am I sending too many requests per endpoint?
- Am I sending too many requests per user (is the rate limit is linked to a session cookie, auth header, etc.)?
- Are connections being closed appropriately?
Throttling/Retrying Requests
There exist many approaches to throttling and retrying requests, usually involving an exponential function. See exponential-backoff-and-jitter from AWS's architecture blog for more information.
A simple implementation of this in Python:
async def backoff(fn:callable, *args, m:int=20, b:int=2, max_retries:int=8, **kwargs) -> any:
for i in range(max_retries + 1):
try:
r = await fn(*args, **kwargs)
r.raise_for_status()
return r
except Exception as e:
if i == max_retries:
logger.warning(f'max retries exceeded\t{e}')
return
t = min(random.random()*(b**i),m)
logger.info(f'retrying in {f"{t:.2f}"} seconds\t{e}')
await asyncio.sleep(t)
Dynamic Data
HLS (HTTP Live Streaming)
A widely used protocol, HLS chunks audio/video data and streams it to your device. You can identify if a website is
using HLS by looking for requests to .m3u8
files in a HAR dump or, Network tab of your browser's developer tools.
In the simplest case, these .m3u8
files contain links to .ts
files, which are the actual video data chunks you need
to download and concatenate to get the full video. You will need to parse the .m3u8
file and filter for the specific
quality variant that you wish to download (1080, 720, etc.)
The file can be complex enough to warrant the use of a parser library, such as m3u8, created by the Brazilian media network Globo — come to Brazil.
HTTP Live Streaming (Datatracker)
DASH (Dynamic Adaptive Streaming over HTTP)
Similar to HLS, DASH streams chunks which are described in an MPD. This contains data relating to timing, URLs, resolution, bitrate, etc.
The process of extracting DASH video data is similar to HLS. Manifest files must be parsed and relevant fields need to be extracted to link to the chunked audio/video data.
Dynamic Adaptive Streaming over HTTP (ISO/IEC 23009-1:2022)
Dynamic Adaptive Streaming over HTTP (Wiki)
Static Data
Static data (images, videos, audio, text, etc. as single files) lacks excitement. The core ideas relating to streaming, laziness, and concurrency remain the same.
Range Requests
In some cases, static data can be downloaded much faster by use of the range
header. One should always
determine if the server accepts range requests before trying to optimize other aspects of the download process.
Using range requests can be very important when downloading large files because it allows us to request chunks of the file within certain byte-ranges, allowing us to download multiple chunks of a file concurrently.
Below we see an example of how to generate ranges for a list of URLs, splitting the content into chunks of size sz
:
async def generate_ranges(urls: list[str], sz: int) -> list[tuple[str, tuple[int, int]]]:
# head requests to get content length
res = await head(urls)
ranges = {}
for r in res:
l = int(r.headers['content-length'])
ranges[str(r.url)] = [(start, min(start + sz-1, l-1)) for start in range(0, l, sz)]
ranges = [(k, rng) for k, v in ranges.items() for rng in v]
return ranges
Concurrency
There are many ways to download data efficiently. The simplest way is to use libraries like aiohttp
or httpx
coupled with asyncio
and aiofiles
. Rather than explaining concurrency in detail, I will provide a simple example
of what it might look like in practice.
The following code illustrates a stripped-down, extremely fragile, yet simple solution to efficiently make network requests and write data to disk asynchronously.
Key characteristics:
- An exponential backoff strategy to handle rate limits and network errors.
- Use of a semaphore to limit the number of concurrent connections.
- An async file I/O library to ensure writes to disk are non-blocking.
- Data is represented as generators where applicable to reduce memory usage.
- All functions are asynchronous and yield generators of partially applied functions where applicable.
- A more efficient event loop is used via the
uvloop
library. - Data is asynchronously streamed to disk as it is received, rather than waiting for the entire response to be received.
import asyncio
import logging
import random
from asyncio import Semaphore
from functools import partial
from pathlib import Path
from typing import Generator
import aiofiles
import orjson
from httpx import AsyncClient, Response, Limits
from tqdm.asyncio import tqdm_asyncio
try:
get_ipython()
import nest_asyncio
nest_asyncio.apply()
except:
...
if platform.system() != 'Windows':
try:
import uvloop
uvloop.install()
except:
...
logger = logging.getLogger('logger')
logger.addHandler(logging.StreamHandler())
logger.addHandler(logging.FileHandler('log.log'))
logger.setLevel(logging.INFO)
USER_AGENTS = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.3.1 Safari/605.1.1',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.1',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.',
]
async def backoff(fn: callable, sem: Semaphore, *args, m: int = 20, b: int = 2, max_retries: int = 8, **kwargs) -> Response:
for i in range(max_retries + 1):
try:
async with sem:
r = await fn(*args, **kwargs)
r.raise_for_status()
return r
except Exception as e:
if i == max_retries:
logger.info(f'max retries exceeded\t{e}')
return
t = min(random.random() * (b ** i), m)
logger.info(f'retrying in {f"{t:.2f}"} seconds\t{e}')
await asyncio.sleep(t)
def send(reqs: list[dict], out: str = 'data', sz: int = None, stream: bool = False, fname_fn: callable = None, **kwargs) -> Generator:
async def fn(client: AsyncClient, sem: Semaphore, req: dict):
url = req.get('url') or kwargs.get('url')
fname = url.split('/')[-1] if not fname_fn else fname_fn(req | kwargs)
try:
async with aiofiles.open(_out / fname, 'wb') as fp:
if stream:
async with sem:
async with client.stream(**req, **kwargs) as r:
async for chunk in r.aiter_bytes(sz):
await fp.write(chunk)
else:
r = await backoff(client.request, sem, **req, **kwargs)
async for chunk in r.aiter_bytes(sz):
await fp.write(chunk)
return r
except Exception as e:
logger.info(f'download failed: {url = }\t{e}')
_out = Path(out)
_out.mkdir(parents=True, exist_ok=True)
return (partial(fn, req=req) for req in reqs)
async def process(fns: Generator, max_connections: int = 1000, desc: str = None, **kwargs):
client_defaults = {
'cookies': kwargs.pop('cookies', None),
'headers': {'user-agent': random.choice(USER_AGENTS)} | kwargs.pop('headers', {}),
'timeout': kwargs.pop('timeout', 30.0),
'verify': kwargs.pop('verify', False),
'http2': kwargs.pop('http2', True),
'follow_redirects': kwargs.pop('follow_redirects', True),
'limits': kwargs.pop('limits', Limits(
max_connections=max_connections,
max_keepalive_connections=None,
keepalive_expiry=5.0,
))
}
sem = Semaphore(max_connections)
async with AsyncClient(**client_defaults, follow_redirects=True) as client:
tasks = (fn(client=client, sem=sem) for fn in fns)
if desc:
return await tqdm_asyncio.gather(*tasks, desc=desc)
return await asyncio.gather(*tasks)
if __name__ == '__main__':
urls = []
headers = {}
res = asyncio.run(process(send([{'url': url, 'headers': headers} for url in urls], method='GET')))
... # further processing
WebSockets
WebSockets are a protocol that allows for full-duplex communication between a client and a server. They are often used for real-time applications, such as chat, games, and live-streaming.
Often, the WebSocket connection must be kept alive by sending messages at regular intervals, indicating that the connection is still active, and should be maintained. These are also known as "heartbeat" signals.
An example of using WebSockets to capture live transcript data from Twitter Spaces:
async def capture(self, endpoint: str, access_token: str, room_id: str):
with aiofiles.open('chat.jsonl', 'ab') as fp:
async with websockets.connect(f"wss://{URL(endpoint).host}/chatapi/v1/chatnow") as ws:
await ws.send(orjson.dumps({
"payload": orjson.dumps({"access_token": access_token}).decode(),
"kind": 3
}).decode())
await ws.send(orjson.dumps({
"payload": orjson.dumps({
"body": orjson.dumps({
"room": room_id,
}).decode(),
"kind": 1
}).decode(),
"kind": 2
}).decode())
while 1:
msg = await ws.recv()
tmp = orjson.loads(msg)
kind = tmp.get('kind')
if kind == 1:
signature = tmp.get('signature')
payload = orjson.loads(tmp.get('payload'))
payload['body'] = orjson.loads(payload.get('body'))
await fp.write(orjson.dumps({
'kind': kind,
'payload': payload,
'signature': signature,
}) + b'\n')
DRM (Digital Rights Management)
Detailing the process of dealing with DRM protected content is beyond the scope of this article. Regardless of the methods and tools used, the discovery process remains fairly consistent:
- Identification of the DRM technology used (e.g. Apple FPS, Google Widevine, Microsoft PlayReady, Adobe Primetime)
- Determining if the highest offered quality at the lowest security level meets your criteria.
- Further investigation into the DRM technology used, determining if it is worth the effort to decrypt content protected by the highest security level.
E.g. in the case of Google Widevine, there are three levels of security which limit the quality of the content you can extract "easily":
- L3: Decryption and processing are purely in software without TEE, content at fixed resolution.
- L2: Decryption and processing happen in software or dedicated hardware, despite TEE presence, with content at fixed resolution.
- L1: Decryption and processing occur entirely within a Trusted Execution Environment (TEE), enabling content in original resolution.
Parsing
The first step in reverse engineering a Web API is always to see if you can identify API endpoints that return some form of reliable structured data.
Many modern web apps make requests to a REST or GraphQL API which returns JSON data. This is our best case scenario,
as JSON data is easy to work with, and can be easily parsed and expanded into tables for further analysis. If the data is returned in a different format,
such as XML, or HTML, we will need to parse it. Common tools exist such as selectolax
, BeautifulSoup
, lxml
, etc.
It is easy to waste a lot of time writing CSS/XPath selectors to extract data from HTML. On top of this, the site structure will inevitably change, continuously breaking your code. You should be rabidly looking for API endpoints that return JSON, and never settle with scraping/parsing HTML unless you have no other choice.
Mobile Apps
A detailed explanation of reverse engineering mobile apps is beyond the scope of this article. To reverse engineer a mobile app (only for the purposes of identifying API endpoints), you will need to make use of various tools to intercept and analyze the network traffic. At a bare minimum, you will need to set up a proxy and deal with SSL pinning to allow for the interception of HTTPS traffic.
Some tools/software that can help you with this are:
Obfuscation Techniques
There are varying degrees of obfuscation that may be employed by a server to make reverse engineering more difficult. Most involve mangling of the data in some way, such as creating confusing control flow, encoding data in a non-standard way, and renaming of variables/functions.
TikTok is an example of a service that employs complex VM-based obfuscation techniques to make reverse engineering difficult. More information can be found in the following blog posts: