API Reverse Engineering and Web Scraping
There are many articles written on Web Scraping and Reverse Engineering of Web APIs. Unfortunately, many of them are misguided and immediately resort to browser automation, or are trying to sell you a product to solve a contrived problem.
My goal is to try and filter out the noise. To clearly show simple examples of "reverse engineering" a Web API for the purposes of efficiently extracting data.
Misconceptions/Avoiding Common Mistakes
- Browser automation is not required. It has the potential to be useful, but only highly specific situations. This means tools like Selenium, Puppeteer, or Playwright should always be a last resort. They are expensive and fragile to run at scale.
- Be wary of paid services that offer "innovative" approaches to web scraping. You can always do it yourself for free.
- Paid proxy services should be a last resort. Depending on your use case, simple VPN automation or cloud resources (cf workers, aws lambdas, etc.) may be adequate.
- If you encounter CAPTCHAs, do not immediately assume you need to start paying for captcha solving services. Following the guidleines detailed in this article will allow you to avoid CAPTCHAs in many cases.
- The barrier to entry for web scraping is extremely low. 99% of "scraping" packages are slop and boil down to using the same handful of techniques and packages. Emojis in the readme + fake benchmark results != value.
Anatomy of a Network Request
SSL Client Hello
Sites may use various fingerprinting methods at different levels of the network stack to identify a client. One data point that can be used to identify a client is the JA3 fingerprint, which is a hash of the SSL Client Hello packet. There exist many software libraries which allow us to generate a hash that matches a common browser's fingerprint.
See the following Salesforce engineering blogs open-sourcing-ja3 and tls-fingerprinting-with-ja3-and-ja3s for more information.
Headers
The user-agent header should always be changed to a generic/common browser user agent.
This is because libraries in Python, Node, and many other languages often use default user agent headers that uniquely identify themselves.
Sometimes headers contain special data. Below is an example of creating "signed" headers which are required to make AWS API requests:
import datetime
import hmac
from functools import partial
from hashlib import sha256
from urllib.parse import quote, urlsplit
hmac_sha256 = partial(hmac.new, digestmod=sha256)
SIG_HEADERS_BLACKLIST = ['expect','user-agent','x-amzn-trace-id']
EMPTY_SHA256 = 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
def sigv4(
*,
secret_key: str,
access_key: str,
region: str,
service: str,
method: str,
url: str,
headers: dict = None,
params: dict = None,
data: str = "",
version: str = "AWS4",
aws4_hash: str = "AWS4-HMAC-SHA256",
rtype: str = "aws4_request",
) -> dict:
headers = headers or {}
params = params or {}
params = dict(sorted((quote(k, safe='-_.~'), quote(str(v), safe='-_.~')) for k,v in params.items()))
x_amz_date = headers.get('x-amz-date') or datetime.datetime.now(datetime.timezone.utc).strftime('%Y%m%dT%H%M%SZ')
timestamp = x_amz_date.split("T")[0]
scope = f"{timestamp}/{region}/{service}/{rtype}"
headers |= {'x-amz-date': x_amz_date, 'host': urlsplit(url).netloc}
headers = {kl:v for k,v in sorted(headers.items()) if (kl:=k.lower().strip()) not in SIG_HEADERS_BLACKLIST}
ch = "\n".join(f"{k}:{v}" for k, v in headers.items())
sh = ";".join(headers)
cp = "&".join(f"{k}={v}" for k, v in params.items())
hp = sha256(data.encode('utf8')).hexdigest() if data else EMPTY_SHA256
cr = f"{method}\n/\n{cp}\n{ch}\n\n{sh}\n{hp}"
hcr = sha256(cr.encode('utf8')).hexdigest()
v = hmac_sha256(f'{version}{secret_key}'.encode('utf8'), timestamp.encode('utf8')).digest()
r = hmac_sha256(v, region.encode('utf8')).digest()
s = hmac_sha256(r, service.encode('utf8')).digest()
k = hmac_sha256(s, rtype.encode('utf8')).digest()
sig = hmac_sha256(k, f"{aws4_hash}\n{x_amz_date}\n{scope}\n{hcr}".encode('utf8')).digest().hex()
return {
'x-amz-date':x_amz_date,
'authorization': f'{aws4_hash} Credential={access_key}/{scope}, SignedHeaders={sh}, Signature={sig}',
}
A much simpler scenario is when one or more headers contain values that must match one another.
Here, we investigate the x-csrf-token header and ct0 cookie required when making requests to certain endpoints within Twitter's GraphQL API:
GET https://api.twitter.com/graphql/piUHOePH_uDdwbD9GkquJA/UserTweets?{various params}
authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA
cookie: ct0=abc123; auth_token=kldsd2fIUyilhufs89;
x-twitter-auth-type: OAuth2Session
x-csrf-token: abc123
Through trial and error, we determine the following conditions:
- The
x-csrf-tokenandct0values must match. If they don't, the request will error out. - We need to supply an
authorizationheader with a valid bearer token (in this case, it's a public token) - If we are logged-in, another cookie
auth_tokenis needed, which in turn maps to a differentx-csrf-token. - We also must supply an
x-twitter-auth-typeheader with the valueOAuth2Sessionif we are using theauth_token, otherwise we set it to an empty string. - Any other combination or permutation of these headers will result in an error.
Cookies
Cookies are used to store state. This state can be used to identify some form of authenticated session, or "guest" session. It is important to understand what triggers the creation of cookies, and how they are used in subsequent requests.
In simple cases, generating cookies boils down to making specifically crafted network requests, to specific pages, in a specific order.
Using expired/invalid cookies may lead to rate limiting, CAPTCHAs, 4xx errors, limited data, limited endpoints, etc.
Payloads, Params, and Paths
These are the attributes of a network request that are commonly altered/fuzzed to see if more data can be returned or sent per-request.
Common scenarios include:
- GraphQL introspection queries to determine the API schema.
- Pluralization of a documented parameter (e.g. user vs users) to see if a batched endpoint exists.
- Trying common endpoint/parameter names from a wordlist (e.g. user=123, /settings, /user_data, etc.)
- Modifying a parameter that indicates the amount of data per result (e.g. includeMetaData=true, includeSubProperties=true, etc.)
- Modifying a parameter that indicates the number of results returned per page (e.g. count=9999, offset=123, etc.)
- Modifying a parameter that indicates the type of data returned (e.g. format=json, format=xml, etc.)
Look for endpoints that contain words like "beta","staging","internal","test",etc. Since these usually aren't intended to be public-facing, their rate limits may be unbounded, finer granularity of data may be available, other special endpoints may be exposed, etc.
Authentication
There exist many forms of Authentication Flows. Some are trivial to replicate, while others require knowledge of cryptography and hashing.
The best case scenario is that the authentication flow is simply a chain of message passing without hashing or cryptographic functions.
An example of an easily replicated step in a login-flow can be seen in my Twitter API library:
def flow_password(client: Client) -> Client:
return update_token(client, 'flow_token', 'https://api.twitter.com/1.1/onboarding/task.json', json={
"flow_token": client.cookies.get('flow_token'),
"subtask_inputs": [{
"subtask_id": "LoginEnterPassword",
"enter_password": {"password": client.cookies.get('password'), "link": "next_link"}}]
})
An example of a more complex step in the login-flow from subzeroid's instagrapi. The server expects the user's password to be encrypted in this specific way:
class PasswordMixin:
def password_encrypt(self, password):
publickeyid, publickey = self.password_publickeys()
session_key = get_random_bytes(32)
iv = get_random_bytes(12)
timestamp = str(int(time.time()))
decoded_publickey = base64.b64decode(publickey.encode())
recipient_key = RSA.import_key(decoded_publickey)
cipher_rsa = PKCS1_v1_5.new(recipient_key)
rsa_encrypted = cipher_rsa.encrypt(session_key)
cipher_aes = AES.new(session_key, AES.MODE_GCM, iv)
cipher_aes.update(timestamp.encode())
aes_encrypted, tag = cipher_aes.encrypt_and_digest(password.encode("utf8"))
size_buffer = len(rsa_encrypted).to_bytes(2, byteorder="little")
payload = base64.b64encode(
b"".join(
[
b"\x01",
publickeyid.to_bytes(1, byteorder="big"),
iv,
size_buffer,
rsa_encrypted,
tag,
aes_encrypted,
]
)
)
return f"#PWD_INSTAGRAM:4:{timestamp}:{payload.decode()}"
Another snippet of a more complex step in the login-flow from my Proton API. Again, the server expects the user's password to be encrypted in this specific way:
def process_challenge(self, bytes_s: bytes, bytes_server_challenge: bytes) -> bytes | None:
""" Returns M or None if SRP-6a safety check is violated """
self.bytes_s = bytes_s
self.B = b2l(bytes_server_challenge)
# SRP-6a safety check
if (self.B % self.N) == 0:
return None
self.u = hash_custom(self.hash_class, self.A, self.B)
# SRP-6a safety check
if self.u == 0:
return None
self.x = calc_x(self.hash_class, self.bytes_s, self.p, self.N)
self.v = pow(self.g, self.x, self.N)
self.S = pow((self.B - self.k * self.v), (self.a + self.u * self.x), self.N)
self.K = l2b(self.S, SRP_LEN_BYTES)
self.M = calc_client_proof(self.hash_class, self.A, self.B, self.K) # noqa
self.expected_server_proof = calc_server_proof(self.hash_class, self.A, self.M, self.K)
return self.M
If the authentication flow is too difficult to reverse, we then fall back to manually generating cookies by logging in through the browser and copying the cookies generated from the login-flow.
Strings
Sometimes there are, for lack of a better term, "special values", "strings", "keys", etc. that are required to make a request.
For example, you may notice long unique looking strings in the network request, minified/obfuscated JavaScript/WASM, or hidden somwhere in the DOM. These values may be generated server-side or client-side via JavaScript/WASM. You will need to investigate when and where the first instances of these values occurred to determine how they were generated. This can be done dynamically inside the browser's developer tools by setting breakpoints, or by exporting and grepping the HAR file.
Some examples of how these values may be generated:
- Hard-coded in the source, either base64 encoded, encrypted, or simply in plain text.
- Calculated by the server, based on specific components of the network request (e.g. server hashes params => add timestamp => xor, b64, etc.).
- Client runs obfuscated JavaScript/WASM which derives these values.
- Client loads an obvious external resource which helps derive these values.
- Client loads a seemingly innocuous external resource which helps derive these values.
Rate Limits
Before jumping straight into rotating proxies, setting up lambdas on AWS, using a VPN, etc. you should first determine the root cause of your rate-limit issues.
Some questions you can ask yourself are:
- What is the target's connection limit per IP?
- Am I sending too many requests per IP?
- Am I sending too many requests per endpoint?
- Am I sending too many requests per user (is the rate limit is linked to a session cookie, auth header, etc.)?
- Are connections being closed appropriately?
Throttling/Retrying Requests
There exist many approaches to throttling and retrying requests, usually involving an exponential function. See exponential-backoff-and-jitter from AWS's architecture blog for more information.
A simple implementation of this in Python:
async def backoff(fn:callable, *args, m:int=20, b:int=2, max_retries:int=8, **kwargs) -> any:
for i in range(max_retries + 1):
try:
r = await fn(*args, **kwargs)
r.raise_for_status()
return r
except Exception as e:
if i == max_retries:
logger.warning(f'max retries exceeded\t{e}')
return
t = min(random.random()*(b**i),m)
logger.info(f'retrying in {f"{t:.2f}"} seconds\t{e}')
await asyncio.sleep(t)
Dynamic Data
HLS (HTTP Live Streaming)
A widely used protocol, HLS chunks audio/video data and streams it to your device. You can identify if a website is
using HLS by looking for requests to .m3u8 files in a HAR dump or, Network tab of your browser's developer tools.
In the simplest case, these .m3u8 files contain links to .ts files, which are the actual video data chunks you need
to download and concatenate to get the full video. You will need to parse the .m3u8 file and filter for the specific
quality variant that you wish to download (1080, 720, etc.)
The file can be complex enough to warrant the use of a parser library, such as m3u8, created by the Brazilian media network Globo — come to Brazil.
HTTP Live Streaming (Datatracker)
DASH (Dynamic Adaptive Streaming over HTTP)
Similar to HLS, DASH streams chunks which are described in an MPD. This contains data relating to timing, URLs, resolution, bitrate, etc.
The process of extracting DASH video data is similar to HLS. Manifest files must be parsed and relevant fields need to be extracted to link to the chunked audio/video data.
Dynamic Adaptive Streaming over HTTP (ISO/IEC 23009-1:2022)
Dynamic Adaptive Streaming over HTTP (Wiki)
Static Data
Static data (images, videos, audio, text, etc. as single files) lacks excitement. The core ideas relating to streaming, laziness, and concurrency remain the same.
Range Requests
In some cases, static data can be downloaded much faster by use of the range header. One should always
determine if the server accepts range requests before trying to optimize other aspects of the download process.
Using range requests can be very important when downloading large files because it allows us to request chunks of the file within certain byte-ranges, allowing us to download multiple chunks of a file concurrently.
Below we see an example of how to generate ranges for a list of URLs, splitting the content into chunks of size sz:
async def generate_ranges(urls: list[str], sz: int) -> list[tuple[str, tuple[int, int]]]:
# head requests to get content length
res = await head(urls)
ranges = {}
for r in res:
l = int(r.headers['content-length'])
ranges[str(r.url)] = [(start, min(start + sz-1, l-1)) for start in range(0, l, sz)]
ranges = [(k, rng) for k, v in ranges.items() for rng in v]
return ranges
Concurrency
There are many ways to download data efficiently. The simplest way is to use libraries like aiohttp or httpx
coupled with asyncio or anyio. Rather than explaining concurrency in detail, I will provide a simple example
of what it might look like in practice.
The following simplified code illustrates a simple solution to efficiently make network requests and write data to disk.
Key characteristics:
- Each response is streamed to disk rather than storing the entire response in memory.
- Exp. backoff + jitter strategy to handle rate limits and network errors.
- Semaphore to limit the number of concurrent connections.
- Async file I/O to ensure writes to disk are non-blocking for large files.
- All functions are asynchronous and yield generators where applicable.
- A more efficient event loop is used via the
uvlooplibrary.
(toy example)
import asyncio
import logging
import random
from asyncio import Semaphore
from functools import partial
from pathlib import Path
from typing import Generator
import traceback
import uvloop
import anyio
from httpx import AsyncClient, Response
from tqdm.asyncio import tqdm_asyncio
logger = logging.getLogger('logger')
logger.addHandler(logging.StreamHandler())
logger.addHandler(logging.FileHandler('log.log'))
logger.setLevel(logging.INFO)
async def backoff(fn: callable, sem: Semaphore, *args, m: int = 20, b: int = 2, max_retries: int = 8, **kwargs) -> Response:
for i in range(max_retries + 1):
try:
async with sem:
r = await fn(*args, **kwargs)
r.raise_for_status()
return r
except Exception as e:
if i == max_retries:
logger.info(f'max retries exceeded\t{e}')
return
t = min(random.random() * (b ** i), m)
logger.info(f'retrying in {f"{t:.2f}"} seconds\t{e}')
await asyncio.sleep(t)
def send(reqs: list[dict], sz: int = None, **kwargs) -> Generator:
async def fn(client: AsyncClient, sem: Semaphore, req: dict) -> Response:
try:
async with await anyio.open_file(req['url'].strip('/').split('/')[-1], "wb") as fd:
r = await backoff(client.request, sem, **req, **kwargs)
async for chunk in r.aiter_bytes(sz):
await fd.write(chunk)
return r
except Exception as e:
logger.info(f'download failed: {req = }\t{e}\t{traceback.format_exc()}')
return (partial(fn, req=req) for req in reqs)
async def process(fns: Generator, max_connections: int = 512, client_defaults:dict|None=None):
client_defaults = client_defaults or {}
sem = Semaphore(max_connections)
async with AsyncClient(**client_defaults) as client:
return await tqdm_asyncio.gather(*(fn(client=client, sem=sem) for fn in fns), desc="Simple Downloader")
if __name__ == '__main__':
urls = [...]
headers = {...}
client_defaults = {...}
reqs = [{'method':'GET','url': 'https://example.com', 'headers': headers} for url in urls]
res = uvloop.run(process(send(reqs), client_defaults=client_defaults))
... # further processing
Exercises for the reader
- Assuming 90% of the resources are expected to be between 512 KiB and 5 MiB, how might we optimize our program to achieve better throughput?
- If we were to rewrite this, how might we utilize
asyncio.create_task()orloop.run_in_executor()? when should weawait? - What would an efficient response caching implementation look like?
- Given our response sizes range between 2 GiB -> 32 GiB, how might we design our response processing function? how about response sizes in the range of 512 KiB to 1 MiB?
- Try alternative concurrency patterns, program structure, etc. What are the tradeoffs with each design decision w.r.t. your specific use case?
- What are the advantages of using this toy example over
scrapy? what are the advantages ofscrapy?
WebSockets
WebSockets are a protocol that allows for full-duplex communication between a client and a server. They are often used for real-time applications, such as chat, games, and live-streaming.
Often, the WebSocket connection must be kept alive by sending messages at regular intervals, indicating that the connection is still active, and should be maintained. These are also known as "heartbeat" signals.
An example of using WebSockets to capture live transcript data from Twitter Spaces:
async def capture(self, endpoint: str, access_token: str, room_id: str):
with aiofiles.open('chat.jsonl', 'ab') as fp:
async with websockets.connect(f"wss://{URL(endpoint).host}/chatapi/v1/chatnow") as ws:
await ws.send(orjson.dumps({
"payload": orjson.dumps({"access_token": access_token}).decode(),
"kind": 3
}).decode())
await ws.send(orjson.dumps({
"payload": orjson.dumps({
"body": orjson.dumps({
"room": room_id,
}).decode(),
"kind": 1
}).decode(),
"kind": 2
}).decode())
while 1:
msg = await ws.recv()
tmp = orjson.loads(msg)
kind = tmp.get('kind')
if kind == 1:
signature = tmp.get('signature')
payload = orjson.loads(tmp.get('payload'))
payload['body'] = orjson.loads(payload.get('body'))
await fp.write(orjson.dumps({
'kind': kind,
'payload': payload,
'signature': signature,
}) + b'\n')
DRM (Digital Rights Management)
Detailing the process of dealing with DRM protected content is beyond the scope of this article. Regardless of the methods and tools used, the discovery process remains fairly consistent:
- Identification of the DRM technology used (e.g. Apple FPS, Google Widevine, Microsoft PlayReady, Adobe Primetime)
- Determining if the highest offered quality at the lowest security level meets your criteria.
- Further investigation into the DRM technology used, determining if it is worth the effort to decrypt content protected by the highest security level.
E.g. in the case of Google Widevine, there are three levels of security which limit the quality of the content you can extract "easily":
- L3: Decryption and processing are purely in software without TEE, content at fixed resolution.
- L2: Decryption and processing happen in software or dedicated hardware, despite TEE presence, with content at fixed resolution.
- L1: Decryption and processing occur entirely within a Trusted Execution Environment (TEE), enabling content in original resolution.
Parsing
The first step in reverse engineering a Web API is always to see if you can identify API endpoints that return some form of reliable structured data.
Many modern web apps make requests to a REST or GraphQL API which returns JSON data. This is our best case scenario,
as JSON data is easy to work with, and can be easily parsed and expanded into tables for further analysis. If the data is returned in a different format,
such as XML, or HTML, we will need to parse it. Common tools exist such as selectolax, BeautifulSoup, lxml, etc.
It is easy to waste a lot of time writing CSS/XPath selectors to extract data from HTML. On top of this, the site structure will inevitably change, continuously breaking your code. You should be rabidly looking for API endpoints that return JSON, and never settle with scraping/parsing HTML unless you have no other choice.
Mobile Apps
A detailed explanation of reverse engineering mobile apps is beyond the scope of this article. To reverse engineer a mobile app (only for the purposes of identifying API endpoints), you will need to make use of various tools to intercept and analyze the network traffic. At a bare minimum, you will need to set up a proxy and deal with SSL pinning to allow for the interception of HTTPS traffic.
Some tools/software that can help you with this are:
Obfuscation Techniques
There are varying degrees of obfuscation that may be employed by a server to make reverse engineering more difficult. Most involve mangling of the data in some way, such as creating confusing control flow, encoding data in a non-standard way, and renaming of variables/functions.
TikTok is an example of a service that employs complex VM-based obfuscation techniques to make reverse engineering difficult. More information can be found in the following blog posts: