Chapter 7: Reconnaissance and Open-Source Intelligence#

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” – attributed to Abraham Lincoln; for the penetration tester, reconnaissance is the sharpening.


Chapter 6 established the methodology, ethics, and legal framework of penetration testing. With written authorization in hand, the engagement’s first technical phase begins, and it is the one that quietly determines the success of all the others: reconnaissance. Before scanning a single port or sending a single exploit, the tester learns everything possible about the target, because the quality of that intelligence shapes every later decision. This chapter covers reconnaissance and open-source intelligence; the active scanning it leads into is the subject of Chapter 8.

Learning Objectives#

After completing this chapter, you will be able to:

  1. Define reconnaissance and distinguish passive from active reconnaissance.

  2. Explain footprinting and the role of open-source intelligence (OSINT) in an engagement.

  3. Use search-engine reconnaissance (“Google dorking”) with advanced operators.

  4. Query WHOIS and the Regional Internet Registries (RIRs) to map an organization’s network ranges.

  5. Perform DNS reconnaissance, including record types, lookups, zone-transfer attempts, and subdomain enumeration.

  6. Describe email harvesting, document metadata analysis, and social-media profiling as intelligence sources.

  7. Select appropriate reconnaissance tools and explain how to defend against reconnaissance.

Key Terms#

  • Reconnaissance (recon): the systematic gathering of information about a target before attacking.

  • Footprinting: building a profile (a “footprint”) of a target’s systems, people, and exposure.

  • OSINT (Open-Source Intelligence): intelligence gathered from publicly available sources.

  • Passive reconnaissance: information gathering that does not interact with the target’s systems.

  • Active reconnaissance: information gathering that does touch the target (and so may be detected).

  • WHOIS: a protocol/service for querying registration data for domains and IP ranges.

  • RIR (Regional Internet Registry): bodies (ARIN, RIPE NCC, APNIC, LACNIC, AFRINIC) that allocate IP address space.

  • DNS (Domain Name System): the naming system that maps names to addresses (Chapter 3).

  • Zone transfer: a bulk copy of a DNS zone’s records (AXFR), valuable to an attacker if misconfigured.

  • Google dorking: using advanced search-engine operators to find exposed information.

  • EASM (External Attack Surface Management): continuously discovering and monitoring an organization’s internet-facing exposure.

7.1 Why Reconnaissance Comes First#

Reconnaissance is the first phase of the attack methodology introduced in Chapter 6, and skilled attackers and testers alike spend a disproportionate share of their effort here, because everything downstream depends on it. A scan is only as good as the target list that drives it; an exploit is only relevant if it matches a service that was discovered; and a social-engineering pretext (Chapter 4) is only convincing if it is built on real details about the organization and its people. Reconnaissance is, in short, the intelligence-gathering that turns a blind assault into a targeted operation.

The discipline rests on a simple but consequential distinction between passive and active reconnaissance. Passive reconnaissance gathers information without ever touching the target’s systems, relying instead on third parties and public sources, so it leaves no trace on the target and is generally lawful even before active testing is authorized. Active reconnaissance interacts with the target directly, querying its DNS servers, grabbing service banners, or probing hosts, which yields richer detail but can be logged and detected, and which, crucially, requires authorization because it touches systems the tester does not own. The professional sequence is to exhaust passive sources first, building as complete a picture as possible invisibly, and only then to move to active techniques within the scope of the rules of engagement.

This chapter follows that order. It begins with open-source intelligence and the passive sources that reveal an organization’s people, technologies, and exposure, then moves through search-engine reconnaissance, WHOIS and registry lookups, and DNS reconnaissance, which straddles the passive-active line. The active scanning that reconnaissance feeds, port and service discovery, vulnerability scanning, and enumeration, is developed in Chapter 8. Throughout, remember the defender’s mirror image: every source an attacker uses is one a defender can monitor and minimize, a theme the chapter closes on.

In practice, reconnaissance follows a loose but deliberate workflow that the rest of the chapter elaborates: begin with the organization’s own public presence, expand outward to its people and their exposure, map its domains and IP ranges through registries and DNS, extend the search to cloud and code platforms, and correlate everything into a validated picture of the attack surface. Each loop can feed the next, a discovered subdomain prompts a new WHOIS lookup, a leaked document reveals an employee whose social media yields a project name, so reconnaissance is iterative rather than strictly linear, and good testers keep structured notes (as in Chapter 6) so that findings accumulate rather than scatter.

A note on legality and etiquette: although passive reconnaissance against public sources is generally lawful, professionals still operate within their rules of engagement and avoid actions that could be construed as unauthorized access (for example, logging into a found portal with leaked credentials, even “just to check,” crosses the line into the access that the Computer Fraud and Abuse Act governs, as Chapter 6 detailed). The safe posture is to observe and report during reconnaissance, and to act only within the explicitly authorized scope and phase.

7.2 Footprinting: Passive and Active#

The goal of reconnaissance is footprinting: assembling a detailed profile, a footprint, of the target’s technical environment, organization, and people. Because the previous section drew the line between passive and active gathering, we can now organize footprinting along it. The diagram summarizes the landscape.

        graph TD
    R[Reconnaissance / Footprinting] --> P[Passive: no contact with target]
    R --> A[Active: touches the target -- needs authorization]
    P --> P1[OSINT: website, social media, job boards]
    P --> P2[WHOIS / RIR registry data]
    P --> P3[Search-engine dorking]
    P --> P4[Document metadata, breach data, EDGAR]
    A --> A1[DNS queries and zone-transfer attempts]
    A --> A2[Banner grabbing]
    A --> A3[Network range / traceroute]
    A --> A4[Leads into scanning -- Chapter 8]
    

A complete footprint answers practical questions that drive the rest of the engagement: What domains and IP ranges does the organization own? What technologies, operating systems, and services does it run? Who works there, in what roles, and what do their public profiles and the company’s job postings reveal? What documents, code, and credentials has it inadvertently exposed? What is its email format and who are its executives? Each answer narrows the attack to what is real and likely, and each is also, from the defender’s seat, a piece of exposure to be measured and reduced. The sections that follow work through the sources that answer these questions, beginning with the richest and most passive of them all, open-source intelligence.

A practical footprinting checklist helps ensure completeness. For infrastructure: registered domains and subdomains, owned IP ranges, name servers, mail servers, cloud assets, and the technologies and versions in use. For people: names, roles, email-address format, phone numbers, and social-media presence of employees, especially those in IT, security, finance, and executive roles. For exposure: indexed documents and their metadata, public code and secrets, leaked credentials, and any system a search engine or device-search engine already reveals. Working through such a checklist turns reconnaissance from ad hoc browsing into a repeatable process whose output is a structured profile ready for the active phases.

7.3 Open-Source Intelligence (OSINT)#

The single most productive passive technique is open-source intelligence (OSINT), defined as information collected from publicly available sources and used in an intelligence context; in the intelligence community, “open” means overt, publicly available sources, as opposed to covert or clandestine ones. The modern organization leaks an astonishing amount of useful information into the open, and the tester’s task is to collect and correlate it.

Several categories are especially fruitful. The organization’s own website reveals technologies, naming conventions, staff, locations, and often documents and email formats. Social media is a gold mine: employees on LinkedIn, Facebook, and similar platforms inadvertently disclose roles, projects, technologies in use, and relationships, which feed both technical targeting and the social-engineering pretexts of Chapter 4. Job postings are a notorious source, because a listing for, say, a “senior administrator with five years of experience in a specific firewall and a specific database version” hands the attacker a map of the internal technology stack. People-search services profile individuals, and the U.S. Securities and Exchange Commission’s EDGAR database exposes a public company’s financial situation and strategic direction. Document metadata, the hidden authorship, software-version, and sometimes username and path information embedded in published files, can reveal internal details the publisher never intended to share. And breach-data repositories show which employee credentials have appeared in past leaks, feeding credential-stuffing and convincing pretexts.

The defining property of OSINT is that all of this is gathered passively, without ever touching the target’s systems, so it is invisible to the target and forms the safe foundation of any engagement. It is also why the corresponding defense, digital footprint management, matters so much: an organization cannot stop the open web from being searched, but it can control what it and its employees place there. The next sections examine the specific techniques and tools that harvest these sources efficiently, starting with the search engine that indexes much of the open web.

A particularly potent OSINT source is breach and credential data. Past data breaches are aggregated into searchable services (the best known being Have I Been Pwned), and large credential dumps circulate in criminal and research channels. For the authorized tester, checking whether a target’s employee emails appear in known breaches reveals likely-reused passwords and feeds both password attacks (Chapter 9) and convincing pretexts (Chapter 4); for the defender, the same check drives forced password resets and multi-factor enforcement. Another rich vein is public code and infrastructure leakage: developers routinely publish source code, configuration, and sometimes secrets to public repositories such as GitHub, where automated scanners (and attackers) hunt for committed API keys, passwords, and internal hostnames. Specialized tools (truffleHog, gitleaks, and GitHub’s own secret scanning) find these exposures, and the lesson, revisited in the cloud section below, is that the modern attack surface extends well beyond an organization’s own servers into the third-party platforms its people use.

Several specialized OSINT techniques deserve mention. Geolocation and image OSINT extracts location from geotagged photos (Chapter 4) and identifies places from visual cues in images, a skill central to many investigations. The Wayback Machine and other web archives preserve old versions of a site, often revealing pages, directories, and information the organization has since removed but never truly erased. Username enumeration across platforms (with tools such as Sherlock) maps an individual’s presence on dozens of sites from a single handle. And public records and data-broker sources aggregate personal and corporate information at scale. Each technique is passive and legal against public data, yet collectively they can assemble a startlingly complete picture of an organization and its people, which is exactly why the defensive counterpart, minimizing and monitoring one’s public footprint, matters so much.

7.4 Search-Engine Reconnaissance (“Google Dorking”)#

Because search engines have already crawled and indexed much of the open web, they are the fastest OSINT tool available, provided one knows how to ask precise questions. Google dorking (also called Google hacking) uses advanced search operators to surface exposed files, login portals, and information that generic searches miss. The technique is passive, since it queries the search engine rather than the target, and it is remarkably powerful.

The key operators, which combine freely, include: site: to restrict results to a domain (for example site:example.com); filetype: to find a particular file type (filetype:xls or filetype:pdf, useful for spreadsheets and documents that should not be public); inurl: to match text in the URL (inurl:admin); intitle: to match text in a page title (intitle:"index of", which often reveals open directory listings); link: historically to find pages linking to a target; and the minus sign to exclude terms. Combinations are where the power lies: a query such as site:example.com filetype:pdf confidential hunts for confidential PDFs on a specific domain. Curated collections of high-yield queries are cataloged in the public Google Hacking Database (GHDB).

Knowledge Check

  1. Write a single search query that would find PowerPoint files exposed on the domain example.org.

  2. Is Google dorking passive or active reconnaissance, and why does that matter legally?

Answers: (1) site:example.org filetype:ppt (or filetype:pptx). (2) Passive, it queries the search engine, not the target’s systems, so it leaves no trace on the target and does not, by itself, require the target’s authorization, though acting on what is found may.

The same operators exist on other engines, and specialized “hacker search engines” such as Shodan and Censys index internet-connected devices and services rather than web pages, letting a tester find exposed servers, databases, industrial systems, and misconfigurations by banner, port, certificate, or software version. These device search engines blur into active-adjacent reconnaissance and are revisited in the tools section and the chapter’s Current News box.

Beyond the core operators, useful additions include intext: (match text in the body), cache: (view Google’s cached copy), ext: (a synonym for filetype), and combining quotes for exact phrases. The public Google Hacking Database (GHDB) organizes thousands of vetted queries into categories such as “files containing passwords,” “sensitive directories,” “login portals,” and “vulnerable servers,” giving testers (and defenders auditing their own sites) a ready catalog of high-yield searches. A disciplined approach is to run a battery of these against the authorized target domain, triage anything sensitive, and feed confirmed findings into the footprint.

7.5 WHOIS and the Regional Internet Registries#

Search engines reveal content; to map the infrastructure an organization owns, the tester turns to registration data. WHOIS is a protocol and service that queries registries and returns registration details for a domain or an IP address block: the network IP range, the domain’s ownership, the registrant address and phone number, and the authoritative DNS name servers. WHOIS is built into Linux (one can run whois example.com), while Windows users typically use a third-party tool or website.

Domain WHOIS reveals who registered a domain and which name servers it uses (though privacy services now mask many personal details). IP WHOIS is queried against the Regional Internet Registries (RIRs), the five bodies that allocate IP address space by region: ARIN (North America), RIPE NCC (Europe and the Middle East), APNIC (Asia-Pacific), LACNIC (Latin America), and AFRINIC (Africa). By looking up a target’s web-server IP address (found during DNS reconnaissance) at the appropriate registry, for example ARIN at arin.net, the tester can determine the organization’s allocated network range, which becomes the scope of later scanning. Establishing the legitimate IP ranges an organization owns is essential both to focus testing and to avoid straying outside the authorized scope onto third-party address space.

Two refinements make registry reconnaissance more powerful. Historical WHOIS services preserve past registration records, which can reveal an organization’s previous providers, contacts, and infrastructure even after privacy masking is applied to current records. And Certificate Transparency (CT) logs, the public, append-only logs of every TLS certificate issued (introduced in Chapter 2), are one of the most effective subdomain-discovery sources available: because every certificate names the hostnames it covers, searching CT logs (through services such as crt.sh) for a target domain often reveals dozens of subdomains, including internal-sounding hosts like dev, staging, and vpn, that the organization never intended to advertise. CT-log mining is passive, fast, and frequently more complete than brute-force subdomain guessing, which is why it has become a staple of both offensive recon and defensive attack-surface discovery.

It is worth distinguishing the two WHOIS layers clearly because they answer different questions. Domain WHOIS, queried against the domain registrar, answers “who registered this name and where does it point?”, returning registrant details (often masked by privacy services today), creation and expiry dates, and the authoritative name servers. IP WHOIS, queried against the appropriate Regional Internet Registry, answers “who owns this address block?”, returning the organization that holds the range, its size, and abuse contacts. A tester typically pivots between them: domain WHOIS yields name servers, DNS yields the web server’s address, and IP WHOIS at the RIR then yields the owning organization’s full network range, which scopes the active phase. Recognizing which layer to query for which fact is a small but constant practical skill.

7.6 DNS Reconnaissance#

The Domain Name System, introduced in Chapter 3, is one of the richest reconnaissance sources, because it must be at least partly public to function, yet it often reveals far more than intended. DNS reconnaissance (DNS enumeration) is the process of locating all available DNS information about a target, identifying internal and external DNS servers and looking up records that expose hostnames, addresses, mail servers, and sometimes internal naming schemes.

A tester should know the common DNS record types and what each leaks:

Record

Name

What it reveals

A / AAAA

Address

Maps a hostname to an IPv4 / IPv6 address

NS

Name Server

The authoritative name servers for the zone

MX

Mail Exchange

The organization’s mail servers (useful for phishing)

CNAME

Canonical Name

Aliases, often exposing third-party services in use

SOA

Start of Authority

The primary name server and zone administration details

PTR

Pointer

Reverse mapping from an IP back to a hostname

TXT

Text

Arbitrary text; often SPF, DKIM, and verification tokens

SRV

Service

Hostname and port of servers for specific services

The basic lookup tools are nslookup (cross-platform) and the more powerful dig (Unix-like systems), along with frameworks such as dnsenum, dnsrecon, and fierce. The most prized misconfiguration is the unrestricted zone transfer (AXFR): a zone transfer is meant to replicate a zone between authoritative name servers, but if a server permits transfers to anyone, an attacker can request a complete copy of all the organization’s DNS records at once, mapping the entire network. This is part of DNS harvesting, which also abuses WHOIS and traceroute. The defense is straightforward, restrict zone transfers to authorized secondary servers, and, more broadly, sign records with DNSSEC to prevent spoofing. Subdomain enumeration, discovering hosts such as vpn, mail, dev, and admin under a domain, rounds out DNS recon and is done by brute-forcing names, scraping certificate-transparency logs, and querying passive-DNS databases. The code cell demonstrates basic, authorized DNS lookups.

Two further DNS techniques complete the picture. Reverse-DNS sweeps query the PTR records across a target’s IP range to discover hostnames, sometimes revealing naming schemes and forgotten hosts. Passive DNS databases (such as those offered by SecurityTrails and similar services) record historical name-to-address mappings observed across the internet, letting a tester see a domain’s past IP addresses, subdomains, and infrastructure changes without ever querying the target’s own servers, a powerful, fully passive complement to live lookups and Certificate Transparency mining.

# Chapter 7 -- Basic DNS reconnaissance (standard library; run only against authorized targets)
import socket

def dns_recon(domain):
    print(f"=== DNS recon for {domain} ===")
    # Forward lookup (A record)
    try:
        ips = socket.gethostbyname_ex(domain)
        print("A record(s):", ips[2])
    except socket.gaierror as e:
        print("A lookup failed:", e)
    # Reverse lookup (PTR) on the first IP
    try:
        ip = socket.gethostbyname(domain)
        host = socket.gethostbyaddr(ip)
        print("PTR (reverse):", host[0])
    except Exception as e:
        print("PTR lookup failed:", e)

dns_recon("example.com")

# With the dnspython library you can query specific record types, e.g.:
#   import dns.resolver
#   for rr in dns.resolver.resolve("example.com", "MX"):   print("MX:", rr.exchange, rr.preference)
#   for rr in dns.resolver.resolve("example.com", "TXT"):  print("TXT:", rr.strings)
# Command-line equivalents:
#   dig example.com MX +short
#   dig axfr @ns1.example.com example.com      # zone-transfer attempt (authorized testing only)
#   nslookup -type=MX example.com
print("\nUse dig/nslookup or dnspython for MX, NS, TXT, and zone-transfer attempts (with authorization).")
=== DNS recon for example.com ===
A record(s): ['104.20.23.154', '172.66.147.243']
PTR lookup failed: [Errno 4] No address associated with name

Use dig/nslookup or dnspython for MX, NS, TXT, and zone-transfer attempts (with authorization).

7.7 Email Harvesting, Metadata, and Social-Media Profiling#

People are part of the attack surface, so reconnaissance also targets the human and document layers, which feed directly into the social engineering of Chapter 4. Email harvesting is the bulk gathering of email addresses associated with an organization, from its website, search results, breach data, and social media, which reveals the organization’s email-address format (such as first.last@example.com) and a roster of valid recipients for phishing. Document metadata analysis extracts the hidden information embedded in published files: tools such as exiftool and the metadata-focused FOCA can reveal author names, internal usernames, software and version numbers, file paths, and even printer or device names, all from documents an organization posted publicly without scrubbing. Social-media profiling correlates employees’ public posts to map roles, relationships, technologies, travel, and routines, the raw material for spear phishing and pretexting.

From the defender’s perspective, each of these is a controllable exposure. Organizations reduce email harvesting by limiting unnecessary public addresses and training staff; they prevent metadata leakage by scrubbing documents before publication; and they manage social-media risk through clear policy and awareness, since, as Chapter 4 stressed, the personal information employees scatter online is the fuel for targeted attacks. The recurring lesson is that reconnaissance against people is defeated less by technology than by disciplined information hygiene.

7.8 Reconnaissance of Cloud and Modern Infrastructure#

The classic sources above assume an organization runs its own servers, but most now depend on cloud platforms, software-as-a-service, and sprawling supply chains, so reconnaissance has expanded to match, which is why a modern treatment must address it before listing tools. Cloud asset discovery looks for an organization’s footprint across providers: storage buckets (Amazon S3, Azure Blob, Google Cloud Storage), which are frequently left publicly readable and have leaked enormous volumes of data; cloud-hosted applications and APIs; and the IP ranges and certificates that tie cloud assets back to the organization. Tools enumerate likely bucket names and probe for public access, and search engines and CT logs reveal cloud endpoints. Source-code and CI/CD reconnaissance targets the development pipeline: public repositories, container registries, and package indexes can expose code, secrets, and internal architecture, as the breach-data discussion above noted.

The broader point is that the attack surface is now distributed. An organization may have hardened its own perimeter while a forgotten public storage bucket, a developer’s committed API key, a misconfigured cloud function, or a third-party SaaS integration quietly exposes the same data. This is precisely why External Attack Surface Management (Section 7.12) treats discovery as continuous and provider-spanning rather than a one-time inventory of owned servers, and why supply-chain risk (Chapter 5) and cloud’s shared responsibility model are inseparable from reconnaissance. For the tester, cloud recon must stay strictly within the authorized scope and respect each provider’s testing rules; for the defender, it is often the fastest way to find the exposure an attacker will find next.

Knowledge Check

  1. Why are Certificate Transparency logs such an effective way to discover subdomains?

  2. Name two modern, non-server exposures that classic “scan the company’s IP range” reconnaissance would miss.

Answers: (1) Every issued TLS certificate is logged publicly with the hostnames it covers, so searching the logs reveals subdomains, including internal-sounding ones, without brute force or touching the target. (2) Publicly readable cloud storage buckets and secrets (API keys, passwords) committed to public code repositories; also misconfigured SaaS integrations or cloud functions.

7.9 The Intelligence Cycle and Organizing Findings#

Reconnaissance is not just collection; it is intelligence production, and borrowing the classic intelligence cycle keeps the effort focused and useful. The cycle has five recurring stages. Planning and direction sets the questions the engagement must answer, driven by the scope and goals from Chapter 6. Collection gathers raw data from the OSINT, registry, DNS, and cloud sources of this chapter. Processing normalizes that raw data into a usable form, deduplicating hosts, resolving names to addresses, and organizing documents and people. Analysis turns processed data into intelligence by correlating it, linking an employee to a technology to an exposed host, and identifying the most promising avenues. Dissemination records the result in a structured footprint that drives scanning (Chapter 8) and feeds the eventual report.

The practical payoff is organization. A reconnaissance effort that scatters findings across browser tabs and notes loses value, whereas one that consolidates them, often in a link-analysis tool such as Maltego or a structured notes repository, reveals relationships no single source showed. Good testers maintain a living target profile throughout the engagement, updating it as active phases discover more, so that the intelligence picture continually sharpens. This disciplined, cyclical approach is also what distinguishes professional reconnaissance from idle searching, and it mirrors the continuous, iterative nature of security work emphasized throughout this book.

Going Deeper (graduate/research): the attack surface as a formal object

Reconnaissance can be framed rigorously as attack-surface discovery. An organization’s external attack surface is the set of all reachable entry points, each with attributes (host, port, service, version, owner, exposure). Discovery is the problem of enumerating this set as completely as possible from outside, and it is fundamentally incomplete and adversarial: the defender and attacker race to enumerate the same surface, and assets appear and disappear continuously (ephemeral cloud instances, rotating certificates, new subdomains). This framing motivates continuous, automated External Attack Surface Management rather than point-in-time scans, and it connects to active research on internet-wide scanning (the data behind Shodan and Censys), graph-based asset correlation, and machine-learning approaches to prioritizing which exposed assets are most likely to be exploited. It also raises measurement questions, how do we quantify attack- surface size and reduction over time?, that tie reconnaissance back to the risk metrics of Chapter 5.

7.10 Reconnaissance Tools#

The sources above can be gathered by hand, but mature practice relies on tools that automate collection and correlation, so a tester should know the standard toolkit. theHarvester aggregates emails, subdomains, hosts, and employee names from many public sources in one command. Maltego is a graphical link-analysis tool that visually maps relationships among domains, people, email addresses, and infrastructure. Recon-ng and SpiderFoot are automation frameworks that run dozens of OSINT modules and consolidate the results. OWASP Amass (Appendix A) performs in-depth subdomain and attack-surface discovery across many data sources. The device search engines Shodan and Censys index internet-connected systems, letting a tester find a target’s exposed servers, databases, and misconfigurations by banner, port, certificate, or software version, an external view of the attack surface that the Current News box illustrates.

Several lightweight, purpose-built scripts complement these. The author’s open-source Links-Extractor extracts all internal and external links from a URL (useful for mapping a site’s structure and discovering endpoints), and SEO-Analysis gathers insights about a domain and a set of keywords; both are listed in Appendix F, and a tutorial on using the Google Search application programming interface for automated OSINT is referenced there. The unifying caution, repeated from Chapter 6, is that some of these techniques (particularly Shodan or Censys lookups acted upon, and any active query against the target) must stay within the authorized scope, and that all of them have a defensive counterpart in the external attack-surface management discussed in Section 7.10.

A few concrete examples show the tools in action (against authorized targets only). On Shodan, a query such as org:"Example Inc" port:3389 surfaces the organization’s exposed Remote Desktop hosts, and filters like ssl:"example.com", product:, and vuln: find systems by certificate, software, or known vulnerability. theHarvester is run simply as theHarvester -d example.com -b all. Amass enumerates subdomains with amass enum -d example.com. FOCA specializes in extracting metadata from an organization’s published documents. And Recon-ng runs modular workflows that pull from many sources and store results in a database for analysis. Mastery is less about memorizing flags than about knowing which tool answers which footprinting question and how to feed one tool’s output into the next.

# Chapter 7 -- Reconnaissance funnel: from broad public sources to a focused target list
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
fig, ax = plt.subplots(figsize=(7.5, 4.6))
stages = [
    ("Public internet (everything)", 0.0, "#d6eaf8"),
    ("OSINT: domains, people, tech, docs", 0.16, "#aed6f1"),
    ("WHOIS / RIR: owned IP ranges", 0.34, "#7fb3d5"),
    ("DNS: hosts, mail, subdomains", 0.52, "#5499c7"),
    ("Validated target list -> scanning (Ch.8)", 0.72, "#2e86c1"),
]
n=len(stages); top=10.0
for i,(label,inset,color) in enumerate(stages):
    y1=top-(i)* (top/n); y2=top-(i+1)*(top/n)
    x1=inset*5; x2=5-inset*5
    nx1=(stages[i+1][1]*5) if i+1<n else 2.0
    nx2=(5-stages[i+1][1]*5) if i+1<n else 3.0
    ax.add_patch(Polygon([(x1,y1),(x2,y1),(nx2,y2),(nx1,y2)], facecolor=color, edgecolor="white"))
    ax.text(2.5,(y1+y2)/2,label,ha="center",va="center",fontsize=9)
ax.set_xlim(0,5); ax.set_ylim(0,10); ax.axis("off")
ax.set_title("The Reconnaissance Funnel (passive -> active -> targets)", fontsize=12, fontweight="bold")
plt.tight_layout(); plt.savefig("ch07_recon_funnel.png", dpi=130, bbox_inches="tight"); plt.close()
print("Saved ch07_recon_funnel.png")
Reconnaissance funnel narrowing from the public internet through OSINT, WHOIS, and DNS to a target list

7.11 Passive Fingerprinting#

Before active scanning, an attacker can often infer a target’s systems merely by observing traffic, which is the most passive form of host identification. Passive operating-system fingerprinting tools such as p0f analyze the subtle characteristics of TCP/IP packets (initial window sizes, time-to-live values, and option orderings) to guess the operating system, browser, and even the network media behind a connection, from as little as a single SYN packet and without sending anything to the target. The network security monitor Zeek (formerly Bro), part of the Security Onion distribution and discussed further in Chapter 12, produces rich logs of what software, versions, and services are present on a network by watching its traffic. Because these techniques only observe, they leave no trace on the target, which is exactly why defenders use the same tools to inventory their own networks. Passive fingerprinting marks the boundary between passive reconnaissance and the active scanning of Chapter 8, where the tester begins to send packets to the target directly.

The value of passive fingerprinting is that it provides a first, risk-free estimate of the target environment that focuses later active work. If passive observation suggests a particular operating system or web stack, the tester can prioritize the relevant probes and exploits in Chapter 8 and Chapter 9 rather than testing blindly. The same data, in the defender’s hands, populates an asset inventory and a baseline of normal traffic against which the anomalies of Chapter 12 stand out. Passive fingerprinting is therefore both the quietest reconnaissance technique and a routine defensive monitoring practice, differing only in who is watching and why.

7.12 Defending Against Reconnaissance#

Every source in this chapter has a defensive mirror image, and a mature security program manages its own exposure as deliberately as an attacker probes it. The umbrella discipline is External Attack Surface Management (EASM): continuously discovering and monitoring everything the organization exposes to the internet, domains, IP ranges, certificates, services, and cloud assets, using the very tools (Shodan, Censys, Amass, and OSINT frameworks) that attackers use, so that defenders find the forgotten server or misconfigured database first.

Concrete defenses follow the chapter’s structure. Against OSINT, organizations practice digital-footprint management: minimizing unnecessary public information, scrubbing document metadata before publication, and training employees on safe social-media use and on what job postings should not reveal. Against search-engine exposure, they audit their sites with the same dorks attackers use, remove sensitive indexed files, and use robots.txt and access controls appropriately (recognizing that robots.txt hides nothing from a determined searcher). Against WHOIS and registry exposure, they use registration privacy where appropriate. Against DNS reconnaissance, they restrict zone transfers to authorized servers, avoid leaking internal naming in public DNS, and deploy DNSSEC. Against email harvesting and profiling, they limit exposed addresses and run awareness programs. The unifying principle is that an organization cannot prevent the open web from being searched, but it can control what it places there and continuously watch what it exposes, shrinking the attacker’s footprint at its source. This sets up the next chapter, where, having mapped the target passively, the tester turns to active scanning to confirm which hosts and services are actually live.

A more proactive defensive measure is deception. Defenders can plant honeytokens (fake credentials, documents, or DNS records) and canary tokens that silently alert when an attacker accesses them, so the very act of reconnaissance against a decoy reveals the intruder. Combined with monitoring of who queries the organization’s DNS, who triggers many reverse lookups, and what appears in Certificate Transparency logs and breach dumps, deception turns reconnaissance from a purely one-sided advantage into an early- warning opportunity for the defender, complementing the honeypots developed in Chapter 12.

Defenders should also recognize that some reconnaissance is effectively impossible to prevent, only to manage. The organization cannot stop search engines from indexing public pages, registries from publishing allocation data, or Certificate Transparency logs from listing certificates; what it can do is ensure that nothing sensitive is in those places, that its public footprint is deliberate, and that it watches the same sources attackers do so it learns of exposure first. This reframes defense from the impossible goal of “hiding” to the achievable one of minimizing and monitoring the attack surface, the essence of External Attack Surface Management. A practical program assigns ownership of this continuous discovery, integrates it with the vulnerability- and risk-management processes of Chapter 5, and treats every newly discovered exposure as an incident to triage, closing the loop between how attackers see the organization and how the organization defends itself.

7.13 AI-Assisted Reconnaissance and Modern WHOIS#

Two shifts have reshaped reconnaissance since the classic techniques above. The first is the use of artificial intelligence to scale and sharpen open-source intelligence. Large language models can extract entities such as names, roles, email patterns, and technologies from unstructured text like job postings, conference talks, and code repositories, summarize a target’s footprint, and draft tailored pretexts, compressing work that once took an analyst hours. Commercial exposure-management and external attack surface platforms automate continuous discovery of an organization’s internet-facing assets, certificates, and leaked credentials, giving both defenders and attackers a near-real-time map. The same tools that help a security team find forgotten assets help an adversary find soft targets, so defenders should run this discovery against themselves first.

The second shift is in domain registration data. Classic WHOIS returned rich registrant details, but since the GDPR took effect in 2018 most registrars redact personal information from public WHOIS, so contact names and emails are often replaced with privacy-proxy values. At the same time the protocol itself is being replaced: the Registration Data Access Protocol (RDAP) supersedes WHOIS with structured JSON output, standardized queries, internationalization, and tiered access that can give authorized parties such as law enforcement more detail than the public sees. For the analyst this means relying less on registrant names and more on technical artifacts such as name servers, certificate transparency logs, passive DNS, and historical records to connect infrastructure.

Lab: Reading Exposure Data the Way Shodan and Censys Present It#

Internet-wide scanners such as Shodan and Censys index banners and metadata from exposed services so an analyst can find, for example, every host of an organization that exposes a database or an outdated web server. Run real queries only against assets you own or are authorized to assess. The cell below works offline on a sample of Shodan-style host records to practice turning raw exposure data into a prioritized list of findings.

# Offline practice: triage Shodan/Censys-style host records (no network calls)
hosts = [
    {"ip": "203.0.113.10", "port": 3389, "product": "Microsoft RDP", "transport": "tcp",
     "tls": False, "tags": ["remote-access"]},
    {"ip": "203.0.113.11", "port": 443,  "product": "nginx 1.18", "transport": "tcp",
     "tls": True,  "tags": ["web"]},
    {"ip": "203.0.113.12", "port": 9200, "product": "Elasticsearch", "transport": "tcp",
     "tls": False, "tags": ["database", "no-auth"]},
    {"ip": "203.0.113.13", "port": 22,   "product": "OpenSSH 7.4", "transport": "tcp",
     "tls": False, "tags": ["remote-access"]},
]

# simple risk heuristic: exposed datastores and unencrypted remote access rank highest
def risk(h):
    score = 0
    if "database" in h["tags"]:            score += 5
    if "no-auth" in h["tags"]:             score += 4
    if "remote-access" in h["tags"] and not h["tls"]: score += 3
    if h["port"] in (3389, 23, 21):        score += 2
    return score

for h in sorted(hosts, key=risk, reverse=True):
    print(f"{h['ip']}:{h['port']:<5} {h['product']:<16} risk={risk(h)}  tags={','.join(h['tags'])}")

Chapter Summary#

This chapter opened the technical phases of offensive security with reconnaissance, the systematic gathering of information about a target before any attack. The central distinction is between passive reconnaissance, which never touches the target and is built on open-source intelligence (OSINT) from websites, social media, job postings, financial filings, document metadata, and breach data, and active reconnaissance, which queries the target directly and so requires authorization. The goal is footprinting: a complete profile of the target’s infrastructure, technologies, and people. Search-engine reconnaissance (“Google dorking”) with operators like site:, filetype:, inurl:, and intitle: surfaces exposed files, while device search engines (Shodan, Censys) index exposed systems. WHOIS and the Regional Internet Registries reveal owned domains and IP ranges, and DNS reconnaissance, through record types, nslookup/dig, zone-transfer attempts, and subdomain enumeration, maps hosts and services. Email harvesting, metadata analysis, and social-media profiling target the human layer that feeds social engineering. Standard tools (theHarvester, Maltego, Recon-ng, SpiderFoot, Amass) automate collection, and passive fingerprinting (p0f, Zeek) identifies systems without sending a packet. Every source has a defensive counterpart under External Attack Surface Management. With the target mapped, the next chapter turns to active scanning and enumeration, sending packets to confirm which hosts and services are live and what vulnerabilities they expose.

Why This Matters#

Reconnaissance is where engagements are won or lost. A tester who invests in thorough, mostly passive intelligence gathering arrives at the active phases with a precise map, the target’s domains and IP ranges, its technologies and versions, its people and their exposure, and its inadvertently published secrets, while one who rushes ahead wastes the engagement’s limited time scanning blindly. The same skills serve the defender in reverse: understanding exactly how attackers profile an organization is the prerequisite for shrinking that profile. And because so much reconnaissance is passive and lawful, drawing only on public sources, it is also the phase where the line between everyday research and an engagement is thinnest, which is precisely why the authorization and scope discipline of Chapter 6 must travel with the tester into every technique here.

News in Focus: Mass Scraping of Public Profiles (2021)#

A vivid illustration of open-source intelligence at scale came in 2021, when datasets containing the information of hundreds of millions of LinkedIn users, reportedly on the order of 700 million, were advertised for sale online. According to public reporting and LinkedIn’s own statements, the data was not the result of a system breach but was scraped from publicly visible profiles and combined with other sources, exactly the kind of aggregation this chapter describes. The exposed fields, names, email addresses, phone numbers, job titles, employers, and locations, are precisely the raw material for large-scale spear phishing and pretexting (Chapter 4).

Through this chapter’s lens, the episode is instructive on two counts. First, it shows that public does not mean harmless: information each user chose to share became, in aggregate, a powerful targeting database, demonstrating why OSINT is the foundation of modern attacks. Second, it highlights the limits of the “breach versus public data” distinction, since the harm is similar whether data is stolen or scraped, which is one reason data-protection regulators have taken increasing interest in scraping. For defenders and individuals, the lesson is the chapter’s central defensive theme: manage your digital footprint deliberately, because what is posted publicly can and will be collected and correlated. (Figures and characterizations per public reporting and the platform’s statements.)

Finding Exposed Devices with Shodan, and Defending Them#

Search engines for devices rather than web pages reveal Internet-connected systems that were never meant to be public. Shodan (shodan.io) continuously scans the Internet and indexes service banners, so queries can surface exposed databases, industrial controllers, and especially IP cameras and other Internet-of-Things devices. Comparable tools include Censys, ZoomEye, and GreyNoise.

Typical authorized, passive queries that surface camera feeds and exposed devices use filters such as port:554 (the Real-Time Streaming Protocol, RTSP, port), product or banner strings for common camera firmware (for example Hikvision, webcamXP, or Network Camera), has_screenshot:true, and country:, city:, or org: to scope results to assets you are authorized to assess. Shodan’s command line (shodan search ..., shodan host <ip>) returns the same data as the web interface, and Censys and ZoomEye offer equivalent filters. Because the results come from banners the device itself advertises, this is passive reconnaissance: you learn what a device tells the world without sending traffic to the target yourself.

That same view is the defender’s checklist. To keep cameras and IoT devices out of these results: never expose camera, RTSP, or management ports (554, 80, 8080, 37777, and similar) directly to the Internet; reach devices only through a VPN or a zero-trust broker (Chapter 11); change default credentials and disable default accounts immediately, since default-password exposure is the single most common cause of public-camera incidents; keep firmware patched; segment cameras onto their own VLAN (virtual local area network) away from production (Chapter 11); disable UPnP (Universal Plug and Play) on the gateway so devices cannot auto-open ports; and periodically search Shodan and Censys for your own IP ranges and organization name to find exposures before attackers do.

Current News: the internet is a searchable attack surface (2025)

Device search engines have made external reconnaissance nearly effortless, and 2025 reporting underscored the scale of accidental exposure. Researchers using Shodan and Censys repeatedly found large numbers of internet-facing systems left open by misconfiguration, including thousands of servers exposing configuration files, database backups, and possible credentials, and, in one widely reported case, more than a thousand publicly accessible artificial-intelligence “agent” instances, many requiring no authentication at all and exposing application keys, conversation histories, and in some cases shell access. The lesson for both attacker and defender is the same one this chapter teaches: a great deal of an organization’s attack surface is discoverable passively, without ever sending a packet to the organization itself, simply by querying systems that have already indexed the internet. This is why external attack-surface management, continuously searching for your own exposure before someone else does, has become a core defensive practice. (Figures and incidents per security-vendor and press reporting; specifics evolve, so consult current sources.)

Review Questions (MCQ)#

Q1. Which best distinguishes passive from active reconnaissance? A. Passive is illegal B. Passive does not interact with the target’s systems C. Active uses only Google D. There is no difference

Q2. OSINT stands for: A. Open Systems Internet B. Open-Source Intelligence C. Operational Security Intel D. Online Scanning Interface

Q3. The Google operator that restricts results to one website is: A. filetype: B. intitle: C. site: D. inurl:

Q4. Which database queries return an organization’s allocated IP network range? A. EDGAR B. WHOIS / the RIRs C. MX records D. robots.txt

Q5. An unrestricted DNS zone transfer (AXFR) is dangerous because it: A. Encrypts traffic B. Lets an attacker copy all of a zone’s records at once C. Blocks scanning D. Hides hosts

Q6. Which DNS record identifies an organization’s mail servers? A. A B. CNAME C. MX D. PTR

Q7. Shodan and Censys are best described as: A. Password crackers B. Search engines for internet-connected devices and services C. Firewalls D. DNS servers

Q8. Job postings are valuable to an attacker because they: A. Contain passwords B. Reveal the internal technology stack and skills in use C. Are encrypted D. Block recon

Q9. Document metadata can leak all of the following EXCEPT: A. Author and usernames B. Software versions and file paths C. The document’s live database password by default D. Internal device names

Q10. “Google dorking” is which kind of reconnaissance? A. Active B. Passive C. Physical D. Internal

Q11. A primary defense against DNS reconnaissance is to: A. Disable DNS entirely B. Restrict zone transfers to authorized servers and deploy DNSSEC C. Publish all records D. Use HTTP

Q12. theHarvester is used to: A. Crack hashes B. Aggregate emails, subdomains, and hosts from public sources C. Exploit buffers D. Sniff Wi-Fi

Q13. p0f performs: A. Active port scanning B. Passive OS fingerprinting from observed packets C. Password spraying D. Zone transfers

Q14. External Attack Surface Management (EASM) is: A. An attacker-only technique B. Continuously discovering and monitoring one’s own internet exposure C. A type of malware D. A firewall rule

Q15. The professional ordering of reconnaissance is to: A. Scan first, then read B. Exhaust passive sources before active techniques C. Skip recon D. Only use Shodan


Answer Key#

1: B 2: B 3: C 4: B 5: B 6: C 7: B 8: B 9: C 10: B 11: B 12: B 13: B 14: B 15: B

Lab Assignment#

Lab 7.1 (beginner) - Footprint yourself. Perform passive OSINT on your own name and a domain you own: search engines, social media, and a WHOIS lookup. Document what an attacker could learn, and list three concrete steps to reduce your exposure.

Lab 7.2 (beginner/intermediate) - Google dorking (authorized). On a domain you own or are authorized to test, construct five dork queries using site:, filetype:, inurl:, and intitle: to find exposed documents or directories. Record findings and remediation. Do not act on results for domains you do not control.

Lab 7.3 (intermediate) - DNS and registry mapping. For an authorized target, use whois, dig/nslookup (A, MX, NS, TXT), and a subdomain-enumeration tool to build a map of hosts and mail servers. Attempt a zone transfer against the authoritative servers and report whether it is (correctly) refused.

Lab 7.4 (advanced/research) - Build a recon pipeline. Using theHarvester or Recon-ng (or the author’s Links-Extractor and SEO-Analysis from Appendix F), assemble an automated passive-recon report for an authorized target, then write the matching defensive plan: what EASM tooling and policies would detect and shrink each exposure you found.

References#

  1. EC-Council. Certified Ethical Hacker (CEH) v13, Domain 2: Reconnaissance Techniques.

  2. National Institute of Standards and Technology. Technical Guide to Information Security Testing and Assessment, NIST SP 800-115, 2008.

  3. OSINT Framework. https://osintframework.com

  4. Exploit Database. Google Hacking Database (GHDB). https://www.exploit-db.com/google-hacking-database

  5. American Registry for Internet Numbers (ARIN) and the other RIRs (RIPE NCC, APNIC, LACNIC, AFRINIC).

  6. Shodan (https://www.shodan.io) and Censys (https://censys.io) device search engines.

  7. OWASP. Amass Project (attack-surface mapping).

Related work by the author (see Appendix E and F):

  • Companion tools: Links-Extractor (extract internal/external links) and SEO-Analysis (domain and keyword insights), with a tutorial on the Google Search API for OSINT. (See Appendix F.)