What one may find in robots.txt

17 May 2015

Introduction

During the reconnaissance stage of a web application testing, the tester (or attacker) usually uses a list of known subdirectories to brute force the server and find hidden resources. For that purpose, a list of known subdirectories is used such as the one provided with Skipfish or wfuzz. If any extra context information is available, it may be added to the list. Most importantly, once a test is concluded (or command execution has been achieved) and a listing or server configuration is disclosed, the list is updated with any missing entries.

I have recently lost my personal list and decided to rebuild one from scratch. Such list is not atemporal. Depending on the uptake of certain web technologies, it needs to be refreshed on a regular basis. I decided to base mine on one of the most valuable asset websites expose: robots.txt. During this exploration, I came across some interesting discoveries.

Disclaimer: This whole exercise has been performed using internet's publically-accessible resources. At worse, only the internet etiquette has been breached.

robots.txt

These files are usually set up at the root of the web server and indicates which part of the site is allowed for web crawlers (e.g., Google robots) to be indexed. They are supposed to be parsed by machines but are human readable. For instance:

User-agent: *
Disallow: /admin/
Disallow: /stats/
Disallow: /internaljobs/
Disallow: /internaljobsbyorganization/
Disallow: /internaljobsearch/

As you may see, the directive Disallow gives an attacker precious knowledge on what may be worth looking at. Additionally, if that is true for one site, it is worth checking for another.

Crawling

In order to have a relevant set of robots.txt files, it is necessary to start with a list of hostnames, if possible hostnames that cover a large range of internet and not limited to an arbitrary subset (the opposite of a focus crawler).

Most of the commercial search engines do not offer free of charge solutions anymore. Fortunately, the Common Crawl project provides recent crawling results via S3.

The crawl is separated in 1GB archives (4GB after decompression). (For the February 2015 crawl, there are more than 33,000 of them). I wrote a script which takes one of these decompressed WARC and extracts any hostname found:

require 'set'

URL_PATTERN = %r{
    (?:(?:https?)://)
    (?:
       (?:(?:[a-z0-9][a-z0-9\-]*)?[a-z0-9]+)
       (?:\.(?:[a-z0-9\-])*[a-z0-9]+)*
       (?:\.(?:[a-z]{2,}))
    )
    (?::\d{1,5})?
}xin

tlds = File.open("tlds-alpha-by-domain.txt").each_line.collect {|w| w.chomp.downcase}

s = Set.new

begin
  File.open(ARGV[0], 'rb').each_line.with_index do |line, i|
    s.merge(line.scan(URL_PATTERN).select {|f|
        # Extract TLD
        h, m, tail = f.rpartition(":")
        w = (h.include?"/") ? h : tail
        head, m, tld = w.rpartition(".")
        tlds.include? tld.downcase
      })
  end
rescue Interrupt
end
puts s.to_a.join("\n")

This gives us a list of hostnames (the TLD is matched against a white list to avoid any false positive). I then used that list with Burst to download the robots.txt files:

import socket
import random
import threading

from burst.all import *
from burst.exception import BurstException
from burst.utils import chunks

nthreads = 4
slice_sz = 200
switch_session('hosts')

hosts = open("hosts").read().splitlines()
random.shuffle(hosts)

def worker(s):
  for i, h in enumerate(s):
    r = create(h + "/robots.txt")
    # Check if we have visited this host before
    for hr in history:
      if hr.hostname == r.hostname:
        break
    else:
      try:
        r()
      except (BurstException, socket.error):
        pass

for ch in chunks(hosts, slice_sz):
  slices = chunks(ch, nthreads)
  jobs = []
  for s in slices:
    t = threading.Thread(target=worker, kwargs={"s": s})
    jobs.append(t)
    t.start()
  for j in jobs:
    while j.is_alive():
      j.join(1)
  save()

The full set is split into chunks. Once a chunk is done, the session is saved. With this, it is possible to start reviewing the results while the crawling is still undergoing.

By hand, using the Burst interface, it is possible to lookup specific patterns:

$ burst -rs hosts
hosts >>> history.responded().filter(lambda x: "Disallow: /admin/" in x.response.content)
{200:1563 | get.kim, rockville.wusa9.com, mars.nasa.gov, ... }

Some statistics from this crawling: on the 59,558 site crawled, 59,436 sent us a response. This is a good indicator of the freshness of Common Crawl results. Of these, 37,431 responded with a HTTP status of 200. Of these, 35,376 returned something that looked like a proper robots.txt (i.e., matching at least one standard directive).

Unexpected directives

The robots.txt specification is rather small. Let's exclude the known directives and see what remains:

RE_COMMENT = re.compile(r'(^|\n|\r)\s*#', re.I)
headers = [r'Disallow', "Allow", "User-Agent", "Noindex", "Crawl-delay",
           "Sitemap", "Request-Rate", "Host"]
all_reg = [ re.compile(r'(^|\n|\r)[ \t]*' + h + r'[ \t]*:?', re.I) for h in headers]

switch_session('hosts')

krs = history.responded().filter(
        lambda x: x.response.status == "200" and
                  len(x.response.content) > 0)

words = defaultdict(int)
for r in krs:
  # At least one directive recognised (avoid html answers)
  for reg in all_reg:
    if reg.findall(r.response.content):
      break
  else:
    continue
  # Dump anything not matching an expected regex
  for l in r.response.content.splitlines():
    for reg in all_reg + [RE_COMMENT,]:
      if reg.findall(l):
        break
    else:
      words[l] += 1

print "\n".join(
        [ str(v) + ":" + k for k,v in
            sorted(words.items(), key=lambda x: x[1], reverse=True)
        ]
      )

Here is an excerpt of the output:

35:ACAP-crawler: *
34:ACAP-disallow-crawl: /public/search/
32:ACAP-disallow-crawl: /article_email/*
11:Clean-param: filter&ref&sq&sqc&sqm&sst&s&id&pm&do&source&method&size&query&sorting
[...]
12:<!-- WP Super Cache is installed but broken. The constant WPCACHEHOME must be set in the file wp-config.php and point at the WP Super Cache plugin directory. -->
1:/* CUSTOMIZATIONS */
[...]
8:Diasllow: authcallback.aspx
1:Disalllow: /search.php
1:Disalow: /dnd/
1:Dissalow: /catalog/product/gallery/

There are multiple findings here. First, the use of non standard directives (e.g., ACAP-*). Then, some HTML and CSS comments are present (robots.txt comments start with #). Finally, we note numerous misspelling of Disallow and other directive names. This will be useful later on when extracting the disallow directive by being more relaxed on the exact spelling.

One of my favourite unexpected directive is the confusion with Apache configuration from www.hanoverian.co.nz:

User-agent: Googlebot
Disallow: /*.gif$

User-agent: Googlebot
Disallow: /*.jpg$

RewriteEngine on
#Options +FollowSymlinks

Comments

By modifying the script above, let's inspect the comments:

241:# block against duplicate content
241:# block MSIE from abusing cache request
[...]
26:#Disallow: /video/index.jsp*
[...]
3:#  --> fuck off.  
2:# dotbot is wreckless.
2:# KSCrawler - we don't need help from you
1:# Huge pig crawlers are eating bandwidth

We observe that some comments are common. It is likely they are coming from one template or a copy/paste. Some directives are also commented out. Again we will reuse that knowledge when extracting Disallow directive. Finally, the good old internet rage is present.

Hiding pattern

In my first attempt to extract the Disallow directives, I was extracting the first part of the path and incrementing a counter on how many times such path has been found. Although the list looked similar to what I was expecting, the first entry caught my eye:

9995:plenum
4603:search
4038:user
3398:wp-admin
3135:en
2774:admin

While the other entries certainly look familiar, I have never heard about an application or server subpath "plenum". By querying our hosts.txt set:

hosts >>> history.responded().filter(lambda x: "plenum" in x.response.content)
{200:1 | www.knesset.gov.il}

Only one hit, for the Israeli assembly website. Why would such website have so many Disallow directives? Here is an excerpt of their robots.txt:

User-agent: *
Disallow: /plenum/data/5510903.doc
Disallow: /plenum/data/5697203.doc
Disallow: /plenum/data/07822203.doc
Disallow: /plenum/data/07822803.doc
Disallow: /plenum/data/7111903.doc
Disallow: /plenum/data/7714403.doc
Disallow: /plenum/data/7714903.doc
Disallow: /plenum/data/7715303.doc
Disallow: /plenum/data/7118203.doc
Disallow: /plenum/data/7118303.doc

About 10,000 of these documents are explicitly required not to be crawled. Further investigation is left as an exercise for the reader.

Disallow

By now, we can focus back on our original objective and extract the first component of each path. Each matching will be increased only once per file. This is similar to the document frequency. Here is my top ten:

5806:wp-admin
5398:search
3793:admin
3206:includes
2805:cgi-bin
2663:modules
2389:xmlrpc
2384:scripts
2333:user
2153:cron

The exact results will depend on which character you include in your regular expression and your objectives (e.g., only extract directory names).

Lists quality

The same experiment was performed with another subset from Common Crawl. After extraction of the hostnames, a comparison with the other subset was done. The similarity between the two lists was measured using the Jaccard Index. This index is in the range [0, 1] where 1 is high similarity (identity) and 0 for not sharing any element (disjoint). For the two hostname lists, an index of 0.27 was found. This is reasonably what one would expect with a majority of different websites but some large websites that are shared (the regular expression used captures all URLs found, including in the response content, and not only the crawled URLs).

The same similarity measure was performed against the Disallow wordlist extracted from both sets. Below is a plot of the result:

List Similarity

The top N words of both lists is plotted against their Jaccard index. At the early stage, the top 100 share a high level of similarity (~0.88). This level decreases while increasing the size of the subset. Three vertical lines have been plotted which represent the threshold for the number of occurrences within the first set. When at least ten occurrences of a word have been found, both lists still share a relatively high level of similarity.

How many words to keep?

Based on the results above, only a certain amount of the top words may be kept for our final list. In reality, there is a trade-off between how many requests one is able to send (time/network constraint) and the benefit of finding a hidden resource.

Personally, I prefer to run an almost unlimited list during the full time frame rather than taking the chance to miss a resource. (If some SSH keys are accessibles within the ~huang directory, I do not want to slip over them).

Limitations

Here are some limitations worth noticing for this exercise. First, the algorithms used are not optimised at all. Obvious improvements exist but one objective of this post was to show that such an experiment requires less than one hundred lines of scripting, mainly reusing available tools.

Secondly, the hostnames visited highly depend on the crawling strategy of Common Crawl. Since based on the open source project Nutch, it is unlikely that a known bias exists. I did not read the source code of Nutch to confirm such assumption.

Thirdly, some irrelevant duplicates may appear which will artificially boost the score of some words. This is especially true for domains with multiple subdomains. At this stage, I simply implemented a blacklist on the biggest domains (*.blogspot.*, *.wordpress.com, etc.).

Fourthly, I only focused on the first component of the page for extraction. It is obvious that the remaining part is of high interest as well. For instance, one may build a Markov chain to estimate the likelihood of the next subpath depending on the current one.

Finally, only the Disallow directives were parsed. Other useful information may be generated from the aggregation of the other directives.

Conclusion

As we have seen, the use of robots.txt is not without consequences. In the simplest cases, it will reveal restricted paths and the technology used by your servers. But with further investigations, one may find acknowledgment that some content should not be there.

From a defender perspective, two common fallacies remain. First, that robots.txt somewhat is acting as an access control mechanism. Secondly, their content will only be read by search engines and not by humans. I hope this post has shown you why these are wrong and what impact such assumptions might have.