Very short introduction to Google web indexing
Google uses a proces called crawling (or fetching) to index new or updated pages. The program responsible for the crawling is called Googlebot (also known as a robot, bot, or spider). Googlebot uses an algorithmic process: computer programs determine which sites to crawl, how often, and how many pages to fetch from each site. Googlebot uses two types of crawling:
- Deep crawl - when Googlebot fetches a page, it culls all the links appearing on the page and adds them to the queue for subsequent crawling. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. Because of their massive scale, deep crawls can reach almost every page on the web. Due to the number of pages existing on the web, this can take some time, so some pages may be crawled only once a month.
- Fresh crawl - to keep the index current, Google continuously rescans popular and frequently changing web pages at a rate roughly proportional to how often the pages change. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl.
What Exactly is Google Hacking?
Google Hacking is a technique that uses Google’s search engine to find vulnerable or sensitive data. To help refine search results, you can use Advanced Search Operators and Special Search Characters. Advanced operators use the following syntax:
operator:search_term
Operator | Purpose | Mixes with other Operators? | Can be used alone? |
intitle | Search page title | yes | yes |
allintitle | Search page title | yes | yes |
inurl | Search URL | yes | yes |
allinurl | Search URL | no | yes |
filetype | Search specific files | yes | no |
allintext | Search text of page only | yes | yes |
site | Search specific site | yes | yes |
link | Search for links to pages | no | yes |
inanchor | Search links anchor text | yes | yes |
numrange | Search numbers within a desired range. | yes | yes |
daterange | Search in date range | yes | no |
Character | Purpose |
+ | forced inclusion of something common |
- | exclude a search term |
“ ” | use quotes around search phrases |
. | a single wildcard |
* | any word |
| | Boolean ‘OR’ |
(“master card” | mastercard) | Parenthesis group queries |
Examples of Google Hacking
So what exactly can you find in Google and why is it vulnerable? Let's take a look at a few examples.
Directory Listings
Directory listings provide a list of files and directories in a browser window instead of the typical text-and graphics mix generally associated with web pages. Directory listings are often placed on web servers on purpose to allow visitors to browse and download files from a directory tree. Many times, however, directory listings are not intentional and there’s a good chance that an attacker may find something interesting inside a directory listing. Query:
intitle:index.of
A basic query that returns a large number of false-positive results But those queries return some more interesting stuff: Query:
intitle:index.of "parent directory"
or Query:
intitle:index.of name size
Web Server Detection
A Security Tester can use this information to determine the version of the web server, or to search Google for vulnerable targets. In addition, this indicates whether the web server is well maintained or not. Query:
intitle:index.of server.at
- This query focuses on the term “index of” in the title and “server at” appearing at the bottom of the directory listing.
intitle:index.of "Apache/2.4.7 Server at"
- This query will find servers with directory listings enabled that are running Apache version 2.4.7.
Files containing usernames and / or passwords
Yes, it's possible to find files containing logins and passwords which still work! Query:
xamppdirpasswd.txt filetype:txt
return password files for XAMPP Server.
site:github.cominurl:sftp-config.json
FTP login/password credentials on github.com
Query:
filetype:passwordjmxremote
Passwords for Java Management Extensions (JMX Remote) used by jconsole.
“# Dumping data for table” (user | username | pass | password)
Sensitive Directories
Query:
inurl:8080 intitle:"Dashboard [Jenkins]"
Access to Jenkins Dashboard. At the beginning, you’re not going to see much, but if you go deeper you may find some more interesting stuff.
Sample screen of one of the latest build.
Query:
“.git" intitle:"Index of"
Shows access to publicly browsable .git directories.
Various Online Devices
Query:
“inurl:system_device.xml”
Displays public status page for Konica Minolta Printer.
As you can see, there’s nothing unusual so far, but from here, you can go to the login screen, and switch to an administrator account.
At this point, you will still need a password. On the previous page, you could have seen the specific printer model, so maybe the default password is going to work? You can always try asking Google. You don’t need a sophisticated query to do so, and the result is:
Now, let’s put it to the test.
As you can see, in this case it worked.
How to remain safe?
I’ve already talked about several examples of finding vulnerable data using Google. Now, let’s take a look at what you can do to avoid falling victim to those methods:
- Disable directory browsing on the webserver. Directory browsing should only be enabled for the web-folders that you want to be accessible for anyone on the Internet.
- Don’t put critical and sensitive information on servers without any proper authentication system. If you do it, they can be directly accessible to anyone on the Internet.
- Always install latest security patches for your applications and latest operating system on your servers.
- Disable anonymous access in the webserver through the Internet to the restricted systems directory.
- If you find any links to your restricted server or sites in Google search results, then it should be removed.
- Google also took some steps to monitor suspicious searches of vulnerable data.
Conclusion
Google hacking can be a very useful tool in penetration testing. Tools like Metasploit and Nmap now have automated scripts that search Google for useful information related to a particular site or organisation. Google hacking also finds excellent use in social engineering attacks and carrying out phishing campaigns. Although Google search hacking is an old technique, it remains effective even to this day. That's the case, because new misconfigured servers, various online devices and vulnerable websites, are arriving every day all over the internet, and Google monitors it all.
Want to know more?
Feel free to take a look at those articles and links:
- How Google's Site Crawlers Index Your Site - Google - more detailed information about indexing by Google.
- Google Hacking Database: Google Hacking Database (GHDB) - Google Dorks, OSINT, Recon (exploit-db.com) (still updated)
- Presentation “Google Hacking for Penetration Testers - Using Google as a Security Testing Tool” by Johnny Long
- Presentation “The Google Hacker’s Guide Understanding and Defending Against the Google Hacker” by Johnny Long
- Shodan Search Engine - the world's first search engine for Internet-connected devices.