## Wednesday, 21 March 2012

I recently started work on a very exciting project called GooDork in its most basic function this python script allows you to run google dorks straight from your command line.

Though its real power lies what it allows you to do with the results from a google dork.
Regex
Many of you may already know what regular expressions are but for those of you who don't I'll briefly run through a short description:
In computing, a regular expression provides a concise and flexible means to "match" (specify and recognize) strings of text, such as particular characters, words, or patterns of characters --- http://en.wikipedia.org/wiki/Regular_expression
Basically as regular expressions a compact programs used to specify how string should be matched. They provide incredible flexabiliy and power and should be part of every hacker/programmer's arsenal!

If you already know about regex then the "google dorking syntax" or search directive syntax will look somewhat familiar to you, this is because its most likely parsed to regex or relates to a regex "machine" in some weird way.

What does this have to do with Google Dorking?
Though quite powerful, Google search directives have many limitations! For instance some search directives cannot be combined, they don't provide all the control we would like, others throw captures at you like crazy e.g inurl! How do we get around this?

Well essentially, lets say we we don't want to use inurl in our Google search but still want results that have been matched according to their URLs, what we can do is run a normal Google search and then match the URLs that google returns with a certain value. This is where regex comes in, there is nothing more powerful at matching strings than regex! You do the dorking GooDork does the regex-ing!

So essentially what GooDork does is allow you to:

1. Run any dork you wish intext/site/intitle/ .etc
2. Match a regex on the attributes of the results including (this is implemented at the moment)
• URL strings --- behaves like inurl
• Displayable text in the body of the page --- behaves like intext
• href values in anchors --- behaves like inanchor
• Title tags of pages --- behaves like intitle
Just these three options prove incredibly powerful since you can do everything a dork can do and more! If you know your regex that is!

GooDork crash course
GooDork as is, is pretty straight forward to use it accepts the following arguments:
1. [dork] -- A search query to run with google
2. Any/All of the following (these options operate on the results of the search query [dork]):
• -t [regex] --- matches the regex in the title of the results
• -b [regex] --- matches the regex in the displayable text or body of the results
• -a [regex] --- matches the regex in the <a> tag's (the href value) of the results
• -u [regex]  --- matches the regex in the URL of the results
./GooDork.py site:.edu -b Students

the above example will run the dork "site:.edu" (which returns all URLS under the .edu domain) and then GooDork will try to match the word "Students" in the body of all the results and return to you only those that it matches.

Much much more is possible! But first I need to show you how do use the basics ;)
GooDork Basics
So I'm so you some basic dorks, to start off with lets try using 'site' as a dork with no arguments, you should have the following typed in on your command line:

And when you press enter you should see the following start filling your screen:
The first part of the output should look familiar to people who know a lil about HTTP, these are the headers that the google server sent back after running the query. Included these in the output so you can tell whats really going on if something goes wrong with the search query. The results in full glory are printed out after that.

You can run just about any dork, and combined them with a regex. An example of this would be running the same dork, but lets say we only want the urls with links to html pages (meaning not php/asp/aspx etc), we dot his by specifying so using inurl. We need to build a regex, the one I will use is this one:
(html|htm).*?$This is how the regex works: • (html|htm) • () --- these brackets group everything inside as a single regex, we do this here so that we can have the special characters after it operate on the entire contents of the string • html --- this is just as it appears, this lil regex string will match anything with 'html' appearing in it • htm --- this is as the one above, so we can match those pages that are html but have this ext. • .*?$
• . --- this special character matches any character except the newline
• * --- this is the greedy operator, it matches as of the preceding regex as possible
•  ? --- this matches either 1 or 0 of the preceding regex
• \$ --- this special char, specifies that whatever regex you specified is to be matched at the end of the string, or just before the newline
Why then this weird combination? Well the combination of special chars '.*?' is to match as little as of the preceding regex as possible, this is because we want to match urls that end in '.html' or '.htm' not '.hhhhttttmmmmmlll' or 'httttttttttm' read more about this at http://docs.python.org/library/re.html

The results we get looks as follows:

As you can see it only returned the html pages from the results list, this is a basic example of what you can do with GooDork, the following articles will focus more on plying GooDork to vulnerability hunting in web applications.

Hope this helped, please comment for more clarification ;)

In completely forgot to add a link to the project lols, you can get GooDork here https://github.com/k3170makan/GooDork