Robots Detail

What Is Robots.txt?
Robots. txt file tells search engine crawlers which pages or files the crawler can or can't request from your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google.

Introduction to Robots.txt

The robots.txt is a very simple text file that is placed on your root directory. An example
would bewww.yourdomain.com/robots.txt. This file tells search engine and other robots
which areas of your site they are allowed to visit and index.
The robots exclusion protocol (REP), or robots.txt is a text file webmasters create to instruct
robots (typically search engine robots) how to crawl and index pages on their website
The Robots Exclusion Protocol (REP) is a group of web standards that regulate web robot
behavior and search engine indexing. The REP consists of the following:
The original REP from 1994, extended 1997, defining crawler directives for robots.txt. Some
search engines support extensions like URI patterns (wild cards).
Its extension from 1996 defining indexer directives (REP tags) for use in the robots meta
element, also known as "robots meta tag." Meanwhile, search engines support additional
REP tags with an X-Robots-Tag. Webmasters can apply REP tags in the HTTP header of
non-HTML resources like PDF documents or images.
The Microformat rel-nofollow from 2005 defining how search engines should handle links
where the A Element's REL attribute contains the value "nofollow."

The Format
The format and semantics of the "/robots.txt" file are as follows:
The file consists of one or more records separated by one or more blank lines
(terminated by CR,CR/NL, or NL). Each record contains lines of the form
"<field>:<optionalspace><value><optionalspace>". The field name is case insensitive.

Robots Exclusion Protocol Tags
Applied to an URI, REP tags (noindex, nofollow, unavailable_after) steer particular tasks of
indexers, and in some cases (nosnippet, noarchive, noodp) even query engines at runtime
of a search query. Other than with crawler directives, each search engine interprets REP
tags differently. For example, Google wipes out even URL-only listings and ODP references
on their SERPs when a resource is tagged with "noindex," but Bing sometimes lists such
external references to forbidden URLs on their SERPs. Since REP tags can be supplied in

META elements of X/HTML contents as well as in HTTP headers of any web object, the
consensus is that contents of X-Robots-Tags should overrule conflicting directives found in
META elements.

Microformats
Indexer directives put as microformats will overrule page settings for particular HTML
elements. For example, when a page's X-Robots-Tag states "follow" (there's no "nofollow"
value), the rel-nofollow directive of a particular A element (link) wins.
Although robots.txt lacks indexer directives, it is possible to set indexer directives for groups
of URIs with server sided scripts acting on site level that apply X-Robots-Tags to requested
resources. This method requires programming skills and good understanding of web
servers and the HTTP protocol.

Pattern Matching

Google and Bing both honor two regular expressions that can be used to identify pages or
sub-folders that an SEO wants excluded. These two characters are the asterisk (*) and the
dollar sign ($).
* - which is a wildcard that represents any sequence of characters
$ - which matches the end of the URL

Public Information
The robots.txt file is public—be aware that a robots.txt file is a publicly available file. Anyone
can see what sections of a server the webmaster has blocked the engines from. This
means that if an SEO has private user information that they don’t want publicly searchable,
they should use a more secure approach—such as password protection—to keep visitors
from viewing any confidential pages they don't want indexed.

Important Rules
In most cases, meta robots with parameters "noindex, follow" should be employed as a way
to to restrict crawling or indexation.
It is important to note that malicious crawlers are likely to completely ignore robots.txt and
as such, this protocol does not make a good security mechanism.

Only one "Disallow:" line is allowed for each URL.
Each subdomain on a root domain uses separate robots.txt files.
Google and Bing accept two specific regular expression characters for pattern exclusion (*
The filename of robots.txt is case sensitive. Use "robots.txt", not "Robots.TXT."
Spacing is not an accepted way to separate query parameters. For example, "/category/
/product page" would not be honored by robots.txt.
Robots.txt is a text (not html) file you put on your site to tell search robots which pages you
would like them not to visit. Robots.txt is by no means mandatory for search engines but
generally search engines obey what they are asked not to do. It is important to clarify that
robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a
firewall, or a kind of password protection) and the fact that you put a robots.txt file is
something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot
prevent thieves from coming in but the good guys will not open to door and enter. That is
why we say that if you have really sensitive data, it is too naïve to rely on robots.txt to
protect it from being indexed and displayed in search results.
Structure of a Robots.txt File
The structure of a robots.txt is pretty simple (and barely flexible) – it is an endless list of
user agents and disallowed files and directories. Basically, the syntax is as follows:
“User-agent” are search engines' crawlers and disallow: lists the files and directories to be
excluded from indexing. In addition to “user-agent:” and “disallow:” entries, you can include
comment lines – just put the # sign at the beginning of the line:
# All user agents are disallowed to see the /temp directory.
User-agent: *
Disallow: /temp/
The Traps of a Robots.txt File
When you start making complicated files – i.e. you decide to allow different user agents
access to different directories – problems can start, if you do not pay special attention to the
traps of a robots.txt file. Common mistakes include typos and contradicting directives.
Typos are misspelled user-agents, directories, missing colons after User-agent and
Disallow, etc. Typos can be tricky to find but in some cases validation tools help.
The more serious problem is with logical errors. For instance:
User-agent: *
Disallow: /temp/
User-agent: Googlebot
Disallow: /images/
Disallow: /temp/
Disallow: /cgi-bin/
The above example is from a robots.txt that allows all agents to access everything on the
site except the /temp directory.
Robots.txt Dos and Don'ts
There are many good reasons to stop the search engines from indexing certain
directories on a website and allowing others for SEO purposes. Let's look at some
Here's what you should do with robots.txt:
 Take a look at all of the directories in your website. Most likely, there are
directories that you'd want to disallow the search engines from indexing, including
directories like /cgi-bin/, /wp-admin/, /cart/, /scripts/, and others that might include
sensitive data.
 Stop the search engines from indexing certain directories of your site that might
include duplicate content. For example, some websites have "print versions" of
web pages and articles that allow visitors to print them easily. You should only
allow the search engines to index one version of your content.
 Make sure that nothing stops the search engines from indexing the main content
of your website.
 Look for certain files on your site that you might want to disallow the search
engines from indexing, such as certain scripts, or files that might contain email
addresses, phone numbers, or other sensitive data.
Here's what you should not do with robots.txt:
 Don't use comments in your robots.txt file.
 Don't list all your files in the robots.txt file. Listing the files allows people to find
files that you don't want them to find.
 There's no "/allow" command in the robots.txt file, so there's no need to add it to
the robots.txt file.
Warning about robots.txt files
Your robots.txt file should never have more than 200 Disallow lines.. Start with as few
as possible and add to it when needed.
Once Google removes links referenced in your robots.txt file, if you want those links to
be added back in it could take up to 3 months before Google re-indexes the previously
disallowed links.
Google pays serious attention to robots.txt files. Google uses robots.txt files as an
authoritative set of links to Disallow. If you Disallow a link in robots.txt, Google
will completely and totally remove the disallowed links from the index which
means you will not be able to find the disallowed links when searching Google.
The big idea for you to take away, is to only use robots.txt to do hard disallows, that you
know you don't want indexed. Not only will the links not be indexed, they won't be
followed by search engines either, meaning the links and content on the disallowed
pages will not be used by the search engines for indexing or for ranking.
So, use the robots.txt file only for disallowing links that you want totally removed from
google. Use the robots meta tag to specify all the allows, and also use
the rel='nofollow' attribute of the a link element when its temporary or you still want the
link to be indexed but not followed.
Things you should always block
There are some parts of every WP blog that should always be blocked: the “cgi-bin”
directory and the standard WP directories.
The “cgi-bin” directory is present on every web server, and it’s the place where CGI
scripts can be installed and then ran. Nowadays, some servers don’t even allow access
to this directory, but it surely won’t do you any harm to include it in the Disallow
directives inside the robots.txt file.
There are 3 standard WP directories (wp-admin, wp-content, wp-includes). You should
block them because, essentially, there’s nothing there that search engines might
consider being interesting.
But there’s one exception. The wp-content directory has a subdirectory called “uploads”.
It’s the place where everything you upload using WP media upload feature gets put.
The standard approach here is to leave it unblocked.
Here are the directives to get the above done:
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Allow: /wp-content/uploads/
Files to block separately
WordPress uses a number of different files to display the content. Most of these don’t
need to be accessible via the search engines.
The list most often includes: PHP files, JS files, INC files, CSS files. You can block them
Disallow: /index.php # separate directive for the main script file of WP
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Robots Meta Tag For SEO
Sometimes you want to only exclude a single page, but you don't want to actually
mention that page in a robots.txt.
The best way, of course, is to implement real file-level security on the file. But failing
that you have 2 other options:
1. Create a folder and put the files you wish to exclude in the folder. then exclude
that folder in the robots.txt file. This will automatically exclude everything under it,
but you will not need to actually name the files.
2. Use the robots metatag on the page(s) in question. What will happen is that the
search engine will look at you robots.txt, see that the page is not mentioned, and go to
it. But then it will see that robots are actually excluded from the page in the robots
metatag and it will then obey it.
I've gone into the robots.txt file in detail elsewhere, so let's talk about the robots
Robots Meta Tag Code Example
<meta name="robots" content="noindex,nofollow">
<meta name="robots" content="none">
This example will tell all robots not index this page, and to not follow links from this
Attributes and Directives
There are other options, of course:
content = all | none | directives
all = "ALL"
none = "NONE"
directives = directive ["," directives]
directive = index | follow
index = "INDEX" | "NOINDEX"
follow = "FOLLOW" | "NOFOLLOW"
This results in the following choices:
 <meta name="robots" content="index,follow">
 <meta name="robots" content="noindex,follow">
 <meta name="robots" content="index,nofollow">
 <meta name="robots" content="noindex,nofollow">
Plus the two attributes:
 <meta name="robots" content="all">
 <meta name="robots" content="none">
ALL is the equivalent of "index, follow"
NONE is the equivalent of "noindex, nofollow"
You can find more information here: http://www.robotstxt.org/wc/meta-user.html.
These attributes are NOT CASE SENSITIVE - you can use upper or lower case.
Important Note: Robots will treat the absence of a specific disallowing tag
("noindex" and/or "nofollow") as permission. Therefore if you do not use "noindex"
for example, the robots will assume you mean "index".
For this reason, it's usually a waste of time to use the "index,follow" or "all"
directives, since they are assumed already as a default. These are only useful when
you wish to specify, for example, that all robots are to noindex,nofollow, but a
specific robot (ie googlebot) is allowed to. In this case, you would put in something
<meta name="robots" content="noindex,nofollow">
<meta name="googlebot" content="index,follow">
If you did not specifically tell Googlebot to index and follow, it would in this case
assume that you did not want it to. This is the only scenario where the index and
follow (and ALL) directives are used. SEO's who misuse metatags are in danger of
losing both respect from their peers, and savvy clients who are checking their basic
SEO abilities before hiring.
It won't hurt you to use the ALL or index,follow robots meta directives by itself, but it
makes you look bad. You should know how to your your own tools.
Engine Specific Commands
All 4 major search engines (Google, Yahoo, MSN, Ask) also support the
NOARCHIVE attribute.
NOARCHIVE: Google uses this to prevent archiving (caching) of a page.
See http://www.google.com/bot.html
Although Google will follow this if it applies to all robots, since only Google uses it,
the best method of using this is to specify "googlebot", rather than "robots" in
<meta name="googlebot" content="noarchive">
You can also combine it:
<meta name="googlebot" content="nofollow,noarchive">
<meta name="googlebot" content="noindex,nofollow,noarchive">
NOSNIPPET: A snippet is a text excerpt that appears below a page's title in Googles
search results and describes the content of the page.
To prevent Google from displaying snippets for your page, place this tag in the
<HEAD> section of your page:
<META NAME="GOOGLEBOT" CONTENT="NOSNIPPET">
Note: removing snippets also removes cached pages (acts as
NOARCHIVE,NOSNIPPET)
Yahoo also obeys the NOARCHIVE attribute.
Ask/Teoma
Ask obeys the
MSN Search
MSN Search obeys the NOARCHIVE attribute, and if you use
the NOCACHE attribute, it acts as the
Conclusion
The robots metatag is a simple method of telling robots what you want and don't want
indexed or followed on your site. Like the robots.txt file, it is NOT secure, nor is it a
requirement. Rogue robots can ignore it if they wish to. Further, a human is under no
restrictions at all.
The robots metatag has the advantage of being very granular - it allows you to restrict
the spidering behavior of a single page. That same granularity is also a disadvantage -
it's a real pain to put one on a lot of pages, and to keep track of what pages you used it
on. It's best used for small numbers of specific pages.
Rule of thumb: If you want to restrict robots from entire websites and directories, use
the robots.txt file. If you want to restrict robots from a single page, use the robots
metatag. If you are looking to restrict the spidering of a single link, you would use the
link "nofollow" attribute.

Digital Training Morquee

Pages

Share The Link Morquee

Robots Detail

No comments:

Post a Comment

Follow Us

Follow Us on Facebook

Digital Marketing Trainer - About Faculty