What are Robots files?
Robots are an important channel for communication between a site and a search engine spider. The site declares through the robots file that the site does not want to be included in the search engine or specifies that the search engine only includes specific parts. Please note that you only need to use the robots.txt file if your site contains content that you do not want to be included in search engines. If you want search engines to include all the content on the site, please do not create robots.txt files. At present, the Robots file set in the leading system allows all content to be included by search engines.
The format of the robots.txt file
Robots files are often placed in the root directory and contain one or more records. These records are separated by blank lines (with CR, CR/NL, or NL as the terminator). The format of each record is as follows:
In this file, you can use # to annotate, the specific use method is the same as the convention in UNIX. The records in this file usually start with one or more lines of User-agent, followed by several Disallow and Allow lines. The details are as follows:
User-agent:The value of this item is used to describe the name of the search engine robot. In the \"robots.txt\" file, if there are multiple User-agent records indicating that multiple robots will be restricted by \"robots.txt\", for this file, there must be at least one User-agent record . If the value of this item is set to *, it is valid for any robot. In the \"robots.txt\" file, there can only be one record like \"User-agent:*\". If you add \"User-agent:SomeBot\" and several Disallow and Allow lines in the \"robots.txt\" file, then the name \"SomeBot\" will only be received after \"User-agent:SomeBot\" Disallow and Allow line restrictions.
Disallow:The value of this item is used to describe a group of URLs that you do not want to be accessed. This value can be a complete path or a non-empty prefix of the path. URLs beginning with the value of the Disallow item will not be accessed by robots. For example, \"Disallow:/help\" prohibits robots from accessing /help.html, /helpabc.html, /help /index.html, and \"Disallow:/help/\" allows robots to access /help.html, /helpabc .html, cannot access /help/index.html. \"Disallow:\" indicates that robots are allowed to access all urls of the website. There must be at least one Disallow record in the \"/robots.txt\" file. If \"/robots.txt\" does not exist or is an empty file, the website is open to all search engine robots.
Allow:The value of this item is used to describe a set of URLs that you want to access. Similar to the Disallow item, this value can be a complete path or a prefix of the path. URLs that begin with the value of the Allow item are allowed to be accessed by robots. For example \"Allow:/hibaidu\" allows robots to access /hibaidu.htm, /hibaiducom.html, /hibaidu/com.html. All URLs of a website are Allow by default, so Allow is usually used in conjunction with Disallow to realize the function of allowing access to some web pages and prohibiting access to all other URLs.
Use \"*\"and\"$\":Baiduspider supports the use of wildcards \"*\" and \"$\" to match fuzzy URLs.
\"*\" matches 0 or more arbitrary characters
\"$\" matches the end-of-line character.
The last thing to note is: Baidu will strictly abide by the relevant agreements of robots. Please pay attention to distinguish the case of the directories you do not want to be crawled or included. Baidu will treat the files written in robots and the directories you do not want to be crawled and included. Do an exact match, otherwise the robots protocol cannot take effect. Commonly used Robots file writing
1. Allow all search engines to access
Here everyone should pay attention, you can most directly create an empty file \"robots.txt\" and put it in the root directory of the website.
2. Prohibit access to all search engines
3. Prohibit all search engines from accessing several parts of the website, here I use a, b, and c directories instead
If yes, yes
4. Prohibit access to a search engine, I use w instead
Adding /d/*.htm after Disallow: means that access to all URLs with the suffix \".htm\" in the /d/ directory is prohibited, including subdirectories.
5. Only allow certain search engines to access, I use e instead
Nothing is added after Disallow: it means that only e is allowed to access the website.
6. Use \"$\" to restrict access to url
This means that only URLs suffixed with \".htm\" can be accessed
7. Prohibit access to all dynamic pages in the website
8. Forbid search engine F to crawl all pictures on the website
It means that only the engine is allowed to crawl webpages, and it is forbidden to crawl any pictures (strictly speaking, it is forbidden to crawl pictures in jpg, jpeg, gif, png, bmp format.)
9. Only search engine E is allowed to crawl web pages and .gif format pictures
It means that only webpage and gif format pictures are allowed to be crawled, other formats are not allowed
Most search engine robots abide by the rules of robots files, which is roughly how to write robots files. We should remind everyone that the robots.txt file must be written correctly. If you don’t know how to write it, you still need to understand it before writing, so as not to cause trouble for the site’s inclusion. In the lead system, set the entry of the Robots file:
Step 1: Log in to the lead system and do the following:
Step 2: Set the Robots file in the figure below and save it;
Step 3: Save and publish to take effect.
If a single page on the website does not need to be included, you can add a meta robots tag to the source code of this page:
. This needs to be led by staff to add, if there is such a demand, please contact QQ: 2417402658.