Overview
IOER uses four solutions to check for dead, inappropriate, and malicious links, and for checking files uploaded by users for viruses. We check for
dead links using a custom solution in a multi-phase, multi-threaded process which checks links to see whether they are good,
valid links or not. We use the Google Safe Browsing API to check for malicious sites, and we have a HOSTS file from
http://winhelp2002.mvps.org/hosts.htm to check for inappropriate sites. Finally,
we use an anti-virus scanner to check files uploaded by users for viruses.
Dead Link Checker
The IOER link checker exists to periodically check the URL of each active resource in the IOER system to verify that it is a good, valid resource.
Resources are checked in a multiphase process to allow for hosts temporarily being down or busy, and for temporary DNS problems, among others.
When a resource is found to be bad, it is removed from the Elastic Search index and any libraries it belongs to, as well as flagged as deleted in the system.
Definitions
- Deleted – The resource is removed from Elastic Search and flagged as deleted in the Link Checker database.
- FQDN - Fully Qualified Domain Name. For example, ioer.ilsharedlearning.org is a fully qualified domain name. This is sometimes also called "Host Name."
- Host Name - See FQDN
Multi-Threaded Process
The IOER link checker uses very few CPU cycles, and database, Elastic Search, and disk I/O are also minimal. Most of the bottleneck can be
attributed to latency across the internet as it checks each resource. Therefore a significant increase in speed can be obtained by doing checks
in multiple threads. For example, a single-threaded check of 40,000 resources took approximately 9 hours, 32 minutes to complete. By
splitting the 40,000 resources among 4 threads running in parallel, the time needed to check them was reduced to 2 hours, 21 minutes,
which is a significant time savings. Because most of the time is spent in a phase 1 check, multi-threaded processing is implemented only for
Phase 1.
The number of threads, and number of resources to check for Phase 1 is configurable through the link checker's .config file.
Phase 1 Processing
The number of resources checked each period is configurable through the link checker’s configuration file, and the resources checked for each
phase 1 run of the link checker are selected in a least recently checked fashion, with new resources going to the front of the line, since they
have never been checked before. For some conditions, a resource is immediately deleted. For others, a counter is incremented which results in
additional checks being done during Phase 2 processing.
There are two different protocols currently supported by the link checker. They are the HTTP and HTTPS protocols (which are treated as one),
and the FTP protocol.
HTTP/HTTPS
Rules which require staff intervention
- Too many redirects - if a page redirects more than 10 levels deep, the URL is logged and no action is taken. It is up to administrators
to decide what to do with the link. This prevents the link checker from getting stuck in an infinite loop of redirects and never finishing.
- Unknown protocols - if an unknown protocol is encountered, the URL is logged and no action is taken. This allows technical staff to examine
the resource's protocol, and write code which will check the resource's validity.
Code-based rules
Code-based rules are used where it is not possible to leverage equality, substrings, or regular expressions, or where the use of such is inefficient
and slows the link checker down too much. Examples of code-based rules are:
- Checking for redirects via meta-refresh - Some pages redirect to a URL using a <meta> tag with an http-equiv attribute that indicates a
redirect. This type of redirect is normally handled client-side by the browser, but the link checker has code specifically for this since it
is not a browser. When this is detected, the link checker checks the page that is being redirected to via the meta-refresh redirect.
- The body of the page contains only a <noscript></noscript> tag. We consider this to be a black hat technique, and delete the
resource from our system.
Conditions Which Result in Immediate Deletion
- Known Bad Protocol – In the early days of the link checker, resources were found which did not have a good protocol on their URL.
The decision was made to simply remove these resources. These known bad protocols are:
- IOER maintains a table of known 404 pages. Each row in the table is flagged so that the link checker knows if it needs to check for an
exact match of a URL, or whether the URL is a regular expression that needs to be checked, and appropriate logic is used to determine whether
to treat it as an exact match or if a regular expression match needs to be done. If a resource URL is found to be a match in this table,
whether an exact match or a regular expression match, or is redirected to a page which matches a rule in this table, it is immediately
deleted from the system.
- IOER maintains a table of known bad titles. Each row in the table is flagged so that the link checker knows if it needs to check
for simple equality of a title, or whether the title is a regular expression that needs to be checked, as well as specification on which
FQDN the rule applies to. If a resource’s web page contains a title which is found to be an exact match or a match on a regular expression,
it is immediately deleted from the system.
- IOER maintains a table of known bad content. All rows in this table are assumed to be regular expressions, and apply to a specific FQDN,
or to all hosts. Applying the rule to all hosts is indicated by putting “all” in the HostName column. If the content of a web page pointed
to by a resource has a match to the Content field of a row in this table, and the HostName matches, the resource is immediately deleted from
the system.
- If a page returns a 404 Not Found, 403 Forbidden, or a 410 Gone HTTP status, or is redirected to a page which returns either of these
statuses, the resource is immediately deleted from the system.
- If an “Invalid URI” exception occurs, the resource is immediately deleted from the system.
Conditions Which Result in Additional Checks Being Done
Each condition outlined below has an individual, configurable threshold that, once exceeded, results in the resource being deleted from the
system. If a resource has a non-zero count of the number of times this resource has been checked, then the resource is flagged for Phase 2
checking. If any of the counts are non-zero, this is interpreted as needing a Phase 2 check.
- The request times out – a timeout occurs if the link checker does not get a response back within 15 seconds. A connection was established
with the host, but it did not return a response before the timeout elapsed.
- Unable to connect – If the link checker is unable to establish a connection to the host. This is a timeout of a different sort. In this
case, a connection to the host could not be established before the timeout elapsed. This can also be caused by a host actively refusing the
connection.
- DNS Errors – A problem occurred when resolving the Host Name to an IP address. These are usually timeouts when contacting the domain’s
DNS server, the FQDN is not found, or the domain has expired.
- Connection to the server was closed, Receive Failure, or Send Failure are all treated as Unable to connect exceptions.
- 400 Bad Request – Sometimes something is wrong with the host (or the request itself), and the issue the host is having gets fixed.
If it’s not fixed before the threshold is exceeded, then the resource ends up getting deleted.
- 500 Internal Server Error – These can happen for any reason from a null reference exception to the server attempting to divide by zero.
It’s a horribly cryptic response that usually means there’s something wrong with the page, and should be tried again later after developers
have had time to fix it.
Table-based rules
Many of the rules for detecting whether a link is bad are stored in tables. By updating a rule in a table, we can avoid having to recompile the
link checker every time a new rule is added. The rules are read from the tables each time the link checker starts up, so if you've updated a rule
and you're running the link checker interactively, you'll have to stop and restart the link checker each time you add a new rule (or set of rules)
so that the new rule will be read in from the table and used.
Rules in tables are updated on a regular basis. There are three tables currently used for storing the rules.
- Known Bad Content table, where rules are placed when looking for a specific piece of content that may indicate that the resource is not valid or
inappropriate for our audience. An example of a resource that is not valid would be a page that contains the words "page requested
is not found." An example of a resource that may be inappropriate for our audience is a page that contains the words "online casino."
- Known Bad Title table, where rules are placed when looking for a specific title that may indicate the page is not valid.
- Known 404 Pages table, where rules are placed that identify a URL as one that should be treated as a page not found. The rules in this table
differ slightly in that they may apply to all links, or only to links which redirect to a link that matches the rule. This is useful when
you have resources that at one time were good, but now all redirect to the home page of a site, where we want to treat the home page of a site
as a valid resource, but anything that redirects to it as an invalid resource.
All rule tables can leverage the power of Regular Expressions for determining whether or not a given resource matches a rule that indicates the
resource should be deleted from the system. It is beyond the scope of this document to discuss regular expressions, however an excellent tutorial
on how to use them can be found at http://www.regular-expressions.info.
You can view the current rule sets by clicking the links below to download .csv files which contain the current rules.
- Download Bad Content Rules
- Download Bad Title Rules
- Download 404 Pages Rules
FTP
Conditions Which Result in Immediate Deletion
The following conditions result in the resource being immediately deleted from the system:
- 550 Not found
- 530 Not logged in
Conditions Which Result in Additional Checks Being Done
Each condition outlined below has an individual, configurable threshold that, once exceeded, results in the resource being deleted from the
system. If a resource has a non-zero count of the number of times this resource has been checked, then the resource is flagged for Phase 2
checking.
- DNS errors - see DNS errors in the HTTP/HTTPS section.
- Unable to Connect - see Unable to connect in HTTP/HTTPS section.
- Invalid URI - These are sometimes resolved later, so pass them to phase 2.
Phase 2 Checking
Phase 2 checking uses the exact same rules as Phase 1, except that only those resources with nonzero number of times they have been checked
trigger a Phase 2 check. All resources which are flagged for Phase 2 checking are checked each time a Phase 2 check is run. Because the number
of resources needing a Phase 2 check is much smaller than the number of resources for a Phase 1 check, this process has not been converted to a
multi-threaded process.
Reporting
At the end of Phase 1 and Phase 2 link checking, a report containing the findings of the Link Checker is generated.
Miscellaneous Utilities
The link checker has 4 utilities built into it related to link checking. These are generally run by developers to test changes to the code,
changes to the rules in the tables, or to make corrections to Elastic Search to reflect the state of the database
Phase 1 for ID Range
Each resource within IOER is assigned an ID by the system. This allows developers to run a link check for a specific ID or range of resource
IDs. As usual, the same rules for the ID range are used as for Phase 1 and Phase 2.
Delete Resources from Elastic Search
This utility issues a delete query to Elastic Search for each resource in the link checker database that is found to be deleted. After each
100 queries are issued to Elastic Search, this utility sleeps for 20 seconds, in order to play nice with Elastic Search. It runs on the entire
Link Checker database, looking for resources that are flagged as deleted in the database.
Delete Single Resource from Elastic Search
This utility issues a delete query to Elastic Search for a single resource in the link checker database, provided that resource is flagged as
deleted within the database.
Phase 1 for Host Name
This utility performs a Phase 1 check for all resources that have a given FQDN.
Bad Link Checking Rules
- Too many redirects - if a page redirects more than 10 levels deep, the URL is logged and no action is taken. It is up to administrators
to decide what to do with the link. This prevents the link checker from getting stuck in an infinite loop of redirects and never finishing.
- Known bad protocols - If a page's URL starts with a known bad protocol, the URL is logged and the resource is marked as deleted.
- Unknown protocols - if an unknown protocol is encountered, the URL is logged and no action is taken. This allows technical staff to examine
the resource's protocol, and write code which will check the resource's validity.
Google Safe Browsing API
Per the Google Safe Browsing API's documentation,
IOER caches the blacklists from the API, and then checks incoming links on import as well as when a user tags a new resource, to see if the
entered link corresponds to a site that Google believes to contain viruses or other malware. If the site is found, our system rejects the link.
HOSTS file
The HOSTS file for looking for inappropriate sites is periodically downloaded from
http://winhelp2002.mvps.org/hosts.htm and imported into a staging table,
and from there the table containing a list of blacklisted hosts is updated using a stored procedure. This table is then used to compare
incoming host names against, and hosts which match a name on the list are excluded from IOER.
Checking Uploads for viruses
IOER checks files uploaded to our site for viruses and malware using a free, open source virus scanner called ClamAV. There exists a .NET
wrapper for this, and our site calls this .NET wrapper to pass the file on to ClamAV for scanning.
Disclaimer: While IOER uses these methods to provide reasonable assurances that files uploaded on our site
by users are virus and malware free, we cannot guarantee that they do not contain some sort of malware or virus. It is prudent for the user
to scan the files with their own anti-virus software before attempting to use files downloaded from IOER, or any other source. IOER
does not guarantee that content or files linked to by us but hosted elsewhere are malware or virus free.