
Web sites change, they exist one moment, and then when you least expect it, “poof” they disappear. Thing is, they are only likely to be missed if they contained something interesting in the first place, and the problem with that is if they did contain something interesting then there is likely to be a human being somewhere that will have linked to them in a document or a web page somewhere. And there’s the rub…. who pro-actively checks the validity of the links on their site or in their documents, I mean lets face it, we all have a million better things to be doing don’t we?
So imagine my surprise this week when I learned that a large proportion of the humans for whom we are developing a shiny new Sherpoint document management systen spend a significant portion of their working day manually going through meta data resources they have authored ensuring that the urls in those resources still take them where they expect to go. This troubled me, and so in a 1 hour coding frenzy on the train one evening I tried to address the issue:
The Solution
The configuration file for this code controls the following:
- The delimited list of document libraries that are checked by the service.
- The delimited list of xml fields that are checked (assumes infopath xml).
- Whether or not to unpublish documents that are found to contain invalid urls.
The code first enumerates through each of the document libraries in the specified sharepoint site and opens up each of the files contained within them that have been modified since the last execution of the service.
It examines the status of each file in the document library along with the version history of those files to identify the current major (published) file and once located its contents are read into an XMLDocument object.
The contents of the fields specified in the configuration file are then concatenated, and a regular expression is then used to identify any url contained within those fields.
Each Url is then checked one by one to ensure that they a) are syntactically correct and that b) suitable DNS entries can be located for them and finally c) that they return an OK http response code when they are accessed.
In the event that one of the urls contained within the document are invalid the document may (according to the config settings) be unpublished and a new SPListItem is added to a dedicated sharepoint list containing all the metadata necessary to identify the problem document, the problem url and the author.
The code is then released as an SPTimerJob and executed weekly.
The beauty of this solution is that as developers we only really need to worry about populating the InvalidDocs list because the method of alerting users to which documents need to be reviewed can be determined by the users themselves i.e. email alerts, list view web parts or review workflows etc.
Anyway, enough talking, here’e the code:
230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 | private static bool ValidateUrl(string url, out string message) { bool retVal = true; message = string.Empty; Uri uri; if (Uri.TryCreate(url, UriKind.Absolute, out uri)) { // Cookies ? HttpWebRequest objRequest = (HttpWebRequest)System.Net.HttpWebRequest.Create(url); CookieContainer cookieContainer = new CookieContainer(); if (!String.IsNullOrEmpty(userAgent)) { objRequest.UserAgent = userAgent; } objRequest.CookieContainer = cookieContainer; objRequest.AllowAutoRedirect = true; objRequest.MaximumAutomaticRedirections = 5; HttpWebResponse objResponse = null; // Try to fetch the page from the given URL, in case of any error return null string try { objResponse = (HttpWebResponse)objRequest.GetResponse(); // In case of page not found error, return null string if (objResponse.StatusCode != HttpStatusCode.OK) { retVal = false; message = objResponse.StatusDescription; } } catch (Exception ex) { message = "An error occured while querying the specified url: " + ex.Message; retVal = false; } } else { message = "The URL was not well formatted."; retVal = false; } return retVal; } |

