|
| What is Screen Scraping and How to do it? |
 |
Thu, 20 Dec 2007 11:11:48 +000 |
In very simple terms, Screen scraping is just making a Http Request from a web
page. The simplest way of making a Http request is to use the WebClient class in
System.Net, but it has its own drawbacks, like it refuses to work when its
behind a proxy.
Then comes the HttpWebRequest class, which has many advanced features and
handles proxies quite good as well.
Let me explain this alongside coding as well.
First create an instance of HttpWebRequest class -
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
Then lets create a HttpWebResponse object, which will contain the response
returned from the GetResponse() method on our request object -
HttpWebResponse res = req.GetResponse();
The HttpWebResponse class provides access to a method called
"GetResponseStream" which provides us access to the stream data
returned -
StreamReader sr = new StreamReader(res.GetResponseStream());
The whole method will look like this-
public void Test_Scraping()
{
HttpWebRequest req =
(HttpWebRequest)WebRequest.Create("http://www.google.com");
using (HttpWebResponse res = (HttpWebResponse)req.GetResponse())
{
StreamReader sr = new StreamReader(res.GetResponseStream());
Response.Write(sr.ReadToEnd());
}
}
*in this case i'm just taking the input stream as a string and writing it to the
page response.
Just run the simple code above and you see content from google home page. As
simple as that.
Now lets see one more feature of this object. Consider the following line-
req.Timeout = 2000;
Here req is the HttpWebRequest object.
In this case we are setting a time out for this request to be executed. If
there is no response from the remote server for a period of 2 secs or 2000 milli
seconds, a WebException is raised. This is always better to do, as there can be
multiple exceptions when contacting a remote server, in worst cases the remote
server may not exist at all.
Hope this helps !
|
| Post Reply
|
| Re: What is Screen Scraping and How to do it? |
 |
Thu, 20 Dec 2007 13:39:10 +000 |
That is a nice post above and here is another (similar) sample, FWIW, which was
found somewhere on the web (I forget where) and modified to suit my needs. HTH.
public const int DefaultUrlLengthMin = 16;
public const int DefaultStartTokenLengthMin = 1;
public const int DefaultEndTokenLengthMin = 1;
public const string DefaultUrl = @"http://www.Google.com";
public const string DefaultStartToken = @"";
public const string DefaultEndToken = @"";
private string ScreenScrapeNow(string targetUrl, string startToken, string
endToken)
{
string myReturnValue = "";
//Validate URL.
targetUrl = targetUrl + "";
targetUrl = targetUrl.Trim();
if (targetUrl.Length >= DefaultUrlLengthMin)
{
//Continue.
}
else
{
throw new System.NotSupportedException("The URL is not valid.");
}
//Validate start.
startToken = startToken + "";
startToken = startToken.Trim();
if (startToken.Length >= DefaultStartTokenLengthMin)
{
//Continue.
}
else
{
throw new System.NotSupportedException("The start-token is not
valid.");
}
//Validate end.
endToken = endToken + "";
endToken = endToken.Trim();
if (endToken.Length >= DefaultEndTokenLengthMin)
{
//Continue.
}
else
{
throw new System.NotSupportedException("The start-token is not
valid.");
}
//Use WebRequest object fetches the URL
WebRequest myWebRequest = WebRequest.Create(targetUrl);
//The WebResponse object gets the Request's response (the HTML)
WebResponse myWebResponse = myWebRequest.GetResponse();
//Put the contents of our HTML in the Response object to a Stream reader, but
probably will not work with Unicode.
StreamReader myStreamReader = new
StreamReader(myWebResponse.GetResponseStream());
//And dump the StreamReader into a string...
string myContent = myStreamReader.ReadToEnd();
//Get a working RegEx.
Regex myRegEx = new Regex(startToken + "((.|\n)*?)" + endToken,
RegexOptions.IgnoreCase);
//Here we apply our regular expression to our string using the Match object.
Match myMatch = myRegEx.Match(myContent);
//Bam! We return the value from our Match, and we're in business.
myReturnValue = myMatch.Value;
return myReturnValue;
}
|
| Post Reply
|
| Re: What is Screen Scraping and How to do it? |
 |
Fri, 21 Dec 2007 06:20:25 +000 |
Thx for the nice code. I meant to keep it simple, so that a novice reader can
understand.
By the way, do u know of any way in which we can ByPass the server proxy while
doing this, and still get to the required url ? I don't mean setting the flag
which bypasses the proxy for local addresses.
|
| Post Reply
|
| Re: What is Screen Scraping and How to do it? |
 |
Fri, 21 Dec 2007 13:55:33 +000 |
slsp:
By the way, do u know of any way in which we can ByPass the server proxy while
doing this, and still get to the required url ? I don't mean setting the flag
which bypasses the proxy for local addresses.
I am sorry; but, I have no idea.
I have not faced that as an issue.
If I do, and if I find a solution, then I will plan to post it here.
Thank you.
-- Mark Kamoski
|
| Post Reply
|
| Re: What is Screen Scraping and How to do it? |
 |
Fri, 21 Dec 2007 14:15:01 +000 |
Mark...Take a quick look at your http://adam.weblogicarts.com/ site, it's
crashing with a webhost4life MDF error.
|
| Post Reply
|
|
|
|
|
|
|
|
|
|