I need some help here: I just got a notification from my web administrator that this site is pulling 1 GB of bandwidth per day. My account allows 3 GB per month -- I can't afford this sort of thing. I will probably be moving the site soon, but even the plan I found only allows a little under 15 GB of bandwidth usage a month. Any suggestions of what could be causing this (beyond the rampant popularity of Tim Blair's site). I was thinking DNS attacks and things liket that; also is there anything I can do to make this site not so resource-intensive, if that will do anything about this problem.
Update: I took a look at the logfile for Tim's site for just today and one thing I did notice what that something called QuepasaCreep v0.9.14 hit the site hundreds of times just today. (I lost count after 110, and I was only a fraction into the logfile.) It seems to be some sort of search engine for Spanish websites -- I don't know why it's hitting this site, since it is not in Spanish. I found instances of other search engines hitting the site today, but they only seemed to do it once or twice. I wonder if that could be the culprit.
Update 2: this is another interesting page full of info. I need to learn how to set up an .htaccess file or edit my robots.txt or something, that is for sure.
Update 3: I've had a communiqué (I use that term 'cos it's neat) with the web administrator over at Cornerhost. He set up a bad bot-blocking thing on the .htaccess file, so maybe that will help; I also cut down the amount of entries showing on Tim Blair's main page, because it was getting pretty huge and just viewing it was sucking bandwidth. Question: would it help if I put some sort of file compression on the blog? Something like the various methods described on this site. I saw the magic phrase "curb your bandwidth usage," so I think I will try out one or the other of these.
Posted by Andrea Harris at July 11, 2003 02:41 AMAndrea,
Give the guys at WRPN a holler. I saw their ad on K5 some time back and bookmarked them. I believe they offer reasonable hosting packages with unlimited bandwidth.
No, they don't host me, but I'm keeping their bookmark just in case I ever need it.
Posted by: roscoe at July 11, 2003 at 03:24 AMCan you get at your access logs? Mine contain lines like:
216.39.50.94 - - [10/Jul/2003:02:52:45 -0400] "GET /denarius/images/erf_rp2012.jpg HTTP/1.0" 200 42172 "http://sekweta.pair.com/denarius/alexgallienus.htm" "Scooter/3.3.vscooter"
where 42172 is the size, in bytes, of the data transferred in that particular request. Your provider may arrange the data differently, but if you can get the data, maybe import it to a spreadsheet, and sort it on that size field, you should be able to see what's chewing up bandwidth.
Posted by: Ed Flinn at July 11, 2003 at 06:27 AMIs there a discrepancy between ( hits * pagesize ) and consumed bandwidth?
say 100K for an average pagesize, that would still mean on the order of 10K hits/day to get at 1GB/day alleged traffic.
most of us mortal bloggers get
Posted by: Dog of Flanders at July 11, 2003 at 07:35 AMAnd don't be surprised if Tim is pulling three-quarters of that bandwidth all by his lonesome.
I'm paying out the nose for 25 GB a month, about eighteen times what I need, on the off-chance that some day I'll need it. (Suspenders, meet belt.)
Posted by: CGHill at July 11, 2003 at 08:25 AMI don't know much about this stuff so I might not know what I'm talking about but I think I remember reading somewhere that spam-bots eat up a lot of bandwidth and that there's a way to block them. I know that would be more helpful if I knew HOW but that's all I've got.
Posted by: Lynn S at July 11, 2003 at 08:34 AMI do have access to my logfiles; I'll check them out. Yeah, it's Tim's site that's pulling all the bandwidth, but it sure seems like a lot of bandwidth, even for a popular site.
Posted by: Andrea Harris at July 11, 2003 at 08:34 AMAndrea,
Since you are currently without a job, let me see if I can offer some assistance. Feel free to contact me via my email address above.
I have my own web hosting services with a decent connection to the backbone. I can do, I am sure, most of what you need . . . Just to give very basic numbers, and I have no idea if these would work for you, but $300/yr for 2 GB Space/20 GB Traffic or $400/yr for 3 GB Space/30 GB traffic.
We would really need to discuss this more formally . . . plus I can float the cost for a while . . .
Chris
Posted by: Chris at July 11, 2003 at 01:27 PMI would very much recommend going to WebHostingTalk.com and either searching through the various offers made there, or posting your own request. You'll need to do some research, because you'll find/get lots of good offers, not all of them as good as they sound (i.e. quality of host). But there are quite a few reputable hosts that work that board hard. If you search the board on various hosts, you'll tend to figure out which ones have been around for a while and are reputable.
In your situation, I would look for a cpanel host with WHM reselling (so you can set up your own accounts), whatever amount of bandwidth you think adequate, and a premium net connection (i.e. premium NAC bandwidth or HE.net would work, in the price range you're probably wanting). Some people also like Rackshack-based servers, which tend to be a little cheaper (but quality of b/w doesn't compare to NAC or HE IMO).
Also, you can check gzip compression here:
http://leknor.com/code/gziped.php
It can make a huge difference if you're gzip enabled.
Posted by: Kevin Whited at July 11, 2003 at 02:34 PMThis:
http://diveintomark.org/archives/2003/07/09/bandwidthsaving_tip_of_the_day.html
is the technology you need. Unfortunately, you'll need your hosting provider to install it, but it's fairly easy to do and I understand you have a good relationship with them. I'm about to implement it myself on a website I'm running to save bandwidth.
Regards, etc...
David
The mod to the .htaccess file should do some good to cut down on the crawlers.
As far as hosting - avoid sites that advertise 'unlimited bandwidth'. It normally isn't an issue for someone running, say, a small time mom and pop business card strorefront, getting around 10 or 15 hits/day - but, you start actually using bandwidth, and you'd probably find yourself dropped like a hot potato, with some citation of useage violation.
At SR we are using AN Hosts - we get 500MB storage, 35GB bandwidth/month, and the use of php, MySql, unlimited sub-domains, and we're free to play with our own cgi-bin. Only restriction being, if we put a bad script up and start hogging processor time, we'd get shut off real quick - saw it happen to another user on our machine.
link to them is on the SR frontpage.
while I wouldn't recommend a package this robust for the occaisional or very low volume blogger, it handles MT great, the server farm has multiple OC3 connectivity, and is fully power protected. Data center is in the Chicago area.
Drawback is that they are primarily an E-biz oriented host - their tech guys will make funny faces and tilt their heads to the right on cue if you start talking blog specifics to them.
Posted by: Wind Rider at July 11, 2003 at 04:18 PMOh, and I forgot to mention the price - with a year contract, it works out to $6.95/month ($95.00 up front - but that gets you 14 months of service)
Posted by: Wind Rider at July 11, 2003 at 04:19 PMSomeone, somewhere at blogger is getting a big laugh out of this. I'm getting a bit of a smirk myself. Make fun of blogspot if you want, but they did offer reasonably priced hosting (free for most people)
Posted by: Jeremy at July 11, 2003 at 04:40 PMDavid already linked to Mark Pilgrim's article about GZip. A few months back he provided an in-depth how-to for blocking all manner of spambots and crawlers. It sounds like your host has already set something like that up for you, but if you want something that explains what's going on, have a look here:
http://diveintomark.org/archives/2003/02/26/how_to_block_spambots_ban_spybots_and_tell_unwanted_robots_to_go_to_hell.html
Posted by: Andrew Duncalfe at July 11, 2003 at 04:48 PMMake sure your pictures files are all located in an identifiable subdirectory, then put two lines in your robots.txt file like:
User-agent: *
Disallow: /picture_directory
...then watch your log files! There are lots of bots out there, and those that don't behave should get this kind of entry:
User-agent: Crappola-Bot 1.0
Disallow: /
regarding our email conversation last week- I will see if I can kick up the idle server I mentioned; as far as costs go, I have no idea at the moment, and location may be problematic (server is located in Brisbane, Australia) but would have shitloads of spare capacity.
Rgds
PMB
Thanks everyone for the emails of support and all the suggestions. I had to drive a friend to the Daytona Beach airport so I haven't been able to do a great deal today. (Except bite my nails...) Anyway, the short solution is to compress the site, which I am going to try to get done over the weekend. The longterm solution is to move the site to a hosting service with more bandwidth capacity -- unless I find out that all this was because some troglodyte in his ma's basement decided to try putting a DOS attack on the site.
To Jeremy: thanks for the vote of confidence. Yeah. Blowspot was great so long as you didn't mind the fact that three-quarters of the time (and I am being generous) no one could access Tim's site and Tim couldn't post to his own blog. But other than that, it was spectacular!
Posted by: Andrea Harris at July 12, 2003 at 01:22 AMWhen I started blocking bots via .hataccess, my hits did a positive nosedive. That will help you a lot, I assure you!
Posted by: Kathy K at July 12, 2003 at 08:39 AMI highly recommend the article at DiveIntoMark that was mentioned earlier. Vast bandwidth like that is almost certainly due to bots grabbing your whole site repeatedly. On Monday I'm going to talk to my colleague about looking into this as I wouldn't be in the least surprised if the websites we're deploying are coming under a similar attack. mod_rewrite is not for the faint-hearted, but you can do some awe-inspiring things with it.
Posted by: David Gillies at July 12, 2003 at 12:13 PMQuepasaCreep, btw, is probably an e-mail harvester.
Another one I've found causing problems is 'grub' -- it purports to be a 'distributed' search engine, where people run it on their machine as they surf around -- but if you get a couple of people a day that are running it, they'll both download your entire site on you. Bad puppy. I've got it banned.
Oh, yeah, and that is '.htaccess' not '.hataccess' (preview is my friend).
Posted by: Kathy K at July 12, 2003 at 02:32 PMRalf D. Kloth - DL4TA
www.kloth.net > internet > bottrap
How to build a Bot Trap and keep bad bots away from a web site
Block spam bots and other bad bots from accessing and scanning your web site
If my host is on a NT box, does any of this htaccess apply to us?
Posted by: Mrs. du Toit at July 12, 2003 at 10:15 PMAndrea, Just a point of reference... We're using about 30 GB a month. Most of that is Kim's site. We had to upgrade to a 60GB a month plan at the beginning of the year.
Posted by: Mrs. du Toit at July 12, 2003 at 10:17 PMI don't know anything about NT servers. I'm used to Apache. In any case, I'm in the midst of moving the site, so stay tuned!
Posted by: Andrea Harris at July 12, 2003 at 10:22 PMAnd I have no idea what happened to the line breaks on this post.
Posted by: Andrea Harris at July 13, 2003 at 05:49 AMOkay, for some reason "convert line breaks" was off on this post. Weird.
Posted by: Andrea Harris at July 13, 2003 at 10:34 AM