Archive for the 'Technical' Category

mkdir too many links

Tuesday, November 11th, 2008

Picture the scene: Day three of my Vegas vacation, I am a mixture of hungover, jet lagged and suffering from the Vegas culture shock. I’m so spaced out that I’m struggling to work out the tip on my meal.

Then I get the message that Fab users can’t upload photos. I stagger back to the hotel room and feeling like a technical Hunter S. Thompson I fired up ssh to investigate. It turns out that mkdir is failing with the error ‘too many links’ and a Google reveals that there is a limit of 32K directories in a directory which is way less than I thought.

Fortunately it’s not that difficult to resolve, so I write a migration script to move the album directories around so that each parent directory has no more than 10,000 directories within it. There were just a couple of helper functions that are used to display photos so I didn’t need to change that many things to get it working again.

Web accelerator problems

Friday, October 10th, 2008

I noticed that the cache hits on the assets server were pretty low and as a result we were getting about 30 hits per second for images on the main applications server (what’s supposed to happen is that although the images start out on the applications server they should all end up cached on the assets server which acts as a web accelerator resulting in virtually no image hits on the apps server).

It turns out that somehow I’d messed up the Cache-Control headers so they were missing which I guess made the cache virtually useless. My problem was that I’d forgotten the ‘ExpiresActive on’ directive which resulted in all my ExpiresByType directives failing silently.

Note to self: lynx -head http://www.fabswingers.com/ is a great way to verify that the headers are correct.

While I was at it, I also adjusted the settings for mod_deflate to make sure that all the right pages were being zipped. On Debian, the default is:

AddOutputFilterByType DEFLATE text/html text/plain text/xml

Which I changed to:

AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css

So my (fairly lengthy) stylesheets are now compressed too, which should make a fractional difference to speed and bandwidth.

Server move complete (more or less)

Sunday, August 17th, 2008

I’ve finished the migration more or less. The great news is that the new configuration has fixed the performance problems and it looks like there is now plenty of headroom for growth.

I now have one server with MySQL and the application server on it which serves the pages of the site. The other server does the photos and outgoing mail. The photos are uploaded to the application server but are accessed via a caching proxy (Apache) on the other server.

At the moment the load on the MySQL/application server is no more than about 0.50 but the photo server is double that. Part of the problem is that I haven’t yet managed to get mod_expire working so the headers are missing.

However I am now pretty happy that this configuration will take me to at least 2 million pageviews a day. It’s also just a huge relief not to have the site on the edge of collapse!

PS. I have also signed up with rsync.net for backup. The Planet do have backup options but rsync.net is a bit cheaper and also I prefer to have the backup with a completely different provider.

Faster photos

Monday, August 4th, 2008

I’ve had some more thoughts about how to handle the photo/assets server. Rather than trying to rsync the images from the applications server I’m going to set up a caching proxy (probably using lighttpd but I may use Apache instead).

This will be massively simpler than periodically rsyncing images over and will require no change to the application. The only negative is that it won’t provide an online backup but I can address that separately.

While I’m migrating the applications server to the new hardware I’ll use rinetd to forward requests while I wait for the DNS to fully propagate.

UPDATE: Rather than lighttpd or Apache I think I shall use Squid as the proxy.

Rsync problems

Sunday, July 20th, 2008

I’ve had some problems with backup running very slowly — in fact by the time backup of the images has finished, it’s pretty much time to begin again.

However it turns out to be a cock-up on my part. I was using rsync which is only supposed to copy over files that have changed. But if you forget to use the -t or –times option then the modification dates are not copied over which means that every single file needs to be copied over the network to find out whether or not it has changed. Whoops!

As I now have two servers at The Planet which are in different data centers (Dallas and Houston) my backup plan is just to rsync the files between then.

I am also considering using rsync to allow my applications to upload photos to the assets server, however I may use Samba instead. NFS is out though!

UPDATE: After a great deal of thought over the past few weeks I’ve reached a decision. I thought about dozens of alternatives but I needed something very secure, robust (even when there is a network outage) and asynchronous with the file uploads so that the user isn’t kept hanging around when they are uploading their images.

The solution is that photos will upload on the applications server and also be served from there. This will result in fast, reliable upload. A cronjob will then rsync the photos over to the assets server every few minutes and will update the photos table to show that the photo has now moved over so future pages will get it from the assets server. I think this should work pretty well!

It is also quite future proof, if I switched to using a Content Delivery Networks (and I note that Peer1 allow adult sites on their CDN) then this approach would work great for uploading the images to it.

1,000,000 page views a day

Monday, July 7th, 2008

Yesterday we passed the 1,000,000 page views a day for the first time and hits on the assets server (which is presently on the same box) were peaking at 100 hits per second. So, not bad for a single fairly low powered server.

From what I’ve read, The Planet have now upgraded the bandwidth on their virtual rack product to 1 Gb. This makes it much more useful and I now plan to get two new servers in a virtual rack, one running the applications server and one running the database. I’ll leave the present server doing mail (because it is such a pain to change the IP address of a mail server) and also serving the assets.

For the new servers I’m not going to bother so much about CPU but I will be looking at memory and the speed of the hard drive as the main issue has been IO.

Turning Lighttpd access log off for performance

Saturday, June 28th, 2008

I took a look at the Lighttpd access log, which is actually more than 1 GB a day. This is completely useless, it’s taking lots of disk space and it’s far too big for me to ever run an analysis program on. Not only that but it’s only going to get worse!

As I’ve studied the performance of the server generally it’s pretty clear that it is very IO bound (although things can go a bit slow sometimes, the processors are always mostly idle). So writing a 1 GB log each day isn’t going to help.

I do want some stats so I enabled the statistics module which tells me that lighttpd is averaging about 60 hits per second (although right now it’s 75 hits per second).

Since I made the change I think that performance has improved a bit but it’s not been massively significant.

I am also considering moving Lighttpd off to another server completely, although it uses almost no CPU I think that it is using up IO bandwidth. I really need another server anyway but I’ll wait until the July promotions start at The Planet.

MyISAM merge tables to improve performance

Sunday, June 22nd, 2008

I use a login MyISAM table to record all logins to the site (IP address, user agent, username, etc) which is pretty handy for spotting scammers and people with multiple accounts.

Anyway, the problem is that this table gets very large, very quickly. My solution had simply been to run a delete every month to remove older entries. But this led to two problems:

  1. The table was locked for ages while the delete happened.
  2. The table was then left fragmented (which prevents concurrent inserts on my configuration)

My first thought was to use DELETE … LIMIT 100; in a loop so that the table wouldn’t be locked for too long. However that was still not an ideal solution as it was still using a lot of resource and fragmenting the table.

What I finally settled on was using a new login table for each month and then using a MyISAM merge table to join them all together. Then, I can simply drop the old tables as they go out of date and reconfigure the merge table.

I can now delete millions of rows instantly with no performance impact or fragmentation. Result!

Various bits of MySQL performance improvement

Tuesday, June 17th, 2008

It’s slightly embarrassing but I’ve actually slightly fallen for the old problem of going into optimizations without properly measuring. To be fair, part of the problem is that MySQL does not have the greatest reporting tools so it is not that as easy to spot the problems as it is with other databases.

Anyway, I identified that one of the problems was that some users just had too many messages so I’ve implemented some pruning so that users can have only 600 messages in their inbox/sent mail - this is all that the UI allowed them to see anyway so there will be no noticeable effect to them.

My improved logging also identified a missing index from my bounce table (which contains all bounced email) so I was able to speed up queries there very substantially.

I’ve also done a review of the MySQL status and it turned out I needed to increase the table cache a bit, but after boosting that from the default of 64 to 128 and then 192 I am now seeing that ‘opened_tables’ is sitting at a constant value.

The major problem that I am left with is the messaging system. It is full of joins, it gets a lot of write traffic, it’s a very large table and is the main contributor to the slow query log. Unfortunately I don’t really have a very clear idea of how to sort this out yet!

Denormalizing the tables

Monday, June 9th, 2008

My latest project is to denormalize some of the heavily used tables. The core ‘person’ table which is used for displaying profiles is fine but the message, wink and friend invite tables usually need to be joined with the person table.

I’ve found that joins in MySQL are very expensive, the slow down is in the region of hundreds or thousands of times.

So I am going to amend the message directory so that instead of simply using a foreign key to reference the person who sent or received the message it will actually contain the basic details of the person. This will remove the need for the join.

It will give me a problem if people change their username, person type or names but presently only the latter can be changed through the web interface and if people do change their names then they will just have to accept that messages sent may have their previous names on them - which is just what happens with ordinary email anyway.

At the same time I have also decided to re-organize the photos directory. At the moment there is a directory for every member with uploaded photos in the top level which means I have a single directory containing 16,000 photo directories. To be honest the performance of this on ext3 isn’t bad but it’s clearly not going to scale. Under the new regime I have a maximum of 10,000 entries per directory.

I hope this works out as well as the summary tables has. Since I made that change the member home page has disappeared completely from the log of slow pages (slow = more than 400 ms).