Large scale databases with PHP/Mysql

by @ 8:38 am on December 20, 2007.  Filed under Mysql, PHP

I’m currently faced with a paradox. I have a huge database (one table is more than 84 million records) that requires daily updates of about 2 m records and probably daily inserts of about 500k or more. Currently, I am trying to achieve this with Mysql however I am finding that the server I’m hosting the application is far from ideal. So my dilemma is, do I spend my hard earned money on a beefy server (>$400 a month) or do I build one and keep it here at my house on my Verizon Fios fiber connection. If I do the first option, it eats into my hard earned cash. If I do the second option, I have to figure out how to connect my home box to my live website in a reliable manner.

The downside to the fios connection is, I do not get a static IP. Also, I have a suspicion that they are blocking incoming ports.

I’ve decided to take the latter route, build a machine for home that will host the database on some RAID1 drives. I have a few SCSI 320 drives as well, so a RAID 10 setup would be optimal. But my first attempt is going to be on an SATA Raid card made by Silicon Image. The complexities of this application are too numerous to go into, but the general goals are to parse large text files, analyze them and store the results in the database. The issue lies in the difficulty of applying these massive updates to Mysql in a way that won’t take half a day to process. It’s a two part equation. The server that is doing the data analysis has to be respectable in terms of hardware/cpu/memory, however, the live web server which also hosts a “copy” of the database for the website to actually use. Uploading the daily updates to the live server will be the bottleneck in terms of system performance. Tests will tell if my RAID on SATA will be of any benefit. If not, I will switch to my SCSI RAID and work from there. I’ve been searching Google for answers on the topic of large scale data warehouses and data mining but I am coming up empty handed. Got any ideas?

Get the most out of your Mysql or Postgresql database server

by @ 1:34 am on December 2, 2007.  Filed under Mysql, Postgresql

If you run your own dedicated server, you have probably been faced with, or at least you have considered, the problem of one website getting so much traffic it brings down all the other sites. There’s also the possibility that a poorly written piece of code will cause the webserver to go into an endless loop. In some places I’ve seen companies use a VPS system for customer websites, and a standalon database server using Postgresql. The concern was that a single customer site getting heavy traffic, or worse yet, attacked, could bring down everyone because of the single point of failure that existed in the database server.

The solution that was proposed was to configure multiple instances of Postgresql, one for each client, running on separate ports. This same idea would work for Mysql as well. This adds a level of security as well, since you could then give the client super human access to their instance, and there’s no way they could see or manipulate databases from other clients. Another reason for doing this is to fine tune each customer’s database server to meet their demands. A lightly loaded server would need fewer processes/memory than a heavy duty one. In a VPS environment, if you ran mysql or Postgresql, you would in effect have an isolated environment anyway, so this is a moot point. It’s the shared hosting environments that benefit greatly.

I’d like to mention that you can easily setup Apache this way too. This would give customers a completely isolated process space for their website on a single dedicated “shared hosting” machine. Granted, there are some caveats, namely, if you’re using CPanel or other hosting control panels, they of course would not know how to manage these custom configurations. Can you think of any other negatives here?

MySQL Database format wars - InnoDB vs MyISAM

by @ 5:07 pm on December 1, 2007.  Filed under Mysql

I’ve read some articles recently on the speed differences between InnoDB and MyISAM database formats under MySQL. The articles suggest that InnoDB is faster in many respects than MyISAM. On the other hand, many people will tell you that MyISAM is not a relational database format, and can’t be compared to InnoDB. So, which database format is better, and why should you care?

Well, if you’re doing much with MySQL such as hosting your forums, blog, or other dynamically published information on your website, you’ll want to know which is more appropriate for your site. MySQL’s default database engine is called MyISAM which is a flat file, B-tree indexed database engine. It is very fast (if indexed properly). So fast, in fact, it would be hard to find any relational database system thats faster in basic queries. The InnoDB database engine is a transactional, relational database engine that supports all of the SQL buzzwords like stored procedures, views, transactions, ACID compliance, etc. So which is right for you? That’s a loaded question with many answers. But unless you have full control over your server, chances are your ISP doesn’t have MySQL configured to use InnoDB, so the question might be pointless. However, if you run your own server, you can probably reconfigure MySQL to use InnoDB. The question is, should you?

For anything but raw power and speed necessities, I would recommend InnoDB for one simple reason: Fault tolerant. I have been personally burned on more than one occasion by a corrupt MySQL table. This may happen if your hard drive filles up or your server crashes unexpectedly, leaving the database in an unsafe state. InnoDB uses a transaction log and can create checkpoints which help to keep your database healthy in times of troubled weather. MyISAM doesn’t have this feature.

But recently I had the need to load a very large dataset and de-dupe the tables. The source files totaled over 200 million lines, which needed to be parsed and imported. I tried Microsoft SQL Server 2005 with this project, and it took all night to import it using the integration tools. I created some custom integration package with some VB script to parse the files. Microsoft SQL Server is of course a better choice for large scale data warehouses because it can take full advantage of Windows’ superior threading, but for this project I needed something fast, simple, and portable.

I imported one of the files which contains about 2 million records. I created the table using the InnoDB engine. I was very disappointed at the results. I’ll let you be the judge:

LOAD DATA INFILE ‘/work/nameservers.csv’ INTO TABLE nameservers FIELDS TERMINATED BY ‘,’ OPTIONALLY ENCLOSED BY ” (nameserver,nameserverid);

The table structure is simply nameserverid int, nameserver varchar(80);

InnoDB results

Query OK, 1856807 rows affected (1 hour 14 min 48.78 sec)
Records: 1856807 Deleted: 0 Skipped: 0 Warnings: 0

MyISAM results:

mysql> LOAD DATA INFILE ‘/work/nameservers.csv’ INTO TABLE nameservers FIELDS TERMINATED BY ‘,’ OPTIONALLY ENCLOSED BY ” (nameserver,nameserverid);
Query OK, 1856807 rows affected (18.66 sec)
Records: 1856807 Deleted: 0 Skipped: 0 Warnings: 0

Furthermore, indexing is also much faster with MyISAM:

InnoDB:
ysql> create index ix_ns on nameservers2(nameserver);
Query OK, 1856807 rows affected (10 min 16.35 sec)
Records: 1856807 Duplicates: 0 Warnings: 0

MyISAM:
mysql> create index ix_ns on nameservers(nameserver);
Query OK, 1856807 rows affected (36.57 sec)
Records: 1856807 Duplicates: 0 Warnings: 0

Am I missing something? I might be wrong in my overall assessment, but for shear speed, it’s hard to beat MyISAM. What’s your experience with large scale MySQL databases?

web·pit n. a place of discovery; a repository of information; where coders come for enlightenment; a programmers diary.

internal links:

categories:

search:

archives:

August 2008
M T W T F S S
« Mar    
 123
45678910
11121314151617
18192021222324
25262728293031

other:

Great Places

  • websiterepairguys.com
    Millions of websites need fixing. Here's the man for the job!
  • mydnsmanager.com
    Free DNS hosting, domain parking and domain name research tools
  • phpfreaks.com
    Excellent resource for all things PHP
  • boonex.com
    Boonex makes great, FREE, social networking software called Dolphin Smart Community Builder

About Me

My name is Mark Williamson and I am somewhat of a PHP freelancer and all around hacker. I have many years of experience in software development, covering a wide variety of languages, databases, and functional areas. From website content management systems, medical billing, website load testing, billing automation, and sales automation, I can pretty much say I've seen a few million lines of code. I am usually available for after-hours contract work (I have a full time day job). If you need assistance implementing ideas for your website, contact me and I'll give you an honest opinion on what it would take. You can see my other website at WebsiteRepairGuys .com (link at the top of this column).
All content Copyright (c) 2008, Mark Williamson