Thursday, January 10, 2013

How to implement really small and fast ORM with PHP (Part 7: IDE)

Fork me on GitHub

Queries are gaining more and more complexity, data is getting bigger and bigger. Most optimizations in database technology are done in the database server. This is an approach to optimize queries on the client side.

With this ORM, queries ...
  • don't select more data than needed
  • contain less joins when data is expected to be consistent
  • can be written manually in pure SQL
  • are not written in a new query language

We need a good API, so ...

  • it should be easy to learn
  • method names must be short and intuitive
  • the goal is to map datasets and relations to objects
  • the API should offer method chaining
  • special features like auto-increments should be included
  • the code should be small, no getters and setters
  • the database schema is created before writing PHP code
  • relationships should be defined in the database, not in the code
  • we get low latencies combined with low memory usage

To make things easier, we make some restrictions:

  • only UTF-8
  • only MySQL/MariaDB (mysqli)
  • only PHP 5.4.0+
  • only buffered queries

Our ORM should have the same efficiency as handwritten SQL. So the following statements should produce only 1 query:

  • DBo::Guestbook(42)->Comments()->delete(); // "Guestbook" and "Comments" are tables
  • DBo::Guestbook(42)->Comments()->update('hidden', 1);
  • DBo::Guestbook(42)->Comments()->count();
  • DBo::Student()->Attend()->Lecture()->Uses()->Book(); // "Uses", "Book", etc. are tables

The following statements should only update 1 column:

  • DBo::Guestbook(42)->Comments()->update('hidden', 1);
  • DBo::Guestbook(42)->update('title', 'hello');

The following statements should only select 1 column (that's the hard one):

  • foreach (DBo::Guestbook() as $gb) echo $gb->title;

This gets implemented with a two-pass method: We store the columns used in the first run in DBo::usage_col[code position] and reuse it in the second run. The "code position" is the line where the constructor was called (returned from debug_backtrace()).

Meta data like columns or indexes should be fetched statically into the code (not during runtime). To create a join out of "Guestbook()->Comments()", the relevant columns are chosen during runtime. The syntax is:
  • table.primary_key_field = other_table.table_primary_key_field
  • e.g. sale.id = salepos.sale_id

To export the schema, we use:

// export schema to schema.php
DBo::conn(new mysqli('127.0.0.1', 'root', 'some_pw'));
DBo::exportSchema();

To keep control over the queries, we allow normal queries and debugging:

$entries = DBo::query('SELECT * FROM guestbook WHERE id=?', [42]); // Iterator
foreach ($entries as $obj) {...}

$id = DBo::query('INSERT INTO guestbook VALUES (...)'); // LastInsert ID
// 42

$affected = DBo::query('UPDATE guestbook SET active=0'); // Affected rows
// 10

$subject = DBo::value('SELECT subject FROM guestbook WHERE id=42'); // String
// 'Hello World'

$categories = DBo::values('SELECT DISTINCT categories FROM guestbook'); // Array
// ['Sports', 'Movies', 'Music']

$row = DBo::one('SELECT * FROM guestbook WHERE id=42'); // Array
// [id=>42, title=>'hello']

$row = DBo::keyValue('SELECT id,title FROM guestbook'); // Array
// [42=>'hello', 43=>'world']

$row = DBo::keyValues('SELECT id,title,subject FROM guestbook'); // Array
// [42=>['title'=>'hello', 'subject'=>'world'], 43=>['title'=>...]

echo DBo::Guestbook(42)->Comments(); // SQL String
// SELECT a.* FROM Comments a, Guestbook b WHERE ...

echo DBo::Guestbook(42)->Comments()->explain(); // explain SQL string
// EXPLAIN SELECT ... id | select_type | table | type ...

DBo::Guestbook(42)->Comments()->print_r(); // print_r related comments
// Array( id=... )

We also allow transactions:

DBo::begin();
$dbo = DBo::Guestbook(42);
$dbo->Comments()->delete(); // DELETE FROM Comments ...
$dbo->update('comments_count', 0); // UPDATE Guestbook SET ...
DBo::commit();

Here are some examples how the ORM should work:

// create a new entry in table "Guestbook"
// set attribute values to "hello" and "world"
// finally print out the primary key (auto-generated by auto-increment)
// - gives INSERT INTO Guestbook SET subject='hello', details='world'
$obj = DBo::Guestbook();
$obj->subject = 'hello';
$obj->details = 'world';
echo $obj->insert(); // 43 (from auto_increment)
echo $obj->id; // 43
// or $_POST = ['subject'=>'hello', ...]
echo DBo::Guestbook()->insert($_POST); // 43

// map entry in table "Guestbook" with primary key "42" to "$obj"
// - gives SELECT * FROM Guestbook WHERE id=42
$obj = DBo::Guestbook(42);
if (!$obj->exists()) {...}
echo $obj->subject;
// or
echo DBo::Guestbook(42)->subject;

// update antry in table "Guestbook" with primary key "42", set "hidden" to "1"
// - gives UPDATE Guestbook SET hidden=1 WHERE id=42
$obj = DBo::Guestbook(42);
$obj->hidden = 1;
$obj->update();
// or
DBo::Guestbook(42)->update('hidden', 1);
// or $_POST = ['hidden'=>'1', ...]
DBo::Guestbook(42)->update($_POST);

// delete entry in table "Guestbook" with primary key "42"
// - gives DELETE FROM Guestbook WHERE id=42
DBo::Guestbook(42)->delete();

// increment a field in table "Guestbook" with primary key "42"
// - gives UPDATE Guestbook SET likes=likes+1 WHERE id=42
DBo::Guestbook(42)->update('likes=likes+1');

Doing 1:n and n:m relations should be also very easy:

// get all Comments for Guestbook entry with primary key 42
// join a 1:n relationship (table.id = table2.table_id)
// - gives SELECT * FROM Comments WHERE guestbook_id=42
$comments = DBo::Guestbook(42)->Comments();
foreach ($comments as $comment) {...}

// update comments
// - gives UPDATE Comments SET hidden=1 WHERE guestbook_id=24
// - note that traditional ORMs do one update statement for each dataset
DBo::Guestbook(42)->Comments()->update('hidden', 1);

// update comments with where predicate
// - gives UPDATE Comments SET active=0 WHERE active=1 AND guestbook_id=24
DBo::Guestbook(42)->Comments('active=1')->update('active', 0);

// deleting comments works in the same way
// - gives DELETE FROM Comments WHERE guestbook_id=24
DBo::Guestbook(42)->Comments()->delete();
// n:m Students attend Lectures
// - gives SELECT a.* FROM Lecture a, Attend b WHERE b.student_id=21 AND
//   b.lecture_id = a.id
$lectures = DBo::Student(21)->Attend()->Lecture();
foreach ($lectures as $lecture) {...}

// n:m Students attend Lectures, Lecture uses Books
// - gives SELECT a.* FROM Book a, Uses b, Lecture c, Attend d
//          WHERE d.student_id = 21 AND d.lecture_id = c.id
//            AND c.id = b.lecture_id AND b.book_id = a.id
$books = DBo::Student(21)->Attend()->Lecture()->Uses()->Book();
foreach ($books as $book) {...}

Sometimes it is better to avoid normalization and store multiple values inside a string. This reduces the number of tables, relations and costly joins, e.g. using a string value like "100,101,102" instead of a join. The data can be also encoded as a JSON string with '[100,101,102]' or '["100","101","102"]'. To do the encoding and decoding automatically, the names of the members can be amended with "_arr" and "_json". Here is an example:

// automatic encoding and decoding of values
// - gives UPDATE Guestbook SET tags_arr='sport,music,tv' WHERE id=42
//         UPDATE Guestbook SET tags_json='{"a":"b","c":"d"}' WHERE id=42
$obj = DBo::Guestbook(42);
$obj->update('tags_arr', ['sport','music','tv']); // field tags (Varchar)
// or
$obj->update('tags_json', ['a'=>'b', 'c'=>'d']); // field tags2 (Varchar)

$obj = DBo::Guestbook(42);
print_r($obj->tags_arr); // Array([0] => sport\n [1] => music\n [2] => tv)
// or
print_r($obj->tags_json); // Array([a] => b\n [c] => d)

Predicates can be defined in several ways:

// select one dataset
// - gives SELECT * FROM Guestbook WHERE id=10
DBo::Guestbook('id=10');
DBo::Guestbook('id=?', 10);
DBo::Guestbook(['id'=>10]);
DBo::Guestbook(10); // id is a numeric primary key

// select multiple datasets
// - gives SELECT * FROM Guestbook WHERE id IN (10,11,12)
DBo::Guestbook('id in (10,11,12)');
DBo::Guestbook('id in ?', [10,11,12]);
DBo::Guestbook(['id'=>[10,11,12]]);
DBo::Guestbook([10,11,12]); // id is a primary key

// select multiple primary keys
// - gives SELECT * FROM Guestbook WHERE (id,id2) IN ((10,11))
DBo::Guestbook('id=10 and id2=11');
DBo::Guestbook('(id,id2) in ?', [10,11]);
DBo::Guestbook('id=? and id2=?', 10, 11);
DBo::Guestbook(['id'=>10, 'id2'=>11]);
DBo::Guestbook([[10,11]]); // id and id2 are a primary key

// select multiple datasets with multiple primary keys
// - gives SELECT * FROM Guestbook WHERE (id,id2) IN ((10,1),(11,2))
DBo::Guestbook('(id,id2) in ((10,1), (11,2))');
DBo::Guestbook('(id,id2) in ?', [[10,1], [11,2]]);
DBo::Guestbook([[10,1], [11,2]]); // id and id2 are a primary key

// additional predicates
DBo::Guestbook(10, "active=1");
DBo::Guestbook("active=1 and foo=?", "bar");

// limit
foreach (DBo::Order("status=open")->limit(100) as $obj) {...

Joins are skipped when data is expected to be consistent:

foreach (DBo::guestbook(42)->comments() as $comment) {...}
// gives SELECT * FROM comments WHERE guestbook_id=42

Custom SQL can be used:

$comments = DBo::object("SELECT * FROM Comments WHERE guestbook_id=?", [42]);
foreach ($comments as $comment) {...}

Aggregate functions can be used:

// SELECT count(*) FROM order where customer_id=42
DBo::Customer(42)->Order()->count(); // 40

// SELECT sum(price) FROM order where customer_id=42
DBo::Customer(42)->Order()->sum('price'); // 600

// SELECT avg(*) FROM order where customer_id=42
DBo::Customer(42)->Order()->avg('price'); // 23

// SELECT stddev(*) FROM order where customer_id=42
DBo::Customer(42)->Order()->stddev('price'); // 8

Data can be archived:

// archiving is not yet implemented

// create archive table
// CREATE TABLE Customer_archive LIKE Customer;
// # remove primary & auto_increment, add index and timestamp
// ALTER TABLE Customer_archive DROP primary key, MODIFY id int,
//   ADD index(id), ADD ts timestamp;

// archive record before updating
// Customer.id = primary key, Customer_archive.id = index
// INSERT INTO test.Customer_archive
//   SELECT a.*, now() FROM test.Customer a WHERE a.id=42
// UPDATE test.Customer a SET a.hidden=1 WHERE a.id=42
DBo::Customer(42)->archive()->update('hidden', 1);

// copy record to table
// INSERT INTO somedb.sometable
//   SELECT a.* FROM test.Customer a WHERE a.id=42
DBo::Customer(42)->copyTo('somedb.sometable');

Custom classes can also be used:

class DBo_Sales extends DBo {
  public function buildData($insert=false) {
    // custom validation
    $data = parent::buildData();
    if (empty($data["some_val"])) throw new Exception(...);
    return $data;
  }

  public function insert($arr=null) {
    // execute some pre trigger
    parent::insert($arr);
    // execute some post trigger
  }

  public function delete() {
    if ($this->status != 'draft') throw new Exception(...);
    parent::delete();
  }

  public function completed() {
    return DBo::query('SELECT * FROM sales WHERE completed=1');
  }

  public function get_age() {
    return date_diff(new DateTime($this->birthdate), new DateTime())->y;
  }
}

// DBo_{table} is automatically used as class
print_r(DBo::Sales());
=> DBo_Sales Object (...

print_r(DBo::Sales()->completed());
=> returns completed() from Dbo_Sales

print_r(DBo::object("SELECT * FROM Sales")->completed());
=> returns completed() from Dbo_Sales

// get_{field}() is automatically used when the member not exists
print_r(DBo::Sales()->age);
=> returns get_age()

The database connection should be opened when the first query is being executed. We use:

class mysqli_lazy extends mysqli {
  public function query($query) {
    if (!@$this->host_info) parent::connect('127.0.0.1', 'root', '', 'db');
    // or persistent connection: 'p:127.0.0.1'
    return parent::query($query);
  }
}
DBo::conn(new mysqli_lazy, 'db');

// instead of
DBo::conn(new mysqli('127.0.0.1', 'root', '', 'db'), 'db');
// or persistent connection: 'p:127.0.0.1'

To get all queries on stdout, we use:

class mysqli_log extends mysqli {
  public function query($query) {
    echo $query."\n";
    return parent::query($query);
  }
}
DBo::conn(new mysqli_log('127.0.0.1', 'root', '', 'db'), 'db');

The fastest way to get data is reading it from a hash table in the main memory. With the APC extension, we can persist data between many requests. The syntax for caching looks like this:

// caching is not yet implemented

// cache categories for 60 seconds
$payments = DBo::Categories()->cache(60);
// [{id=>0, name=>Sports}, {id=>1, name=>Movies}, ...]

$payments = DBo::Categories()->oarray(60);
// [[id=>0, name=>Sports], [id=>1, name=>Movies], ...]

$payments = DBo::Categories()->ovalues('col_name', 60);
// [Sports, Movies, ...]

$payments = DBo::Categories()->okeyValue('col_id', 'col_name', 60);
// [0=>Sports, 1=>Movies, ...]

$payments = DBo::Categories()->count(60); // 42

Broken queries or connection errors can be handled with try-catch:

try {
  DBo::conn(new mysqli('127.0.0.1', 'root', '', 'db'), 'db');
  DBo::query('select * from invalid');
}
catch (mysqli_sql_exception $e) {
  echo $e->getMessage();
  exit(1);
}

If column names are ambiguous, we need to prefix them with "@":

// SELECT a.* FROM app a, os b WHERE a.os_id=b.id AND id=42 AND id=13
DBo::os('id=42')->app('id=13');

// SELECT a.* FROM app a, os b WHERE a.os_id=b.id AND b.id=42 AND a.id=13
DBo::os('@id=42')->app('@id=13');

Security:

Scalar parameters are escaped automatically if the previous parameter is a string. When using scalar parameters without predicates, you need to cast them manually:

DBo::Guestbook('id=?', $_GET['id']); // automatic
DBo::Guestbook((int)$_GET['id']); // manual!
DBo::Guestbook((array)$_GET['ids']); // manual!

Other parameters and field names are automatically escaped:

DBo::Guestbook([ $_GET['id'], $_GET['val'] ]); // automatic
DBo::Guestbook([ 'id' => $_GET['id'] ]); // automatic

DBo::Guestbook()->setFrom($_POST)->insert(); // automatic
DBo::Guestbook()->limit($_POST['limit']); // automatic

IDE integration: (e.g. PhpStorm)

// auto-generation of PHPdoc hints is not yet implemented

// type hint
$sales = DBo::Sales(); /* @var $sales DBo_Sales */

// or class hint
/**
 * @method static DBo_Sales Sales some description
 */
class DBo implements IteratorAggregate {...}

// or extended class hint
/**
 * @method static DBo_Sales Sales some description
 */
class HDBo extends DBo {}

$sales = HDBo::Sales();

/**
 * @property int id
 * @property decimal_6_2 price
 * @property varchar_40 desc
 */
class DBo_Sales extends DBo {}





Keyboard shortcuts:
- auto-complete: Ctrl+space
- documentation: Ctrl+q
- variable info: Ctrl+mouseover

The implementation in detail:

Running DBo::Student()->Attend()->Lecture()->array() as a single query can be implemented by using a stack. The query itself presents a chain of tables being joined by some predicates. Attend() and others are handled by PHP's magic __call(). Every table in the chain pushes a new element on the stack containing the name of the table, the primary key and the required predicate(s).

Even with the best ORM, you can still write bad code:

foreach (DBo::os(10)->app() as $app) {
  if (!$app->active) continue; // bad
  ...
}
foreach (DBo::os(10)->app('active=1') as $app) {...} // good, less data

foreach (DBo::os(10)->app() as $app) {
  foreach ($app->compontent() as $compontent) {...} // bad, many queries
}
foreach (DBo::os(10)->app()->compontent() as $compontent) {...} // good

foreach (DBo::os(10)->app() as $app) $app->update('active', 1); // bad
DBo::os(10)->app()->update('active', 1); // good, 1 query

foreach (DBo::sale(10)->salepos() as $salepos) {
  if (DBo::logistics($salepos->id, 'shipped=1')->exists()) {
    $salepos->update('complete', 1); // bad
  }
}
DBo::sale(10)->salepos()->logistics('shipped=1')
  ->salepos()->update('complete', 1); // good

In general, it is better to select as little data as possible from the database. Joins are often expensive, but if the database handles them correctly, it is much more efficient than performing joins directly in PHP.

Changelog:


The code is available at: https://github.com/thomasbley/DBo (~500 loc, test coverage 100%)


5 comments:

  1. Would it be possible to see the ORM class(es)? It's hard to realize the examples above without the class.

    ReplyDelete
  2. The code and the article are not yet final, but I think they will be until Christmas.

    ReplyDelete
  3. Thanks Thomas - looking forward to it!

    ReplyDelete
  4. Hi Thomas,

    Thanks again for a great article and class.

    I really like your blogs and read and used some of your code (Router) and I also agree with your way of thinking. Small well written classes even if limited in definition (eg MySQLi) are the way to go. Keep the footprint small so you can oversee and know the ins and outs of your lib.

    What I was wondering is if you have already been using this class in production and have some updates about how stable it is.

    I'll be be digging into it it shortly as I've build an ORM like this an want to update it to the new possibilities of php.

    Grtz!

    ReplyDelete

Labels

performance (23) benchmark (6) MySQL (5) architecture (5) coding style (5) memory usage (5) HHVM (4) C++ (3) Java (3) Javascript (3) MVC (3) SQL (3) abstraction layer (3) framework (3) maintenance (3) Go (2) Golang (2) HTML5 (2) ORM (2) PDF (2) Slim (2) Symfony (2) Zend Framework (2) Zephir (2) firewall (2) log files (2) loops (2) quality (2) real-time (2) scrum (2) streaming (2) AOP (1) Apache (1) Arrays (1) C (1) DDoS (1) Deployment (1) DoS (1) Dropbox (1) HTML to PDF (1) HipHop (1) OCR (1) OOP (1) Objects (1) PDO (1) PHP extension (1) PhantomJS (1) SPL (1) SQLite (1) Server-Sent Events (1) Silex (1) Smarty (1) SplFixedArray (1) Unicode (1) V8 (1) analytics (1) annotations (1) apc (1) archiving (1) autoloading (1) awk (1) caching (1) code quality (1) column store (1) common mistakes (1) configuration (1) controller (1) decisions (1) design patterns (1) disk space (1) dynamic routing (1) file cache (1) garbage collector (1) good developer (1) html2pdf (1) internationalization (1) invoice (1) just-in-time compiler (1) kiss (1) knockd (1) legacy code (1) legacy systems (1) logtop (1) memcache (1) memcached (1) micro framework (1) ncat (1) node.js (1) openssh (1) pfff (1) php7 (1) phpng (1) procedure models (1) ramdisk (1) recursion (1) refactoring (1) references (1) regular expressions (1) search (1) security (1) sgrep (1) shm (1) sorting (1) spatch (1) ssh (1) strange behavior (1) swig (1) template engine (1) threads (1) translation (1) ubuntu (1) ufw (1) web server (1) whois (1)