Wednesday, October 17, 2012

Automated Foreign Key Cached Dictionary Generation in SubSonic 3 : Part 2

In part one, we tackled the problem of what Foreign Key (FK) relationships in a database could look like when translated into the object world.
An initial, not terribly successful attempt to provide these methods was undertaken.

At this stage, several design points became obvious:

  • there was a need to cache the entire set of objects
  • this would best be done using a static list / dictionary
  • the least confusing way to provide the FK lookups would be for each cached table to provide lookup dictionaries to itself, indexed by FK column value (rather than one table storing data about another)
  • we might want also to look up any indexed column values by index as well as those specifically with a FK
  • it would be great to have the option to cache only some tables/objects and not others, and have the object seamlessly detect and deal with this - retrieve cached values if present, or query the database otherwise

Cached Lists

The cached lists were declared in a normal class, which could later be declared as the static member of an application.
This is a sample of the objects for the Order table.
public class DataCache {
    public List<Order> Order_BaseList = null;
    public Dictionary<int, Order> Order_By_OrderID = null;    
    public Dictionary<string, List<Order>> Order_GroupBy_CustomerID = null;
    public Dictionary<int, List<Order>> Order_GroupBy_EmployeeID = null;
    public Dictionary<int, List<Order>> Order_GroupBy_ShipVia = null;

    public DataCache () {
        Order_BaseList = Order.All().ToList();
        foreach (IHasDataCache hdc in Order_BaseList ) { hdc.CachedData = this; } 
        Order_By_OrderID = Order.OrderID_CreateLookupList(Order_BaseList);  
        Order_GroupBy_CustomerID = Order.CustomerID_CreateFkChildList (Order_BaseList);
        Order_GroupBy_EmployeeID = Order.EmployeeID_CreateFkChildList (Order_BaseList);
        Order_GroupBy_ShipVia = Order.ShipVia_CreateFkChildList (Order_BaseList);
    }
}
The 'Base' list could be populated using LINQ from the database.

The 'X_By_XID' dictionary Order_By_OrderID, serves to retreive an Order object given an OrderID.
Any database column having a unique index (clearly including a single column PK) gets a list like this.

The 'X_GroupBy_YID' dictionaries hold lists of Order objects corresponding to a particular FK column value. For example, Order_GroupBy_EmployeeID[employeeId] contains a List<Order> of all the orders associated with that particular employeeID.

You can also see the initialisation code from the constructor. The static methods being called were designed only to be used in the initial creation of the dictionaries from the base list.

Don't worry about the second line, the 'foreach', yet.

It may not be clear yet, but all the functionality we created in Part 1 can easily be derived from these lists. As a bonus, we get this bunch of handy lookup lists to use for any other purpose we please.

Sharing the Cached Data Between Objects

In order to make the cached objects able to access other cached objects, the objects needed to contain a reference to the cache.
This was done by creating an IHasDataCache interface and modifying all data access objects to implement that interface:

public interface IHasDataCache {
    DataCache CachedData { get; set; }
}

public partial class Order: IActiveRecord, IHasDataCache {
    ...

    private DataCache _dataCache = null;

    [SubSonicIgnore]
    public DataCache CachedData {
        get { return _dataCache; }
        set { _dataCache = value; }
    }

    ...
}

public partial class Product: IActiveRecord, IHasDataCache {
    ...
etc
and now we can see the reason for the 'foreach' line back in the DataCache class constructor, remember:
foreach (IHasDataCache hdc in Order_BaseList ) { hdc.CachedData = this; } 
That line loops through each of the just-loaded cached objects and sets a reference to the parent DataCache object. Now the table classes could access the other related classes as they needed to.

Instantiating the Cache

I usually implement a static class 'Gbl' where I put anything that's global to my application, such as cached data ...
public static class Gbl {
    public static DataCache NorthwindDataCache = new DataCache();
    ...
}
The Access Methods

It's time to unveil the actual methods used to access the FK related objects. The naming conventions are a lot less cryptic. Any single FK object is accessed by a method starting with 'FkParent_...', and any list of FK objects is access by a method starting with 'FkList...'.
Here's the methods for the Product class:
List<Order_Detail> orderDetail = p.FkList_Order_Detail_ProductID;
List<Product_Category_Map> pcm = p.FkList_Product_Category_Map_ProductID;
Category category = p.FkParent_CategoryID;
Supplier supplier = FkParent_SupplierID;
Under the Hood

The beauty of this actually resides in the caching system, and the implementation of the above methods gives insight into how this system actually works:
public List<Order_Detail> FkList_Order_Detail_ProductID {
    get {
        if (_FkList_Order_Detail_ProductID == null) {
            if (_dataCache != null && _dataCache.Order_Detail_GroupBy_ProductID!=null) {
                if (!_dataCache.Order_Detail_GroupBy_ProductID.TryGetValue(_ProductID, out _FkList_Order_Detail_ProductID)) {
                    // deal with the case where there are no related records and hence no list
                    _FkList_Order_Detail_ProductID = new List<Order_Detail>();
                }
            } else {
                _FkList_Order_Detail_ProductID = (from items in Order_Detail.All()
                    where items.ProductID == _ProductID
                    select items).ToList();
            }
        }
        return _FkList_Order_Detail_ProductID;
    }
}
 
public Category FkParent_CategoryID {
    get {
       if (_FkParent_CategoryID == null) { 
           if (_dataCache != null && _dataCache.Category_By_CategoryID!=null) {
               _FkParent_CategoryID = _dataCache.Category_By_CategoryID[this.CategoryID]; 
           } else {
               _FkParent_CategoryID = Category.SingleOrDefault(x => x.CategoryID == this.CategoryID);
           }
       }
       return _FkParent_CategoryID;
    }
}
So it goes something like this:
  • lazy loading means the FK object or list is only fetched once
  • if the object was created in the DataCache class and was populated using the code in the constructor, then it contains a valid reference to the DataCache object
  • a valid DataCache reference is used to load the wanted information from the cached data dictionaries, if present
  • if no cache is present, the data is loaded from the database via LINQ
This system is pretty flexible. It allows ad-hoc mixing of cached and non cached objects in code, supporting optimally efficient access to all cached objects, but database fetches where that's not desired.
Not only that, but we can use LINQ to query the cached objects without generating a database call.

As mentioned in Part 1, the caching took the run-time of the first program I used this with (where all the tables needed to be cached, but I initially just cached the main ones and LINQ queried the rest) from 3.5 minutes to sub-second - literally finishing before the mouse button had moved back up from the click to start the program.

Here are some snippets:
foreach (Supplier supp in Gbl.NorthwindDataCache.Supplier_BaseList) {
    foreach Product product in supp.FkList_Product_SupplierID) {
        ...
    }
}

if (product.FkParent_SupplierID.Country != "Australia") { overseasSupplier = true; }

tablesAlphaOrder = Gbl.NorthwindDataCache.Product_BaseList
    .Where(x => x.SupplierId > 0 && x.CategoryID != 25)
    .OrderBy(x => x.ProductName).ToList();
The Code

So where can you get your hands on this little beauty ?
The best place is probably in my SubSonic templates branch here.
You only need the T4 Templates.

Automated Foreign Key List Generation in SubSonic 3 : Part 1

This is another post in the series of SubSonic enhancements.
I'm always tinkering with the class templates, trying to get just that little bit more mileage out of them.
This post is going to be a bit long, and a bit complex, but well worth your while.

The Problem

This post deals with a commonly heard request: 
  using foreign key links to connect to related lists of objects
For example, in the good old Northwind database, a product has a single supplier, but can belong to many orders. Wouldn't it be great if we could write something like the following:
Product p = Products.SingleOrDefault(productId);
if (p!=null) {
    string companyName = p.GetSupplier().CompanyName;
    List<order> orderList = p.GetOrderList();
    ... do something ...
}
Another complication is that often there is a need to cache data before performing complex operations.
For example, we might want to iterate over all orders and carry out some business logic to do with suppliers, which requires stepping through the product level.

We could use a single SQL dataset to handle all three levels at the same time, but then we denormalise the data and lose all the advantages of object-orientedness. 

We could iterate through the orders one at a time and query (by LINQ or SQL) all the related data for each row, which would be programmatically easy, but highly inefficient, generating multiple queries for each row (RBAR, anyone ?).
Time and time again, I'd find myself fetching all three full result sets as lists and writing something like the following:
foreach (Supplier s in supplierList) {
   foreach (Product p in productList) {
      if  (p.SupplierID==s.SupplierID) {
          foreach (Order o in orderList) {
             if  (o.ProductID==p.ProductID) {
                ... do something with supplier orders ...
          }
      }
}
At least if you're going to do wildly inefficient reiteration - do it in memory !

But wouldn't it be great if somehow we could automatically set up and pre-cache, with just three queries to the database, all of the related objects in their relationships to the other objects.

Read on ....

Stage 1: Creating LINQ lookups

Let's state the goal clearly at the outset. 
A foreign key is a one-to-many relationship.
In the Northwind example above, the supplier is on the one/parent side of the relationship (a product can have only one supplier), and the order is on the many/child side (a product can be in many orders).
So
  • for each parent FK relationship, we want to provide a single object 
  • for each child FK relationship, we want to provide a generic list of objects
In the first draft of the solution, this was achieved by providing two sets of properties. There was no attempt at caching.
For example:
Product p = Products.SingleOrDefault(productId);
Supplier s = p.FK_SupplierID_Supplier;
foreach (OrderDetail orderdet in p.FK_Order_ProductID) {
    ... do something ....
}
The naming convention was a little cryptic, but  FK links need to be named using the table and column involved in the FK relationship (Remember - there can be two differently named FK columns in a table that link to the same external table).
The order of the table names was swapped for the one-to-many and the many-to-one member.

The implementation was like this (this is the end result: the code was actually done in the T4 template to auto generate what you see below):
private Supplier _FK_SupplierID_Supplier;
public Supplier FK_SupplierID_Supplier {
  get {
      if (_FK_SupplierID_Supplier == null) { 
           _FK_SupplierID_Supplier = Supplier.SingleOrDefault(
             x => x.SupplierID == this.SupplierID);
      }
      return _FK_SupplierID_Supplier;
  }
}

private List<order> _FK_Order_ProductID;
public List<order> FK_Order_ProductID {
  get {
    if (_FK_Order_ProductID == null) {
        _FK_Order_ProductID = (from items in Order.All()
            where items.OrderID == _OrderID
            select items).ToList();
        }
        return _FK_Order_ProductID;
  }
}
Problems with Stage 1

Programmatically, this solved the problem.  But performance was terrible.
While this approach used lazy loading to avoid repeat queries, it still generated a query to the database for each related record or set of records when first requested.
When I set it up, I was under the impression that LINQ would in fact query the database at load-time rather than run-time. Not so.
To give us a benchmark, the program I first tried this out on, ran a procedure involving object hierarchies that took three and a half minutes to run, and hit the database steadily during the whole process.
This was, indeed, a textbook example of RBAR.

Once implementing the caching in the following steps, the same program ran in a fraction of a second. I didn't even bother to measure it, because it was so fast.

See Part 2 of 2



SubSonic 3 Automated Enum Generation

As I've said before, I love the SubSonic Project for generating data access classes from my database.

Probably 12 months ago, I wrote an automated enum generator for it, since that was one feature that was missing,
It's a drop-in - it doesn't affect any other functionality or require modifications to existing code.
I thought that it had been incorporated into the SubSonic 3 Templates trunk, but whoops - it hasn't !
So the two template files (Enums.tt and Enums.ttinclude) can be downloaded at my GitHub branch.

I don't think there's any point rewriting the comments at the start of Enums.tt regarding usage, so here they are:
----------------------------------------------------------------------------------------------------
INSTRUCTIONS
----------------------------------------------------------------------------------------------------

Enum Generator Features
-----------------------
 - auto generates enum values from the row data in the tables
 - will generate regular enums for integer values or an 'enum-like' struct for string values
 - a single setting will generate enums for all lookup tables with a standard prefix, with default enum
   name based on the table name
 - the enum name, and the value and description columns used to create the enum can be customised per-table
 - multiple enums can be generated from the same table 
 - a MULTI mode allows automated enum generation from a MUCK (Massively Unified Code-Key) general purpose lookup table
   (BTW MUCK tables are NOT a good idea, but in the tradition of SubSonic, we let you make the choice)

Typical 'integer valued' table:

  CategoryID  CategoryName   
  int         nvarchar(50)   
  ----------- ---------------
  1           Beverages       
  2           Condiments      
  3           Confections     
  4           Dairy Products  
  5           Grains/Cereals  

Typical 'string valued' table:

  State_Str     State
  nvarchar(10)  nvarchar(50)
  ------------  ----------------------------
  ACT           Australian Capital Territory
  NSW           New South Wales
  NT            Northern Territory
  QLD           Queensland
  SA            South Australia
  TAS           Tasmania
  VIC           Victoria
  WA            Western Australia

Typical 'MUCK' table:

  LookupKey                                          LookupVal    LookupDescLong
  nvarchar(50)                                       nvarchar(20) nvarchar(100)
  -------------------------------------------------- ----------   --------------------------
  AssignStatusStr                                    F            Fully
  AssignStatusStr                                    P            Partly
  AssignStatusStr                                    U            Not
  AssignStatusStr                                    X            n/a
  BatchAutoGenModeStr                                E            Assign to existing batch
  BatchAutoGenModeStr                                N            Make new batch
  BatchAutoGenModeStr                                X            Do not assign to batch
  BatchPackStatusStr                                 C            Cancelled
  BatchPackStatusStr                                 L            Locked
  BatchPackStatusStr                                 P            Packing
  BatchPackStatusStr                                 T            Complete


EnumSettings contains a list of enum generation settings.
NOTE: enum Generation uses CleanUp() from Settings.ttinclude to sanitize names so make sure it's up to scratch

FORMAT:   [table name regexp]:[enum name]:[id column name]:[descr column name]:[sql where clause]

 - all params are optional except the first. if omitting an earlier parameter but using a later parameter then 
   still include the ':' as a placeholder

  [table name regexp] = regular expression matching the table name.  Can be just the table name but is advisable 
      to use the end and/or start RegEx markers.

  [enum name] = the name to use for the enum (default=table name + 'Enum')
   - if the enum name is in the format MULTI=[KeyColName] then the key column values will be used to name 
     the enum and to match the blocks of row values to be used for each enum

  [id column name] = the name of the column to use for the enum value (default=PK col)

  [descr column name] = the name of the column to use for the enum description (default=first text col)

  [sql where clause] = where clause to use when retrieving the enum values (default=empty)

EXAMPLES
string[] EnumSettings = new string[]{
 "lk_.+",
 - generates enums from all table in the database starting with 'lk_' using default names and columns

 "tblLookupVals:AssignStatusEnumStr:LookupVal:LookupDescLong:where [LookupKey]='AssignStatusStr'",
 - generates the named enum from the designated table, using the designated columns and WHERE

 "tblLookupVals:MULTI=LookupKey:LookupVal:LookupDescLong",
 - generates multiple enums from the 'tblLookupVals' MUCK table; one enum for each block of values in column 'LookupKey'

 "lk_State:StateShortEnum:State_Str:State_Str",
 - generates an enum of 'short' state values only

 "lk_State:StateLongEnum:State:State",
 - generates an enum of 'long' state values only

};
Samples of generated enums are shown below. Note that the tool can generate string 'enums', which are a struct of a type I put together after browsing the many proposals on StackOverflow.
namespace MyNamespace { 

 // string enum derived from database rows: libm_ColType.DotNetSystemTypeName, libm_ColType.DotNetSystemTypeName
 public struct DataTypeEnum {
  public const string Int64 = "Int64 ";
  public const string Boolean = "Boolean ";
  public const string Byte = "Byte ";
  public const string Decimal = "Decimal ";
  public const string DateTime = "DateTime ";
  public const string Double = "Double ";

  public string Value { get; set; }
  public override string ToString() { return Value; }
 }

 // enum derived from database rows: libm_DataView.ViewDescr, libm_DataView.DataViewID
 public enum DataViewEnum {
  Table_Default = -81,
  Default = -18,
  None = -16
 }
}
The string 'enum' can be used in two ways. You can declare your variable as string, and simply use the enum const values:
string s = DataTypeEnum.Decimal;
Or you can declare your variable as as DataTypeEnum, and use the .Value method to get and set values. This doesn't actually restrict the values used to those in the enum unless you write some more code (you can add it to the template so it autogenerates), but it conforms to the style of enums, and if you don't ever assign anything but the matching enum constant values, it's eqivalent to an enum.
DataTypeEnum x;
x.Value = DataTypeEnum.Decimal;
string x1 = x.Value;
string x2 = x.ToString();
Up to you if it's important enough to add that feature. NOTE: I've also added automatic enum generation to SubSonic 2. I'm an admin of that GitHub project, so the enum generation IS part of the main code branch.

Sunday, October 14, 2012

The Death of the Hard Drive - and the Database Server ?

I'll start this post with a prediction: Hard drives are doomed

Actually, not much to argue with there for the informed observer. The new solid state hard drives can run rings around the mechanical version. Oh, they're not quite there yet - they're smaller and more expensive - but ten years from now you can bet that will all have changed.
And solid state hard drives are actually just memory - the only reason that they are packaged as hard drives is because that is the paradigm that we are locked into.

They're not the same memory as we buy for our motherboards.  That's dynamic, or volatile, RAM, where the data disappears with the power. The drives use static, or nonvolatile, memory (otherwise known as EEPROM).  The write cycles are orders of magnitude longer, and there is a limit to the number of write cycles per bit/byte/unit of storage, so the solid state drives have management units to spread the writing load around so as not to exhaust this limit.
Managing the write cycle limit is the one compelling reason to keep this memory as an independent unit, but I suspect that soon someone will find away around this limitation and also a way to greatly decrease the write cycle time.
And it's just a matter of time before the capacity increases beyond the ability of mechanics to compete with.

Once these things change, there will be no reason not to connect static memory direct to the processor. Why pipe it thought a slow interface when a 64 bit processor can directly address 18 million terabytes ?
The static memory will replace the role of the hard drive and probably some uses of the dynamic memory, and the dynamic memory will remain the fast, volatile work area it is now.

But the IDE/SATA/SCSI-connected box we call a hard drive will be gone.

And this brings us to the corollary of the first prediction:
Database servers as we know them are doomed

And that's because database servers are designed specifically to take memory based structures and store them in detached persistent storage - ie. a hard drive.

Now let's backtrack a fraction.
Database servers perform a lot of valuable and complex tasks, including not just storing the data, but also navigating it and keeping it consistent - transactions and query optimisation being two great examples.
So there's no way I'm predicting that these functions are going to disappear.

ACID is mandatory for any reliable system and in fact, I think this is a great litmus test for how any database system performs: assuming competent database design and indexing, can it handle an ad-hoc, highly complex query on millions of rows and still return results with low latency and throughput times ?
In the case of MS SQL Server and Oracle, the answer is yes (and server clustering is another game altogether, we won't cover it here). But many other database systems fail this test dismally.

But I do strongly believe that there will be a paradigm shift in how the data is actually stored. What concerns me is that none of the standard vendors appear to be gearing up for this.
Most of the serous RDBMSs utilise modelling of the hard disk topology at a very low level, so as to squeeze the maximum performance out of the disk.

So what happens when the hard drive disappears ? Well, the data will be stored persistently  in-place in memory. But we'll still need transactions. We'll still need a query optimiser.
And we'll still need to house the data store on a separate server and communicate with it somehow, whether by API or by SQL.
How is that going to look, and who's planning for this ? As far as I can see, as the Johnny Cash song goes: nobody.