Some questions

On Thu, Apr 23, 2009 at 1:45 PM, Adam Jacob adam@opscode.com wrote:

On Thu, Apr 23, 2009 at 12:45 AM, David Lee david.lee@kanji.com.au
wrote:

I'm not sure I understand how the search indexes help solve Miguel's
problem?

I was just curious about how difficult the implementation is likely to
be.
An ActiveRecord, with polymorphic joins etc, would likely require less
work
than one using native database interaction.

But I gather it would have to be built directly on top of couch; I have
little sense of the difficulty involved at this point, and I'm curious.

CouchDB is particularly ill-suited to ad-hoc queries of that sort -
while you would be easily able to pull out single objects, you really
don't have the ability to string together arbitrary queries in the way
you are thinking. This is a side-effect of being schema-free and
document oriented, and it's why something like a full text index for
the CouchDB documents is necessary.

We're chatting about ways to make this better in the long term - we
would love to hear your thoughts on the matter.

Adam

I'll throw in my thoughts on this. For the amount of free formed data being
stored, I think couchdb is overkill. It's another moving part that imo is
not really adding much value. If you were storing gigabytes of data and
doing map/reduce over distributed data sets I could see the point, but right
now couchdb is more a point of frustration for doing lots of stuff that
would be easier with datamapper/activerecord.

Why not just serialize the node attributes and stick them in a text column?
Index them, even keep them all in memory if you want, it's not a large
amount of data even with hundreds of servers. You could store it as just
one large hash. Another thing is that couchdb plus an sql db would end up
being too much, it really needs to be one or the other. Which one will add
the most value overall? I'm guessing datamapper/activerecord will make chef
easier to work with, get more people contributing, and result in more people
using it.

Chris

Yeah, I don't see what benefits couchdb has over sqlite3 at the scale we are
talking either. Can anyone clarify?

On Thu, Apr 23, 2009 at 4:48 PM, snacktime snacktime@gmail.com wrote:

On Thu, Apr 23, 2009 at 1:45 PM, Adam Jacob adam@opscode.com wrote:

On Thu, Apr 23, 2009 at 12:45 AM, David Lee david.lee@kanji.com.au
wrote:

I'm not sure I understand how the search indexes help solve Miguel's
problem?

I was just curious about how difficult the implementation is likely to
be.
An ActiveRecord, with polymorphic joins etc, would likely require less
work
than one using native database interaction.

But I gather it would have to be built directly on top of couch; I have
little sense of the difficulty involved at this point, and I'm curious.

CouchDB is particularly ill-suited to ad-hoc queries of that sort -
while you would be easily able to pull out single objects, you really
don't have the ability to string together arbitrary queries in the way
you are thinking. This is a side-effect of being schema-free and
document oriented, and it's why something like a full text index for
the CouchDB documents is necessary.

We're chatting about ways to make this better in the long term - we
would love to hear your thoughts on the matter.

Adam

I'll throw in my thoughts on this. For the amount of free formed data
being stored, I think couchdb is overkill. It's another moving part that
imo is not really adding much value. If you were storing gigabytes of data
and doing map/reduce over distributed data sets I could see the point, but
right now couchdb is more a point of frustration for doing lots of stuff
that would be easier with datamapper/activerecord.

Why not just serialize the node attributes and stick them in a text
column? Index them, even keep them all in memory if you want, it's not a
large amount of data even with hundreds of servers. You could store it as
just one large hash. Another thing is that couchdb plus an sql db would
end up being too much, it really needs to be one or the other. Which one
will add the most value overall? I'm guessing datamapper/activerecord will
make chef easier to work with, get more people contributing, and result in
more people using it.

Chris

On Thu, Apr 23, 2009 at 4:48 PM, snacktime snacktime@gmail.com wrote:

I'll throw in my thoughts on this. For the amount of free formed data being
stored, I think couchdb is overkill. It's another moving part that imo is
not really adding much value. If you were storing gigabytes of data and
doing map/reduce over distributed data sets I could see the point, but right
now couchdb is more a point of frustration for doing lots of stuff that
would be easier with datamapper/activerecord.

CouchDB really isn't the point of frustration here. We're storing
fairly large semi-structured JSON data on the server, which we want
clients to be able to query via a RESTful API. In our case, CouchDB
is providing the back-end storage, and that query API is coming from
Ferret. The query API is the part that sucks here, not really
CouchDB.

If you were to move this to a relational database, or to something
like sqlite, you'll be essentially creating a trivial table with two
columns: key and a BLOB. (Any other structure is a road to ruin -
check out iClassify for an example) At which point any benefits SQL
bought you are pretty well moot.

Why not just serialize the node attributes and stick them in a text column?
Index them, even keep them all in memory if you want, it's not a large
amount of data even with hundreds of servers. You could store it as just
one large hash. Another thing is that couchdb plus an sql db would end up
being too much, it really needs to be one or the other. Which one will add
the most value overall? I'm guessing datamapper/activerecord will make chef
easier to work with, get more people contributing, and result in more people
using it.

You'll get very little to no benefit from DM/AR, since that won't ever
be the interface you are exposing to the clients themselves, unless
you want every system on your network connecting directly to the
database to handle your queries. The Chef internals are pretty simple
and clear with CouchDB in place. I would be open to having switchable
data storage layers, but really, I don't think it will solve the
problem you think it will.

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-4759 E: adam@opscode.com

On Thu, Apr 23, 2009 at 4:58 PM, David Balatero dbalatero@gmail.com wrote:

Yeah, I don't see what benefits couchdb has over sqlite3 at the scale we are
talking either. Can anyone clarify?

At a significant size, sqlite3 will have all sorts of lock contention
and other issues. You'll want to get to replication, caching,
redundancy, etc. eventually for the data storage layer. Those issues
are all solved and trivial with CouchDB.

You could solve them with MySQL or PostgreSQL as well, but the reality
is we would be using those databases in the most trivial manner
possible. Once the search interface is improved, I think this will
become significantly less of an issue.

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-4759 E: adam@opscode.com

The query API is the part that sucks here

Hmmm. A JSON equivalent of xpath, anyone?

I've written some code for work that will traverse a JSON object returned by
JSON.parse, through an interface such as:

json.node("path/to/my/json/node") => single object
json.node_set("path/to/an/array") => [obj1, obj2, obj3]

If that's possibly useful, it could be incorporated (although I'm not an
expert on Chef yet to know whether that would be useful or not, as I'm still
get up and running at the moment...)

  • david

On Thu, Apr 23, 2009 at 5:19 PM, Steven Parkes smparkes@smparkes.netwrote:

The query API is the part that sucks here

Hmmm. A JSON equivalent of xpath, anyone?

On Thu, Apr 23, 2009 at 5:19 PM, Steven Parkes smparkes@smparkes.net wrote:

The query API is the part that sucks here

Hmmm. A JSON equivalent of xpath, anyone?

BINGO

I think doing JSONQuery as a CouchDB view server would get us the
rocking' ad-hoc query-ness.

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-4759 E: adam@opscode.com

I believe David Lee's recent patch for CHEF-243 [1] should make
searches from Ferret much more usable.

Essentially, it gives the ability for the hash returned by search()
and Ferret to be used like a real, nested hash.

e.g.:

data = {
"parent_child_0" => "pc0",
"parent_child_grandchild" => "gc",
"parent_child_1" => "pc",
"parent_child_2" => "pc2"
}

search() => h(data)

h["parent_child_0"] # => "pc0"
h["parent"]["child_0"] # => "pc0"
h["parent"]["child"]["grandchild"] # => "gc"

Mad props to David =)

Regards,

AJ

[1] http://tickets.opscode.com/browse/CHEF-243

On 24/04/2009, at 12:26 PM, Adam Jacob wrote:

On Thu, Apr 23, 2009 at 5:19 PM, Steven Parkes
smparkes@smparkes.net wrote:

The query API is the part that sucks here

Hmmm. A JSON equivalent of xpath, anyone?

BINGO

I think doing JSONQuery as a CouchDB view server would get us the
rocking' ad-hoc query-ness.

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-4759 E: adam@opscode.com

On Thu, Apr 23, 2009 at 5:13 PM, Adam Jacob adam@opscode.com wrote:

On Thu, Apr 23, 2009 at 4:48 PM, snacktime snacktime@gmail.com wrote:

I'll throw in my thoughts on this. For the amount of free formed data
being
stored, I think couchdb is overkill. It's another moving part that imo
is
not really adding much value. If you were storing gigabytes of data and
doing map/reduce over distributed data sets I could see the point, but
right
now couchdb is more a point of frustration for doing lots of stuff that
would be easier with datamapper/activerecord.

CouchDB really isn't the point of frustration here. We're storing
fairly large semi-structured JSON data on the server, which we want
clients to be able to query via a RESTful API. In our case, CouchDB
is providing the back-end storage, and that query API is coming from
Ferret. The query API is the part that sucks here, not really
CouchDB.

I was more thinking of all the other parts of chef that could benefit from
having an sql db, but can't as long as you have couchdb, unless you want to
run both an sql server and couchdb. And what if people want to extend chef
for their own particular needs, or you expand the UI, or any number of other
things that can and will be added to chef in the future? Sql won't make
querying attributes any easier, that's true.

Chris

On Thu, Apr 23, 2009 at 6:31 PM, Arjuna Christensen aj@junglist.gen.nz wrote:

I believe David Lee's recent patch for CHEF-243 [1] should make searches
from Ferret much more usable.

Essentially, it gives the ability for the hash returned by search() and
Ferret to be used like a real, nested hash.

Oh that's awesome. I was totally giving my search code that recreates
hashes from the flattened results an evil eye recently.

On 24/04/2009, at 1:40 PM, snacktime wrote:

CouchDB really isn't the point of frustration here. We're storing
fairly large semi-structured JSON data on the server, which we want
clients to be able to query via a RESTful API. In our case, CouchDB
is providing the back-end storage, and that query API is coming from
Ferret. The query API is the part that sucks here, not really
CouchDB.

I was more thinking of all the other parts of chef that could
benefit from having an sql db, but can't as long as you have
couchdb, unless you want to run both an sql server and couchdb. And
what if people want to extend chef for their own particular needs,
or you expand the UI, or any number of other things that can and
will be added to chef in the future? Sql won't make querying
attributes any easier, that's true.

As the chef server is distributed as a slice, expanding the UI becomes
trivial. Bolting additional pieces (datamapper, activerecord) into a
host merb app (merb-auth, for example) and then mounting the slice
just about covers "expanding the UI".

Additional (potentially read only) data storage mechanisms for node
data (attributes) is definitely something I believe is on the cards,
if only to ease transition from older configuration management
systems, infrastructure inventory systems and such.

Care to clarify the things you'd like to do that you believe aren't
possible with CouchDB?

Regards,

AJ

On Thu, Apr 23, 2009 at 6:40 PM, snacktime snacktime@gmail.com wrote:

I was more thinking of all the other parts of chef that could benefit from
having an sql db, but can't as long as you have couchdb, unless you want to
run both an sql server and couchdb. And what if people want to extend chef
for their own particular needs, or you expand the UI, or any number of other
things that can and will be added to chef in the future? Sql won't make
querying attributes any easier, that's true.

I totally hear what you're saying - it's true that more people are
familiar with using SQL databases for these kinds of applications.
Having built a similar web UI with Rails, Active Record and
acts_as_solr, (iClassify) the difference between using CouchDB and a
SQL database for this sort of application was pretty huge. It does
take a bit more learning to understand, but not much - in truth, you
are much closer to the point of view you care about most: the objects
you deal with in Ruby.

This is particularly true with tools like CouchRest, which will likely
find it's way into Chef, rather than our own Chef::CouchDB interface
(which gives you a lot of the AR like API). It's not that CouchDB is
better for every application - but for Chef, where you have a pretty
clean set of objects, that happens to speak JSON as it's own REST API,
CouchDB is pretty great.

In case I wasn't totally clear, I would absolutely accept patches to
enable the use of multiple back end data stores, including a SQL one
(and all that would really be required is that things inflate to the
right objects - it probably wouldn't be a ton of work.) I chose
CouchDB because, in my opinion, it fit the model perfectly, and has
great scaling characteristics. (Think about read performance, and
what a single varnish instance could accomplish.)

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-4759 E: adam@opscode.com

well, my immediate thought is to use eg sqlite3 + activerecord to store
metadata for a node like this, keyed on the node/recipe name or other
unique ID.

This would be pretty easy, and make implementing these features a snap-
not sure if the additional dependencies would be welcomed though. It'd
also mean 2 separate data stores, which I don't think I like the sound of.

Otherwise, it seems there are a few couch-backed ORMs turning up. Any of
those decent?

Writing complex queries directly is so 2001 ...

Adam Jacob wrote:

On Thu, Apr 23, 2009 at 12:45 AM, David Lee david.lee@kanji.com.au wrote:

I'm not sure I understand how the search indexes help solve Miguel's
problem?
I was just curious about how difficult the implementation is likely to be.
An ActiveRecord, with polymorphic joins etc, would likely require less work
than one using native database interaction.

But I gather it would have to be built directly on top of couch; I have
little sense of the difficulty involved at this point, and I'm curious.

CouchDB is particularly ill-suited to ad-hoc queries of that sort -
while you would be easily able to pull out single objects, you really
don't have the ability to string together arbitrary queries in the way
you are thinking. This is a side-effect of being schema-free and
document oriented, and it's why something like a full text index for
the CouchDB documents is necessary.

We're chatting about ways to make this better in the long term - we
would love to hear your thoughts on the matter.

Adam

--

David Lee

Application Development Coordinator
Kanji Group Pty Ltd

02 8272 9483
david.lee@kanji.com.au


This message and any attachment are confidential and may be privileged
or otherwise protected from disclosure. If you are not the intended
recipient, please telephone or email the sender and delete this message
and any attachment from your system. If you are not the intended
recipient you must not copy this message or attachment or disclose the
contents to any other person.

This email message does not constitute legal, financial or any other
kind of advice and reliance must not be placed on its contents. Any
advice will be prefixed with a notice to that effect - and unless such a
notice is affixed all liability for the contents of this email is
disclaimed. The integrity of this email, its contents or any attachments
is not certified in any way by the sender.

Liability limited by a scheme approved under Professional Standards
Legislation

Actually, before I go offering any more opinions, can I please ask what
we're actually doing? In reading the responses, I realised I don't know
enough about what's going on yet to offer very constructive input.

Please correct me if i'm wrong, but it seems like we're basically
wanting to:

  1. store node data records, which are a fairly large, arbitrarily
    structured nested hash (ruby / json), with string key/value pairs

  2. find node records which match some very simple criteria, and return
    the entire node data structure for matches

  3. store some additional metadata about nodes themselves, recipes, and
    other Chef classes / objects; these bits of metadata would be
    lightweight and possibly act like polymorphically associated
    ActiveRecord objects.

Is this a reasonable summary?

If it is, would a decent native Ruby object database be a pretty
reasonable thing to use as a backend? Does such a thing exist?

David Lee wrote:

well, my immediate thought is to use eg sqlite3 + activerecord to store
metadata for a node like this, keyed on the node/recipe name or other
unique ID.

This would be pretty easy, and make implementing these features a snap-
not sure if the additional dependencies would be welcomed though. It'd
also mean 2 separate data stores, which I don't think I like the sound of.

Otherwise, it seems there are a few couch-backed ORMs turning up. Any of
those decent?

Writing complex queries directly is so 2001 ...

Adam Jacob wrote:

On Thu, Apr 23, 2009 at 12:45 AM, David Lee david.lee@kanji.com.au
wrote:

I'm not sure I understand how the search indexes help solve Miguel's
problem?
I was just curious about how difficult the implementation is likely
to be.
An ActiveRecord, with polymorphic joins etc, would likely require
less work
than one using native database interaction.

But I gather it would have to be built directly on top of couch; I have
little sense of the difficulty involved at this point, and I'm curious.

CouchDB is particularly ill-suited to ad-hoc queries of that sort -
while you would be easily able to pull out single objects, you really
don't have the ability to string together arbitrary queries in the way
you are thinking. This is a side-effect of being schema-free and
document oriented, and it's why something like a full text index for
the CouchDB documents is necessary.

We're chatting about ways to make this better in the long term - we
would love to hear your thoughts on the matter.

Adam

--

David Lee

Application Development Coordinator
Kanji Group Pty Ltd

02 8272 9483
david.lee@kanji.com.au


This message and any attachment are confidential and may be privileged
or otherwise protected from disclosure. If you are not the intended
recipient, please telephone or email the sender and delete this message
and any attachment from your system. If you are not the intended
recipient you must not copy this message or attachment or disclose the
contents to any other person.

This email message does not constitute legal, financial or any other
kind of advice and reliance must not be placed on its contents. Any
advice will be prefixed with a notice to that effect - and unless such a
notice is affixed all liability for the contents of this email is
disclaimed. The integrity of this email, its contents or any attachments
is not certified in any way by the sender.

Liability limited by a scheme approved under Professional Standards
Legislation

On Thu, Apr 23, 2009 at 9:02 PM, Adam Jacob adam@opscode.com wrote:

On Thu, Apr 23, 2009 at 6:40 PM, snacktime snacktime@gmail.com wrote:

I was more thinking of all the other parts of chef that could benefit
from
having an sql db, but can't as long as you have couchdb, unless you want
to
run both an sql server and couchdb. And what if people want to extend
chef
for their own particular needs, or you expand the UI, or any number of
other
things that can and will be added to chef in the future? Sql won't make
querying attributes any easier, that's true.

I totally hear what you're saying - it's true that more people are
familiar with using SQL databases for these kinds of applications.
Having built a similar web UI with Rails, Active Record and
acts_as_solr, (iClassify) the difference between using CouchDB and a
SQL database for this sort of application was pretty huge. It does
take a bit more learning to understand, but not much - in truth, you
are much closer to the point of view you care about most: the objects
you deal with in Ruby.

This is particularly true with tools like CouchRest, which will likely
find it's way into Chef, rather than our own Chef::CouchDB interface
(which gives you a lot of the AR like API). It's not that CouchDB is
better for every application - but for Chef, where you have a pretty
clean set of objects, that happens to speak JSON as it's own REST API,
CouchDB is pretty great.

In case I wasn't totally clear, I would absolutely accept patches to
enable the use of multiple back end data stores, including a SQL one
(and all that would really be required is that things inflate to the
right objects - it probably wouldn't be a ton of work.) I chose
CouchDB because, in my opinion, it fit the model perfectly, and has
great scaling characteristics. (Think about read performance, and
what a single varnish instance could accomplish.)

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-4759 E: adam@opscode.com

Thanks Adam for the thorough reply. Personally I can live with couchdb, as
long as I don't have to go writing custom views for everything that
resembles an sql join:) I wouldn't mind contributing code to enable the use
of multiple backends. I was looking at the couchdb interface last night
and it looks pretty straight forward. But I really didn't want to drag
this whole thread off topic, so I'll post in another if I have questions on
this.

Chris

In any case I didn't mean to drag this thread off topic:)

On Thu, Apr 23, 2009 at 9:46 PM, David Lee david.lee@kanji.com.au wrote:

Actually, before I go offering any more opinions, can I please ask what
we're actually doing? In reading the responses, I realised I don't know
enough about what's going on yet to offer very constructive input.

Please correct me if i'm wrong, but it seems like we're basically wanting
to:

  1. store node data records, which are a fairly large, arbitrarily
    structured nested hash (ruby / json), with string key/value pairs

  2. find node records which match some very simple criteria, and return the
    entire node data structure for matches

  3. store some additional metadata about nodes themselves, recipes, and
    other Chef classes / objects; these bits of metadata would be lightweight
    and possibly act like polymorphically associated ActiveRecord objects.

Is this a reasonable summary?

I think that eventually you will want to attach attributes to a variety of
chef objects, such as cookbooks, recipes, etc..

If it is, would a decent native Ruby object database be a pretty reasonable
thing to use as a backend? Does such a thing exist?

Not that I know of.

The problem with multiple backend stores is the lowest common denominator.
If you are going to mix sql/non sql backends, you need a reference
implementation to go by. It wouldn't be much work to drop in activerecord,
but the minute I start using activerecord functionality that doesn't exist
in the other backends, I've just broken chef for everyone not using
activerecord.

Off the top of my head, I would probably pick one of the full featured orm's
as a reference, and then define a subset of that orm's functionality that
can be used in chef core. The goal being to use a reference implementation
that has enough functionality to carry you forward. I suspect activerecord,
datamapper, or couchrest would be good candidates.

The only other approach I can think of is to use couchdb OR sql, but not
both, and then just pick the best orm and stick with it.

Since chef already uses couchdb, I think just using the best couchdb orm is
the saner approach. Multiple backends of completely different types will
lead to chaos.

If others care to chime in and come to some consensus, I'd be happy to start
coding it up.

Chris

I dunno, "couch orm" seems like a misnomer to me, there's no "r" in
couchdb (it's not relational). And yea, multiple backends sounds like
overkill, I'd much rather see new features in chef then a science
project around an (seemingly gratuitous) abstraction for that. In
general, I think the most common case access to the data is for composed
objects (nodes, cookbooks, etc), de/marshalling overhead of ORM'ing the
data isn't necessary for those cases so Adam's decision to use a
document store makes a lot of sense. For simply querying by metadata
elements, it seems to me that full text search (ferret, solr, etc)
should be sufficient. However, if it's really import to have relational,
transactional access, I would suggest considering a minimal number of
rdbms tables to support that and treat the couch documents as the moral
equivalent of a denormalized table to access composed the documents
(similar to what friendfeed does
http://bret.appspot.com/entry/how-friendfeed-uses-mysql though they
explicitly relax transactional consistency to address web request
latency concerns).
-Ian

snacktime wrote:

On Thu, Apr 23, 2009 at 9:46 PM, David Lee <david.lee@kanji.com.au
mailto:david.lee@kanji.com.au> wrote:

Actually, before I go offering any more opinions, can I please ask
what we're actually doing? In reading the responses, I realised I
don't know enough about what's going on yet to offer very
constructive input.

*Please* correct me if i'm wrong, but it seems like we're
basically wanting to:

1) store node data records, which are a fairly large, arbitrarily
structured nested hash (ruby / json), with string key/value pairs

2) find node records which match some very simple criteria, and
return the entire node data structure for matches

3) store some additional metadata about nodes themselves, recipes,
and other Chef classes / objects; these bits of metadata would be
lightweight and possibly act like polymorphically associated
ActiveRecord objects.

Is this a reasonable summary?

I think that eventually you will want to attach attributes to a
variety of chef objects, such as cookbooks, recipes, etc..

If it is, would a decent native Ruby object database be a pretty
reasonable thing to use as a backend? Does such a thing exist?

Not that I know of.

The problem with multiple backend stores is the lowest common
denominator. If you are going to mix sql/non sql backends, you need a
reference implementation to go by. It wouldn't be much work to drop
in activerecord, but the minute I start using activerecord
functionality that doesn't exist in the other backends, I've just
broken chef for everyone not using activerecord.

Off the top of my head, I would probably pick one of the full featured
orm's as a reference, and then define a subset of that orm's
functionality that can be used in chef core. The goal being to use a
reference implementation that has enough functionality to carry you
forward. I suspect activerecord, datamapper, or couchrest would be
good candidates.

The only other approach I can think of is to use couchdb OR sql, but
not both, and then just pick the best orm and stick with it.

Since chef already uses couchdb, I think just using the best couchdb
orm is the saner approach. Multiple backends of completely different
types will lead to chaos.

If others care to chime in and come to some consensus, I'd be happy to
start coding it up.

Chris

--
Ian Kallen
blog: What's That Noise?! [Ian Kallen's Weblog]
tweetz: http://twitter.com/spidaman
vox: 415.505.5208