# Copyright 2006-2008 The FLWOR Foundation.
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
# http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#


                       Zorba Data Flow
                       ---------------
                        May 26, 2007
                        Revision 1.0

                        Dana Florescu
                        Donald Kossmann
                        Tim Kraska
                        Paul Pedersen










Table of Contents
-----------------

    - Introduction
    - Outline
    - Parsing
    - Normalization
    - Codegen
    - Iterator eval
		- Complete example
    

    
    





Introduction
-------------

The zorba architecture is a collaborative effort of the FLWOR 
foundation.  Participants in the design have been: Dana Florescu, 
Donald Kossmann, Tim Kraska, John Cowan, and Paul Pedersen.
(add anyone else you think belongs here..)

Zorba follows the following specifications:

  XQuery 1.0: An XML Query Language       
    [http://www.w3.org/TR/xquery/]
  XQuery 1.0 and XPath 2.0 Data Model (XDM)
    [http://www.w3.org/TR/xpath-datamodel/]
  XML Schema 1.1 Part 2: Datatypes        
    [http://www.w3.org/TR/xmlschema11-2/]
  XQuery 1.0 and XPath 2.0 Formal Semantics
    [http://www.w3.org/TR/xquery-semantics/]
  XQuery 1.0 and XPath 2.0 Functions and Operators
    [http://www.w3.org/TR/xpath-functions/]
  XQuery Update Facility                   
    [http://www.w3.org/TR/xqupdate/]
  XQuery 1.0 and XPath 2.0 Full-Text       
    [http://www.w3.org/TR/xquery-full-text/]

    





    
    
    

Outline
-------

This document describes the program flow during the evaluation
of the most basic XQuery function, namely,
  
  fn:doc($uri as xs:string?) as document-node()?
  

The zorba query evaluation data flow consists of steps:
 
          ___________________
         |                   |
         |   query source    |
         |___________________|
                  |
                  | <parse>
                  |
          ________v__________
         |                   |
         |    syntax tree    |
         |___________________|
                  |
                  |
                  | <normalize>
                  |      __________
                  |     |          |
          ________v_____v____      |
         |                   |     | <optimize>
         |  expression tree  |     |
         |___________________|     |
                  |     |          |
                  |     |__________|
                  |
                  | <code gen>
                  |
          ________v__________
         |                   |
         |   iterator plan   |
         |___________________|
                  |
                  | <evaluate>
                  |
          ________v__________
         |                   |
         |   XDM instance    |
         |___________________|
                  |
                  | <serialize>
                  |
          ________v__________
         |                   |
         |    XML output     |
         |___________________|











Parsing
-------

The input expression is handed the an 'xquery_driver' class
which invokes the flex/bison scanner/parser subsystem:

	  // zorba query driver (file 'parser/xquery_driver.h')
		xquery_driver driver(log_stream);
	
		driver.trace_parsing = true;
		driver.trace_scanning = true;
	
		driver.parse(expression_string);
		parsenode* n_p = driver.get_expr();


Currently the 'xquery_driver' class holds a simple log stream and 
minimal disgnostic options (shown above).  The driver expects either 
an expression string or an input stream where it can read the 
expression.  In the cases of a parse error, the driver will emit a log 
message and exit.  In the absence of error the driver concludes the 
parse holding a local parse tree consisting of heap-allocated parse 
nodes: 

	  // zorba syntax tree (file 'parser/parsenodes.h')
	  class parsenode : public rcobject
	  {
	  protected:
	    yy::location loc;  // source code line number
	  
	  public:
	    parsenode(yy::location const& _loc) : loc(_loc) { }
	    ~parsenode() { }
	  
	  public:
	    yy::location get_location() const { return loc; }
	    virtual std::ostream& put(std::ostream&) const;
	    virtual void accept(parsenode_visitor&) const = 0;
	  };


For the specific example of the expression

		fn:doc("sample.xml")

the syntax tree has the following structure:

		MainModule[
		  QueryBody[
		    Expr[
		      FunctionCall[
		        QName[ fn:doc ]
		        ArgList[
		          StringLiteral[ strval="sample.xml" ]
		        ]
		      ]
		    ]
		  ]
		]


The parser actual performs about 50 reductions.   This is caused by 
the natural precedence nesting, where the parser follows the chain of 
command all the way up to the top of the precedence hierarchy. For 
example in parsing the string literal "sample.xml", the parser 
triggers a cascade of reductions:

		StringLiteral [ ]
		Literal [string]
		PrimaryExpr [literal]
		FilterExpr [primary]
		StepExpr [filter]
		RelativePathExpr [step]
		PathExpr [relative]
		ValueExpr [path]
		UnaryExpr [value]
		CastExpr [unary]
		CastableExpr [cast]
		TreatExpr [castable]
		InstanceofExpr [treat]
		IntersectExceptExpr [instanceof]
		UnionExpr [interexcept]
		MultiplicativeExpr [union]
		AdditiveExpr [mult]
		RangeExpr [add]
		FTContainsExpr [range]
		ComparisonExpr [ftcontains]
		AndExpr [comp]
		OrExpr [and]
		ExprSingle [OrExpr]

All these reductions, apart from the first, are harmless, they cause 
no allocations and no actions. 











Normalization
-------------

Normalizations specified by the XQuery semantics document
XQuery 1.0 and XPath 2.0 Formal Semantics
[http://www.w3.org/TR/xquery-semantics/]

Zorba uses a 'visitor' design pattern to collect the methods used in 
transforming a parse tree into an expression tree. The visitor pattern 
separates algorithms from object structure. A practical result of this 
separation is the ability to add new operations to existing object 
structures without modifying those structures. 


		// zorba normalization (file exprtree/normalize_visitor.h)
    class normalize_visitor : public parsenode_visitor
    {
    public:
      typedef rchandle<expr> exprref;
    
    protected:
      zorba* zorp;
      std::stack<exprref> nodestack;
      std::stack<exprref> argstack;
      std::stack<exprref> pstack;
      fxcharheap sheap;
    
    public:
      normalize_visitor(zorba*);
      ~normalize_visitor() {}
    
    public:
      exprref pop_nodestack();
      void clear_argstack();
      void clear_pstack();
    
    public:
     /*..........................................
       :  begin visit                            :
       :.........................................*/
      bool begin_visit(parsenode const&);
      bool begin_visit(AbbrevForwardStep const&);
      bool begin_visit(AnyKindTest const&);
      bool begin_visit(AposAttrContentList const&);
      bool begin_visit(AposAttrValueContent const&);
      bool begin_visit(ArgList const&);
      bool begin_visit(AtomicType const&);
      ...
    
     /*..........................................
       :  end visit                              :
       :.........................................*/
      void end_visit(parsenode const&);
      void end_visit(AbbrevForwardStep const&);
      void end_visit(AnyKindTest const&);
      void end_visit(AposAttrContentList const&);
      void end_visit(AposAttrValueContent const&);
      void end_visit(ArgList const&);
      void end_visit(AtomicType const&);
      ...
    };


For example, the XQuery grammar specifies an function call
expression as: 

		[93]  FunctionCall  ::=  QName "(" (ExprSingle ("," ExprSingle)*)? ")"


and the XQuery semantics document specifies that:

    statEnv |-  QName of func expands to expanded-QName
    statEnv.typeDefn(expanded-QName) = define type QName_2 AtomicTypeDerivation
    statEnv.funcType(expanded-QName,n) =
						declare function expanded-QName(Type_1, ..., Type_n) as Type
    --------------------------------------------------------------------------
    statEnv |-  [QName( Expr1, ..., Exprn )]_Expr =
							     QName( [Expr1]_FuncArg(Type_1), ..., [Exprn]_FuncArg(Type_n) )

              ==

    QName( [Expr1]_FuncArg(Type1), ..., [Exprn]_FuncArg(Typen) )


Approximately, this statement says that the normalization of the 
function call syntax corresponds to the funcall expression for the 
given function QName, with the vector of normalized argument 
expressions wrapped in the appropriate type promotions.

The 'normalize_visitor' rules for function call currently look
like this:


		bool normalize_visitor::begin_visit(ArgList const& v)
		{
			nodestack.push(NULL); // mark the stack
			return true;
		}

		void normalize_visitor::end_visit(ArgList const& v)
		{
			clear_argstack();
			while (true) { // pop stack to mark
				expr_t e_h = pop_nodestack();
				if (e_h==NULL) break;
				argstack.push(e_h);	// save args
			}
		}
		

		bool normalize_visitor::begin_visit(FunctionCall const& v)
		{
			rchandle<QName> qn_h = v.get_fname();
			string prefix = qn_h->get_prefix();
			string fname = qn_h->get_localname();
			string uri = zorp->resolve_namespace(prefix);
			qnamekey_t funkey = zorba_qname(uri,prefix,fname).qnamekey();
			yy::location loc = v.get_location();
			rchandle<fo_expr> fo_h = new fo_expr(loc);
			fo_h->set_func(dctx_p->get_function(funkey));
			nodestack.push(&*fo_h);
			return true;
		}

		void normalize_visitor::end_visit(FunctionCall const& v)
		{
			rchandle<fo_expr> fo_h = dynamic_cast<fo_expr*>(&*nodestack.top());
			if (fo_h==NULL) return;
			while (!argstack.empty()) {
				expr_t e_h = argstack.top();
				argstack.pop();
				if (e_h==NULL) continue;
				// >>can add treat-as wrappers here<<
				fo_h->add(e_h);
			}
		}


The data flow through normalize_visitor for the query 'fn:doc("sample.xml")'
looks like this:

		 -normalize_visitor.cpp:899::begin_visit: Expr
		|  -normalize_visitor.cpp:930::begin_visit: FunctionCall
		| |  -normalize_visitor.cpp:87::begin_visit: ArgList
		| | |  -normalize_visitor.cpp:1061::begin_visit: StringLiteral
		| | |  -normalize_visitor.cpp:2268::end_visit: StringLiteral
		| |  -normalize_visitor.cpp:1440::end_visit: ArgList
		|  -normalize_visitor.cpp:2125::end_visit: FunctionCall : argstack.size() = 1
		 -normalize_visitor.cpp:2087::end_visit: Expr

with output expression tree:

		fo_expr[xs_qname([http://www.w3.org/2005/xpath-functions]fn:fn_doc)
			string[ "sample.xml" ]
    ]

The normalization replaces an expression list of size one with the
inner expression.











Codegen
-------

The zorba expression tree is compiled into an execution plan
which consists of a compositon of iterators.  Evaluation is
kept as lazy as possible within the constraints of the XQueryP
semantics.  The types 'item_t' and 'iterator_t' are declared as
The iterator interface:

    class basic_iterator : public rcobject
    {
    protected:
      dynamic_context* dctx_p;
    
    public:
      item_iterator() : dctx_p(NULL) {}
      item_iterator(dynamic_context* _dctx_p) : dctx_p(_dctx_p) {}
      item_iterator(const item_iterator& it) : dctx_p(it.dctx_p) {}
      virtual ~item_iterator() {}
    
    public:  // iterator interface
      virtual void open() = 0;
      virtual void close() = 0;
      virtual item_t next() = 0;
      virtual bool done() const = 0;
    
    public:  // C++ interface
      virtual item_t operator*() const = 0;
      virtual item_iterator& operator++() = 0;
      virtual item_iterator& operator=(const item_iterator&) = 0;
    };


The methods open/close are used to acquire and release resources used 
by the iterator when it runs.   An iterator can be allocated and 
passed around before being run. 

For the example of fn:doc, then normalized expression tree gets
transformed by the 'plan_visitor' as follows:

		 -plan_visitor.cpp:87::begin_visit: fo_expr
		|  -plan_visitor.cpp:160::begin_visit: literal_expr
		|  -plan_visitor.cpp:347::end_visit: literal_expr
		 -plan_visitor.cpp:274::end_visit: fo_expr


The literal_expr visit creates a singleton iterator:


		void plan_visitor::end_visit(const literal_expr& v)
		{
		  switch (v.get_type()) {
		  case literal_expr::lit_string: {
		    iterator_t it = new singleton_iterator(
													new stringValue(xs_string,v.get_sval()));
		    itstack.push(it);
		    break;
		  }
		  case literal_expr::lit_integer: {
		    iterator_t it = new singleton_iterator(
													new numericValue(xs_integer,v.get_ival()));
		    itstack.push(it);
		    break;
		  }
		  case literal_expr::lit_decimal: {
		    iterator_t it = new singleton_iterator(
													new numericValue(xs_decimal,v.get_decval()));
		    itstack.push(it);
		    break;
		  }
		  case literal_expr::lit_double: {
		    iterator_t it = new singleton_iterator(
													new numericValue(xs_double,v.get_dval()));
		    itstack.push(it);
		    break;
		  }
		  case literal_expr::lit_bool: {
		    iterator_t it = new singleton_iterator(
													new booleanValue(v.get_bval()));
		    itstack.push(it);
		    break;
		  }}
		}
		

The fo_expr visit uses the function codegen method 'operator()' to 
create an iterator for the previously generated argument iterators. 

		void plan_visitor::end_visit(const fo_expr& v)
		{
			const function* func_p = v.get_func();
			assert(func_p!=NULL);
			vector<iterator_t> argv;
			while (true) {
				iterator_t it_h = pop_itstack();
				if (it_h==NULL) break;
				argv.push_back(it_h);
			}
			const function& func = *func_p;
			itstack.push(&*func(zorp,argv));
		}


For the query expression 'fn:doc("sample.xml")' the codegen tree
looks like:

		doc_iterator
			|
		singleton_iterator("sample.xml")
	









Iterator eval
-------------

All the work of the doc_iterator planis done during '_open':

		void doc_iterator::_open()
		{
			uri_resolver* urires_p = new zorba_uri_resolver();
			arg->open();
			string path = arg->next()->str(zorp);
			rchandle<source> src_h = urires_p->resolve(path);
			istream* is_p = src_h->get_input(zorp);
			assert(is_p!=NULL);
		
			ostringstream oss;
			string line;
			while (!is_p->eof()) {
				getline(*is_p,line);
				oss << line << endl;
			}
			string bufs = oss.str();
			size_t n = bufs.length();
			char buf[n+1];
			strcpy(buf, bufs.c_str());
		
			xml_scanner* scanner_p = new xml_scanner();
			dom_xml_handler* xhandler_p = new dom_xml_handler(zorp,"/",path);
			scanner_p->scan(buf, n, dynamic_cast<scan_handler*>(xhandler_p));
			doc_node = dynamic_cast<dom_document_node*>(xhandler_p->context_node());
		
			delete xhandler_p;
		}
		  
		item_t doc_iterator::_next()
		{
			document_node* result = doc_node;
			doc_node = NULL;
			return result;
		}
		
		void doc_iterator::_close()
		{
		}

The uri_resolver takes a URI and returns as input stream. Some 
alternatives are possible.   The uri_rsolver could return an 
xml_istream, or it could return an abstract document node. 

Given an istream from the uri_resolver, we buffer all the input
characters and then process them using the zorba native XML
scanner.  The result initializes the doc_iterator state to
hold a single document_node (in the native DOM model).

The _next method returns this document_node and sets the state
to 'done'.

The _close method has no specific actions (everything is done
in the generic 'done' method of 'basic_iterator').









  
Complete Example
----------------

We use the input document "sample.xml":

		<sample>
			<item xmlns:x="http://a.b.c" ata="1.1" atb="1.2">one
				<x:sub-item>one_one</x:sub-item>
				<x:sub-item>one_two</x:sub-item>
				<x:sub-item>one_three</x:sub-item>
			</item>
			<item xmlns:y="http://d.e.f" ata="2.1" atb="2.2">two
				<y:sub-item>two_one</y:sub-item>
				<y:sub-item>two_two</y:sub-item>
			</item>
			<item ata="3.1" atb="3.2">
				<sub-item>three_one</sub-item>
				three
			</item>
			<item at="4">four</item>
			<item at="5">five</item>
		</sample>


Tracing the complete evaluation of 'fn:doc"sample.xml")':


		library.cpp:114::init : fn_doc_key = 3375376522981177514
		StringLiteral [ ]
		Literal [string]
		PrimaryExpr [literal]
		FilterExpr [primary]
		StepExpr [filter]
		RelativePathExpr [step]
		PathExpr [relative]
		ValueExpr [path]
		UnaryExpr [value]
		CastExpr [unary]
		CastableExpr [cast]
		TreatExpr [castable]
		InstanceofExpr [treat]
		IntersectExceptExpr [instanceof]
		UnionExpr [interexcept]
		MultiplicativeExpr [union]
		AdditiveExpr [mult]
		RangeExpr [add]
		FTContainsExpr [range]
		ComparisonExpr [ftcontains]
		AndExpr [comp]
		OrExpr [and]
		ExprSingle [OrExpr]
		ArgList [single]
		FunctionCall [arglist]
		PrimaryExpr [funcall]
		
		FilterExpr [primary]
		StepExpr [filter]
		RelativePathExpr [step]
		PathExpr [relative]
		ValueExpr [path]
		UnaryExpr [value]
		CastExpr [unary]
		CastableExpr [cast]
		TreatExpr [castable]
		InstanceofExpr [treat]
		IntersectExceptExpr [instanceof]
		UnionExpr [interexcept]
		MultiplicativeExpr [union]
		AdditiveExpr [mult]
		RangeExpr [add]
		FTContainsExpr [range]
		ComparisonExpr [ftcontains]
		AndExpr [comp]
		OrExpr [and]
		ExprSingle [OrExpr]
		Expr [single]
		QueryBody [expr]
		MainModule [querybody]
		Module [main]
		
		Syntax tree:
		  MainModule[
		    QueryBody[
		      Expr[
		        FunctionCall[
		          QName[ fn:doc          ]
		
		          ArgList[
		            StringLiteral[strval="sample.xml"
		            ]
		          ]
		        ]
		      ]
		    ]
		  ]
		
		Expression tree:
		 -normalize_visitor.cpp:899::begin_visit: Expr
		|  -normalize_visitor.cpp:930::begin_visit: FunctionCall
		| |  -normalize_visitor.cpp:87::begin_visit: ArgList
		| | |  -normalize_visitor.cpp:1061::begin_visit: StringLiteral
		| | |  -normalize_visitor.cpp:2268::end_visit: StringLiteral
		| |  -normalize_visitor.cpp:1440::end_visit: ArgList
		|  -normalize_visitor.cpp:2125::end_visit: FunctionCall : argstack.size() = 1
		 -normalize_visitor.cpp:2087::end_visit: Expr
		
		  fo_expr[
		    xs_qname([http://www.w3.org/2005/xpath-functions]fn:fn_doc)
		      string[sample.xml      ]
		
		    ]
		
		Codegen:
		 -plan_visitor.cpp:87::begin_visit: fo_expr
		|  -plan_visitor.cpp:160::begin_visit: literal_expr
		|  -plan_visitor.cpp:347::end_visit: literal_expr
		 -plan_visitor.cpp:274::end_visit: fo_expr
		
	
Iterator run:
	
		[10]<sample>
		[7]<item xmlns:x="http://a.b.c" ata="1.1" atb="1.2" >one
			[1]<x[http://a.b.c]:sub-item>one_one</x[http://a.b.c]:sub-item>
			[1]<x[http://a.b.c]:sub-item>one_two</x[http://a.b.c]:sub-item>
			[1]<x[http://a.b.c]:sub-item>one_three</x[http://a.b.c]:sub-item>
		</item>
		[5]<item xmlns:y="http://d.e.f" ata="2.1" atb="2.2" >two
			[1]<y[http://d.e.f]:sub-item>two_one</y[http://d.e.f]:sub-item>
			[1]<y[http://d.e.f]:sub-item>two_two</y[http://d.e.f]:sub-item>
		</item>
		[3]<item ata="3.1" atb="3.2" >
			[1]<sub-item>three_one</sub-item>
			three
		</item>
		[1]<item at="4" >four</item>
		[1]<item at="5" >five</item></sample>
		[7]<item xmlns:x="http://a.b.c" ata="1.1" atb="1.2" >one
			[1]<x[http://a.b.c]:sub-item>one_one</x[http://a.b.c]:sub-item>
			[1]<x[http://a.b.c]:sub-item>one_two</x[http://a.b.c]:sub-item>
			[1]<x[http://a.b.c]:sub-item>one_three</x[http://a.b.c]:sub-item>
		</item>
		[5]<item xmlns:y="http://d.e.f" ata="2.1" atb="2.2" >two
			[1]<y[http://d.e.f]:sub-item>two_one</y[http://d.e.f]:sub-item>
			[1]<y[http://d.e.f]:sub-item>two_two</y[http://d.e.f]:sub-item>
		</item>
		[3]<item ata="3.1" atb="3.2" >
			[1]<sub-item>three_one</sub-item>
			three
		</item>
		[1]<item at="4" >four</item>
		[1]<item at="5" >five</item>
		</sample>
	
		dom_xml_handler.cpp:272::etag stack pop
		done
	

The notations are:

  	[10]<sample>	means that <sample> has 10 children.
		x[http://a.b.c]:sub-item  means that x:sub-item has namespace 'http://a.b.c'


	
		

