Mediatume dev:Webservice REST

From Mediatum_dev
Jump to: navigation, search

Web Service REST


REST (REpresentational State Transfer architecture):

Resource oriented architectural style for web applications based on the doctoral dissertation of T.R. Fielding (2000,

Seems to be well suited to resource oriented systems like ordering systems, resource servers, catalogues, ...

For interaction oriented systems like online games, collaboration etc. SOAP/XML-RPC seem to find favour.

Some Characteristics

a) Resource oriented

Data are seen as resources which can be accessed via a unique URI - without side effect and repeatable ("idempotent"). The resources may be served in different representations or formats (XML, JSON, CSV, Excel, HTML,)

The resources are accessed via HTTP (GET, POST, PUT, DELETE, ...)

REST is not a standard but a style that makes use of established standards as HTTP, URI.

  • + access to resources can be organized or shared per links or bookmarks
  • + plus in security: access may be more easily limited with a firewall, and access may be more transparent than with SOAP/XML-RPC
  • + the unique correspondence resource-URI/URL may simplify search engine optimization
  • + advantages in performance because less XML has to be parsed compared to SOAP
  • - URI design is crucial
  • - for SOAP more tool kits seem to be available (especially for Java)

b) Hypermedia usage

A resource may contain links to other resources (possibly in other systems).

This may allow application states and asynchronous communication.


User adds a new resource (a PDF file) via HTTP PUT or POST (../docs/123) which has to undergo a time consuming OCR process. As answer the user will get at once a resource with a link to a processing state resource for the process started at 2011-01-26T15:41:54 (.../processes/docs/123/job-2011-01-26_15-41-54) where s/he may read about the progress of the processing of the document (comparable to a tracking link for an online order). In this manner the client server communication may seem stateless for the server, but the client receives the information necessary for a stateful communication.

REST and mediatum:

The interfaces xmlsearch and jssearch are already RESTful.

Some links



During startup core.webconfig.loadServices() will be called to load web services conformant to these prerequisites:

For each service a subfolder in web/services/ (or in a plug-in like test_plugin/services/) containing files and is expected. must define a function request_handler(request) which will be called by the web server.

The name of this subfolder will be used as base context for this web service. For example, a web service residing in web/services/static01/ will be visible under http://servername/services/static01. This base context can be overridden in mediatum.cfg in the section [services] as in static01.basecontext=testcontext to make the service visible under http://servername/services/testcontext. The prefix services for all service contexts is configured in core.webconfig.CONTEXTPREFIX .

Further configuration in mediatum.cfg:

activate=true  # true/false to activate or deactivate all services
example01.activate=false  # true/false to deactivate a single service
example01.basecontext=newcontext01 # override standard base context

Basic Services


The following snippet from web/services/export/ lists the contexts that are served, and the respective handlers.

collections_id = tree.getRoot('collections').id


feed_info_stachus = {'title': 'keyword stachus', 'description': "ressources with keyword 'stachus' in metadata or fulltext"}

urls = [
       # format:
       #[HTML_method, visible_pattern, handler_for_visible_pattern, (rewrite_pattern, query_for_rewrite_pattern, dictionary_for_match_groups_of_target_pattern), url_type_flag]

        # Examples for simple url alias (will be removed from production server)
        ["GET", "/stachus.rss", handlers.get_node_allchildren, ("/node/"+collections_id+"/allchildren/", {"format": "rss",  "q": "stachus", "acceptcached": 3600*6, "feed_info": feed_info_stachus}, {'id': collections_id}), SERVICES_URL_SIMPLE_REWRITE, None],
        ["GET", "/index.html$", handlers.serve_file, ("/static/index.html", {}, {'filepath': 'index.html'}), SERVICES_URL_SIMPLE_REWRITE, None],
        ["GET", "/$",           handlers.serve_file, ("/static/index.html", {}, {'filepath': 'index.html'}), SERVICES_URL_SIMPLE_REWRITE, None],

        ["GET", "/node/(?P<id>\d+)/{0,1}$", handlers.get_node_single, None, SERVICES_URL_HAS_HANDLER, None],
        ["GET", "/node/(?P<id>\d+)/children/{0,1}$", handlers.get_node_children, None, SERVICES_URL_HAS_HANDLER, None],
        ["GET", "/node/(?P<id>\d+)/allchildren/{0,1}$", handlers.get_node_allchildren, None, SERVICES_URL_HAS_HANDLER, None],
        ["GET", "/node/(?P<id>\d+)/parents/{0,1}$", handlers.get_node_parents, None, SERVICES_URL_HAS_HANDLER, None],

        ["GET", "/static/(?P<filepath>.*)$", handlers.serve_file, None, SERVICES_URL_HAS_HANDLER, None],


The handler functions from the handlers module will only be called for urls of type SERVICES_URL_HAS_HANDLER. For urls of type SERVICES_URL_SIMPLE_REWRITE the handler for the rewrite_pattern will be called. The dictionary query_for_rewrite_pattern will be added to the dictionary req.params of the request req. If a rss feed is served via url alias, the title and description of the channel have to be passed in a dictionary as in the 'stachus' example above. The dictionary entries will be used to fill the channels copy of Remark: For rss feeds generated from requests on patterns with handlers, the path and query of the request will be used to build a default title, and no description will be supplied.

The aliases have been introduced to allow for more illustrative patterns for often used requests.

What the handlers do:


A request on /node/{id} will return a representation of the node specified by the id. The output format (xml being default) may be specified in the query parameter format (see below).


A request on /node/{id}/parents will return a representation of a list with the data of the direct predecessors of the specified node.


A request on /node/{id}/children will return a representation of a list with the data of the direct descendants of the specified node.


A request on /node/{id}/allchildren will return a representation of a list with the data of all descendants of the specified node.

The result lists can be manipulated using the following query parameters:

format: specifies the response format. Currently supported: xml (default), json, csv, rss.
The default response format is xml. The xml data contains the id, name and type of the node,
the attributes and information about the files of the node.
Only xml and json give the full data of the node. The csv response is limited to node id, name, type and the attributes.

The csv format can be best viewed with openoffice calc with semicolon as separator and (") as string delimiter.
Excel will not detect utf-8 encoding if no byte order marker 'bom' is given. (see below)

The rss item of the response can be configured in an export mask named <code>rss</code> for the metadatatype of the node.
If no such export mask is present, the fields of the <code>nodesmall</code> mask of the metadatatype of the node will be used to fill the rss item.
The surrounding <code><item/></code> tag will be added by the handler.
The handler will always add the node type as <code><category/></code> to the item.

A compressed response may be requested by adding &gzip or &deflate to the query for all formats. 
This will reduce the payload and may be an option over slow network connections.
The compression rates for gzip and deflate are virtually the same (as the formats will only differ
by an 8 bytes prefix).

The xml and json responses will contain information on the processing time of different steps of response generation.
They also contain a shortlist of the result nodelist of the request containing only node id and type if the query parameter 'add_shortlist' is added to the request.
This shortlist is not affected by the query parameters <code>start</code> and <code>limit</code> (see below).
It has been introduced to reduce network traffic when the web service is used for browsing, overview or similar purposes.

start: specifies the index of the first node to be returned
limit: specifies the maximal number of nodes to be returned

type: only nodes of the specified type (regular expressions are allowed) are returned 
(/node/{id}/children?type=directory will return a list of the subdirectories of the node)

i0, i1: specify the slice of the shortlist to be returned like ''shortlist[i0:i1]'' in Python (xml and json only)

sortfield: specifies a comma-separated attribute list for sorting the result. 
Besides node attributes the following sortfields may be used: "", "", "node.type", "node.orderpos". 
The sorting is lexical and upward by default. Downward sorting for a sortfield can be chosen by prefixing it with a minus '-'.
The default lexical sorting may be overridden with the parameter "sortformat". This uses the flags "s", "i", "f".
When using 4 sortfields and the second should be read as integer, the fourth as float, the parameter "sortformat=sisf" 
shoud be added to the query. "s" (for string) specifies lexical sorting.

Remark: When csv is used as format, the sortfield(s) are advanced to the first column(s) after the node id, type and name

q: allows specifying a search below the given node to generate the result
/node/{id}/allchildren/q=tea%20AND%20coffee      full search (metadata and fulltext) for ''tea'' and ''coffee''
/node/{id}/allchildren/?format=rss&q=year=2011   search in metadata attribute ''year'' for value ''2011''

attrreg: allows specifying a node attribute (metadata) that has to match a given regular expression.
"attrreq" allows much faster results than "q" for this special case of matching an attribute.
Note that only regular expressions that do not violate the url syntax will work.
If one is only interested in nodes where the attribute "author" matches ".*(H|h)uber.*", use

acceptcached: query parameter to indicate the age (measured in seconds) of a cached query that the client would accept as a result.
The default value is 0.
Remark: Only the unsliced and unsorted python data of the result may be cached - the slicing,
sorting  and rendering to output formats will - at least in this version - always be computed.

sep: field separator used for csv format
delimiter: string delimiter used for csv format
bom: (no value needed) adds an utf-8 byte order marker to the output to make it more excel friendly

A translation dictionary for the values for sep and delimiter can be configured in the function handlers.struct2csv(...)
    trans = {
             'none': u(''),
             'tab':  u('\t'),
             'quote':  u("'"),
             'dquote':  u('"'),

mimetype: override the default mime type of the response
Excel fiendly output that would open directly in an excel sheet in Internet Explorer can be generated this way:
Remark: Use mimetype=application/vnd.oasis.opendocument.spreadsheet to make the browser open OpenOffice or LibreOffice Calc.

disposition: set the disposition string in the header of the response
(directly open browsers download popup)
(open in browser, but offer set filename when user tries to save)

Remark 1: Firefox is much faster to render large xml responses in this format than Internet Explorer.

Remark 2: the first queries after a system restart may take quite long time.

Remark 3: To reduce traffic in case of limited bandwidth or slow network connection the query parameter "gzip" or "deflate" (no values required) may be added to the request. The server will then send compressed data.

Remark 4: For compatibility with with Mediatum_dev:JavaScriptExport the json response will evaluate the following parameters:

mask: 'none', 'apa', 'default' (-> 'none' being the deafult value)
html markup of the default shortview mask content for each node is added to the response if the query has 'mask=default'. If used 'mask='apa' output delivers html markup in the APA format. Each field is separated by a <span>-tag with own css classes. If needed other masks can be configured: mask==mediatum mask name.

maskcache: 'none', 'deep', 'shallow' (-> 'deep' being the default value)
For performance reasons the masks switched on by 'mask=default' will be cached. 
This cache may be switched off with 'maskcache=none', or to a shallower, but not that performant shallow cache
using 'maskcache=shallow'. These caches will be flushed when a mask is edited.
The hit statistics of the deep mask cache may be found under

attrspec: 'none', 'all', 'default_mask' (-> 'default_mask' being the default value)
A single node may have more than a hundred attributes. The output of all or none of them may be triggered
with 'attrspec=all' or 'attrspec=none'. With 'attrspec=default_mask' only the attributes for the fields 
of the default shortview mask of the nodes metadata type will be sent. 

The set of attributes chosen with 'attrspec' may be enriched by the 
(comma separated) attribute list in 'attrlist'.

Remark: If fields are specified in 'mediatum_config' in the JavaScript export, 
'attrspec=all' will be added to the query run by 'mediatum_load'. 

Some examples to help to illustrate this:

all attributes plus mask                          &mask=default&attrspec=all
mask plus mask attributes                         &mask=default
no mask, only mask attributes                     (this is the default)
only mask, no attributs                           &mask=default&attrspec=none
no mask, only 1 attribute (title)                 &mask=none&attrspec=none&attrlist=title
no mask, 3 attributes                             &mask=none&attrspec=none&attrlist=year,title,author.fullname
no mask, mask attributes + update-,creationtime   &mask=none&attrlist=updatetime,creationtime

Using Export Serivce with JavaScript

moved to Mediatum_dev:JavaScriptExport