Roadmap
User Interface
The current user interface is effective but we can
do much better in some areas. There are a number
of low-hanging fruits to be picked there, not only
in the looks-nice department, but ergonomically
wise. We will progressively roll enhancements as
time and new releases pass.
-
Reorganize the "tsinfo" view with tabs and a
generally less confusing layout. History and
Edition views should be easily available from
there in their own section.
-
Replace the "cockpit" page with something more
immediately useful (that could be the search
page). And also provide a way to jump easily
from a view to another without juggling with
uris or going back to the cockpit.
-
A web ui for the basket (stored dynamic search
queries) feature (allowing: definition,
edition, deletion, and also listing and using
the baskets).
-
Allow the user to configure a default horizon
and a time zone for the views showing series
(tsinfo, quickview, history, edition), with date
+ time and predefined horizons. We probably
want, on tsinfo, the ability to browse back
and forth using the selected horizon.
-
Overhaul the series Edition view to allow
pasting from Excel or defining values from
simple rules + a granularity indication.
-
Provide a zero-code dashboarding system with
sophisticated presentation capabilities
providing complete autonomy to end users.
-
Provide a revamped Excel client. The current works
as designed but has a number of limitations and
shortcomings (some due to Excel itself). We at least
want time zone support.
-
Formula editor: have an undo/redo and also
copy/cut/paste capabilities on the edition
tree.
Data management
Data management is the core business of the tool;
we will complete the existing tool set to achieve
brilliance in this area, as time and budget
permit.
-
Add support for Pandas 2.x. This will open
interesting performance possibilities in the
future, specially with the PyArrow
backend. PyArrow allows interoperability with
other arrow-based data transformation tools.
-
Better tools to help seeing missing data
(e.g. associate a series with a granularity - by
inference or user choice - when showing it on the
info or quickview views). Maybe permit to use bars
rather than just lines.
-
Provide a data monitoring tool. We want to
know if a series has not been updated (as per
its expected update rules), and if there is
missing data.
-
Provide means to browse a series reverse
dependencies.
-
Allow to perform history rewriting, as in: insert
versions of a series older than the current.
-
Allow to perform data supervision on formulas. Right
now only stored series can receive supervised
corrections.
-
Associate series with classification schemes (or
folder-like hierachies). Searching by name /
metadata is powerful but people are used to browsing
folders to find items. It can help discoverability
of series.
-
Data-mesh: move the sources definition to the
database (currently in configuration files) and
provide an API point and a Web UI to edit it.
-
Data-mesh: provide an active mode, where
events like series edition, renaming, deletion
are propagated to its consumers for good
effects (most notably the maintenance of the
whole system's integrity, cross-instances).
-
Have a notification hub for various events. People
should be able to ask being notified on tasks
failures, series (non-)updates, etc. and the hub use
whatever preferred and available method (email,
chat) to inform them.
-
A better data model for Market Data. It is possible
to represent them with naked time series, but with
a number of downsides. A richer data model is being
worked on to make it easy and reasonnable.
-
An API for models management. We want a
lightweight formalism to express the ins/outs
of a model, to make it easily trackable. Model
backtesting will be made easier with this.
Storage
The current postgres-based
storage system works very well and provides a fast
transactional storage with good density. It will
in remain the default for a long time. However we
have a plan for something with better performance
(less storage space, lower latency) and
scalability (big-data scalability).
-
High performance storage on the filesystem :
have an alternative to postgres for the time
series themselves. The catalog, metadata
etc. will still reside within
postgres. However with the time series on the
filesystem we will get lower latencies (on
read/writes), much faster dump/restore times
(can get a bottleneck with big postgres
databases) and potentially infinite
scalabality if used in conjunction with a
distributed filesystem such as
CephFS.
-
An alternative to postgres (with or without
the high performance item above) could be to
use sqlite,
for a lightweight personal
refinery-on-my-laptop deployment, without the
hassle of deploying postgres there.
Security
As of today we provide few security features: it
is possible to configure http basic auth out of
the box. While the general leniency with respect
to security is not an issue in practice in many
organizations, it is a real concern in others, and
that will be addressed. We want to talk to the
standard security protocols (OAuth2 / OIDC) and to
have a basic permissioning system.
-
Support the OAuth2 / OIDC authentication
stack, with concrete implementations for
Cognito (aws) and Okta. Be open to other
providers. Investigate Active Directory.
-
Once we can identify people, let's have a
permissioning system. The basic permissions could be
read, update, rename, delete for time series, and
read, launch, delete for tasks. It should be
possible to define roles having a given permission
set and associate people with roles, through the API
and the Web UI.
Version 0.8 (shipped 2023-09-07)
Developped between January and August 2023, this
release contains a number of new powerful features
and also some internal changes. In this report we
will separate these clearly.
API points
-
A new very powerful search facility has been added,
which is spelled .find. Find will take
a lisp-expression string allowing to combine
filters on the series name, metadata keys and
values, and for the later permits to write
inequalities. The query returns a list of
"series descriptor": lightweight objects that
provide the series name and source, and
optionally their metadata. The documentation
can be found there: API doc. This
will probably replace the older .catalog API
point, which is in comparison to .find
less convenient to use.
-
To complement the find api point, we
added a number of end points to facilitate the
management (creation, listing, deletion) of
persistent named search queries under the name
of baskets.
-
The filtering feature is also made available
to the formula system. It is possible to e.g. add series
over a findseries dynamic query.
-
The .update_metadata method now
actually only updates the metadata rather than
replace it. For this behaviour, we introduced
.replace_metadata.
-
Formulas got a number of new operators:
trigonometric operators, ** (exponentiation),
sub (series substraction), abs, round (rounding).
User Interface changes
-
In the tasks manager list, it is possible to
filter the service, input and status
columns. This is very convenient to find
failed tasks or tasks that are concerned by a
given input, etc.
-
The task manager list now uses lazy loading
for its initial display. This helps a lot when
a very long tasks tail is kept there.
-
In the tsinfo view, we show the series source - it
will be marked "local" if it comes from the main
source.
Technical changes
-
The minimal Python version is now set to 3.10,
and pandas 1.5 support has been added.
-
The data model has been simplified for series
and groups to enable the filtering features.
-
The tshistory.cfg configuration file
now contains everything
needed. The refinery.cfg file is
deprecated.
-
Data-mesh: the handling of an unavailable (for
whatever reason) secondary source instance is
now smoother. We get the local series and
those of the currently available secondary
sources.
-
Storage: the max-bucket-size value has been
set from 250 to 150. This will provide a more
optimal (on average more compact) chunking
strategy.
-
Core packages now have a __version__
attribute. The deployed version is also stored
into the database and is checked at startup
time against the packages version.
-
Migrations: a generic tsh migrate command
has been added. It will use information in the
deployed package versions and what is stored
in the database to determine the exact
migrations steps to run and actually run them.
The prehistory
Genesis of a time series cache
The TimeSeries Refinery started in 2017 as an
experiment to plug a logistical hole between on
one hand, a "big data" enterprise time series
silo, and on the other hand people doing analysis
using Excel (and sometimes Python).
The Excel sheet could receive data from the big
data silo but there were a number of downsides
working like this:
-
manually overridden values in Excel were
overwritten at pull-from-the-silo time
-
the Excel sheets tended to grow too big: they
performed poorly and tended to become intractable
-
the Excel sheet calculations and expert
overrides were not easily shareable
-
the Excel client would regularly blow up and
lose its settings
-
at peak hours (in the morning) the big data
system tended to answer slowly and randomly
crashed
So we designed a "simple cache" for the silo's
versioned time series and another Excel client to
talk to our cache. The benefits became quickly
clear:
-
the "cache" would regularly update itself from
the silo, hence the morning latencies went
down drastically
-
it also provided the ability to effectively
save the analyst's data overrides as a new
version that was kept from one refresh to the
next (unless upstream values had been changed)
-
it suddenly allowed the analysts to save their
own hand-made series (either "expert values"
or outputs of computations or models) directly
from Excel, and of course from Python too
-
everything in the "cache" was easily
shareable, including from Python since we
devised a simple but effective Python API to
use it
-
it was soon possible to quickly build reports
and dashboards from what was in the "cache"
-
we provided simple but effective ways to
browse and display the series catalog and show
the curves - which was either clumsy or
impossible with the central IT managed time
series silos
Adding computations
That successful initial success opened the road to
the next step: moving computations out of
Excel. After months of observing analysts'
workflows with Excel it became clear a notion of
computed series had to be added to the
"cache". When that was done, around 2019, using an
elegant domain specific language for time series
under a clean and simple API, the Time Series
Refinery was truly born.
We chose the simplest syntax available for the
formula language:
Lisp.
This was immediately picked up by analysts (it is
after all simpler than the Excel formula language
or the ubiquitous VBA) and they started to build
very sophisticated formulas made of formulas
... down to stored series of course. We added
features to track formula dependencies and show
and edit formulas from within the browser. A
low-code platform was born.
Lastly, we also coupled to the timeseries (stored
and computed) system with a task manager fit for
the purpose of managing scraping and models
tasks. Simple and lightweight, it provides the
maintainers of the Information System a great deal
of overview of the health of the system and again,
a lot of autonomy.
Towards a Universal Time Series Information System
At that point in time though, people from other
commodities or topical activities (e.g. hydrology,
meteorology) had started to use it by setting up
their own "cache". Quickly enough, it was
understood that some "caches" would be interested
in the data of another (the meteo time series are
typically a cross-interest item as they are used
as inputs in a variety of forecast models), and
that duplicating data would be a bad idea. We came
up with a straightforward implementation of the
"data mesh" concept and soon we had a web of
connected refineries instances. It turned out it
would then be possible to aggregate many of them
into some kind of "data hub" for further
downstream usage. This is the basis of the current
EnergyScan
commercial offering. While doing so we also made a
number of things better:
-
the catalog browsing provided nifty ways to filter
series on name and metadata
-
it was possible to have a detailed view of each
series (either stored or computed)
-
the API had grown two operating modes: direct mode
with a postgres connection string, and http mode
using an http uri; both behaved exactly the same way
-
a number of subtle and generally difficult
issues pertaining to naive vs timezone aware
time series, correctness and performance
issues in the formula interpreter, had been
ironed out
-
a promising (albeit still experimental) time
series groups features was brewing, serving
the needs of non-deterministic meteo forecasts
-
a powerful online auto-documented formula
editor to boost analyst's productivity
-
a (very important) protocol to use the formula
system as a proxy for tier time series silos
was devised and put in production (what we
call the "autotrophic operators")
-
a cache system on top of the formula system
was devised to help with compute-heavy
formulas
It is on top of these robust foundations and years
of hard work on the ground that we are confidently
bringing the TimeSeries Refinery to the commercial
open source sector. Its pupose is to reduce the
often tiresome Analysts / IT back and forth
communications by giving the maximum amount
possible of autonomy to the former, while
discharging the later from many chores. We hope it
will be a resounding success !