Roadmap
User Interface
The current user interface has been seriously
overhauled in version 0.9, and we will continue
working on it to make it the best possible UI
experience. And we have these two "big ticket"
items.
-
Provide a revamped Excel client. The current works
as designed but has a number of limitations and
shortcomings (some due to Excel itself). We at least
want time zone support.
-
Formula editor: have an undo/redo and also
copy/cut/paste capabilities on the edition
tree.
Data management
Data management is the core business of the tool;
we will complete the existing tool set to achieve
brilliance in this area, as time and budget
permit.
-
Extend the Python api to provide seemless
tracking of model inputs and outputs. That
will significantly augment the traceability of
relationships between series and models. Model
backtesting will be easier with this.
-
Now that support for Pandas 2.x is there,
there will be interesting performance
possibilities in the future, specially with
the PyArrow backend. PyArrow allows
interoperability with other arrow-based data
transformation tools such as Polars.
-
Provide means to browse a series reverse
dependencies.
-
Allow to perform history rewriting, as in: insert
versions of a series older than the current.
-
Allow to perform data supervision on formulas. Right
now only stored series can receive supervised
corrections.
-
Associate series with classification schemes (or
folder-like hierachies). Searching by name /
metadata is powerful but people are used to browsing
folders to find items. It can help discoverability
of series.
-
Data-mesh: move the sources definition to the
database (currently in configuration files) and
provide an API point and a Web UI to edit it.
-
Data-mesh: provide an active mode, where
events like series edition, renaming, deletion
are propagated to its consumers for good
effects (most notably the maintenance of the
whole system's integrity, cross-instances).
-
Have a notification hub for various events. People
should be able to ask being notified on tasks
failures, series (non-)updates, etc. and the hub use
whatever preferred and available method (email,
chat) to inform them.
-
A better data model for Market Data. It is possible
to represent them with naked time series, but with
a number of downsides. A richer data model is being
worked on to make it easy and reasonnable.
Storage
The current postgres-based
storage system works very well and provides a fast
transactional storage with good density. It will
in remain the default for a long time. However we
have a plan for something with better performance
(less storage space, lower latency) and
scalability (big-data scalability).
-
High performance storage on the filesystem :
have an alternative to postgres for the time
series themselves. The catalog, metadata
etc. will still reside within
postgres. However with the time series on the
filesystem we will get lower latencies (on
read/writes), much faster dump/restore times
(can get a bottleneck with big postgres
databases) and potentially infinite
scalabality if used in conjunction with a
distributed filesystem such as
CephFS.
-
An alternative to postgres (with or without
the high performance item above) could be to
use sqlite,
for a lightweight personal
refinery-on-my-laptop deployment, without the
hassle of deploying postgres there.
Version 0.9 (shipped 2024-06-15)
Developped between September 2023 and June 2024,
this release contains an important number of
features and significant fixes.
Whole new packages will also be provided with the
"Pro" (commercial customers only) version: a tool
named "tswatch" to monitor the time series status
(we often want to know if series are in a
resonnable state of freshness) and a separate
complete Dashboarding tool with a low-code/slick UI
to create the graphs.
Experimental support for identification using
OAuth2 + OpenIdConnect has been added and covers
the API and UI. This is also part of the
Commercial offering.
API points
-
Added support for Pandas 2.2. The
compatibility with the older Pandas (1.0.5 to
1.5.3) series is still garanteed.
-
A new inferred_freq api point has been
added. It provides the median time delta
between points and a quality indicator.
-
The get and edited api points
have been extended with two parameters: a flag
to ask the series resampled by its "inferred
frequency" (see above), showing potential
missing data; and also, for the http/json
path, it is possible to provide an horizon
expression as an alternative to a
"from_value_date" / "to_value_date" pair - an
experimental feature developped chiefly for
the UI (to ease browsing of the series back
and forth).
-
The find search expressions now allow
to filter on internal metadata.
-
The Swagger UI has been completed with more
information to help third party implementors
for non-Python clients. Documentation, intput
types, etc. have been largely improved.
-
The experimental authentication which has been
added around the web UI and the API supports 3
flows: the application (normal, with a
browser) flow, the machine-to-machine flow
(for robots running server-side) and PKCE (for
users using the refinery from a local Jupyter
Notebook for instance). A preliminary simple
security model based on readers and writers is
provided. It may evolve into something more
sophisticated, depending on feedback.
User Interface changes
This release is chock full of changes. First of
all, the landing (home) page has been completely
redesigned and now provides much more clarity and
crisp information about where to go from there. We
have added a general menu on the left, which eases
navigation amongst views a lot. Lastly, a number
of views have been enhanced, re-designed and
sometimes completely rewritten, with drastic
effects on performance and interaction quality.
-
The formula editor has been rewritten from the
ground up. The original version was a
"successful POC" that was difficult to
maintain. We did a complete re-design of the
internals; this already shows up in a number
of visual enhancements: the code editor shows
errors/problems in a much clearer way, the
tree editor is slicker and now allows to
select series from the catalog for the
"series" operator. More features will come in
the next releases.
-
The independent "tshistory" view (very
powerful to look at the versions in a
synthetic way, down to the versions of an
individual point in time) has been rewritten
and included as a mode of the "tsinfo" (time
series info) view. It is now slicker and much
more reactive than before.
-
The independant "tseditor", allowing to edit
primary series and also see the values of the
underlying series of a formula, has been
rewritten and is now slicker, much more
reactive and agreeable to use. It is now
possible to paste values from Excel and
identify missing data with the "inferred
frequency" option.
-
The formula "batch upload" and "all formulas"
views have been rewritten and now provides the
same look and feel as the other views.
-
A new "basket editor" has been fleshed out. It
allows to list, create and edit baskets
(pre-recorded filters to get series list from
the catalog). Baskets can be used in the
catalog searches as well as in formulas.
-
In the task manager we now have a new "plan"
tab that shows the tasks that will be scheduled
in the next hours. In the tasks tab, the lazy
scrolling view has been fixed, and an action
to fetch everything at once has been
added. This makes the filtering on colum much
more usable without hampering the UI
responsiveness.
-
In the catalog (search) view, we can now
filter on the "tzawareness" of series. The
metadata filter is more intuitive and reactive
than before and should be much more convenient
to use.
New Tools
The commercial offering now comes with two new
powerful packages: a time series "monitoring" tool
and a full-fledged Dashboard system. While only
distributed to commercial subsribers, these
packages are Open Source.
-
Time series monitoring has many angles, and one of
them is the issue of freshness. The "tswatch"
package provides a simple web UI to specify the
series (primary or formula) you want to
track. They will display in a prominent way thoses
that are lagging behind (because a scrapper has
been failing for some time or some other
reason). In the case of formulas, we can see their
decomposition in terms of primary series to
diagnose which part of the formula is in a stale
state.
-
A time series management system eventually winds
up collecting and producing to be shown to some
people. At the end of the production pipeline, a
dashboard is the most straightforward way to
expose what's going on.
Our graph library is linked to a web-UI
that guides, restricts and documents the graph
building process through a python-coded
specification. This UI is open-ended: a
developper can add any graph type to the
current collection, with its own
specification. Our graph collection includes
the “table” format (timeseries joint on
their indexes), a “scatter-plot” useful for
statistical exploration and analysis, a
“group” plot for weather (or otherwise)
scenarios and a “balance” plot to follow any
process with inflows vs outloflows
following. More graph types are one the way.
The balances are fully instantiable through
the web-ui. The dashboard layout is also built
through a compact and easy to use UI to offer
the best productivity to the users.
Technical Changes
-
Through the REST API, the /series/state (get)
route now accept a "timezone" parameter for
the json path.
-
The "refinery.cfg" configuration file is no
longer used. Everything is set in the
"tshistory.cfg" file.
-
An annoying bug in the task scheduler
(non-firing of some scheduled tasks) has been
fixed. We don't use APScheduler any more.
Version 0.8 (shipped 2023-09-07)
Developped between January and August 2023, this
release contains a number of new powerful features
and also some internal changes. In this report we
will separate these clearly.
API points
-
A new very powerful search facility has been added,
which is spelled .find. Find will take
a lisp-expression string allowing to combine
filters on the series name, metadata keys and
values, and for the later permits to write
inequalities. The query returns a list of
"series descriptor": lightweight objects that
provide the series name and source, and
optionally their metadata. The documentation
can be found there: API doc. This
will probably replace the older .catalog API
point, which is in comparison to .find
less convenient to use.
-
To complement the find api point, we
added a number of end points to facilitate the
management (creation, listing, deletion) of
persistent named search queries under the name
of baskets.
-
The filtering feature is also made available
to the formula system. It is possible to e.g. add series
over a findseries dynamic query.
-
The .update_metadata method now
actually only updates the metadata rather than
replace it. For this behaviour, we introduced
.replace_metadata.
-
Formulas got a number of new operators:
trigonometric operators, ** (exponentiation),
sub (series substraction), abs, round (rounding).
User Interface changes
-
In the tasks manager list, it is possible to
filter the service, input and status
columns. This is very convenient to find
failed tasks or tasks that are concerned by a
given input, etc.
-
The task manager list now uses lazy loading
for its initial display. This helps a lot when
a very long tasks tail is kept there.
-
In the tsinfo view, we show the series source - it
will be marked "local" if it comes from the main
source.
Technical changes
-
The minimal Python version is now set to 3.9,
and pandas 1.5 support has been added.
-
The data model has been simplified for series
and groups to enable the filtering features.
-
The tshistory.cfg configuration file
now contains everything
needed. The refinery.cfg file is
deprecated.
-
Data-mesh: the handling of an unavailable (for
whatever reason) secondary source instance is
now smoother. We get the local series and
those of the currently available secondary
sources.
-
Storage: the max-bucket-size value has been
set from 250 to 150. This will provide a more
optimal (on average more compact) chunking
strategy.
-
Core packages now have a __version__
attribute. The deployed version is also stored
into the database and is checked at startup
time against the packages version.
-
Migrations: a generic tsh migrate command
has been added. It will use information in the
deployed package versions and what is stored
in the database to determine the exact
migrations steps to run and actually run them.
The prehistory
Genesis of a time series cache
The TimeSeries Refinery started in 2017 as an
experiment to plug a logistical hole between on
one hand, a "big data" enterprise time series
silo, and on the other hand people doing analysis
using Excel (and sometimes Python).
The Excel sheet could receive data from the big
data silo but there were a number of downsides
working like this:
-
manually overridden values in Excel were
overwritten at pull-from-the-silo time
-
the Excel sheets tended to grow too big: they
performed poorly and tended to become intractable
-
the Excel sheet calculations and expert
overrides were not easily shareable
-
the Excel client would regularly blow up and
lose its settings
-
at peak hours (in the morning) the big data
system tended to answer slowly and randomly
crashed
So we designed a "simple cache" for the silo's
versioned time series and another Excel client to
talk to our cache. The benefits became quickly
clear:
-
the "cache" would regularly update itself from
the silo, hence the morning latencies went
down drastically
-
it also provided the ability to effectively
save the analyst's data overrides as a new
version that was kept from one refresh to the
next (unless upstream values had been changed)
-
it suddenly allowed the analysts to save their
own hand-made series (either "expert values"
or outputs of computations or models) directly
from Excel, and of course from Python too
-
everything in the "cache" was easily
shareable, including from Python since we
devised a simple but effective Python API to
use it
-
it was soon possible to quickly build reports
and dashboards from what was in the "cache"
-
we provided simple but effective ways to
browse and display the series catalog and show
the curves - which was either clumsy or
impossible with the central IT managed time
series silos
Adding computations
That successful initial success opened the road to
the next step: moving computations out of
Excel. After months of observing analysts'
workflows with Excel it became clear a notion of
computed series had to be added to the
"cache". When that was done, around 2019, using an
elegant domain specific language for time series
under a clean and simple API, the Time Series
Refinery was truly born.
We chose the simplest syntax available for the
formula language:
Lisp.
This was immediately picked up by analysts (it is
after all simpler than the Excel formula language
or the ubiquitous VBA) and they started to build
very sophisticated formulas made of formulas
... down to stored series of course. We added
features to track formula dependencies and show
and edit formulas from within the browser. A
low-code platform was born.
Lastly, we also coupled to the timeseries (stored
and computed) system with a task manager fit for
the purpose of managing scraping and models
tasks. Simple and lightweight, it provides the
maintainers of the Information System a great deal
of overview of the health of the system and again,
a lot of autonomy.
Towards a Universal Time Series Information System
At that point in time though, people from other
commodities or topical activities (e.g. hydrology,
meteorology) had started to use it by setting up
their own "cache". Quickly enough, it was
understood that some "caches" would be interested
in the data of another (the meteo time series are
typically a cross-interest item as they are used
as inputs in a variety of forecast models), and
that duplicating data would be a bad idea. We came
up with a straightforward implementation of the
"data mesh" concept and soon we had a web of
connected refineries instances. It turned out it
would then be possible to aggregate many of them
into some kind of "data hub" for further
downstream usage. This is the basis of the current
EnergyScan
commercial offering. While doing so we also made a
number of things better:
-
the catalog browsing provided nifty ways to filter
series on name and metadata
-
it was possible to have a detailed view of each
series (either stored or computed)
-
the API had grown two operating modes: direct mode
with a postgres connection string, and http mode
using an http uri; both behaved exactly the same way
-
a number of subtle and generally difficult
issues pertaining to naive vs timezone aware
time series, correctness and performance
issues in the formula interpreter, had been
ironed out
-
a promising (albeit still experimental) time
series groups features was brewing, serving
the needs of non-deterministic meteo forecasts
-
a powerful online auto-documented formula
editor to boost analyst's productivity
-
a (very important) protocol to use the formula
system as a proxy for tier time series silos
was devised and put in production (what we
call the "autotrophic operators")
-
a cache system on top of the formula system
was devised to help with compute-heavy
formulas
It is on top of these robust foundations and years
of hard work on the ground that we are confidently
bringing the TimeSeries Refinery to the commercial
open source sector. Its pupose is to reduce the
often tiresome Analysts / IT back and forth
communications by giving the maximum amount
possible of autonomy to the former, while
discharging the later from many chores. We hope it
will be a resounding success !