DATAFLOWS

D A T A F L O W S ================= DataFlows is a novel and intuitive way of building data processing flows in Python. - Get started quickly, easy to scale up when needed - Low footprint, few dependencies - Built for processing - Made for creating portable, resuable and standard based data packages. Set up from the command line in seconds: +--- | | $ pip install dataflows | +--- | [GITHUB] [TUTORIAL] [REFERENCE] | | | | The Cook Book - How to use dataflows | | _______ | | / /, | | / df // | | /______// | | (______(/ | FEATURES - -------- - Trivial to set up and use on a local machine - Load data from nearly anywhere, put data wherever you need it - Easily integrates with other frameworks and tools - Validate input data with little effort (non-zero length, right structure, etc.) - Supports cache data from source and even between steps - Standard building blocks for common data manipulation tasks, promoting maintainable and reusable code - Designed for stream processing to control memory consumption DATAFLOWS IS GREAT FOR - ---------------------- - EXTRACTing data from various sources - TRANSFORMing the data using reusable building blocks and custom code - LOADing data into various targets files, databases etc. - MODELling data using the Data Package and Table Schema standards It's not an analysis tool (but it will get the data ready for analysis in your tool of choice). It works with any orchestration tool (cron, AirFlow or our own datapackage-pipelines). (obligatory xkcd) GETTING STARTED ---------------


# Install from PyPi
$ pip install dataflows

# Inspect a remote CSV file
$ dataflows init https://raw.githubusercontent.com/datahq/dataflows/master/data/academy.csv
Writing processing code into academy_csv.py
Running academy_csv.py
academy:
#     Year           Ceremony  Award                                 Winner  Name                            Film
      (string)      (integer)  (string)                            (string)  (string)                        (string)
----  ----------  -----------  --------------------------------  ----------  ------------------------------  -------------------
1     1927/1928             1  Actor                                         Richard Barthelmess             The Noose
2     1927/1928             1  Actor                                      1  Emil Jannings                   The Last Command
3     1927/1928             1  Actress                                       Louise Dresser                  A Ship Comes In
4     1927/1928             1  Actress                                    1  Janet Gaynor                    7th Heaven
5     1927/1928             1  Actress                                       Gloria Swanson                  Sadie Thompson
6     1927/1928             1  Art Direction                                 Rochus Gliese                   Sunrise
7     1927/1928             1  Art Direction                              1  William Cameron Menzies         The Dove; Tempest
...

# dataflows create a local package of the data and a reusable processing script which you can tinker with
$ tree
.
├── academy_csv
│   ├── academy.csv
│   └── datapackage.json
└── academy_csv.py

1 directory, 3 files

# Resulting 'Data Package' is super easy to use in Python
[adam] ~/code/budgetkey-apps/budgetkey-app-main-page/tmp (master=) $ python
Python 3.6.1 (default, Mar 27 2017, 00:25:54)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datapackage import Package
>>> pkg = Package('academy_csv/datapackage.json')
>>> it = pkg.resources[0].iter(keyed=True)
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': None, 'Name': 'Richard Barthelmess', 'Film': 'The Noose'}
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': '1', 'Name': 'Emil Jannings', 'Film': 'The Last Command'}

# You now run `academy_csv.py` to repeat the process
# And obviously modify it to add data modification steps