Getting Started

Installation

From PyPi:

$ pip install pytubes

From source:

$ pip install -r build_requirements.txt
$ python setup.py install

Usage

Usage is very simple:

  1. Import tubes
  2. create an input tube (currently either: tube.Each or tube.Count) to get some data into the tube
  3. call methods on the input tube to build up each step of the processing (e.g. read_files().split().json()…)
  4. Iterate over the tube to generate the data, by either:
    • Calling list(tube)
    • looping over it in a for-loop: for item in tube:
    • or: Calling x = iter(tube), and then next(x) repeatedly.

Some Examples

>>> from tubes import Each, Count
>>> list(Count().first(5))
[0, 1, 2, 3, 4]
>>> from urllib.request import urlopen
>>> response = urlopen("https://dumps.wikimedia.org/other/pageviews/2019/2019-07/pageviews-20190716-140000.gz")
>>> dict(Each([response]).read_fileobj().gunzip(stream=True)  # Stream the response and gunzip it
        .tsv(sep=" ", skip_empty_rows=True)                   # Parse as a TSV file (with spaces not tabs)
        .skip_unless(lambda x: x.get(0).to(bytes).equals(b"en")) # EN wikipedia only
        .skip_unless(lambda x: x.get(2).to(int).gt(10_000))   # Only include pages with viewcount > 10,000
        .first(3)                                             # Get the first 5 only
        .multi(lambda x: (                                    # Extract Column 1(Page title) and Column 2(Page count)
            x.get(1).to(str),
            x.get(2).to(int))
        )
    )
{'-': 31066, 'Main_Page': 709331, 'Special:Search': 49869}