.. _performance: Pytubes' Performance ==================== To assess performance, a number of sample workloads have been implemented, both in native python, and using pytubes, with their results being compared .. image:: _static/perf_graph.png Dataset: Pypi download stats ---------------------------- Pypi provide package download logs via google bigquery. These tables can be downloaded in a number of formats, including gzipped, line-separated JSON files. One day's worth of download data for the 14th December 2017 was taken. Google provided this data as 38 gzip compressed files, totalling 1.2GB (9.3GB uncompressed). How many records?:: import tubes, glob print(list(tubes.Each(glob.glob("*.jsonz")).read_files().gunzip(stream=True).chunk(1).split().enumerate().slot(0))[-1]) 15,612,859 15 million package downloads happened on the 14th December 2017!. Each row looks similar to this:: { "timestamp":"2017-12-14 00:42:55 UTC", "country_code":"US", "url":"/packages/02/ee/b6e02dc6529e82b75bb06823ff7d005b141037cb1416b10c6f00fc419dca/Pygments-2.2.0-py2.py3-none-any.whl", "file":{ "filename":"Pygments-2.2.0-py2.py3-none-any.whl", "project":"pygments", "version":"2.2.0", "type":"bdist_wheel" }, "details":{ "installer":{ "name":"pip", "version":"9.0.1" }, "python":"3.4.3", "implementation":{ "name":"CPython", "version":"3.4.3" }, "distro":{ "name":"Amazon Linux AMI", "version":"2017.03", "id":"n/a", "libc":{ "lib":"glibc", "version":"2.17" } }, "system":{ "name":"Linux", "release":"4.4.35-33.55.amzn1.x86_64" }, "cpu":"x86_64", "openssl_version":"OpenSSL 1.0.1k-fips 8 Jan 2015" }, "tls_protocol":"TLSv1.2", "tls_cipher":"ECDHE-RSA-AES128-GCM-SHA256" } So, many fields, with a nested structure (the nested structure actually doesn't help pytubes' performance, so this seems reasonable to have) Extracting one field ~~~~~~~~~~~~~~~~~~~~ `Notebook 1 <_static/perf1.html>`_ Let's say our analysis just requires a single field of this dataset for processing, for example, the country code, to examine which countries download the most. The python version:: result = [] for file_name in FILES: with gzip.open(file_name, "rt") as fp: for line in fp: data = json.loads(line) result.append(data.get("country_code")) with pytubes:: list(tubes.Each(FILES) .read_files() .gunzip(stream=True) .split(b'\n') .chunk(1) .json() .get("country_code", "null")) results: +----------+--------------+---------+---------+ | Version | Pure Python | pytubes | Speedup | +----------+--------------+---------+---------+ | Time (s) | 254 | 19.6 | 12.9x | +----------+--------------+---------+---------+ About 1/2 of the pytubes time is spent gunzipping 9GB of data. Extracting one field without gunzip ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Doing the same thing as before, but with pre-expanded data gives a different picture: `Notebook 2 <_static/perf2.html>`_ Python version:: result = [] for file_name in FILES: with open(file_name, "rt") as fp: for line in fp: data = json.loads(line) result.append(data.get("country_code")) Pytubes version:: return list(tubes.Each(FILES) .read_files() .split(b'\n') .json() .get("country_code", "null")) results: +----------+--------------+---------+---------+ | Version | Pure Python | pytubes | Speedup | +----------+--------------+---------+---------+ | Time (s) | 208 | 7.78 | 26.7x | +----------+--------------+---------+---------+ Extracting multiple fields ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Rather than just a single field, it may be more useful to extract multiple fields from each record. In this test, the following set of 12 fields are pulled from each record:: timestamp country_code url file → filename file → project details → installer → name details → python details → system details → system → name details → cpu details → distro → libc → lib details → distro → libc → version and flattened into a tuple, the result is actually discarded (rather than collected into a list, as the memory pressure of loading datasets that large complicate things.) Code can be seen in the `Notebook 3 <_static/perf3.html>`_ The performance improvement here isn't great, as the time is dominated by python allocation overheads. +----------+--------------+---------+---------+ | Version | Pure Python | pytubes | Speedup | +----------+--------------+---------+---------+ | Time (s) | 355 | 87 | 4x | +----------+--------------+---------+---------+ Multiple fields, Filtered ~~~~~~~~~~~~~~~~~~~~~~~~~~ If the dataset can be filtered on loading, then we can regain some performance benefits, by avoiding the allocation overhead entirely. Loading a similar set of fields:: timestamp country_code url file → filename file → project details → installer → name details → python details → system → name details → cpu details → distro → libc → lib details → distro → libc → version But only where the country_code is 'GB' gives: +----------+--------------+---------+---------+ | Version | Pure Python | pytubes | Speedup | +----------+--------------+---------+---------+ | Time (s) | 523 | 7.43 | 70.4x | +----------+--------------+---------+---------+ Code here: `Notebook 4 <_static/perf4.html>`_