Introduction to Data Engineering workshop, learn to build a data pipeline with Luigi!
The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:
This repository contains the files and data from the workshop as well as resources around Data Engineering. For the workshop (and after) we will use a Discord chatroom to keep the conversation going: https://discord.gg/86cYcgU.
And/or please do not hesitate to reach out to me directly via email at [email protected] or over twitter @memoryphoneme
The presentation can be found on Slideshare here or in this repository (
presentation.pdf). Video can be found here.
Throughout this workshop, you will learn how to make a scalable and sustainable data pipeline in Python with Luigi
Prior experience with Python and the scientific Python stack is beneficial. The workshop will focus on using the Luigi framework, but will have code from the following lobraries as well:
pip install -r requirements.txt
luigid --background --logdir logs
[port]is the port the
luigidserver has started on (
luigiddefaults to port 8082)
python ml-pipeline.py EvaluateModel --input-dir text --lam 0.8
python ml-pipeline.py BuildModels --input-dir text --num-topics 10 --lam 0.8
For parallelism, set
--workers(note this is Task parallelism):
python ml-pipeline.py BuildModels --input-dir text --num-topics 10 --lam 0.8 --workers 4
hadoop fs -mkdir /tmp/text
hadoop fs -put ./data/text /tmp/text
hadoop fs -getmerge /tmp/text-count/2012-06-01 ./counts.txt
docker run -it -v /LOCAL/PATH/TO/REPO/data-engineering-101:/root/workshop clearspandex/pydata-seattle bash
pip2 install flask
text/ 20newsgroups text files topmodel/ Stripe's topmodel evaluation library example_luigi.py example scaffold of a luigi pipeline hadoop_word_count.py example luigi pipeline using Hadoop ml-pipeline.py luigi pipeline covered in workshop app.py Flask server to deploy a scikit-learn model LICENSE Details of rights of use and distribution presentation.pdf lecture slides from presentation readme.md this file!
The data (in the
text/folder) is from the 20 newsgroups dataset, a standard benchmarking dataset for machine learning and NLP. Each file in
textcorresponds to a single 'document' (or post) from one of two selected newsgroups (
alt.atheism). The first line provides which group the document is from and everything thereafter is the body of the post.
comp.sys.ibm.pc.hardware I'm looking for a better method to back up files. Currently using a MaynStream 250Q that uses DC 6250 tapes. I will need to have a capacity of 600 Mb to 1Gb for future backups. Only DOS files.
I would be VERY appreciative of information about backup devices or manufacturers of these products. Flopticals, DAT, tape, anything.
If possible, please include price, backup speed, manufacturer (phone #?), and opinions about the quality/reliability.
Please E-Mail, I'll send summaries to those interested.
Thanx in advance,
Copyright 2015 Jonathan Dinu.
All files and content licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License