Skip to content

Instantly share code, notes, and snippets.

View danielfrg's full-sized avatar

Daniel Rodriguez danielfrg

View GitHub Profile
@danielfrg
danielfrg / merge-files-hdfs-count-pipeline.py
Last active October 15, 2018 16:18
Luigi pipeline: 1. Read a bunch of TDF files from local storage and created a big json file in HDFS 2. Uses a hadoop MR job to count the number of words (this is actually a field on each json object)
import json
import luigi
import luigi.hdfs
import luigi.hadoop
import pandas as pd
import numpy
import pandas
luigi.hadoop.attach(numpy, pandas)
@danielfrg
danielfrg / clean-html-solr-pipeline.py
Last active May 6, 2019 12:45
Luigi pipeline that: 1. Reads a tdf file using pandas with html on the 'content' column and created another tdf with just the text of the html (beautifulsoup) 2. Indexes the text into a Solr collection using mysolr
import re
import json
import luigi
import pandas as pd
from mysolr import Solr
from bs4 import BeautifulSoup
class InputText(luigi.ExternalTask):