Why and How to Use Pandas with Large Data by Admond Lee

Pandas has been one of the most popular and favourite data science tools used in Python programming language for data wrangling and analysis. Data is unavoidably messy in real world. And Pandas is seriously a game changer when it comes to cleaning, transforming, manipulating and analyzing data. In simple terms, Pandas helps to clean the mess. My Story of NumPy & Pandas When I first started out learning Python, I was naturally introduced to NumPy (Numerical Python). It is the fundamental package for scientific computing with Python that provides an abundance of useful features for operations on n-arrays and matrices in Python. In addition, the library provides vectorization of mathematical operations on the NumPy array type, which significantly optimizes computation with high performance and enhanced speed of execution. NumPy is cool. But therein still lies some underlying needs for more higher level of data analysis tools. And this is where Pandas....

Read the rest of this story with a free account.

Already have an account? Sign in

March 6, 2019

1 Comment

Newest

Oldest Most Voted

Inline Feedbacks

View all comments

Adrian Bool

5 years ago

Hi, It looks like a large amount of the process you describe can be performed within read_csv itself; saving your computer from having to combine the data tables at the end – which is probably doubling your RAM requirement. Ot would be interesting to know if something like the following would permit to to load your dataset in one go: df_final = pd.read_csv( r’../input/data.csv’, # Use the low memory option to tall Pandas to use the # chunk size concept internally to the read_csv function low_memory = True, # Only load the columns that are required to reduce memory #… Read more »

Why and How to Use Pandas with Large Data by Admond Lee

https://hakin9.org/wp-content/uploads/2023/06/hakin9_logo.svg

Download Free eBook